If you have read one blog post about retrieval-augmented generation, you have read twenty. They all cover the same four things: chunking strategies, embedding models, vector databases, and a clever prompt template. None of these are what will take your RAG system to production, and none of these are where your real bugs will come from.
The things that actually matter, in rough order of how much pain they will cause you: evaluation harnesses, latency budgets, prompt versioning, retrieval debugging, and grounding verification. Let me walk through each.
Evaluation harnesses. Without an eval suite, you cannot tell whether a change to your prompt helped or hurt. You will end up running the same five test queries by hand and shipping on vibes. That works for a demo. It does not work for a product. Invest in evals before you invest in anything else. Start with fifty handwritten golden queries and grow from there.
Latency budgets. A retrieval step plus a generation step takes somewhere between eight hundred milliseconds and four seconds depending on your stack. For interactive UX, that is a lifetime. Design your interface for the waiting — streaming, optimistic UI, partial results. Don't pretend it will be fast.
Prompt versioning. You will iterate on your system prompt for two years. Put it in git. Tag every production deploy. Roll back is a feature. A/B testing prompts requires infrastructure most teams don't build for three months too late.
Retrieval debugging. When the model says something wrong, the first question is always: did we retrieve the right thing? Build a logging path that captures exactly what was retrieved for any given query. Without it, you are blind.
Grounding verification. Even with perfect retrieval, models can hallucinate. For high-stakes outputs, run a second model as a grounding check against the retrieved context. It is not expensive, and it catches the most embarrassing failures.
None of this is the fun part. All of it is the part that earns the right to call your system production-ready.