What Matters in Production RAG
What Matters in Production RAG
Arpit Bhayani<br>engineering, databases, and systems. always building.
Most of us build RAG the same way: follow a tutorial that embeds a handful of PDFs, stores the vectors in a local Chroma instance, and chains everything together with LangChain (if that’s still a thing). The demo works. The answer looks reasonable. Then you take it to production and it falls apart in quiet, hard-to-diagnose ways.
This article is about what comes after the demo. It covers the fundamentals of how RAG actually works under the hood, the engineering challenges of keeping an index fresh and correct over time, and how to build the observability layer that lets you answer “why did the system retrieve that?” when things go wrong. None of these topics are exotic. All of them are consistently underbuilt in practice.
RAG Basics
The core idea is simple: instead of asking an LLM to answer from memory, you retrieve relevant documents at query time and inject them into the prompt as context. The model’s role shifts from “know everything” to “reason over what you are given.” This architectural choice has made RAG the dominant pattern for grounding LLMs in specific, current, or proprietary knowledge.
A RAG system has two distinct pipelines that run at different times.
The indexing pipeline runs offline (or in the background). It ingests raw documents, splits them into chunks, converts each chunk into a dense vector embedding, and stores those vectors in a vector database alongside metadata and the original text. This pipeline populates the knowledge base the retriever will search at query time.
The query pipeline runs online, per user request. It takes the user’s question, embeds it using the same model used during indexing, searches the vector database for the nearest chunks, assembles those chunks into a context window, and sends the whole thing to the LLM as a prompt.
The math underlying the retrieval step is cosine similarity. Two vectors are considered close if the angle between them is small:
similarity(q,d)=q⋅d∥q∥⋅∥d∥\text{similarity}(q, d) = \frac{q \cdot d}{\|q\| \cdot \|d\|}similarity(q,d)=∥q∥⋅∥d∥q⋅d<br>Where qqq is the query embedding and ddd is a document chunk embedding. In practice, most vector databases use approximate nearest neighbor (ANN) search rather than exact exhaustive search, because scanning billions of vectors at query time is prohibitively slow. HNSW (Hierarchical Navigable Small World) is the dominant algorithm: it builds a layered proximity graph during indexing that allows retrieval in O(logn)O(\log n)O(logn) time at the cost of a small, tunable recall loss.
Chunking
Chunking is where most RAG systems silently fail. The intuition is straightforward: chunks need to be small enough that retrieved text is specific and relevant, but large enough that they contain complete thoughts. In practice, getting this right requires understanding your document corpus.
The naive approach is fixed-size chunking at some character or token count, say 512 tokens with a 128-token overlap. It is simple and fast. It is also routinely wrong. Fixed-size chunking cuts sentences in half, separates questions from their answers in FAQ documents, and splits code across function boundaries.
The approaches that actually work in production:
Recursive splitting: split on paragraphs first, then sentences, then characters as a fallback. This preserves semantic structure far better than character counting.
Semantic chunking: embed consecutive sentences and insert chunk boundaries where cosine similarity between adjacent sentences drops below a threshold. This identifies genuine topic shifts rather than arbitrary position boundaries.
Structure-aware splitting: for code, split at function or class boundaries using AST parsing. For legal documents, split at clause boundaries. For contracts, include the parent section heading with every child chunk.
Always store metadata with each chunk: the source document ID, section heading, page number, creation timestamp, and a content hash. You will need all of these later, both for filtering and for keeping the index current.
Embedding Models and the Model-Lock Problem
The embedding model you choose during indexing is a ‘long-term commitment’ (sorry, could not come with a better working here). Every vector in your index was produced by that model. If you switch models, every vector is now incommensurable with the new query embeddings, and you must re-embed the entire corpus.
Production-grade options as of mid-2026:
text-embedding-3-large (OpenAI): 3072-dimensional, best general-purpose recall, but API-dependent
embed-v3 (Cohere): strong multilingual performance, supports truncation modes
bge-large-en-v1.5 (BAAI): open-source, deployable locally, competitive with the above for English
e5-mistral-7b-instruct: instruction-tuned, excellent for asymmetric retrieval tasks
RAG Indexing Pipelines
Here is where most tutorials stop...