Scaling RAG pipelines from a prototype to a production system handling thousands of queries per second (QPS) reveals a harsh reality: default configurations rarely meet sub-second service level agreements (SLAs).
Achieving consistent low latency at scale requires a fundamental shift in perspective. Speed is not merely a function of a faster vector database. Instead, latency must be treated as a tightly managed, end-to-end budget across retrieval, infrastructure, and generation.
This article focuses on the engineering discipline behind production-ready RAG. In practice, that comes down to a few principles that consistently separate prototypes from reliable systems.
1. The latency budget
For interactive workloads, users perceive a system as “slow” if the initial response takes longer than ~800ms. To comfortably meet a P95 latency target within this range, engineers must enforce strict budgets on the pre-generation phases.
A practical rule is that retrieval plus prompt construction must be completed in under 250ms (P95). If this phase bleeds into 400ms+, the time remaining for the LLM to generate the first token becomes dangerously thin.
This budget forces hard caps on every stage: tokenization (negligible), Approximate Nearest Neighbor (ANN) search (<100ms), and re-ranking (<80ms).
2. Multi-layer caching
The most effective way to reduce latency is to avoid performing work altogether. A robust RAG pipeline requires three distinct caching layers.
i) Query & semantic caching
Before hitting the retrieval stack, check if the question has been asked before.
- Exact cache: Hash normalized queries (lowercased, punctuation stripped) to serve instant answers for identical requests.
- Semantic cache: Use a lightweight, in-memory ANN index to find highly similar previous queries. If a match exceeds a high similarity threshold, return the previously retrieved documents or the final answer.
ii) Retrieval-level caching
Instead of caching the full text response, cache the intermediate retrieval results. Map a query’s embedding bucket (using techniques like Locality-Sensitive Hashing or quantized centroids) to a list of document IDs. This allows similar queries to reuse the same set of retrieved candidates, drastically reducing expensive vector database calls.
- Engineering note: Ensure cache keys include tenant IDs and filter signatures to prevent data leakage in multi-tenant systems. Implement strict Time-To-Live (TTL) eviction, as invalidating approximate buckets based on document updates is notoriously difficult.
iii) Model KV caching
Leverage Key-Value (KV) caching at the LLM inference provider layer. By reusing the pre-computed attention tensors for shared prompt prefixes (such as system instructions and context templates), you only “pay” the latency cost for processing new, incremental tokens.
3. Redefining retrieval architecture
A naive RAG pipeline runs serially embed → search → re-rank → generate. Sub-second architectures must utilize parallelism and pipelining.
i) Parallel “fan-out” retrieval
Treat retrieval as a graph, not a single step. Initiate multiple retrieval paths simultaneously:
- Dense Retrieval: Semantic vector search for conceptual matches.
- Sparse Retrieval: BM25/keyword search for exact term matching (crucial for acronyms or product IDs).1
- Knowledge Graph: Optional lookups for structured relationships.
These branches execute in parallel with strict timeouts. Their results are merged using reciprocal rank fusion (RRF) or a lightweight re-ranker. This ensures robustness against “cold spots” in vector indices without incurring a latency penalty.
ii) Pipelining generation
Do not wait for the entire retrieval and re-ranking process to finish before engaging the LLM. Start the LLM inference connection as soon as the top-N candidates are identified. The retrieved context can be streamed into the prompt, allowing the model to begin processing immediately.
4. Vector database and index optimization
The vector database is often the perceived bottleneck. Optimizing it requires balancing recall, latency, and memory footprint.
i) Index selection & memory
For sub-second requirements, HNSW (Hierarchical Navigable Small World) graphs are the standard due to their superior query speed compared to inverted file indexes (IVF).2 However, HNSW is memory-intensive. To maintain speed, hot index partitions must reside in RAM. Sharding data by time or tenant ensures that active data fits in memory while colder data can be tiered to SSD-based solutions (like DiskANN).
ii) Modern dimensionality reduction (Matryoshka embeddings)
Reducing vector dimensions (e.g., from 768 to 384) via generic methods like PCA often harms accuracy. A superior approach is using embedding models trained with Matryoshka Representation Learning (MRL).
MRL models (such as OpenAI’s text-embedding-3 or recent open-source alternatives) allow embeddings to be explicitly truncated. You can perform an ultra-fast initial search using the first 64 or 128 dimensions, then re-score the top candidates using the full 768 dimensions. This yields significant speedups with minimal loss in recall.
5. Managing Real-world infrastructure constraints
A fast architecture must remain stable under pressure.
i) Handling traffic spikes
Under heavy load, maintaining P95 latency is more important than perfect answer quality. Implement “degraded modes” that automatically shed expensive load during traffic spikes. For example, if CPU usage hits 85%, the system can temporarily bypass the re-ranker stage or reduce the number of documents retrieved (top-k), sacrificing marginal quality to preserve system responsiveness.
ii) Request collapsing
In high-concurrency scenarios, multiple users often submit identical queries simultaneously (e.g., during a live event). Implement request collapsing at the API gateway layer. The system should identify duplicate in-flight requests, execute the retrieval once, and broadcast the single result to all waiting connection threads.
iii) LLM optimization
To accelerate the generation phase without changing the model, employ speculative decoding. This technique uses a smaller, faster “draft” model to guess the next few tokens in parallel with the main “target” model, verifying them. This can increase token generation speed by 2x-3x, significantly improving the user-perceived latency of streaming responses.
Building a sub-second RAG pipeline at scale is less about discovering a single “magic bullet” component and more about rigorous architectural discipline. It requires shifting focus from raw model performance to a holistic engineering strategy that prioritizes aggressive caching, parallel execution, and infrastructure that can gracefully degrade under pressure.
By treating latency as a non-negotiable budget and implementing the layers of optimization discussed—from Matryoshka embeddings to speculative decoding—engineers can bridge the gap between impressive prototypes and reliable, high-performance production systems. Ultimately, the difference between a fragile demo and a robust product lies in these “boring” engineering details: how well you cache, how efficiently you retrieve, and how resiliently you handle the inevitable spikes in real-world traffic.
