Why Most RAG Tutorials Leave You Stranded in Production
Retrieval-augmented generation works in every demo. It breaks in production. The gap isn’t conceptual — it’s architectural. Most tutorials show you how to embed a few PDFs and run a similarity search. They skip chunking strategy, hybrid retrieval, reranking, and the feedback loops that separate a working system from one that hallucinates half the time.
This guide is different. You’ll build a complete, production-ready RAG pipeline — from raw documents to grounded LLM answers — using Claude Code as your coding agent. By the end, you’ll have a running system with hybrid search, a reranker, evaluation hooks, and a clear path to scaling it. Estimated hands-on time: 30 minutes, assuming you can type fast.
What You’ll Need Before You Start
Before running a single command, make sure you have the following:
- Claude Code installed and authenticated (
npm install -g @anthropic-ai/claude-code) - Python 3.11+ with pip
- Docker (for running Qdrant locally — no account required)
- An Anthropic API key (for Claude as the generator) or an OpenAI key — we’ll use Claude here
- A Cohere API key (free tier is enough) for the reranker
- 10–20 representative documents in PDF or plain text format — your own data, or download the RAG survey paper as a test corpus
This stack deliberately avoids paid managed services. Everything runs locally or on a $30/month VPS. You can swap Qdrant Cloud or Pinecone in later — the interface is the same.
The Five Layers of a Production RAG Pipeline
Before touching code, understand the architecture. Every production RAG system has five distinct layers, and each one has failure modes that don’t show up in toy examples.
Layer 1: Ingestion and Chunking
Your documents need to be split into chunks before embedding. The naive approach — split every 500 tokens — works poorly on structured documents. A 2026 production baseline: 512–1024 tokens per chunk with 20–25% overlap, using a sentence-aware splitter that never cuts mid-sentence. For technical docs with code blocks, use a code-aware splitter that preserves function boundaries.
The overlap is non-negotiable. Without it, a key sentence that straddles a chunk boundary becomes irretrievable. A 20% overlap on a 512-token chunk costs you ~100 tokens of redundancy per chunk — cheap insurance against broken context.
Layer 2: Embedding
In 2026, Cohere embed-v4 edges out OpenAI text-embedding-3-large on MTEB benchmarks (65.2 vs 64.6) and adds multimodal support and a 128k token context window — meaning you can embed an entire short document as a single chunk if needed. For multilingual corpora, Cohere’s multilingual support across 100+ languages is a practical advantage. OpenAI’s text-embedding-3-small remains the cheapest option at $0.02/million tokens and is perfectly adequate for English-only workloads under 5M vectors.
Layer 3: Vector Store
For most teams, pgvector on PostgreSQL is the right first choice — it’s zero incremental cost if you already run Postgres, handles up to 10 million vectors without tuning, and keeps your data in a system your team already knows how to operate. If you need complex metadata filtering (e.g., filter by department, date range, and document type before vector search), switch to Qdrant: it applies filters before the vector search rather than after, which is both faster and more accurate. Qdrant’s p50 latency sits around 4ms in recent benchmarks. Pinecone is the zero-ops managed option — but at 5M+ vectors, you’re looking at $500–$1,500/month, versus $30–$80 for self-hosted alternatives.
Layer 4: Retrieval
Single-vector similarity search misses too much. The 2026 production standard is hybrid search: dense vector retrieval combined with sparse BM25 keyword matching, then merged with Reciprocal Rank Fusion (RRF). According to benchmarks from DEV Community’s production RAG analysis, 72% of enterprise production RAG systems use hybrid search, hitting 91% recall@10 versus ~78% for dense-only. After retrieval, pass the top-20 candidates through a cross-encoder reranker (Cohere Rerank or a local BGE-reranker-v2) to re-score and return the top-5. This two-stage approach is where the largest single precision improvement comes from.
Layer 5: Generation with Agentic Feedback
The basic pattern — stuff retrieved chunks into a prompt, call the LLM — works for demos. Production systems need an agentic feedback loop: if the retrieval score is low or the LLM signals uncertainty, the agent reformulates the query using HyDE (Hypothetical Document Embeddings) and retrieves again. HyDE generates a hypothetical answer to your question, embeds that, and uses it as the query vector. On knowledge-dense corpora, this improves precision by 20–40%.
Building It: Step-by-Step with Claude Code
Open your terminal in a fresh project directory and launch Claude Code: claude. You’re now working with an agent that will write, run, and debug your code. Here’s the sequence of prompts to feed it.
Step 1: Scaffold the Project
Prompt Claude Code: “Create a Python project for a production RAG pipeline. Use poetry for dependency management. Dependencies: qdrant-client, langchain-community, langchain-anthropic, sentence-transformers, cohere, pypdf, python-dotenv. Create a .env.example with ANTHROPIC_API_KEY, COHERE_API_KEY placeholders. Add a README with setup instructions.”
What you should see: a pyproject.toml, a src/ directory, and a .env.example. Run poetry install to confirm no dependency errors before proceeding.
Step 2: Start Qdrant Locally
Run: docker run -p 6333:6333 qdrant/qdrant. Verify it’s alive: curl http://localhost:6333/health should return {"status":"ok"}. This is your vector store.
Step 3: Build the Ingestion Pipeline
Prompt Claude Code: “Write src/ingest.py. It should: load PDF and .txt files from a ./documents/ directory, split with RecursiveCharacterTextSplitter at 768 tokens with 15% overlap, embed using Cohere embed-v4 (input_type=’search_document’), and upsert into Qdrant collection named ‘production_rag’ with payload fields: source filename, chunk index, and character offset. Log progress. Handle errors gracefully.”
What you should see: a complete ingestion script. Drop a few test documents into ./documents/ and run python src/ingest.py. Verify in Qdrant’s dashboard at http://localhost:6333/dashboard that vectors are indexed.
Step 4: Build the Hybrid Retriever
Prompt: “Write src/retriever.py. Implement hybrid search: dense retrieval from Qdrant (top 20 candidates) combined with BM25 keyword search over chunk text. Use Reciprocal Rank Fusion to merge results. Then re-rank the merged top-20 using Cohere Rerank API, return top-5. Accept a query string, return a list of (chunk_text, source, score) tuples.”
What you should see: a retriever that calls both Qdrant and a BM25 index (built in-memory at startup from stored payloads). Test it: python -c "from src.retriever import retrieve; print(retrieve('your test question here'))". You should get 5 ranked chunks with source citations.
Step 5: Add the Agentic Generation Loop
Prompt: “Write src/agent.py. It should: call the retriever, check if max retrieval score is above 0.7 — if not, use HyDE to reformulate the query (ask Claude to write a hypothetical answer, embed that answer, retry retrieval once), then call Claude claude-sonnet-4-6 with the retrieved context and question in a structured prompt that asks for an answer with inline citations. Stream the response. Log which chunks were used.”
This is where Claude Code earns its place — the HyDE reformulation and citation logic involves non-trivial prompt engineering that the agent handles cleanly. As noted in our earlier analysis of Claude Code’s rise, the tool’s strength is exactly this kind of multi-file, interconnected code generation.
Step 6: Add Evaluation Hooks
Prompt: “Write src/eval.py. For a set of (question, expected_answer) pairs in eval_set.json, run the full pipeline and compute: context recall (did retrieved chunks contain the answer?), answer faithfulness (does the answer cite only retrieved content?), and answer relevancy (does the answer address the question?). Output a score table and flag any questions with context recall below 0.6.”
What you should see: a file you can run weekly to catch retrieval regressions. This is the step most RAG tutorials skip. Without it, you won’t notice when document updates silently break retrieval quality.
Failure Modes to Watch For
Production RAG systems fail in predictable ways. The most common: chunk boundary hallucinations — the answer spans two adjacent chunks but only one was retrieved. Fix: increase overlap to 25% or use a parent-document retriever that returns the full section containing a matching chunk.
Second most common: query-document vocabulary mismatch — the user says “cost” but your docs say “expenditure”. This is why BM25 alone fails and why HyDE helps: a hypothetical answer uses the document’s vocabulary, not the user’s paraphrase.
Third: stale embeddings. When you update a document, you must re-ingest and re-embed that document’s chunks. A common production pattern is to store a content hash in the Qdrant payload and skip re-ingestion if the hash matches — reducing incremental ingestion time by 80–90% on large corpora.
For teams deploying longer-running AI agents, RAG is often the bottleneck — not the generation step. Optimizing retrieval latency matters more than shaving milliseconds off the LLM call.
Choosing Your Vector Database for Scale
The decision tree is straightforward. Under 5 million vectors and you already run PostgreSQL: use pgvector — it costs nothing extra and your ops team knows it. Between 5M and 100M vectors with complex metadata filtering: Qdrant self-hosted ($30–$50/month on an 8GB RAM VPS). Over 100M vectors with a team that doesn’t want to manage infrastructure: Pinecone or Qdrant Cloud — but budget accordingly. Here’s a quick reference:
| Database | Best for | p50 Latency | Monthly cost (5M vectors) | Metadata filtering |
|---|---|---|---|---|
| pgvector | Existing Postgres stacks, <10M vectors | ~15ms | $0 incremental | Good |
| Qdrant (self-hosted) | Complex filters, performance-critical | 4ms | $30–50 | Excellent (pre-filter) |
| Qdrant Cloud | Managed, mid-scale | 4–8ms | $100–300 | Excellent |
| Pinecone | Zero-ops, billion-scale | 10–30ms (serverless) | $70–300+ | Good (post-filter) |
What to Build Next
With the base pipeline running, the next three improvements give the biggest return. First: document-level metadata filtering — if users work in specific departments or time ranges, filtering before vector search cuts both latency and irrelevant results. Second: query caching — 20–30% of production queries are repeated or near-identical; caching embeddings and results at the query layer cuts API costs meaningfully. Third: streaming responses with source attribution UI — users trust answers more when they can see which document they came from, and streaming keeps the interface responsive on longer answers.
The architecture here also maps directly to more complex multi-agent patterns: each specialized agent in a network can have its own RAG retriever scoped to its domain, with a router agent selecting which retriever to query first. That’s where production agentic systems are heading in 2026.
Further Reading
- How to Build a Production RAG Pipeline in 2026 (RoboRhythms) — good deep-dive on the five architectural layers with concrete benchmark numbers
- LangChain vs LlamaIndex 2026: Complete Production RAG Comparison — covers the framework decision in depth; useful if you’re choosing your orchestration layer
- Vector Database Comparison 2026 (4xxi) — the most thorough head-to-head of pgvector, Qdrant, Pinecone, LanceDB, and ChromaDB for production RAG workloads

