Why You Need to Build This Yourself First

Before we dive into code, let's talk about why building RAG from scratch matters even when frameworks like LangChain exist. When your retrieval quality is poor and your answers are wrong, you need to know exactly which layer is failing. Is it your chunk boundaries? Your embedding model? Your similarity metric? Your prompt structure? If you only ever use abstractions, every failure looks the same: "the AI gave a bad answer." When you understand each layer, you can pinpoint the problem in minutes instead of days.

There's also a second reason: performance. Frameworks add overhead. Every abstraction has a cost. In production, where you're serving thousands of queries per hour and every 50ms of latency matters, knowing how to bypass the abstraction layer and talk directly to FAISS or manipulate embeddings directly can mean the difference between a responsive product and a sluggish one. The knowledge compounds: once you understand what the primitives do, you can use frameworks intelligently instead of blindly.

Finally, building from scratch forces you to make decisions. What chunk size works best for your documents? Which embedding model is fast enough for your latency requirements? What similarity threshold separates "relevant" from "noise"? Frameworks make default choices for you, and those defaults are often wrong for your specific use case. By building from scratch first, you develop the intuition to know when to override defaults and what to change. That intuition is worth far more than any abstraction.

The RAG Problem: Why LLMs Alone Aren't Enough

Let's set the scene. You have an LLM. It's smart. It's trained on the internet up to April 2024. But your company has:

Internal documentation updated weekly
Customer-specific knowledge bases
Proprietary research papers
Regulatory compliance docs

Your LLM has no access to any of that. It'll hallucinate. Confidently. So what do we do?

We give the LLM the documents it needs before asking it to answer. That's RAG: Retrieval-Augmented Generation.

The pipeline looks like this:

Load documents from files, databases, APIs
Chunk them into manageable pieces
Embed each chunk into a vector space
Store vectors in a searchable database
Retrieve relevant chunks when a user asks a question
Generate an answer using the LLM + retrieved context

Simple in theory. Nightmarish in practice. Let's build it.

Why RAG Over Fine-Tuning?

This is the question every ML engineer asks when they first encounter RAG. You have domain-specific documents. You want your LLM to know about them. Why not just fine-tune the model on your data and be done with it?

The answer comes down to four practical realities. First, knowledge currency: your documents change. Regulations get updated, products ship new features, internal policies evolve. Fine-tuning is a batch process, you retrain, you re-deploy. With RAG, you update your vector store and the change is live immediately. Second, cost: fine-tuning a 7B parameter model costs hundreds of dollars per run, and a 70B model costs thousands. A RAG update is just re-embedding new documents, which costs pennies. When your knowledge base changes weekly, those costs compound fast.

Third, interpretability: RAG systems can show you which chunks they retrieved to generate an answer. That's debuggable. That's auditable. In regulated industries, being able to say "the model answered this because it found these three specific paragraphs in your compliance manual" is not a nice-to-have, it's a requirement. A fine-tuned model that has absorbed knowledge into its weights cannot tell you why it believes what it believes. Fourth, overfitting risk: fine-tuning on a small domain-specific corpus tends to degrade general capability. Your model gets better at answering questions about your product and worse at everything else. RAG sidesteps this entirely, your base model stays unchanged, and you layer domain knowledge on top at inference time. Fine-tuning does have its place: when you need to change the model's behavior, tone, or output format. But for grounding responses in specific facts and documents, RAG wins almost every time.

Architecture: The Five Layers

Before we code, understand the layers:

User Query
    ↓
[Embedding Layer] → Convert query to vector
    ↓
[Retrieval Layer] → Find similar chunks in vector store
    ↓
[Reranking Layer] (optional) → Filter & rank results
    ↓
[Context Assembly] → Build prompt with top-k chunks
    ↓
[LLM Layer] → Generate answer with context
    ↓
Final Response

Each layer has failure modes. Each layer has knobs you'll twist. Let's implement each one.

Embedding Space Intuition

Before you can reason about why retrieval succeeds or fails, you need an intuitive understanding of what embeddings actually are. An embedding is a function that maps text to a point in high-dimensional space, typically 384, 768, or 1536 dimensions. The magic is in the geometry: semantically similar text ends up at nearby points, and semantically different text ends up far apart.

Think of it this way. "The dog ran through the park" and "A puppy sprinted across the garden" will land very close together in embedding space, even though they share almost no words. Meanwhile, "The dog ran through the park" and "Quarterly revenue increased by 12%" will land in completely different regions. The embedding model has learned, from billions of training examples, what topics and concepts cluster together in human language.

This has immediate practical consequences for your RAG system. Your query embedding needs to land near the document chunks that answer it. If your query is "What's the refund policy?" and your document chunk says "Returns are accepted within 30 days for a full refund," those should be near each other in embedding space. And for a good general-purpose model like all-MiniLM-L6-v2, they will be. But here's where it gets subtle: the model's notion of "similar" is based on its training data. A model trained on general web text may not understand that "FedWire instructions" and "ACH routing" are similar concepts in banking, even though a domain expert knows they're closely related. That's why domain-specific embedding models exist, and why you should evaluate your embedding model on your actual data before committing to it. The geometry of embedding space is not neutral, it reflects the biases and knowledge of the training corpus.

Layer 1: Loading and Chunking Documents

You can't embed a 50-page PDF. You need chunks. But chunks that are too small lose context. Chunks that are too big become noise. Let's implement three chunking strategies.

Fixed-Size Chunking

This is the simplest (and often the worst). It works by slicing the document into pieces of exactly N characters, with an optional overlap between adjacent chunks so you don't lose context at boundaries. The overlap is critical: without it, a sentence that straddles a chunk boundary gets split in half, and neither half makes sense on its own.

python

def chunk_fixed_size(text, chunk_size=512, overlap=50):
    """
    Split text into fixed-size chunks with overlap.
 
    Args:
        text: Full document text
        chunk_size: Characters per chunk
        overlap: Characters to overlap between chunks
 
    Returns:
        List of text chunks
    """
    chunks = []
    stride = chunk_size - overlap
 
    for i in range(0, len(text), stride):
        chunk = text[i:i + chunk_size]
        if chunk.strip():  # Skip empty chunks
            chunks.append(chunk)
 
    return chunks
 
# Example
doc = "Natural Language Processing is a subfield of linguistics..."
chunks = chunk_fixed_size(doc, chunk_size=256, overlap=32)
print(f"Created {len(chunks)} chunks")

The overlap parameter is doing important work here. With chunk_size=512 and overlap=50, your stride is 462 characters, meaning each chunk shares 50 characters with the next one. That 50-character overlap ensures that sentences near chunk boundaries appear in at least one complete chunk. Tune it up when your sentences are long, down when your documents are tight.

Why this is mediocre: Fixed-size chunks don't respect sentence or paragraph boundaries. You'll split mid-thought and create noise. But it's fast and deterministic, good for prototyping.

Sentence-Level Chunking

Smarter: break at sentence boundaries. This approach uses NLTK's sentence tokenizer to find natural stopping points in the text, then groups those sentences together until you hit a size limit. The result is chunks that always contain complete thoughts, which tends to produce more meaningful embeddings.

python

import nltk
from nltk.tokenize import sent_tokenize
 
nltk.download('punkt')  # Run once
 
def chunk_by_sentence(text, max_chunk_size=512):
    """
    Group sentences into chunks until chunk size is exceeded.
 
    Args:
        text: Full document text
        max_chunk_size: Maximum characters per chunk
 
    Returns:
        List of text chunks
    """
    sentences = sent_tokenize(text)
    chunks = []
    current_chunk = ""
 
    for sentence in sentences:
        # If adding this sentence would exceed limit...
        if len(current_chunk) + len(sentence) > max_chunk_size:
            if current_chunk.strip():
                chunks.append(current_chunk.strip())
            current_chunk = sentence
        else:
            current_chunk += " " + sentence
 
    # Don't forget the last chunk
    if current_chunk.strip():
        chunks.append(current_chunk.strip())
 
    return chunks
 
# Example
chunks = chunk_by_sentence(doc, max_chunk_size=256)
for i, chunk in enumerate(chunks[:2]):
    print(f"Chunk {i}: {chunk[:80]}...")

Notice the trailing-chunk logic at the end of the function. It's easy to forget that the last batch of sentences never triggers the size-overflow condition, you have to flush it manually. This is a common bug in naive implementations, and it means the last paragraph of your document silently disappears from your index.

Why this is better: You respect semantic boundaries. Each chunk (mostly) contains complete thoughts. Cosine similarity will be more meaningful.

Recursive Character Chunking

The gold standard for most documents. The insight here is that documents have natural structure at multiple levels: paragraphs, sentences, clauses, words. This function respects that hierarchy by trying the broadest separator first (double newline = paragraph break), then falling back to narrower ones if the resulting pieces are still too large.

python

def chunk_recursive(
    text,
    chunk_size=512,
    overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""]
):
    """
    Recursively split text using different separators.
    Tries to preserve structure: paragraphs → sentences → words → chars.
 
    Args:
        text: Full document text
        chunk_size: Target characters per chunk
        overlap: Characters to overlap between chunks
        separators: List of separators to try in order
 
    Returns:
        List of text chunks
    """
    def _split(text, separators):
        # Base case: use the last separator (empty string = character split)
        if not separators:
            return [text]
 
        separator = separators[0]
        remaining = separators[1:]
 
        # Try splitting with this separator
        if separator:
            splits = text.split(separator)
        else:
            # Character-level split
            splits = list(text)
 
        # Now filter out small splits and merge with larger ones
        good_splits = []
        for split in splits:
            if len(split) < chunk_size:
                good_splits.append(split)
            else:
                # This split is still too large, recurse
                if good_splits:
                    merged_split = "\n".join(good_splits)
                    if len(merged_split) > chunk_size:
                        # Recursively split this
                        good_splits = _split(merged_split, remaining)
 
                # Now handle the too-large split
                other_splits = _split(split, remaining)
                good_splits.extend(other_splits)
 
        return good_splits
 
    splits = _split(text, separators)
 
    # Now apply overlap and return
    chunks = []
    stride = chunk_size - overlap
 
    for split in splits:
        for i in range(0, len(split), stride):
            chunk = split[i:i + chunk_size]
            if chunk.strip():
                chunks.append(chunk.strip())
 
    return chunks
 
# Example
chunks = chunk_recursive(doc, chunk_size=256, overlap=32)
print(f"Recursive chunking: {len(chunks)} chunks")

The separator list ["\n\n", "\n", ". ", " ", ""] is doing heavy lifting here. For a well-formatted technical document, most splits will happen at paragraph breaks (\n\n), preserving entire paragraphs as single chunks. Only when a paragraph exceeds chunk_size does the function fall back to sentence breaks, then word breaks, then raw characters. This graceful degradation means you get the best possible structure at every level.

Why this works: You try paragraph breaks first, then sentences, then words, then characters. You preserve structure as much as possible while hitting your size targets. This is what LangChain uses under the hood.

Critical detail: All three methods create chunks. But which is right depends on your data. Legal documents? Recursive character. Twitter threads? Sentence. Chat logs? Fixed-size with large overlap. You need to experiment.

Chunking Strategies: When to Use What

The three chunking methods above cover most use cases, but the decision of which to use is more nuanced than it first appears. The right strategy depends on your document structure, query patterns, and latency requirements, not on which sounds most sophisticated.

Fixed-size chunking is the right choice when your documents have no meaningful structure: raw crawled web text, log files, transcripts without speaker labels, or any content where paragraph and sentence boundaries aren't reliable. It's also the right choice when you're prototyping and need something deterministic to test with. The overlap parameter does most of the heavy lifting for you, and the simplicity makes debugging easy.

Sentence-level chunking shines when your content is dense with specific facts: FAQ documents, product specs, legal clauses, medical literature. Each sentence typically carries a complete assertion, and you want each chunk to contain exactly one or two clean assertions so your embeddings are focused. The risk is chunk size variance: some sentences are five words, some are sixty. Wide variance means your vector store contains a mix of highly specific narrow chunks and broad context-heavy chunks, and your retrieval results will be inconsistent.

Recursive character chunking is the workhorse for general-purpose document processing: PDFs, markdown files, HTML articles, internal wikis. Most documents are organized into paragraphs for a reason, each paragraph develops a single idea, and this method respects that. The one situation where it fails is heavily nested content like code files or XML, where the structure is hierarchical rather than linear. For those, consider building a structure-aware chunker that understands the document's schema. Experiment with chunk sizes between 256 and 1024 characters before committing. A quick precision@5 measurement on a small labeled set will tell you more than any rule of thumb.

Layer 2: Embedding Models

Now you have chunks. Convert them to vectors. There are two main paths: closed-source and open-source.

Closed-Source: OpenAI's text-embedding-3-small

Fast, good quality, costs money. The API is straightforward: you send a list of strings, you get back a list of floating-point vectors. The key thing to understand is that the text-embedding-3-small model produces 1536-dimensional vectors by default, but you can request smaller dimensions (like 384) to save storage and speed up search, at a small quality cost.

python

from openai import OpenAI
 
client = OpenAI(api_key="sk-...")
 
def embed_with_openai(texts):
    """
    Embed a list of texts using OpenAI's embedding API.
 
    Args:
        texts: List of strings to embed
 
    Returns:
        List of embedding vectors (each 384-dim for text-embedding-3-small)
    """
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts
    )
 
    # Extract embeddings from response
    embeddings = [item.embedding for item in response.data]
    return embeddings
 
# Example
chunks = ["Machine learning is...", "Deep learning is..."]
embeddings = embed_with_openai(chunks)
print(f"Embedded {len(embeddings)} chunks")
print(f"Embedding shape: {len(embeddings[0])}")  # 384 dims

Batch your API calls aggressively. OpenAI's API accepts up to 2048 inputs per request, and the per-request overhead is significant. If you send 10,000 chunks as individual requests, you'll wait a long time and burn through more of your rate limit than necessary. Send them in batches of 500 to 1000, and you'll see embedding throughput improve dramatically.

Cost: ~$0.02 per 1M tokens. For a typical company knowledge base (1M tokens = ~200k chunks), that's about $20. Cheap.

Latency: API round-trip. Slow for large batches unless you batch carefully.

Open-Source: Sentence-Transformers

Free, good quality, runs locally. The sentence-transformers library wraps dozens of pre-trained models with a consistent interface. For general English text, all-MiniLM-L6-v2 is the go-to: it's 22MB, produces 384-dimensional vectors, and runs at roughly 1000 sentences per second on a modern CPU. For multilingual content, use paraphrase-multilingual-MiniLM-L12-v2. For academic or scientific content, allenai-specter was trained on scientific paper abstracts and will outperform general-purpose models significantly.

python

from sentence_transformers import SentenceTransformer
 
# Download and cache the model locally
model = SentenceTransformer('all-MiniLM-L6-v2')  # 384 dims, 22MB
 
def embed_with_sentence_transformers(texts):
    """
    Embed texts using open-source sentence transformers.
 
    Args:
        texts: List of strings to embed
 
    Returns:
        List of embedding vectors (384-dim for all-MiniLM-L6-v2)
    """
    embeddings = model.encode(texts, convert_to_numpy=True)
    return embeddings
 
# Example
chunks = ["Machine learning is...", "Deep learning is..."]
embeddings = embed_with_sentence_transformers(chunks)
print(f"Embedded {len(embeddings)} chunks")
print(f"Embedding shape: {embeddings[0].shape}")  # (384,)

The convert_to_numpy=True parameter is important here. By default, model.encode() returns a PyTorch tensor, which is not directly compatible with FAISS. Converting to numpy immediately keeps your code clean and avoids silent type errors downstream. If you're running on a machine with a GPU, also pass device='cuda' to the SentenceTransformer constructor to see 10-50x speedups on large batches.

Cost: Free (compute cost is on you).

Latency: Fast. All-MiniLM-L6-v2 embeds ~1000 chunks/sec on CPU.

Quality: Slightly lower than OpenAI's, but good enough for 90% of use cases.

When to pick which:

OpenAI: Need the absolute best quality + don't want to manage compute
Sentence-Transformers: Need it fast, cheap, and under your control

For this article, we'll use sentence-transformers (it's open-source and runs offline).

Layer 3: Vector Storage and Retrieval

You have embeddings. Now you need to store and search them fast. Two main options: in-memory and persistent.

In-Memory: FAISS

Facebook's AI Similarity Search. Lightning fast, no server needed. The key conceptual point is that FAISS is not a database, it's an index. It doesn't store your text, your metadata, or any payload. It stores vectors and returns indices. You're responsible for mapping those indices back to your actual chunks. Keep your chunks list in sync with your FAISS index, and you'll have no trouble.

python

import numpy as np
import faiss
 
def build_faiss_index(embeddings, use_gpu=False):
    """
    Build a FAISS index from embeddings.
 
    Args:
        embeddings: numpy array of shape (n_chunks, embedding_dim)
        use_gpu: Whether to use GPU (requires FAISS GPU build)
 
    Returns:
        Tuple of (faiss_index, embeddings_array)
    """
    embeddings = np.array(embeddings).astype('float32')
 
    # Normalize embeddings for cosine similarity
    faiss.normalize_L2(embeddings)
 
    # Create index: flat (no compression, brute-force search)
    index = faiss.IndexFlatIP(embeddings.shape[1])  # IP = inner product on normalized = cosine
 
    # Add vectors
    index.add(embeddings)
 
    return index, embeddings
 
def search_faiss(index, query_embedding, k=5):
    """
    Search FAISS index for top-k similar vectors.
 
    Args:
        index: FAISS index object
        query_embedding: Single embedding vector (1, embedding_dim)
        k: Number of results to return
 
    Returns:
        Tuple of (distances, indices)
    """
    query_embedding = np.array([query_embedding]).astype('float32')
    faiss.normalize_L2(query_embedding)
 
    distances, indices = index.search(query_embedding, k)
    return distances[0], indices[0]
 
# Example: Build index from our chunks
chunks = ["Machine learning...", "Deep learning...", "Neural networks..."]
embeddings = embed_with_sentence_transformers(chunks)
index, _ = build_faiss_index(embeddings)
 
# Search
query = "What is machine learning?"
query_embedding = embed_with_sentence_transformers([query])[0]
distances, indices = search_faiss(index, query_embedding, k=2)
 
print(f"Top results:")
for dist, idx in zip(distances, indices):
    print(f"  - {chunks[idx][:50]}... (similarity: {dist:.4f})")

The faiss.normalize_L2() call is doing something critical: it converts Euclidean distance to cosine similarity. When you normalize all vectors to unit length, the inner product between two vectors equals their cosine similarity. IndexFlatIP computes inner products (IP = inner product), so normalizing + inner product gives you cosine similarity. Always normalize unless you have a specific reason not to.

Why FAISS: Sub-millisecond searches on millions of vectors. Production-grade. Used by Google, Meta, Microsoft. No external service needed.

Tradeoff: Ephemeral. Index lives in memory. Lose it on restart. Good for demos and in-process systems.

Persistent: Chroma

SQLite-backed vector store. Survives restarts. Chroma handles both the vector storage and the document storage together, which makes it much more convenient for building complete applications. You store chunks and embeddings in one call, and you get chunks back on retrieval without maintaining a separate mapping.

python

import chromadb
 
def build_chroma_index(chunks, embeddings, collection_name="documents"):
    """
    Build a Chroma collection (persistent vector store).
 
    Args:
        chunks: List of text chunks
        embeddings: List of embedding vectors
        collection_name: Name of the collection
 
    Returns:
        Chroma collection object
    """
    client = chromadb.Client()
 
    collection = client.get_or_create_collection(
        name=collection_name,
        metadata={"hnsw:space": "cosine"}  # Use cosine similarity
    )
 
    # Add documents with embeddings
    collection.add(
        ids=[f"chunk_{i}" for i in range(len(chunks))],
        documents=chunks,
        embeddings=embeddings,
        metadatas=[{"source": f"doc_{i}"} for i in range(len(chunks))]
    )
 
    return collection
 
def search_chroma(collection, query_embedding, k=5):
    """
    Search Chroma collection.
 
    Args:
        collection: Chroma collection object
        query_embedding: Single embedding vector
        k: Number of results
 
    Returns:
        Dict with documents, distances, metadatas
    """
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=k
    )
    return results
 
# Example
chunks = ["Machine learning...", "Deep learning...", "Neural networks..."]
embeddings = embed_with_sentence_transformers(chunks)
collection = build_chroma_index(chunks, embeddings)
 
# Search
query = "What is machine learning?"
query_embedding = embed_with_sentence_transformers([query])[0]
results = search_chroma(collection, query_embedding, k=2)
 
print(f"Top results:")
for doc, distance in zip(results['documents'][0], results['distances'][0]):
    print(f"  - {doc[:50]}... (distance: {distance:.4f})")

The metadata field is underused by most developers but invaluable in production. Store the source document name, the page number, the creation date, the author, anything that lets you trace a retrieved chunk back to its origin. When your system gives a wrong answer, that provenance metadata is how you diagnose whether the problem is a bad chunk, a stale document, or a retrieval ranking issue.

Why Chroma: Easy persistent storage. Built-in similarity search. SQLite backend (zero setup). Good for applications that need to survive restarts.

Tradeoff: Slower than FAISS on large datasets. But still milliseconds.

Choose between them:

FAISS: Production at scale (millions of vectors), need blazing speed, don't mind in-memory
Chroma: Smaller datasets (<100k vectors), need persistence, prefer simplicity

We'll use FAISS for the rest of this article.

Layer 4: Retrieval Quality

You've built the pipeline. Now, does it work?

This is where most RAG systems fail. Bad chunking + bad embeddings = bad retrieval. And bad retrieval makes any LLM look stupid.

Metric 1: Precision@K

Simple question: Of your top-k results, how many are actually relevant? This metric forces you to build a labeled evaluation set, a list of (query, correct_chunk_id) pairs, which feels like extra work but is genuinely irreplaceable. Without it, you're flying blind when you change your chunking strategy or swap embedding models.

python

def precision_at_k(retrieved_indices, relevant_indices, k=5):
    """
    Calculate precision@k: fraction of top-k results that are relevant.
 
    Args:
        retrieved_indices: List of indices returned by search (sorted by relevance)
        relevant_indices: Set of indices that are actually relevant
        k: Evaluate top-k results
 
    Returns:
        Precision@k as float (0 to 1)
    """
    top_k = retrieved_indices[:k]
    relevant_in_top_k = sum(1 for idx in top_k if idx in relevant_indices)
    return relevant_in_top_k / k
 
# Example
retrieved = [0, 2, 1, 5, 3]  # Search result indices
relevant = {0, 1}  # Ground-truth relevant indices
precision_at_5 = precision_at_k(retrieved, relevant, k=5)
print(f"Precision@5: {precision_at_5:.2f}")  # 0.40 (2 of 5 are relevant)

A precision@5 of 0.40 means two of your top five results are actually relevant. That sounds low, but it depends on how many relevant chunks exist in your corpus. If only two chunks in your entire 10,000-chunk corpus are relevant to this query, then precision@5 = 0.40 is actually perfect retrieval. Context matters: always compare your precision@k to a random baseline (relevant chunks / total chunks) before drawing conclusions.

Metric 2: Mean Reciprocal Rank (MRR)

Where's the first relevant result? MRR penalizes systems that bury the most relevant result at position 4 instead of position 1. It's especially important when your application puts the first retrieved chunk into the prompt context, if the right answer is sixth in the list, it might not make it into the context window at all.

python

def mean_reciprocal_rank(retrieved_indices, relevant_indices):
    """
    Calculate MRR: 1 / rank of first relevant result.
 
    Args:
        retrieved_indices: List of indices (sorted by relevance)
        relevant_indices: Set of relevant indices
 
    Returns:
        MRR as float (0 to 1)
    """
    for rank, idx in enumerate(retrieved_indices, start=1):
        if idx in relevant_indices:
            return 1.0 / rank
    return 0.0
 
# Example
mrr = mean_reciprocal_rank(retrieved, relevant)
print(f"MRR: {mrr:.2f}")  # 1.0 (first result is relevant)

Why these matter: You want high precision@5 (top results are good) and high MRR (relevant results are near the top). If precision@5 is 0.3, your chunking or embeddings are broken.

Building a Ground-Truth Eval Set

Here's the hard part: you need labels. Which chunks are relevant to which queries? You have two realistic options. Option one: hire a subject-matter expert to manually label 50-100 query-chunk pairs. Tedious, expensive, but ground truth. Option two: use an LLM to auto-label a larger set, then spot-check a sample manually. The LLM approach scales better but introduces label noise, so verify that your LLM-generated labels match human judgment on at least 20% of samples before trusting them.

python

def create_eval_dataset(query_chunk_pairs):
    """
    Create evaluation dataset from labeled (query, relevant_chunk_id) pairs.
 
    Args:
        query_chunk_pairs: List of (query, chunk_id) tuples
 
    Returns:
        Dict mapping queries to sets of relevant chunk IDs
    """
    eval_set = {}
    for query, chunk_id in query_chunk_pairs:
        if query not in eval_set:
            eval_set[query] = set()
        eval_set[query].add(chunk_id)
    return eval_set
 
def evaluate_retrieval(chunks, embeddings, index, eval_set, k=5):
    """
    Evaluate retrieval quality on labeled dataset.
 
    Args:
        chunks: List of chunk texts
        embeddings: List of embeddings
        index: FAISS index
        eval_set: Dict of {query: {relevant_chunk_ids}}
        k: Evaluate top-k
 
    Returns:
        Dict of metrics
    """
    precisions = []
    mrrs = []
 
    for query, relevant_ids in eval_set.items():
        # Embed query and search
        query_embedding = embed_with_sentence_transformers([query])[0]
        distances, indices = search_faiss(index, query_embedding, k=k)
 
        # Compute metrics
        p_at_k = precision_at_k(indices, relevant_ids, k=k)
        mrr = mean_reciprocal_rank(indices, relevant_ids)
 
        precisions.append(p_at_k)
        mrrs.append(mrr)
 
    return {
        "precision@k_mean": np.mean(precisions),
        "precision@k_std": np.std(precisions),
        "mrr_mean": np.mean(mrrs),
        "mrr_std": np.std(mrrs)
    }
 
# Example: Create tiny eval set
eval_set = {
    "What is machine learning?": {0},  # Chunk 0 is relevant
    "Tell me about neural networks": {2}  # Chunk 2 is relevant
}
 
metrics = evaluate_retrieval(chunks, embeddings, index, eval_set, k=5)
print(f"Precision@5: {metrics['precision@k_mean']:.2f}")
print(f"MRR: {metrics['mrr_mean']:.2f}")

Run this evaluation after every significant change to your pipeline: after changing chunk size, after switching embedding models, after adding or removing documents. The standard deviation fields tell you something important too, a high standard deviation means your system works well for some queries and poorly for others, pointing to systematic gaps in your corpus coverage.

Real talk: Getting from 0.4 precision@5 to 0.8 is 90% of the work in RAG. You'll tune chunking, embedding models, and search parameters endlessly. The LLM is the easy part.

Layer 5: Generation

Finally, your LLM. Take the top-k chunks, build a prompt, ask the model to answer. The prompt structure here is not incidental, it directly determines how well the LLM uses the context you've retrieved. The model needs to understand that the context is authoritative, that it should prefer context over its training data, and that it should say "I don't know" when the context doesn't answer the question.

python

from openai import OpenAI
 
client = OpenAI(api_key="sk-...")
 
def generate_answer(user_query, retrieved_chunks, model="gpt-4"):
    """
    Generate an answer using retrieved context.
 
    Args:
        user_query: User's question
        retrieved_chunks: List of relevant text chunks
        model: LLM model to use
 
    Returns:
        Generated answer string
    """
    # Build context string
    context = "\n\n".join(retrieved_chunks)
 
    # Build prompt
    prompt = f"""You are a helpful assistant. Answer the user's question using the provided context.
 
Context:
{context}
 
Question: {user_query}
 
Answer:"""
 
    # Call LLM
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ],
        temperature=0.7,
        max_tokens=500
    )
 
    return response.choices[0].message.content
 
# Example: End-to-end RAG
user_query = "What is machine learning?"
query_embedding = embed_with_sentence_transformers([user_query])[0]
distances, indices = search_faiss(index, query_embedding, k=3)
retrieved_chunks = [chunks[idx] for idx in indices]
 
answer = generate_answer(user_query, retrieved_chunks)
print(f"Q: {user_query}")
print(f"A: {answer}")

The temperature=0.7 setting is worth thinking about. For factual Q&A over a knowledge base, you typically want lower temperatures, closer to 0.1 or 0.2, because you want the model to precisely extract and rephrase information from the context, not creatively elaborate on it. Higher temperatures are appropriate when you're using RAG to ground a more conversational or creative output, where some variation is acceptable. Match your temperature to your use case.

That's it. You've built RAG from scratch.

Putting It All Together: End-to-End RAG

Here's the complete pipeline. Everything we've built so far comes together in a single class that you can drop into any project. The add_documents method handles chunking and indexing. The search method handles embedding and retrieval. The answer method handles generation. It's clean, it's debuggable, and every line does exactly what it looks like it does.

python

import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
from openai import OpenAI
 
class RAGSystem:
    def __init__(self, model_name="all-MiniLM-L6-v2"):
        self.embedding_model = SentenceTransformer(model_name)
        self.index = None
        self.chunks = []
        self.llm_client = OpenAI(api_key="sk-...")
 
    def add_documents(self, documents, chunk_size=512, overlap=50):
        """Load and chunk documents."""
        all_chunks = []
        for doc in documents:
            chunks = chunk_recursive(doc, chunk_size=chunk_size, overlap=overlap)
            all_chunks.extend(chunks)
 
        # Embed all chunks
        embeddings = self.embedding_model.encode(all_chunks, convert_to_numpy=True)
 
        # Build FAISS index
        embeddings = np.array(embeddings).astype('float32')
        faiss.normalize_L2(embeddings)
        self.index = faiss.IndexFlatIP(embeddings.shape[1])
        self.index.add(embeddings)
 
        self.chunks = all_chunks
        print(f"Indexed {len(all_chunks)} chunks")
 
    def search(self, query, k=5):
        """Search for relevant chunks."""
        query_embedding = self.embedding_model.encode([query], convert_to_numpy=True)[0]
        query_embedding = np.array([query_embedding]).astype('float32')
        faiss.normalize_L2(query_embedding)
 
        distances, indices = self.index.search(query_embedding, k)
        return [self.chunks[idx] for idx in indices[0]]
 
    def answer(self, query, k=5, model="gpt-4"):
        """Full RAG: retrieve + generate."""
        # Retrieve
        relevant_chunks = self.search(query, k=k)
        context = "\n\n".join(relevant_chunks)
 
        # Generate
        prompt = f"""Context: {context}
 
Question: {query}
 
Answer:"""
 
        response = self.llm_client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=500
        )
 
        return response.choices[0].message.content
 
# Usage
rag = RAGSystem()
rag.add_documents([
    "Machine learning is a branch of AI that learns from data.",
    "Deep learning uses neural networks with many layers."
])
answer = rag.answer("What is machine learning?")
print(answer)

This class is intentionally not production-ready, it has no error handling, no logging, no batch processing for large document sets, and no caching. But it's a solid foundation. Every enhancement you add from here, retry logic, async embedding, metadata filtering, multi-collection support, should slot in cleanly because the architecture is clear.

Done. Not pretty, but it works. And you understand every line.

Common RAG Mistakes

Most RAG failures are not subtle. They're the same mistakes, made over and over, by developers who skipped the evaluation step or never benchmarked their retrieval quality. Here are the five that will burn you the most.

Mistake 1: Chunk size too large. You set chunk_size=2048 because you figure more context is better. Your precision@5 tanks to 0.2. The problem: large chunks produce noisy, diffuse embeddings that average over many topics. The embedding for a 2000-character paragraph covers three separate ideas, none of them well. The similarity score for "How do I reset my password?" gets confused with a chunk that mentions passwords briefly in the context of a broader security discussion. Drop your chunk size to 256-512 and measure again.

Mistake 2: Not evaluating retrieval before blaming the LLM. Your system gives wrong answers. You assume the LLM is the problem and start swapping models. But retrieval precision@5 is 0.15, the correct answer isn't even in the context 85% of the time. The LLM is doing the best it can with garbage input. Always benchmark retrieval first. If precision@5 is below 0.5, retrieval is broken, not generation.

Mistake 3: Using cosine similarity when Euclidean is more appropriate. For most text embeddings, cosine similarity is correct because documents vary wildly in length and you want to capture semantic direction, not magnitude. But for some specialized embeddings, particularly those trained with metric learning objectives, L2 distance is more appropriate. Check your embedding model's documentation before assuming cosine is always right.

Mistake 4: Not handling out-of-scope queries. Your knowledge base covers product documentation. A user asks "Who won the 1994 World Cup?" Your retriever finds the three most similar chunks, which talk about your product's global availability. Your LLM confidently generates an answer about your product. The user is confused. Build an out-of-scope detector: if the maximum similarity score from retrieval falls below a threshold (typically 0.4-0.5 for normalized cosine), return "I can only answer questions about [domain]" instead of hallucinating.

Mistake 5: Indexing without re-indexing. You build your vector store in January. Your documents get updated in March. You never re-index. Now your retrieval is working against stale chunks that no longer match the current documents. Build a re-indexing pipeline from day one. Track document modification times. Re-embed and re-index changed documents automatically. Stale retrieval is a silent failure mode that only shows up when a user notices the answer doesn't match the document they're looking at.

Common Failure Modes (And How to Fix Them)

Problem 1: Search Returns Irrelevant Results

Symptoms: Precision@5 < 0.4. Search gives you tangential results.

Causes:

Chunks too large (noisy embedding)
Chunks too small (lost context)
Embedding model not good for your domain
Wrong similarity metric

Fixes:

Try smaller chunk size (256 instead of 512)
Switch to domain-specific embedding model (e.g., all-mpnet-base-v2 for general, allenai/specter for academic)
Use cosine similarity instead of Euclidean (we do this by normalizing)
Try hybrid search: keyword + semantic (see advanced techniques below)

Problem 2: LLM Still Hallucinates Despite Good Retrieval

Symptoms: Retrieved chunks are relevant, but LLM ignores them.

Causes:

Chunks don't fit in context window with full prompt
Model too small (< 70B params)
Prompt not emphasizing to use context

Fixes:

python

# Force the model to use context
prompt = f"""You MUST answer using ONLY the provided context.
If the context doesn't contain the answer, say "I don't have that information."
 
Context:
{context}
 
Question: {query}
 
Answer (using context only):"""

When the LLM ignores your context, a harder prompt is often the right lever to pull before you reach for a larger or more expensive model. The instruction "MUST answer using ONLY the provided context" combined with a clear fallback instruction ("say 'I don't have that information'") significantly reduces hallucination rates across most models. Test this before spending money on GPT-4 when GPT-3.5 with better prompting might suffice.

Problem 3: Slow Search on Large Datasets

Symptoms: Search takes >100ms on 100k chunks.

Causes:

FAISS IndexFlatIP is brute-force (no compression)
Embeddings not normalized

Fixes:

python

# Use IVF index (faster, approximate)
quantizer = faiss.IndexFlatIP(embedding_dim)
index = faiss.IndexIVFFlat(quantizer, embedding_dim, n_clusters)
index.train(embeddings)
index.add(embeddings)

The IVF (Inverted File) index partitions your vector space into n_clusters regions and only searches the regions nearest to your query. For a 100k-vector corpus, use n_clusters=100. For a 1M-vector corpus, use n_clusters=1000. The rule of thumb is n_clusters = sqrt(n_vectors). You'll trade 1-5% retrieval accuracy for 10-50x search speed. In most production systems, that's an excellent trade.

Advanced RAG: Beyond Basic Retrieval

You now have a working system. Here are three ways to make it better.

Technique 1: Hypothetical Document Embeddings (HyDE)

Instead of embedding the user's question, generate a hypothetical answer and embed that. More semantic. The intuition is that questions and answers have different linguistic patterns, a question has interrogative structure while a document has declarative structure. By generating a hypothetical document that would answer the question, you're embedding something that's structurally similar to your actual corpus.

python

def hyde_retrieval(query, embedding_model, index, chunks, llm_client):
    """
    HyDE: Generate hypothetical document, embed it, search.
 
    Args:
        query: User question
        embedding_model: Sentence transformer
        index: FAISS index
        chunks: List of chunks
        llm_client: OpenAI client
 
    Returns:
        Retrieved chunks
    """
    # Generate hypothetical answer
    response = llm_client.chat.completions.create(
        model="gpt-4",
        messages=[{
            "role": "user",
            "content": f"Write a paragraph answering: {query}"
        }],
        max_tokens=200
    )
    hypothetical_answer = response.choices[0].message.content
 
    # Embed the hypothetical answer instead of the query
    hypo_embedding = embedding_model.encode(
        [hypothetical_answer],
        convert_to_numpy=True
    )[0]
 
    # Search with hypothetical embedding
    hypo_embedding = np.array([hypo_embedding]).astype('float32')
    faiss.normalize_L2(hypo_embedding)
    distances, indices = index.search(hypo_embedding, k=5)
 
    return [chunks[idx] for idx in indices[0]]
 
# Usage
chunks = hyde_retrieval("What is machine learning?", embedding_model, index, chunks, client)

HyDE adds a full LLM call to every retrieval operation, so it roughly doubles your latency. That cost is worth it when your queries are short and ambiguous, single-word queries like "latency" or "authentication" that could match many document types. For longer, specific queries, the benefit is smaller. Measure before committing.

Why it works: The hypothetical answer is likely closer in language to the training documents than the raw question. Better retrieval.

Technique 2: Multi-Query Retrieval

Ask the LLM to rephrase the question, search multiple times, combine results. The underlying insight is that the embedding space is not perfectly smooth, small differences in phrasing can lead to meaningfully different retrieval results. By searching with three to five different phrasings of the same question, you cast a wider net and catch more of the relevant chunks.

python

def multi_query_retrieval(query, embedding_model, index, chunks, llm_client, k=5):
    """
    Generate multiple reformulations of the query, retrieve with each, combine.
    """
    # Generate reformulations
    response = llm_client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{
            "role": "user",
            "content": f"""Generate 3 different ways to ask: {query}
Format as:
1. [rephrasing 1]
2. [rephrasing 2]
3. [rephrasing 3]"""
        }],
        max_tokens=200
    )
 
    reformulations_text = response.choices[0].message.content
    # Parse reformulations (simple split by newline)
    reformulations = [
        line.split(". ", 1)[1]
        for line in reformulations_text.split("\n")
        if line.strip()
    ]
 
    # Search with each reformulation
    all_retrieved = set()
    for requery in [query] + reformulations:
        emb = embedding_model.encode([requery], convert_to_numpy=True)[0]
        emb = np.array([emb]).astype('float32')
        faiss.normalize_L2(emb)
        distances, indices = index.search(emb, k=k)
        all_retrieved.update(indices[0])
 
    return [chunks[idx] for idx in all_retrieved]

The deduplication via set() is simple but effective. One limitation: this approach treats all retrieved chunks equally, with no ranking. In practice you'll want to add a voting or scoring mechanism, chunks retrieved by multiple query variants are more likely to be truly relevant, so weight them higher. A simple count of how many variants retrieved each chunk is a good starting point.

Why it works: Different phrasings retrieve different chunks. You catch more relevant results.

Technique 3: Reranking with a Cross-Encoder

Retrieve top-50 with semantic search (fast), then rerank with a cross-encoder (slower but smarter). The key difference between a bi-encoder (what we've been using) and a cross-encoder is that a bi-encoder embeds query and document independently, while a cross-encoder sees both simultaneously and models their interaction directly. That interaction is where precision comes from.

python

from sentence_transformers import CrossEncoder
 
def rerank_chunks(query, chunks, k=5):
    """
    Use cross-encoder to rerank chunks.
    Cross-encoder scores (query, chunk) pairs directly (smarter than embedding distance).
    """
    cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-12-v2')
 
    # Score all chunks
    pairs = [[query, chunk] for chunk in chunks]
    scores = cross_encoder.predict(pairs)
 
    # Sort by score descending, return top-k
    ranked = sorted(zip(scores, chunks), reverse=True)
    return [chunk for score, chunk in ranked[:k]]
 
# Usage: hybrid
query = "What is machine learning?"
query_emb = embedding_model.encode([query], convert_to_numpy=True)[0]
query_emb = np.array([query_emb]).astype('float32')
faiss.normalize_L2(query_emb)
_, indices = index.search(query_emb, k=50)  # Retrieve top 50 fast
retrieved_chunks = [chunks[idx] for idx in indices[0]]
 
# Rerank top 50 to get top 5
reranked = rerank_chunks(query, retrieved_chunks, k=5)
print(f"Reranked results: {len(reranked)} chunks")

The retrieve-50-then-rerank-to-5 pattern is the standard production approach for high-quality RAG. FAISS handles the coarse filtering in sub-millisecond time, and the cross-encoder handles the fine-grained ranking with full query-document interaction. In practice, this combination regularly outperforms a pure semantic search by 15-30% on precision@5. Load the cross-encoder once at startup (it's 33MB), not per request.

Why it works: Semantic search is fast but noisy. Cross-encoders are slow but precise. Use semantic to filter to top-50, then cross-encode to top-5.

Cost: Cross-encoders add latency (50-100ms for 50 chunks). Worth it for quality-critical applications.

Comparing to LangChain

We've built RAG from scratch. LangChain does... this. It abstracts:

python

# Our implementation
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = embedding_model.encode(chunks)
faiss_index = build_faiss_index(embeddings)
distances, indices = search_faiss(faiss_index, query_embedding)
 
# LangChain
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
 
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = FAISS.from_documents(documents, embeddings)
results = vectorstore.similarity_search(query, k=5)

Both approaches call the same underlying FAISS functions. LangChain's version is more concise for standard use cases, but our version is easier to debug when something goes wrong, and easier to extend when your needs go beyond what LangChain's abstraction anticipated.

LangChain buys you:

Abstraction: One interface for FAISS, Chroma, Pinecone, Weaviate, etc.
Chaining: Easy prompt templates + LLM calls
Integration: Document loaders for PDF, CSV, web, etc.

What it doesn't buy:

Better retrieval quality (still up to you)
Hidden magic (same primitives underneath)
Easier debugging (abstractions hide problems)

When to use LangChain:

You need to swap vector stores later
You want document loader integrations
You're building a complex chain (RAG → summarization → Q&A)

When to build from scratch:

You want full control and debugging visibility
Performance is critical (less abstraction overhead)
You're optimizing retrieval quality (need to tweak every layer)

Key Takeaways

RAG is retrieval first, generation second. Bad retrieval breaks everything. Spend 80% of your time here.
Chunking is hard. Recursive character chunking respects structure. Sentence chunking is safer. Fixed-size is acceptable for prototypes. There's no universal best; test on your data.
Embeddings matter. Sentence-Transformers is good enough for most cases. OpenAI is slightly better. Domain-specific models (academic, legal, medical) exist for specialized use.
FAISS for speed, Chroma for persistence. Both use cosine similarity. Both scale to millions of vectors.
Evaluate retrieval quality with precision@k and MRR on labeled data. If precision@5 < 0.5, your pipeline is broken, not your LLM.
Advanced techniques work: HyDE, multi-query, reranking. But only if basic retrieval is solid.
LLMs are the easy part. A small language model with excellent retrieval beats a giant model with garbage retrieval.

Quick Reference: RAG Checklist

[ ] Load documents and chunk them (recursive character, 256-512 chars)
[ ] Embed chunks (sentence-transformers or OpenAI)
[ ] Store in vector database (FAISS or Chroma)
[ ] Build eval set with ground-truth labels
[ ] Measure precision@5 and MRR (target: >0.6)
[ ] Generate answers using top-k chunks
[ ] If precision < 0.5: adjust chunk size or embedding model
[ ] If retrieval is good but LLM hallucinates: improve prompt
[ ] Consider advanced techniques: HyDE, multi-query, reranking
[ ] Monitor live: log queries, retrieved chunks, answers
[ ] Iterate: user feedback → better eval set → better parameters

Summary

You've now built RAG from the ground up, chunking, embeddings, vector search, and generation. You understand the failure modes. You can measure retrieval quality. You can implement advanced techniques. And you know when to reach for LangChain and when to stay close to the metal.

The most important thing to internalize is that RAG quality is not a function of which LLM you use. It's a function of how well your retrieval pipeline surfaces the right information at the right time. The best RAG systems in production are not running the largest models, they're running well-engineered retrieval pipelines with solid evaluation infrastructure and continuous iteration based on real user feedback.

Start with a 50-query labeled eval set. Measure precision@5. If it's below 0.5, fix your chunking. If it's above 0.5, measure your end-to-end answer quality. If answers are still wrong, improve your prompt. Only then, if retrieval and prompting are both dialed in and you still need better performance, should you consider upgrading your embedding model or your LLM. Work from the ground up, not from the top down.

RAG is not magic. It's engineering. Get the retrieval right, and any competent LLM becomes powerful. Screw up the retrieval, and even GPT-4 becomes a hallucination machine. The future of LLMs isn't bigger models, it's smarter retrieval. Your documents, your rules, your context. Build it well, and you control the narrative.

RAG: Retrieval-Augmented Generation from Scratch

Why You Need to Build This Yourself First

The RAG Problem: Why LLMs Alone Aren't Enough

Why RAG Over Fine-Tuning?

Architecture: The Five Layers

Embedding Space Intuition

Layer 1: Loading and Chunking Documents

Fixed-Size Chunking

Sentence-Level Chunking

Recursive Character Chunking

Chunking Strategies: When to Use What

Layer 2: Embedding Models

Closed-Source: OpenAI's text-embedding-3-small

Open-Source: Sentence-Transformers

Layer 3: Vector Storage and Retrieval

In-Memory: FAISS

Persistent: Chroma

Layer 4: Retrieval Quality

Metric 1: Precision@K

Metric 2: Mean Reciprocal Rank (MRR)

Building a Ground-Truth Eval Set

Layer 5: Generation

Putting It All Together: End-to-End RAG

Common RAG Mistakes

Common Failure Modes (And How to Fix Them)

Problem 1: Search Returns Irrelevant Results

Problem 2: LLM Still Hallucinates Despite Good Retrieval

Problem 3: Slow Search on Large Datasets

Advanced RAG: Beyond Basic Retrieval

Technique 1: Hypothetical Document Embeddings (HyDE)

Technique 2: Multi-Query Retrieval

Technique 3: Reranking with a Cross-Encoder

Comparing to LangChain

Key Takeaways

Quick Reference: RAG Checklist

Summary

Need help implementing this?