TechTrends Now - Tech News for Builders and Operators

Most RAG tutorials I found were either "pip install langchain and you're done" or 50-page academic papers. I wanted something in between — a pipeline I could actually explain in an interview, where I understood every line.

So I built one from scratch. No LangChain, no LlamaIndex, no frameworks. Just FastAPI, FAISS, sentence-transformers, and an LLM API.

Here's what I built, what worked, and what broke.

The architecture

PDF --> extract text (pypdf) --> chunk (500 char, 50 overlap) --> embed (MiniLM-L6-v2)
                                                                        |
                                                                        v
question --> embed --> FAISS top-k search --> build prompt with chunks --> LLM --> answer + sources

Five Python files, ~300 lines total:

File	Responsibility
`main.py`	FastAPI app, 3 endpoints, prompt engineering
`pdf_loader.py`	PDF text extraction via pypdf
`rag.py`	Chunking + embedding
`store.py`	FAISS vector store wrapper
`llm.py`	Swappable LLM client (Groq / OpenAI / Anthropic)

How the upload works

When you POST a PDF to /upload, three things happen:

1. Text extraction — pypdf reads each page and returns the raw text. Pages with no extractable text (scanned images) are skipped.

2. Chunking — each page is split into ~500-character chunks with 50 characters of overlap. The overlap prevents losing context at chunk boundaries.

CHUNK_SIZE = 500
CHUNK_OVERLAP = 50

def chunk_pages(pages):
    chunks = []
    chunk_id = 0
    for text, page_num in pages:
        start = 0
        while start < len(text):
            end = min(start + CHUNK_SIZE, len(text))
            chunk_text = text[start:end].strip()
            if chunk_text:
                chunks.append(Chunk(chunk_id=chunk_id, text=chunk_text, page=page_num))
                chunk_id += 1
            if end == len(text):
                break
            start = end - CHUNK_OVERLAP
    return chunks

3. Embedding — each chunk is embedded into a 384-dimensional vector using all-MiniLM-L6-v2. This runs locally on CPU, no API call needed. Vectors are normalized so we can use inner product as cosine similarity.

def embed_texts(texts):
    model = get_embed_model()  # lazy-loaded singleton
    vectors = model.encode(
        texts,
        normalize_embeddings=True,
        show_progress_bar=False,
        convert_to_numpy=True,
    )
    return vectors.astype("float32")

The vectors and chunk metadata go into a FAISS IndexFlatIP index — brute-force exact search, which is fine for up to ~100k vectors.

How the query works

When you POST a question to /query:

The question is embedded using the same model
FAISS finds the top-k most similar chunks by cosine similarity
The chunks are formatted into a prompt with labels like [Chunk 3 | Page 2]
The LLM generates an answer grounded in those chunks
Both the answer and source chunks are returned

The system prompt is deliberately strict:

You are a careful assistant that answers questions strictly
from the provided document context.

Rules:
- Use ONLY the context below. Do not use outside knowledge.
- If the answer is not in the context, say:
  "I couldn't find that in the document."

Swappable LLM providers

One thing I'm happy with — the LLM is swappable via a single environment variable:

LLM_PROVIDER=groq      # or openai, or anthropic

All three providers share the same interface:

class LLMClient(ABC):
    @abstractmethod
    def generate(self, system: str, user: str) -> str: ...

You only need an API key for the provider you pick. I used Groq with Llama 3.3 70B for development because it's fast and free-tier friendly.

Testing it: what worked and what didn't

I created a fictional 5-page company document and threw 19 questions at the pipeline. Questions ranged from simple lookups to multi-hop reasoning to negative tests (questions the document can't answer).

What worked well:

Direct lookups: "What is the list price of the Magpie-7?" — nailed it
Table data: "What's included in the Standard tier?" — correct
Negative tests: "What's Zentara's stock ticker?" — correctly said "not in the document"
Multi-hop: "If I want 1-hour SLA support, what will it cost?" — combined info from the pricing table

What failed:

"Who is the CEO?" — couldn't find it
"How many employees does Zentara have?" — couldn't find it

Both answers were on page 1, in a dense "Company snapshot" table: CEO, CTO, HQ, employees, revenue — all packed together.

Why it failed (and what I learned)

The problem wasn't the LLM — it was the retriever. The Company snapshot table had 8+ different facts crammed into one chunk. The embedding for that chunk became a muddy average of all those topics, so it didn't rank highly for any specific question.

This is the classic weakness of pure semantic search. The word "CEO" appears exactly once in the document. A keyword search (BM25) would find it instantly. But vector search relies on semantic similarity, and a short query like "Who is the CEO?" doesn't produce a strong enough match against a chunk that's 80% about revenue, headquarters, and employee count.

The fix: hybrid retrieval — combine BM25 (keyword matching) with vector search. This is what production RAG systems do. It's on my to-do list.

Key design decisions (interview-ready)

If you're building this for interviews, these are the tradeoffs worth knowing:

Decision	Why
Character-based chunking (not token-based)	Simpler, no tokenizer dependency. Production would use tiktoken.
Local embeddings (not OpenAI)	Free, offline, no API latency. Lower quality but fine for demos.
FAISS IndexFlatIP (not HNSW)	Exact search, no approximation. Fine up to ~100k vectors.
Normalized embeddings	Inner product = cosine similarity. One less thing to configure.
No streaming	v1 simplification. Streaming is where LLM SDKs diverge the most.
No conversation memory	Each query is independent. Adding memory is straightforward but adds complexity.

What I'd add next

Hybrid retrieval (BM25 + vector) — catches keyword matches that pure semantic search misses
Reranker (cross-encoder) — re-scores the top-k results for better precision
Evaluation set — automated accuracy measurement instead of manual testing
Streaming — better UX for longer answers
Conversation memory — follow-up questions

Try it yourself

The repo is here: github.com/santanu2908/chat-with-pdf-rag

uv sync
cp .env.example .env   # set your API key
uv run uvicorn app.main:app --reload

Open http://localhost:8000/docs, upload a PDF, and start asking questions.

If you've built something similar or have suggestions (especially on hybrid retrieval), I'd love to hear about it in the comments.

I'm Santanu Mohanta — you can connect with me on LinkedIn or check out my other projects on GitHub.

I built a RAG pipeline from scratch — no LangChain, just FastAPI + FAISS

The architecture

How the upload works

How the query works

Swappable LLM providers

Testing it: what worked and what didn't

Why it failed (and what I learned)

Key design decisions (interview-ready)

What I'd add next

Try it yourself

Comments (0)

United States

Related News

Every Medium Publication That Accepts 3D Content (2026 Map)

Agentic Ops: How I Shipped My Vibe-Coded Game to Production

I build a project calculator web app for n8n / automation folks

Integers and Floating-Point Numbers in C++

How to Secure Azure Storage Using Managed Identities and RBAC