
Most RAG tutorials I found were either "pip install langchain and you're done" or 50-page academic papers. I wanted something in between — a pipeline I could actually explain in an interview, where I understood every line.
So I built one from scratch. No LangChain, no LlamaIndex, no frameworks. Just FastAPI, FAISS, sentence-transformers, and an LLM API.
Here's what I built, what worked, and what broke.
The architecture
PDF --> extract text (pypdf) --> chunk (500 char, 50 overlap) --> embed (MiniLM-L6-v2)
|
v
question --> embed --> FAISS top-k search --> build prompt with chunks --> LLM --> answer + sources
Five Python files, ~300 lines total:
| File | Responsibility |
|---|---|
main.py |
FastAPI app, 3 endpoints, prompt engineering |
pdf_loader.py |
PDF text extraction via pypdf |
rag.py |
Chunking + embedding |
store.py |
FAISS vector store wrapper |
llm.py |
Swappable LLM client (Groq / OpenAI / Anthropic) |
How the upload works
When you POST a PDF to /upload, three things happen:
1. Text extraction — pypdf reads each page and returns the raw text. Pages with no extractable text (scanned images) are skipped.
2. Chunking — each page is split into ~500-character chunks with 50 characters of overlap. The overlap prevents losing context at chunk boundaries.
CHUNK_SIZE = 500
CHUNK_OVERLAP = 50
def chunk_pages(pages):
chunks = []
chunk_id = 0
for text, page_num in pages:
start = 0
while start < len(text):
end = min(start + CHUNK_SIZE, len(text))
chunk_text = text[start:end].strip()
if chunk_text:
chunks.append(Chunk(chunk_id=chunk_id, text=chunk_text, page=page_num))
chunk_id += 1
if end == len(text):
break
start = end - CHUNK_OVERLAP
return chunks
3. Embedding — each chunk is embedded into a 384-dimensional vector using all-MiniLM-L6-v2. This runs locally on CPU, no API call needed. Vectors are normalized so we can use inner product as cosine similarity.
def embed_texts(texts):
model = get_embed_model() # lazy-loaded singleton
vectors = model.encode(
texts,
normalize_embeddings=True,
show_progress_bar=False,
convert_to_numpy=True,
)
return vectors.astype("float32")
The vectors and chunk metadata go into a FAISS IndexFlatIP index — brute-force exact search, which is fine for up to ~100k vectors.
How the query works
When you POST a question to /query:
- The question is embedded using the same model
- FAISS finds the top-k most similar chunks by cosine similarity
- The chunks are formatted into a prompt with labels like
[Chunk 3 | Page 2] - The LLM generates an answer grounded in those chunks
- Both the answer and source chunks are returned
The system prompt is deliberately strict:
You are a careful assistant that answers questions strictly
from the provided document context.
Rules:
- Use ONLY the context below. Do not use outside knowledge.
- If the answer is not in the context, say:
"I couldn't find that in the document."
Swappable LLM providers
One thing I'm happy with — the LLM is swappable via a single environment variable:
LLM_PROVIDER=groq # or openai, or anthropic
All three providers share the same interface:
class LLMClient(ABC):
@abstractmethod
def generate(self, system: str, user: str) -> str: ...
You only need an API key for the provider you pick. I used Groq with Llama 3.3 70B for development because it's fast and free-tier friendly.
Testing it: what worked and what didn't
I created a fictional 5-page company document and threw 19 questions at the pipeline. Questions ranged from simple lookups to multi-hop reasoning to negative tests (questions the document can't answer).
What worked well:
- Direct lookups: "What is the list price of the Magpie-7?" — nailed it
- Table data: "What's included in the Standard tier?" — correct
- Negative tests: "What's Zentara's stock ticker?" — correctly said "not in the document"
- Multi-hop: "If I want 1-hour SLA support, what will it cost?" — combined info from the pricing table
What failed:
- "Who is the CEO?" — couldn't find it
- "How many employees does Zentara have?" — couldn't find it
Both answers were on page 1, in a dense "Company snapshot" table: CEO, CTO, HQ, employees, revenue — all packed together.
Why it failed (and what I learned)
The problem wasn't the LLM — it was the retriever. The Company snapshot table had 8+ different facts crammed into one chunk. The embedding for that chunk became a muddy average of all those topics, so it didn't rank highly for any specific question.
This is the classic weakness of pure semantic search. The word "CEO" appears exactly once in the document. A keyword search (BM25) would find it instantly. But vector search relies on semantic similarity, and a short query like "Who is the CEO?" doesn't produce a strong enough match against a chunk that's 80% about revenue, headquarters, and employee count.
The fix: hybrid retrieval — combine BM25 (keyword matching) with vector search. This is what production RAG systems do. It's on my to-do list.
Key design decisions (interview-ready)
If you're building this for interviews, these are the tradeoffs worth knowing:
| Decision | Why |
|---|---|
| Character-based chunking (not token-based) | Simpler, no tokenizer dependency. Production would use tiktoken. |
| Local embeddings (not OpenAI) | Free, offline, no API latency. Lower quality but fine for demos. |
| FAISS IndexFlatIP (not HNSW) | Exact search, no approximation. Fine up to ~100k vectors. |
| Normalized embeddings | Inner product = cosine similarity. One less thing to configure. |
| No streaming | v1 simplification. Streaming is where LLM SDKs diverge the most. |
| No conversation memory | Each query is independent. Adding memory is straightforward but adds complexity. |
What I'd add next
- Hybrid retrieval (BM25 + vector) — catches keyword matches that pure semantic search misses
- Reranker (cross-encoder) — re-scores the top-k results for better precision
- Evaluation set — automated accuracy measurement instead of manual testing
- Streaming — better UX for longer answers
- Conversation memory — follow-up questions
Try it yourself
The repo is here: github.com/santanu2908/chat-with-pdf-rag
uv sync
cp .env.example .env # set your API key
uv run uvicorn app.main:app --reload
Open http://localhost:8000/docs, upload a PDF, and start asking questions.
If you've built something similar or have suggestions (especially on hybrid retrieval), I'd love to hear about it in the comments.
I'm Santanu Mohanta — you can connect with me on LinkedIn or check out my other projects on GitHub.
United States
NORTH AMERICA
Related News
Every Medium Publication That Accepts 3D Content (2026 Map)
14h ago

Agentic Ops: How I Shipped My Vibe-Coded Game to Production
14h ago
I build a project calculator web app for n8n / automation folks
14h ago
Integers and Floating-Point Numbers in C++
14h ago

How to Secure Azure Storage Using Managed Identities and RBAC
14h ago
