When I started building my MCA capstone project, I had one non-negotiable requirement: user data must never leave the machine. That ruled out OpenAI, Anthropic, and every cloud LLM API. What followed was one of the most educational deep-dives of my dev career.
Here's a full breakdown of how I built an offline, privacy-first RAG (Retrieval-Augmented Generation) chatbot that lets users query their PDF documents using a locally-running LLM.
The Architecture
PDF Files → Ingestion Pipeline → ChromaDB (vector store) → Retriever → LLaMA 3.2 (via Ollama) → Answer
The five core components are:
- PDF Parser — extracts raw text from uploaded PDFs
- Text Chunker — splits text into overlapping chunks for better retrieval
- Embedding Model — converts chunks into vector embeddings
- ChromaDB — stores and indexes the embeddings for similarity search
- LLaMA 3.2 via Ollama — generates answers from retrieved context
Step 1: PDF Ingestion with LangChain
from langchain_community.document_loaders import PyPDFLoader from langchain.text_splitter import RecursiveCharacterTextSplitter def ingest_pdf(file_path: str): loader = PyPDFLoader(file_path) documents = loader.load() splitter = RecursiveCharacterTextSplitter( chunk_size=500, chunk_overlap=50, ) chunks = splitter.split_documents(documents) return chunks
The RecursiveCharacterTextSplitter is my preferred splitter because it tries to keep paragraphs and sentences intact before breaking them up — this preserves context much better than a naive character split.
Step 2: Storing Embeddings in ChromaDB
from langchain_community.vectorstores import Chroma from langchain_community.embeddings import OllamaEmbeddings embeddings = OllamaEmbeddings(model="nomic-embed-text") vectorstore = Chroma.from_documents( documents=chunks, embedding=embeddings, persist_directory="./chroma_db", )
I chose nomic-embed-text as the embedding model because it runs locally via Ollama and produces high-quality embeddings for semantic search — comparable to OpenAI's text-embedding-ada-002 in most benchmarks.
Step 3: Multi-Query Retriever
One of the most impactful improvements I made was switching from a simple similarity retriever to a multi-query retriever. Instead of searching for your exact question, it generates multiple paraphrased versions of the query and retrieves results for all of them.
from langchain.retrievers.multi_query import MultiQueryRetriever from langchain_community.llms import Ollama llm = Ollama(model="llama3.2") retriever = MultiQueryRetriever.from_llm( retriever=vectorstore.as_retriever(search_kwargs={"k": 4}), llm=llm, )
This single change improved answer quality noticeably — especially for vague or complex questions.
Step 4: Running LLaMA 3.2 Locally via Ollama
Getting Ollama set up is surprisingly straightforward:
# Install Ollama curl -fsSL https://ollama.com/install.sh | sh # Pull LLaMA 3.2 ollama pull llama3.2 # Pull the embedding model ollama pull nomic-embed-text
Key Takeaways
- Local LLMs are production-ready for specific use cases. LLaMA 3.2 handled document Q&A remarkably well.
- Chunking strategy matters more than you think. Spend time tuning
chunk_sizeandchunk_overlapfor your document types. - Multi-query retrieval is a simple upgrade that significantly improves recall.
- ChromaDB's persistence means you only embed once — subsequent queries are fast.
The full project is on my GitHub. Feel free to open issues or questions!