Building a Privacy-First RAG Chatbot with LangChain and LLaMA 3

January 15, 2025

When I started building my MCA capstone project, I had one non-negotiable requirement: user data must never leave the machine. That ruled out OpenAI, Anthropic, and every cloud LLM API. What followed was one of the most educational deep-dives of my dev career.

Here's a full breakdown of how I built an offline, privacy-first RAG (Retrieval-Augmented Generation) chatbot that lets users query their PDF documents using a locally-running LLM.

The Architecture

PDF Files → Ingestion Pipeline → ChromaDB (vector store) → Retriever → LLaMA 3.2 (via Ollama) → Answer

The five core components are:

  1. PDF Parser — extracts raw text from uploaded PDFs
  2. Text Chunker — splits text into overlapping chunks for better retrieval
  3. Embedding Model — converts chunks into vector embeddings
  4. ChromaDB — stores and indexes the embeddings for similarity search
  5. LLaMA 3.2 via Ollama — generates answers from retrieved context

Step 1: PDF Ingestion with LangChain

from langchain_community.document_loaders import PyPDFLoader from langchain.text_splitter import RecursiveCharacterTextSplitter def ingest_pdf(file_path: str): loader = PyPDFLoader(file_path) documents = loader.load() splitter = RecursiveCharacterTextSplitter( chunk_size=500, chunk_overlap=50, ) chunks = splitter.split_documents(documents) return chunks

The RecursiveCharacterTextSplitter is my preferred splitter because it tries to keep paragraphs and sentences intact before breaking them up — this preserves context much better than a naive character split.

Step 2: Storing Embeddings in ChromaDB

from langchain_community.vectorstores import Chroma from langchain_community.embeddings import OllamaEmbeddings embeddings = OllamaEmbeddings(model="nomic-embed-text") vectorstore = Chroma.from_documents( documents=chunks, embedding=embeddings, persist_directory="./chroma_db", )

I chose nomic-embed-text as the embedding model because it runs locally via Ollama and produces high-quality embeddings for semantic search — comparable to OpenAI's text-embedding-ada-002 in most benchmarks.

Step 3: Multi-Query Retriever

One of the most impactful improvements I made was switching from a simple similarity retriever to a multi-query retriever. Instead of searching for your exact question, it generates multiple paraphrased versions of the query and retrieves results for all of them.

from langchain.retrievers.multi_query import MultiQueryRetriever from langchain_community.llms import Ollama llm = Ollama(model="llama3.2") retriever = MultiQueryRetriever.from_llm( retriever=vectorstore.as_retriever(search_kwargs={"k": 4}), llm=llm, )

This single change improved answer quality noticeably — especially for vague or complex questions.

Step 4: Running LLaMA 3.2 Locally via Ollama

Getting Ollama set up is surprisingly straightforward:

# Install Ollama curl -fsSL https://ollama.com/install.sh | sh # Pull LLaMA 3.2 ollama pull llama3.2 # Pull the embedding model ollama pull nomic-embed-text

Key Takeaways

  • Local LLMs are production-ready for specific use cases. LLaMA 3.2 handled document Q&A remarkably well.
  • Chunking strategy matters more than you think. Spend time tuning chunk_size and chunk_overlap for your document types.
  • Multi-query retrieval is a simple upgrade that significantly improves recall.
  • ChromaDB's persistence means you only embed once — subsequent queries are fast.

The full project is on my GitHub. Feel free to open issues or questions!

GitHub
LinkedIn