Building a Privacy-First RAG Chatbot with LangChain and LLaMA 3

When I started building my MCA capstone project, I had one non-negotiable requirement: user data must never leave the machine. That ruled out OpenAI, Anthropic, and every cloud LLM API. What followed was one of the most educational deep-dives of my dev career.

Here's a full breakdown of how I built an offline, privacy-first RAG (Retrieval-Augmented Generation) chatbot that lets users query their PDF documents using a locally-running LLM.

The Architecture

PDF Files → Ingestion Pipeline → ChromaDB (vector store) → Retriever → LLaMA 3.2 (via Ollama) → Answer

The five core components are:

PDF Parser — extracts raw text from uploaded PDFs
Text Chunker — splits text into overlapping chunks for better retrieval
Embedding Model — converts chunks into vector embeddings
ChromaDB — stores and indexes the embeddings for similarity search
LLaMA 3.2 via Ollama — generates answers from retrieved context

Step 1: PDF Ingestion with LangChain

from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

def ingest_pdf(file_path: str):
    loader = PyPDFLoader(file_path)
    documents = loader.load()

    splitter = RecursiveCharacterTextSplitter(
        chunk_size=500,
        chunk_overlap=50,
    )
    chunks = splitter.split_documents(documents)
    return chunks

The RecursiveCharacterTextSplitter is my preferred splitter because it tries to keep paragraphs and sentences intact before breaking them up — this preserves context much better than a naive character split.

Step 2: Storing Embeddings in ChromaDB

from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import OllamaEmbeddings

embeddings = OllamaEmbeddings(model="nomic-embed-text")

vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db",
)

I chose nomic-embed-text as the embedding model because it runs locally via Ollama and produces high-quality embeddings for semantic search — comparable to OpenAI's text-embedding-ada-002 in most benchmarks.

Step 3: Multi-Query Retriever

One of the most impactful improvements I made was switching from a simple similarity retriever to a multi-query retriever. Instead of searching for your exact question, it generates multiple paraphrased versions of the query and retrieves results for all of them.

from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_community.llms import Ollama

llm = Ollama(model="llama3.2")

retriever = MultiQueryRetriever.from_llm(
    retriever=vectorstore.as_retriever(search_kwargs={"k": 4}),
    llm=llm,
)

This single change improved answer quality noticeably — especially for vague or complex questions.

Step 4: Running LLaMA 3.2 Locally via Ollama

Getting Ollama set up is surprisingly straightforward:

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull LLaMA 3.2
ollama pull llama3.2

# Pull the embedding model
ollama pull nomic-embed-text

Key Takeaways

Local LLMs are production-ready for specific use cases. LLaMA 3.2 handled document Q&A remarkably well.
Chunking strategy matters more than you think. Spend time tuning chunk_size and chunk_overlap for your document types.
Multi-query retrieval is a simple upgrade that significantly improves recall.
ChromaDB's persistence means you only embed once — subsequent queries are fast.

The full project is on my GitHub. Feel free to open issues or questions!