Skip to content

liseli/rag_prototype

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RAG Prototype: LangChain + LangGraph + Chroma

This project provides a minimal Retrieval-Augmented Generation (RAG) search system using:

  • LangChain for document loading, splitting, embeddings, and retrieval
  • Chroma as a local vector database for similarity search
  • LangGraph-inspired orchestration (simple 2-node pipeline: retrieve -> synthesize)

Features

  • Local-first: uses HuggingFace sentence-transformers by default; no API key required
  • Local LLM via Ollama by default (pull a model like llama3.1:8b). If Ollama is not available, it can fall back to OpenAI (if OPENAI_API_KEY is set) or to an extractive response.
  • Persistent vector store using Chroma
  • Simple CLI for ingestion and queries

Prerequisites

  • Python 3.10+

Setup

  1. Create and activate a virtual environment (optional): python -m venv .venv source .venv/bin/activate # Windows: .venv\Scripts\activate

  2. Install dependencies: pip install -r requirements.txt

  3. Add some .txt files into data/ (some sample content is already in data/sample/).

Environment variables (optional)

  • EMBED_MODEL: HuggingFace embeddings model (default: sentence-transformers/all-MiniLM-L6-v2)
  • LLM_PROVIDER: choose 'ollama' (default) or 'openai'
  • OLLAMA_MODEL: Ollama model to use (default: llama3.1:8b)
  • OLLAMA_BASE_URL (or OLLAMA_HOST): e.g., http://localhost:11434
  • OPENAI_API_KEY: if set and provider=openai, the system will use OpenAI Chat Completions
  • OPENAI_MODEL: default gpt-4o-mini
  • CHROMA_URL: if set, the app connects to a Chroma Server (e.g., http://localhost:8000) instead of local .chroma

Usage 0. Extract OCR from XML into .txt (if your data is XML with OCR):

Example: convert all XML files under data_sample/ into plain text files under data/

python -m src.rag_system.cli extract_ocr --input data_sample --output data --glob "**/*.xml"

If you know the exact XPath to the OCR node(s), provide it to improve accuracy, e.g.:

python -m src.rag_system.cli extract_ocr --input data_sample --output data --xpath ".//OCR"

  1. Ingest documents into Chroma (local embedded): python -m src.rag_system.cli ingest --source data

    Or, with Chroma Server in Docker (recommended for shared access):

    Start server

    docker compose up -d chroma

    Point the app to the server and use a collection (default: corpus)

    export CHROMA_URL=http://localhost:8000 python -m src.rag_system.cli ingest --source data --chroma_url $CHROMA_URL --collection corpus

  2. Run a query (Ollama by default):

    Make sure Ollama is installed and running: https://ollama.com/

    Pull a model, e.g.: ollama pull llama3.1:8b

    Local Chroma

    python -m src.rag_system.cli query --provider ollama --ollama_model llama3.1:8b "What is this repository about?"

    Chroma Server

    python -m src.rag_system.cli query --provider ollama --ollama_model llama3.1:8b --chroma_url $CHROMA_URL "What is this repository about?"

    Use OpenAI instead:

    export OPENAI_API_KEY=sk-... python -m src.rag_system.cli query --provider openai --model gpt-4o-mini --chroma_url $CHROMA_URL "What is this repository about?"

Advanced options

  • Choose a different embedding model: python -m src.rag_system.cli ingest --embed_model sentence-transformers/all-mpnet-base-v2

  • Configure top-k and model for queries:

    Ollama

    python -m src.rag_system.cli query --provider ollama --ollama_model llama3.1:8b --k 8 "Explain the stack used here"

    OpenAI

    python -m src.rag_system.cli query --provider openai --model gpt-4o-mini --k 8 "Explain the stack used here"

Project structure

  • src/rag_system/ingest.py -> Ingestion pipeline (load, split, embed, index)
  • src/rag_system/graph.py -> Retrieval + synthesis pipeline
  • src/rag_system/cli.py -> Command-line interface
  • docker-compose.yml -> Chroma Server (Docker) for remote vector DB
  • data/ -> Put your .txt files here

Notes

  • If you do not set OPENAI_API_KEY, answers are generated by a simple extractive fallback (concatenation of top documents) to keep everything offline.
  • If you set OPENAI_API_KEY, the system uses OpenAI's Chat model configured via OPENAI_MODEL.

Troubleshooting: “Numpy is not available” on macOS

If you see an error like RuntimeError: Numpy is not available when running ingest or query, install NumPy explicitly before other packages and ensure pip/setuptools are recent:

  1. Upgrade build tooling python -m pip install --upgrade pip setuptools wheel

  2. Install NumPy first (compatible range) python -m pip install "numpy>=1.26,<2.1"

  3. Install the project requirements python -m pip install -r requirements.txt

Notes:

  • On Apple Silicon (M1/M2/M3), use Python 3.10+ from python.org or pyenv and a recent pip (>=23).
  • If you still hit issues, try recreating the venv and installing NumPy first, then requirements.

How RecursiveCharacterTextSplitter works

We use LangChain’s RecursiveCharacterTextSplitter during ingestion to break large documents into smaller, partially-overlapping chunks before embedding and indexing in Chroma.

Where it’s used here

  • File: src/rag_system/ingest.py
  • Code: splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap, add_start_index=True)
  • You control chunk_size and chunk_overlap via CLI: --chunk_size and --chunk_overlap

What it does (high level)

  • It tries to split text using a prioritized list of separators to preserve natural boundaries: by default ["\n\n", "\n", " ", ""].
  • It recursively picks the first separator that actually appears in your text. If a segment is still too long (exceeds chunk_size by the chosen length function), it recursively retries with the next, “finer” separator.
  • It merges segments back into chunks no larger than chunk_size, with chunk_overlap characters of overlap between consecutive chunks to improve retrieval recall.

Key parameters you’ll care about

  • chunk_size: Target maximum size (in characters by default) of each chunk.
  • chunk_overlap: Number of characters to overlap between adjacent chunks, preserving context across boundaries.
  • separators: Optional custom list of separators to try in order (e.g., section headings, paragraphs, sentences, words, characters). Defaults to ["\n\n", "\n", " ", ""].
  • keep_separator: Whether to keep the separator in chunks; can be True, False, "start", or "end". Defaults to True.
  • is_separator_regex: Treat separators as regex patterns (advanced). Defaults to False.
  • add_start_index: When True, each output Document receives metadata["start_index"] with its starting character offset relative to the original text. We set this to True so you can trace chunks back to the source.
  • length_function: Function used to measure length (defaults to Python len on characters). You can customize (e.g., token counting) if you subclass or construct differently.

How the algorithm works (step-by-step)

  1. Choose a separator:
    • Scan the separators list in order; use the first that occurs in the text. If none match, use the last (often empty string) which falls back to character-level splitting.
  2. Split the text by that separator.
  3. For each resulting piece:
    • If piece length < chunk_size: add it to a temporary list of “good” splits.
    • If piece length >= chunk_size:
      • First, merge current “good” splits into final chunks (respecting chunk_size and chunk_overlap).
      • Then recursively call the same procedure on the long piece, but now with the remaining, finer separators.
  4. After processing all pieces, merge any remaining “good” splits into the final chunk list.

Merging and overlap

  • The splitter uses a sliding window over the accumulated splits to emit chunks whose size does not exceed chunk_size.
  • It ensures consecutive chunks overlap by chunk_overlap characters, which helps retrieval models maintain context when a relevant sentence lies near a boundary.

Why it’s good for RAG

  • Preserves semantic boundaries when possible (paragraphs, then lines, then words) while guaranteeing chunks are not too large for embedding/token limits.
  • Overlap improves recall and robustness to query variations.

Practical tuning advice

  • Start with chunk_size=800 and chunk_overlap=120 (our CLI defaults). Increase chunk_size for long-form technical docs; decrease for short notes.
  • If your documents have strong structure (e.g., Markdown, headings), consider providing custom separators, e.g.:
    • separators=["\n\n# ", "\n\n", "\n", " ", ""] with is_separator_regex=False
  • If you have very long tokens/words (e.g., base64 or code blobs) and chunks exceed the limit, the recursion eventually falls back to character-level splitting "".
  • keep_separator="end" can help keep phrase punctuation near the chunk end; "start" can help the next chunk’s beginning be self-contained.
  • For token-aware sizing (e.g., tiktoken), consider a custom length_function or LangChain’s token-based splitters for more precise control.

Examples with this project

  • Default ingestion: python -m src.rag_system.cli ingest --source data --chunk_size 800 --chunk_overlap 120

  • Larger chunks for long technical reports: python -m src.rag_system.cli ingest --source data --chunk_size 1200 --chunk_overlap 150

  • Smaller, tighter chunks for noisy OCR text: python -m src.rag_system.cli ingest --source data --chunk_size 500 --chunk_overlap 100

Customizing separators (code snippet)

  • If you want custom separators, edit src/rag_system/ingest.py and pass separators in the splitter construction, for example:

    splitter = RecursiveCharacterTextSplitter( chunk_size=chunk_size, chunk_overlap=chunk_overlap, add_start_index=True, separators=["\n\n## ", "\n\n", "\n", " ", ""], # try headings, then paragraphs, lines, words, chars keep_separator=True, )

Edge cases

  • Documents with no newlines: the splitter quickly falls back to splitting on spaces or characters.
  • Separator is regex: set is_separator_regex=True and pass patterns (ensure they appear in text, or recursion goes finer).
  • Extremely long single “words”: recursion will end up splitting at character level to respect chunk_size.

Logging during ingestion

You can enable runtime logs for the ingestion process to monitor progress and performance.

  • Set the log level via CLI: python -m src.rag_system.cli ingest --source data --log_level DEBUG

  • Or via environment variable (default is INFO if not provided): export LOG_LEVEL=DEBUG python -m src.rag_system.cli ingest --source data

The logs include:

  • Start parameters (source, glob, chunking, collection, target Chroma location)
  • Number of documents loaded
  • Chunking stats (number of chunks and average length)
  • Embedding model used
  • Where data is being written (local Chroma directory or Chroma Server)
  • Total ingestion time

Dockerized Ollama server

You can run the Ollama server via Docker using the provided docker-compose.yml.

  • Start Ollama (and optionally Chroma) in the background: docker compose up -d ollama

    or both services

    docker compose up -d ollama chroma

  • Point the app to the Dockerized Ollama server: export OLLAMA_BASE_URL=http://localhost:11434

  • Pull a model inside the container (one-time): docker exec -it ollama ollama pull llama3.1:8b

Then you can query with the CLI (provider=ollama), either against local Chroma or Chroma Server:

  • Local Chroma: python -m src.rag_system.cli query --provider ollama --ollama_model llama3.1:8b "What is this repository about?"
  • Chroma Server: export CHROMA_URL=http://localhost:8000 python -m src.rag_system.cli query --provider ollama --ollama_model llama3.1:8b --chroma_url $CHROMA_URL "What is this repository about?"

How to pull a model from the Ollama server

You have three convenient options to download an Ollama model:

  1. If Ollama is installed on your host (macOS/Linux/WSL):

    • Ensure the Ollama daemon is running: ollama serve (usually started automatically)
    • Pull a model by name: ollama pull llama3.1:8b
  2. If you run Ollama via Docker Compose (this repo’s docker-compose.yml):

    • Start the service: docker compose up -d ollama
    • Pull the model inside the container: docker exec -it ollama ollama pull llama3.1:8b
    • Point the app to the Dockerized server: export OLLAMA_BASE_URL=http://localhost:11434
  3. Using this project’s CLI (talks to the Ollama HTTP API):

    • Host or Docker both work as long as the server is reachable.
    • Example (defaults to http://localhost:11434 if OLLAMA_BASE_URL is not set): python -m src.rag_system.cli ollama_pull --model llama3.1:8b --base_url $OLLAMA_BASE_URL

Notes

  • Model names are listed on https://ollama.com/library (e.g., llama3.1, mistral, codellama). Tags like :8b/:70b choose parameter sizes.
  • The first pull downloads the weights; subsequent pulls are fast.
  • If the server is remote, set OLLAMA_BASE_URL to that host, e.g., http://your-server:11434.

About

Repository to run search

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages