A comprehensive Retrieval-Augmented Generation (RAG) system that integrates multiple data sources for enhanced contextual AI responses.
- Multi-Source Integration: Connect to emails, Odoo, Mattermost, SQL databases, and various document formats (PDFs, Markdown, TXT, JSON)
- Vector Database: Qdrant for efficient vector similarity search
- Graph-Based Retrieval: Advanced retrieval with relationship tracking
- Hybrid Search: Combines vector similarity, keyword matching, and semantic understanding
- Advanced Document Processing: Multi-stage pipeline with quality assessment, OCR, and intelligent chunking
- Multiple Embedding Options: Support for various embedding providers (SentenceTransformers, OpenAI, Cohere, HuggingFace)
- Re-ranking: Cross-encoder based re-ranking for higher quality results
- Source Attribution: Automatic citations and source tracking
- Multi-Query Expansion: Generates variations of queries for better recall
- Document Quality Analysis: Automatic assessment and enhancement of content quality
- OCR Integration: Extract text from images and scanned PDFs
- Docker Support: Full containerization for easy deployment
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Data Sources │────▶│ RAG Pipeline │────▶│ LLM Model │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ - Emails │ │ - Indexing │ │ - Generation │
│ - Odoo ERP │ │ - OCR │ │ - Citations │
│ - Mattermost │ │ - Quality Check│ │ - Formatting │
│ - Databases │ │ - Chunking │ │ - Streaming │
│ - Documents │ │ - Embedding │ └─────────────────┘
└─────────────────┘ │ - Retrieval │
│ - Re-ranking │
└─────────────────┘
The system includes a sophisticated document processing pipeline:
- Document Validation: Initial check of file format and content
- Preprocessing: Prepare document for extraction
- Content Extraction: Extract text from various formats
- OCR Processing: Apply OCR to images and scanned documents
- Quality Assessment: Evaluate document quality and enhance if needed
- Chunking: Intelligently split document into manageable pieces
- Embedding: Generate vector representations for each chunk
- Indexing: Store in the vector database for retrieval
- Docker and Docker Compose
- API keys for LLM providers (OpenAI, Anthropic, or Cohere)
- Credentials for data sources you want to connect
-
Clone the repository:
git clone https://github.com/yourusername/complex_rag.git cd complex_rag -
Create a
.envfile from the example:cp .env.example .env
-
Edit the
.envfile to add your API keys and configuration. -
Start the system using Docker Compose:
docker-compose up -d
-
Access the UI at http://localhost:8501
The system can be configured through the .env file or by modifying config/settings.py. Key configuration options include:
- Vector Database: Connection details for Qdrant
- SQL Database: Connection details for PostgreSQL
- Data Sources: Credentials for emails, Odoo, Mattermost
- Embedding Model: Provider, model name and parameters
- Document Processing: OCR, quality thresholds, chunking strategies
- LLM Settings: Provider, model, and parameters
- Retrieval Settings: Chunk limits, thresholds, reranking
Connect to email servers using IMAP, Gmail API, or Exchange protocols.
Retrieve records from Odoo modules including CRM, Sales, Projects, etc.
Connect to Mattermost teams and channels for retrieving communication history.
Query SQL databases (PostgreSQL, MySQL, SQLite) for structured data.
Process various document formats including:
- PDFs
- Markdown files
- Text files
- JSON data
- Images (via OCR)
- Office documents
- And more...
Components that connect to and retrieve data from various sources.
Handle document parsing, OCR, quality assessment, chunking, and metadata extraction.
Manage vector embeddings and storage in the vector database.
Implement retrieval strategies including hybrid search and re-ranking.
Interface with language models for generating answers and embedding models for vectorization.
FastAPI-based REST API for interacting with the system.
Streamlit-based user interface for searching and administration.
The system automatically analyzes document quality, including:
- Content quality scoring
- Noise detection and filtering
- Structure analysis
- Language detection
- Content deduplication
- Entity identification
Extract text from images and scanned documents with:
- PDF text extraction with OCR fallback
- Image text extraction
- Table detection and processing
- Layout preservation
- Multi-language support
Support for various embedding providers:
- SentenceTransformers (local)
- OpenAI
- Cohere
- HuggingFace
Different chunking strategies for different document types:
- Fixed size
- Sentence-based
- Paragraph-based
- Semantic-based
The system uses Docker Compose with the following services:
- rag-api: Main RAG application
- qdrant: Vector database
- postgres: Traditional database
- rag-ui: Streamlit UI
- minio: Object storage for documents
- document-processor: Asynchronous document processing service
- monitoring: Prometheus and Grafana for system monitoring
This project is licensed under the MIT License - see the LICENSE file for details.