Skip to content

Commit e4f8b75

Browse files
Add full DOCX and enhanced PDF support to RAG system
- Installed mammoth.js for DOCX file processing - Enhanced PDF extraction with metadata (title, author, page count) - Added structure-preserving chunking for both PDF and DOCX - DOCX files preserve headings, lists, and paragraph structure - Smart chunking maintains document context with overlap - Added visual processing indicators showing file type icons - Comprehensive test coverage for new document processing features - Updated documentation to reflect new capabilities 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
1 parent 1e47ce7 commit e4f8b75

File tree

7 files changed

+718
-52
lines changed

7 files changed

+718
-52
lines changed

README.md

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -105,10 +105,10 @@ tests/
105105
The app includes a powerful client-side RAG system that enhances AI responses with your uploaded documents.
106106

107107
### Supported Document Formats
108-
- 📄 PDF files
109-
- 📝 Text files (.txt)
110-
- 📋 Markdown files (.md)
111-
- 📊 Word documents (.docx)
108+
- 📕 **PDF files** - Full text extraction with metadata (title, author, page count)
109+
- 📘 **Word documents (.docx)** - Preserves document structure (headings, lists, paragraphs)
110+
- 📝 **Text files (.txt)** - Plain text processing
111+
- 📋 **Markdown files (.md)** - Markdown content processing
112112

113113
### How to Use RAG
114114

@@ -156,6 +156,14 @@ Or use natural language:
156156
- **Token Badge**: Shows when RAG context is used in responses
157157
- **Source Citations**: Responses end with "📚 Source: [filename]"
158158
- **Search Status**: "🔍 Searching through X documents..." appears during search
159+
- **Processing Status**: Shows file type icons (📕 PDF, 📘 DOCX, 📄 Text) during upload
160+
161+
### Advanced Features
162+
163+
- **Smart Chunking**: Documents are intelligently split preserving structure (headings, paragraphs)
164+
- **Metadata Extraction**: PDFs extract title, author, page count automatically
165+
- **Structure Preservation**: DOCX files maintain heading hierarchy and lists
166+
- **Page Tracking**: PDF chunks remember their source page numbers
159167

160168
For more detailed RAG usage instructions, see [RAG_USAGE_GUIDE.md](RAG_USAGE_GUIDE.md)
161169

package-lock.json

Lines changed: 180 additions & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

package.json

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,7 @@
5858
"@types/highlight.js": "^9.12.4",
5959
"fs-minipass": "^3.0.3",
6060
"highlight.js": "^11.11.1",
61+
"mammoth": "^1.9.1",
6162
"pdfjs-dist": "^3.11.174"
6263
},
6364
"type": "module"

0 commit comments

Comments
 (0)