Skip to content

Commit e5f5c41

Browse files
committed
readme
1 parent 187fad9 commit e5f5c41

File tree

7 files changed

+810
-476
lines changed

7 files changed

+810
-476
lines changed

2206.01062v1.md

Lines changed: 400 additions & 0 deletions
Large diffs are not rendered by default.

README.md

Lines changed: 196 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Vectorize Iris - Simple Text Extraction
1+
# Vectorize Iris
22

33
**Extract text from any document with AI-powered precision.**
44

@@ -9,19 +9,21 @@ Documentation: [docs.vectorize.io](https://docs.vectorize.io/build-deploy/extrac
99

1010
## Why Iris?
1111

12-
Traditional text extraction tools struggle with:
13-
- Complex layouts (multi-column documents, tables, forms)
14-
- Poor quality scans or images
15-
- Mixed content types (text, tables, images)
16-
- Structured data extraction
17-
- Preserving document semantics
12+
Traditional OCR tools struggle with complex layouts, poor scans, and structured data. **Iris uses advanced AI** to understand document structure and context, delivering:
1813

19-
**Iris solves these problems** by using advanced AI models that understand document structure and context, delivering:
20-
-**High accuracy** - Even with poor quality or complex documents
14+
-**High accuracy** - Handles poor quality scans and complex layouts
2115
- 📊 **Structure preservation** - Maintains tables, lists, and formatting
22-
- 🎯 **Smart chunking** - Splits documents at semantic boundaries
16+
- 🎯 **Smart chunking** - Semantic splitting for RAG pipelines
2317
- 🔍 **Metadata extraction** - Extract specific fields using natural language
2418
- 🚀 **Simple API** - One function call to extract text
19+
-**Parallel processing** - Process multiple documents simultaneously
20+
- 🌐 **URL support** - Extract directly from HTTP/HTTPS URLs
21+
- 📂 **Batch processing** - Process entire directories automatically
22+
- 🔧 **Multiple formats** - Output as JSON, YAML, or plain text
23+
- 🪶 **Lightweight** - Single binary CLI with no dependencies
24+
- ☁️ **Cloud-native** - Serverless-ready APIs
25+
- 🌍 **Multi-lingual** - 100+ languages including Hindi, Arabic, Chinese
26+
- 🔌 **Multi-platform** - Python, Node.js, and CLI support
2527

2628
## Quick Start
2729

@@ -47,13 +49,12 @@ console.log(result.text);
4749

4850
[→ See Node.js examples](nodejs-api/)
4951

50-
### ⚡ Rust CLI
52+
### ⚡ CLI
53+
5154
```bash
5255
vectorize-iris document.pdf
5356
```
5457

55-
[→ See CLI examples](rust-cli/)
56-
5758
## Installation
5859

5960
**CLI:**
@@ -105,6 +106,188 @@ result = extract_text_from_file('document.pdf', options=ExtractionOptions(
105106
))
106107
```
107108

109+
## CLI Examples
110+
111+
### Basic Extraction
112+
113+
Beautiful terminal output with progress indicators:
114+
115+
```bash
116+
vectorize-iris document.pdf
117+
```
118+
119+
**Output:**
120+
```
121+
✨ Vectorize Iris Extraction
122+
──────────────────────────────────────────────────
123+
124+
✓ Upload prepared
125+
✓ File uploaded successfully
126+
✓ Extraction started
127+
✓ Extraction completed in 7s
128+
129+
─────────────────────────────────────────────────────────
130+
📄 Extracted Text
131+
─────────────────────────────────────────────────────────
132+
133+
Stats: 5536 chars • 1245 words • 89 lines
134+
135+
This is the extracted text from your PDF document.
136+
All formatting and structure is preserved.
137+
138+
Tables, lists, and other elements are properly extracted.
139+
```
140+
141+
### Extract from URL
142+
143+
Download and extract files directly from HTTP/HTTPS URLs:
144+
145+
```bash
146+
vectorize-iris https://example.com/document.pdf
147+
```
148+
149+
**Output:**
150+
```
151+
🚀 Downloading file from URL
152+
──────────────────────────────────────────────────
153+
154+
✓ Downloaded 2.1 MB to temporary file
155+
156+
✨ Vectorize Iris Extraction
157+
──────────────────────────────────────────────────
158+
159+
✓ Upload prepared
160+
✓ File uploaded successfully
161+
✓ Extraction started
162+
✓ Extraction completed in 8s
163+
```
164+
165+
### JSON Output (for piping)
166+
167+
```bash
168+
vectorize-iris document.pdf -o json
169+
```
170+
171+
**Output:**
172+
```json
173+
{
174+
"success": true,
175+
"text": "This is the extracted text from your PDF document...",
176+
"chunks": null,
177+
"metadata": null
178+
}
179+
```
180+
181+
**Pipe to jq:**
182+
```bash
183+
vectorize-iris document.pdf -o json | jq -r '.text' > output.txt
184+
```
185+
186+
### Plain Text Output
187+
188+
Get only the extracted text:
189+
190+
```bash
191+
vectorize-iris document.pdf -o text
192+
```
193+
194+
**Pipe directly:**
195+
```bash
196+
vectorize-iris document.pdf -o text > output.txt
197+
```
198+
199+
### Save to File
200+
201+
Use `-f` to save output directly:
202+
203+
```bash
204+
vectorize-iris document.pdf -o json -f output.json
205+
```
206+
207+
**Output:**
208+
```
209+
✨ Vectorize Iris Extraction
210+
──────────────────────────────────────────────────
211+
212+
✓ Upload prepared
213+
✓ File uploaded successfully
214+
✓ Extraction started
215+
✓ Extraction completed in 7s
216+
✓ Output written to output.json
217+
```
218+
219+
### Process Directory
220+
221+
Process all files in a directory automatically:
222+
223+
```bash
224+
vectorize-iris ./documents -f ./output
225+
```
226+
227+
**Output:**
228+
```
229+
📦 Processing Directory
230+
──────────────────────────────────────────────────
231+
232+
💡 Found 5 files to process
233+
234+
⚙️ Processing 1/5 - report-q1.pdf
235+
✨ Vectorize Iris Extraction
236+
──────────────────────────────────────────────────
237+
✓ Upload prepared
238+
✓ File uploaded successfully
239+
✓ Extraction started
240+
✓ Extraction completed in 8s
241+
✓ Output written to output/report-q1.txt
242+
243+
⚙️ Processing 2/5 - report-q2.pdf
244+
...
245+
246+
──────────────────────────────────────────────────
247+
✨ Batch Processing Complete
248+
249+
✓ Successful: 5
250+
```
251+
252+
**With custom output format:**
253+
```bash
254+
# Extract all PDFs to JSON
255+
vectorize-iris ./documents -o json -f ./output
256+
257+
# Extract all files to plain text
258+
vectorize-iris ./scans -o text -f ./extracted
259+
```
260+
261+
### Chunking for RAG
262+
263+
```bash
264+
vectorize-iris long-document.pdf --chunk-size 512
265+
```
266+
267+
Splits documents at semantic boundaries, perfect for RAG pipelines.
268+
269+
### Custom Parsing Instructions
270+
271+
```bash
272+
vectorize-iris report.pdf --parsing-instructions "Extract only tables and numerical data, ignore narrative text"
273+
```
274+
275+
### Advanced Options
276+
277+
```bash
278+
# Custom chunk size with metadata extraction
279+
vectorize-iris document.pdf \
280+
--chunk-size 256 \
281+
--infer-metadata-schema \
282+
--parsing-instructions "Focus on extracting structured data" \
283+
-o yaml -f output.yaml
284+
285+
# Longer timeout for large documents
286+
vectorize-iris large-document.pdf \
287+
--timeout 600 \
288+
--poll-interval 5
289+
```
290+
108291
## Configuration
109292

110293
Set your API credentials:
@@ -122,10 +305,6 @@ For detailed documentation, API reference, and advanced features:
122305

123306
📚 **[docs.vectorize.io](https://docs.vectorize.io)**
124307

125-
## Examples
126-
127-
See the [examples](examples/) directory for sample documents and complete usage examples.
128-
129308
## License
130309

131310
MIT

nodejs-api/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Vectorize Iris Node.js SDK
22

3-
**AI-powered document text extraction for Node.js & TypeScript**
3+
**Document text extraction for Node.js & TypeScript**
44

55
Extract text, tables, and structured data from PDFs, images, and documents with a single async function. Built on Vectorize Iris, the industry-leading AI extraction service.
66

python-api/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Vectorize Iris Python SDK
22

3-
**AI-powered document text extraction for Python**
3+
**Document text extraction for Python**
44

55
Extract text, tables, and structured data from PDFs, images, and documents with a single function call. Built on Vectorize Iris, the industry-leading AI extraction service.
66

rust-cli/Cargo.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,3 +21,4 @@ indicatif = "0.17"
2121
console = "0.15"
2222
textwrap = "0.16"
2323
syntect = "5.2"
24+
tempfile = "3.13"

0 commit comments

Comments
 (0)