Skip to content

Commit 75b6711

Browse files
committed
add more
1 parent ee25cec commit 75b6711

File tree

6 files changed

+635
-56
lines changed

6 files changed

+635
-56
lines changed

README.md

Lines changed: 34 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ Documentation: [docs.vectorize.io](https://docs.vectorize.io/build-deploy/extrac
1111

1212
Traditional OCR tools struggle with complex layouts, poor scans, and structured data. **Iris uses advanced AI** to understand document structure and context, delivering:
1313

14+
- 📄 **Universal format support** - Works with all unstructured document types (PDFs, images, scans, and more)
1415
-**High accuracy** - Handles poor quality scans and complex layouts
1516
- 📊 **Structure preservation** - Maintains tables, lists, and formatting
1617
- 🎯 **Smart chunking** - Semantic splitting for RAG pipelines
@@ -87,12 +88,17 @@ Split documents into semantic chunks perfect for RAG pipelines:
8788
- Preserves context across chunks
8889

8990
### Metadata Extraction
90-
Extract structured data using natural language:
91+
Extract structured data using JSON schemas (OpenAPI spec format recommended):
9192
```python
9293
result = extract_text_from_file('invoice.pdf', options=ExtractionOptions(
9394
metadata_schemas=[{
9495
'id': 'invoice-data',
95-
'schema': 'Extract: invoice_number, date, total_amount, vendor_name'
96+
'schema': {
97+
'invoice_number': 'string',
98+
'date': 'string',
99+
'total_amount': 'number',
100+
'vendor_name': 'string'
101+
}
96102
}]
97103
))
98104
# Returns structured JSON metadata
@@ -143,23 +149,7 @@ Tables, lists, and other elements are properly extracted.
143149
Download and extract files directly from HTTP/HTTPS URLs:
144150

145151
```bash
146-
vectorize-iris https://example.com/document.pdf
147-
```
148-
149-
**Output:**
150-
```
151-
🚀 Downloading file from URL
152-
──────────────────────────────────────────────────
153-
154-
✓ Downloaded 2.1 MB to temporary file
155-
156-
✨ Vectorize Iris Extraction
157-
──────────────────────────────────────────────────
158-
159-
✓ Upload prepared
160-
✓ File uploaded successfully
161-
✓ Extraction started
162-
✓ Extraction completed in 8s
152+
vectorize-iris https://arxiv.org/pdf/2206.01062
163153
```
164154

165155
### JSON Output (for piping)
@@ -272,6 +262,31 @@ Splits documents at semantic boundaries, perfect for RAG pipelines.
272262
vectorize-iris report.pdf --parsing-instructions "Extract only tables and numerical data, ignore narrative text"
273263
```
274264

265+
### Document Classification
266+
267+
Pass multiple metadata schemas and Iris will automatically classify which schema matches best:
268+
269+
```bash
270+
vectorize-iris invoice.pdf \
271+
--metadata-schema 'invoice:{"invoice_number":"string","date":"string","total_amount":"number","vendor":"string"}' \
272+
--metadata-schema 'receipt:{"store_name":"string","date":"string","items":"array","total":"number"}' \
273+
--metadata-schema 'contract:{"parties":"array","effective_date":"string","terms":"string"}' \
274+
--metadata-schema 'cv:{"name":"string","contact_info":"object","skills":"array","experience":"array"}' \
275+
-o json
276+
```
277+
278+
**Output:**
279+
```json
280+
{
281+
"success": true,
282+
"text": "...",
283+
"metadata": "{\"invoice_number\":\"INV-2024-001\",\"date\":\"2024-01-15\",\"total_amount\":1250.00,\"vendor\":\"Acme Corp\"}",
284+
"metadataSchema": "invoice"
285+
}
286+
```
287+
288+
Iris automatically detected this was an invoice and extracted the relevant fields using the matching schema.
289+
275290
### Advanced Options
276291

277292
```bash

examples/classification.sh

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
#!/bin/bash
2+
3+
# Document Classification Example
4+
# This example shows how to use multiple metadata schemas to automatically
5+
# classify documents and extract relevant fields.
6+
7+
echo "==================================================================="
8+
echo "Document Classification Example"
9+
echo "==================================================================="
10+
echo ""
11+
echo "When you pass multiple metadata schemas, Iris will automatically"
12+
echo "determine which schema best matches your document and extract"
13+
echo "fields accordingly."
14+
echo ""
15+
16+
# Example 1: Single document classification
17+
echo "Example 1: Classifying a single document"
18+
echo "-----------------------------------------------------------------"
19+
echo ""
20+
echo "Command:"
21+
echo " vectorize-iris document.pdf \\"
22+
echo " --metadata-schema 'invoice:{\"invoice_number\":\"string\",\"date\":\"string\",\"total_amount\":\"number\",\"vendor\":\"string\"}' \\"
23+
echo " --metadata-schema 'receipt:{\"store_name\":\"string\",\"date\":\"string\",\"items\":\"array\",\"total\":\"number\"}' \\"
24+
echo " --metadata-schema 'contract:{\"parties\":\"array\",\"effective_date\":\"string\",\"terms\":\"string\"}' \\"
25+
echo " -o json"
26+
echo ""
27+
echo "Expected output:"
28+
echo "{"
29+
echo " \"success\": true,"
30+
echo " \"text\": \"...\","
31+
echo " \"metadata\": \"{\\\"invoice_number\\\":\\\"INV-2024-001\\\",\\\"date\\\":\\\"2024-01-15\\\",\\\"total_amount\\\":1250.00,\\\"vendor\\\":\\\"Acme Corp\\\"}\","
32+
echo " \"metadataSchema\": \"invoice\""
33+
echo "}"
34+
echo ""
35+
echo "Note: The 'metadataSchema' field tells you which schema matched best."
36+
echo ""
37+
38+
# Example 2: Batch classification of multiple documents
39+
echo "Example 2: Batch classification of multiple documents"
40+
echo "-----------------------------------------------------------------"
41+
echo ""
42+
echo "You can classify multiple documents at once:"
43+
echo ""
44+
echo "Command:"
45+
echo " vectorize-iris ./documents \\"
46+
echo " --metadata-schema 'invoice:{\"invoice_number\":\"string\",\"date\":\"string\",\"total_amount\":\"number\",\"vendor\":\"string\"}' \\"
47+
echo " --metadata-schema 'receipt:{\"store_name\":\"string\",\"date\":\"string\",\"items\":\"array\",\"total\":\"number\"}' \\"
48+
echo " --metadata-schema 'contract:{\"parties\":\"array\",\"effective_date\":\"string\",\"terms\":\"string\"}' \\"
49+
echo " -o json -f ./output"
50+
echo ""
51+
echo "This will process all documents in the ./documents directory,"
52+
echo "classify each one, and save the results to ./output with the"
53+
echo "appropriate schema detected for each document."
54+
echo ""
55+
56+
# Example 3: Using jq to filter by document type
57+
echo "Example 3: Using jq to filter classified documents"
58+
echo "-----------------------------------------------------------------"
59+
echo ""
60+
echo "You can pipe the JSON output to jq to filter by document type:"
61+
echo ""
62+
echo "Command:"
63+
echo " vectorize-iris document.pdf \\"
64+
echo " --metadata-schema 'invoice:{\"invoice_number\":\"string\",\"date\":\"string\",\"total_amount\":\"number\",\"vendor\":\"string\"}' \\"
65+
echo " --metadata-schema 'receipt:{\"store_name\":\"string\",\"date\":\"string\",\"items\":\"array\",\"total\":\"number\"}' \\"
66+
echo " -o json | jq 'select(.metadataSchema == \"invoice\")'"
67+
echo ""
68+
echo "This extracts only documents that were classified as invoices."
69+
echo ""
70+
71+
echo "==================================================================="
72+
echo "Try it yourself!"
73+
echo "==================================================================="

nodejs-api/README.md

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -301,13 +301,17 @@ import type {
301301
MetadataExtractionStrategySchema
302302
} from '@vectorize-io/iris';
303303

304-
// Type-safe options
304+
// Type-safe options with structured schema (OpenAPI spec format)
305305
const options: ExtractionOptions = {
306306
chunkSize: 512,
307307
parsingInstructions: 'Extract code blocks',
308308
metadataSchemas: [{
309309
id: 'doc-meta',
310-
schema: 'Extract: title, author, date'
310+
schema: {
311+
title: 'string',
312+
author: 'string',
313+
date: 'string'
314+
}
311315
}],
312316
pollInterval: 2000,
313317
timeout: 300000
Lines changed: 163 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,163 @@
1+
/**
2+
* Document Classification Example
3+
*
4+
* This example demonstrates how to use multiple metadata schemas to automatically
5+
* classify documents and extract relevant fields.
6+
*
7+
* When you provide multiple metadata schemas, Iris will:
8+
* 1. Analyze the document
9+
* 2. Determine which schema best matches the document type
10+
* 3. Extract fields according to the matching schema
11+
* 4. Return the schema ID in the response
12+
*/
13+
14+
import { extractTextFromFile, ExtractionOptions } from '@vectorize-io/iris';
15+
import * as fs from 'fs/promises';
16+
import * as path from 'path';
17+
18+
// Example 1: Single document classification
19+
console.log('='.repeat(70));
20+
console.log('Example 1: Classifying a single document');
21+
console.log('='.repeat(70));
22+
console.log();
23+
24+
(async () => {
25+
// Define multiple schemas for different document types (JSON objects)
26+
const result = await extractTextFromFile('document.pdf', {
27+
metadataSchemas: [
28+
{
29+
id: 'invoice',
30+
schema: {
31+
invoice_number: 'string',
32+
date: 'string',
33+
total_amount: 'number',
34+
vendor_name: 'string'
35+
}
36+
},
37+
{
38+
id: 'receipt',
39+
schema: {
40+
store_name: 'string',
41+
date: 'string',
42+
items: 'array',
43+
total: 'number'
44+
}
45+
},
46+
{
47+
id: 'contract',
48+
schema: {
49+
parties: 'array',
50+
effective_date: 'string',
51+
terms: 'string'
52+
}
53+
}
54+
]
55+
});
56+
57+
// Check which schema matched
58+
console.log(`Document classified as: ${result.metadataSchema}`);
59+
console.log(`Extracted metadata: ${result.metadata}`);
60+
console.log();
61+
62+
// Example 2: Processing multiple documents with classification
63+
console.log('='.repeat(70));
64+
console.log('Example 2: Batch classification of multiple documents');
65+
console.log('='.repeat(70));
66+
console.log();
67+
68+
const documentsDir = './documents';
69+
try {
70+
const files = await fs.readdir(documentsDir);
71+
72+
for (const file of files) {
73+
const filePath = path.join(documentsDir, file);
74+
const stat = await fs.stat(filePath);
75+
76+
if (stat.isFile()) {
77+
const result = await extractTextFromFile(filePath, {
78+
metadataSchemas: [
79+
{
80+
id: 'invoice',
81+
schema: {
82+
invoice_number: 'string',
83+
date: 'string',
84+
total_amount: 'number',
85+
vendor_name: 'string'
86+
}
87+
},
88+
{
89+
id: 'receipt',
90+
schema: {
91+
store_name: 'string',
92+
date: 'string',
93+
items: 'array',
94+
total: 'number'
95+
}
96+
},
97+
{
98+
id: 'contract',
99+
schema: {
100+
parties: 'array',
101+
effective_date: 'string',
102+
terms: 'string'
103+
}
104+
}
105+
]
106+
});
107+
108+
console.log(`File: ${file}`);
109+
console.log(` Type: ${result.metadataSchema}`);
110+
console.log(` Metadata: ${result.metadata}`);
111+
console.log();
112+
}
113+
}
114+
} catch (error) {
115+
console.log('Documents directory not found, skipping batch example');
116+
}
117+
118+
// Example 3: Conditional processing based on classification
119+
console.log('='.repeat(70));
120+
console.log('Example 3: Conditional processing based on document type');
121+
console.log('='.repeat(70));
122+
console.log();
123+
124+
const classifiedResult = await extractTextFromFile('document.pdf', {
125+
metadataSchemas: [
126+
{
127+
id: 'invoice',
128+
schema: {
129+
invoice_number: 'string',
130+
date: 'string',
131+
total_amount: 'number',
132+
vendor_name: 'string'
133+
}
134+
},
135+
{
136+
id: 'receipt',
137+
schema: {
138+
store_name: 'string',
139+
date: 'string',
140+
items: 'array',
141+
total: 'number'
142+
}
143+
}
144+
]
145+
});
146+
147+
// Process differently based on document type
148+
switch (classifiedResult.metadataSchema) {
149+
case 'invoice':
150+
console.log('Processing as invoice...');
151+
// Invoice-specific logic here
152+
console.log(`Invoice data: ${classifiedResult.metadata}`);
153+
break;
154+
case 'receipt':
155+
console.log('Processing as receipt...');
156+
// Receipt-specific logic here
157+
console.log(`Receipt data: ${classifiedResult.metadata}`);
158+
break;
159+
default:
160+
console.log('Unknown document type');
161+
console.log(`Extracted text: ${classifiedResult.text.substring(0, 200)}...`);
162+
}
163+
})();

0 commit comments

Comments
 (0)