A TypeScript utility that converts PDF documents into structured JSON data while preserving text content, formatting, and hyperlinks. Perfect for resume parsing, document analysis, and content extraction workflows.
- Text Extraction: Extract text content with precise positioning and styling
- Hyperlink Detection: Capture clickable links with their coordinates and target URLs
- Font Preservation: Maintains font information for each text element
- Multi-page Support: Processes documents of any length
- Type Safety: Built with TypeScript for better development experience
- Lightweight: Minimal dependencies
Make sure you have the following installed on your system:
- Node.js (v16 or higher)
- npm (v7 or higher) or yarn
Using npm:
npm install @shilendra-dev/pdf-to-jsonOr using yarn:
yarn add @shilendra-dev/pdf-to-jsonThis package requires the following peer dependencies which will be installed automatically:
pdfjs-dist: ^3.4.120 (PDF.js library for PDF parsing)@types/node: ^18.0.0 (TypeScript types for Node.js)
import { pdfToJson } from '@shilendra-dev/pdf-to-json';
import fs from 'fs/promises';
async function convertPdfToJson() {
try {
// Read PDF file
const pdfBuffer = await fs.readFile('path/to/your/document.pdf');
// Convert to JSON
const result = await pdfToJson(pdfBuffer, {
outputPath: 'output.json' // Optional: Path to save the JSON output
});
console.log('Conversion complete!');
console.log(`Processed ${result.numPages} pages`);
} catch (error) {
console.error('Error converting PDF:', error);
}
}
convertPdfToJson();Converts a PDF document to JSON.
Parameters:
pdfSource: PDF file as Buffer or file pathoptions: (Optional) Configuration optionsoutputPath: (string) Path to save the JSON output fileincludeTextContent: (boolean) Whether to include raw text content (default: true)includeStyles: (boolean) Whether to include font and style information (default: true)includeLinks: (boolean) Whether to include hyperlinks (default: true)
Returns: Promise that resolves to the parsed PDF data
The converter generates a JSON object with the following structure:
{
numPages: number;
pages: Array<{
pageNumber: number;
width: number;
height: number;
items: Array<{
type: 'text' | 'link';
content: string;
x: number;
y: number;
width: number;
height: number;
fontFamily?: string;
fontSize?: number;
color?: string;
url?: string; // For links
}>;
}>;
}import { pdfToJson } from '@shilendra-dev/pdf-to-json';
// Convert PDF from URL
const response = await fetch('https://example.com/document.pdf');
const pdfBuffer = await response.arrayBuffer();
const result = await pdfToJson(Buffer.from(pdfBuffer));
// Process the extracted data
result.pages.forEach(page => {
console.log(`Page ${page.pageNumber} (${page.width}x${page.height}):`);
console.log(`- Contains ${page.items.length} text items`);
});Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
Made with ❤️ by Shilendra Singh