extract data from pdf node.jsparse pdf javascriptpdf to json nodejs

How to Extract Data from PDF with Node.js: A Complete Developer Guide (2026)

March 16, 2026

## The Problem With PDF Parsing in Node.js PDFs are everywhere in production systems. Invoices, bank statements, contracts, insurance forms — businesses run on documents, and developers are constantly asked to extract structured data from them. The Node.js ecosystem has several libraries for this. Most of them are frustrating to use for anything beyond simple text extraction. Here's what you're actually dealing with: - **pdf-parse** — extracts raw text, but returns it as one flat string. You have to write regex to find your fields. Works okay for simple docs. Fails badly on scanned documents (which have no embedded text — just image layers). - **pdfjs-dist** — the Mozilla PDF renderer, ported to Node. Can extract text with position data, which helps. But scanned PDFs still return nothing useful without a separate OCR step. - **tesseract.js** — open-source OCR you can run locally. Good for simple scans. Accuracy drops fast on handwriting, rotated pages, low-DPI scans, or anything with a complex layout. - **AWS Textract / Google Document AI** — production-grade OCR, but you're now managing AWS credentials, IAM roles, or GCP service accounts. Textract returns raw key-value pairs that still need significant post-processing. Not a 5-minute integration. This guide walks through all three tiers: raw library parsing, self-hosted OCR, and a REST API (Dokyumi) that handles OCR and structured extraction in one call. --- ## Option 1: pdf-parse (Simple PDFs Only) ```bash npm install pdf-parse ``` ```typescript import pdfParse from 'pdf-parse'; import fs from 'fs'; async function extractText(filePath: string): Promise { const dataBuffer = fs.readFileSync(filePath); const data = await pdfParse(dataBuffer); return data.text; } // Usage const text = await extractText('./invoice.pdf'); console.log(text); // Returns: flat string with newlines. All formatting collapsed. // "Acme Corp\nInvoice #1042\nDate: 2026-01-15\nTotal: $1,847.50" ``` **Then you still need to parse the string:** ```typescript function extractInvoiceTotal(rawText: string): string | null { const match = rawText.match(/Total[:\s]+\$?([\d,]+\.\d{2})/i); return match ? match[1] : null; } const total = extractInvoiceTotal(text); // "1847.50" (maybe) ``` The regex approach works until it doesn't. The moment a vendor changes their template, or you get a scanned PDF instead of a digital one, it breaks. And you need different regexes for every document type. --- ## Option 2: pdfjs-dist (Text with Position Data) For documents where field positions matter (like tables), pdfjs-dist gives you more control: ```bash npm install pdfjs-dist ``` ```typescript import * as pdfjsLib from 'pdfjs-dist/legacy/build/pdf.js'; interface TextItem { str: string; transform: number[]; // [scaleX, skewX, skewY, scaleY, x, y] } async function getTextWithPositions(filePath: string) { const data = new Uint8Array(fs.readFileSync(filePath)); const doc = await pdfjsLib.getDocument({ data }).promise; const page = await doc.getPage(1); const content = await page.getTextContent(); return content.items .filter((item): item is TextItem => 'str' in item) .map(item => ({ text: item.str, x: item.transform[4], y: item.transform[5], })); } ``` This gives you positional text, which you can use to reconstruct tables or find fields by coordinate. It's a lot more code to maintain and still fails completely on scanned documents. --- ## Option 3: tesseract.js (OCR for Scanned Docs) If you need to handle scanned PDFs, you need OCR. tesseract.js runs Tesseract locally: ```bash npm install tesseract.js pdf-to-img ``` ```typescript import Tesseract from 'tesseract.js'; import { pdf } from 'pdf-to-img'; async function ocrPdf(filePath: string): Promise { const pages = await pdf(filePath, { scale: 2 }); let fullText = ''; for await (const page of pages) { const { data: { text } } = await Tesseract.recognize(page, 'eng', { logger: () => {}, // suppress verbose output }); fullText += text + '\n'; } return fullText; } ``` This works. But Tesseract accuracy on real-world documents — low DPI scans, skewed pages, forms with checkboxes, mixed fonts — ranges from okay to bad. You'll spend a lot of time tuning and you still end up with raw text that you have to parse with regexes. --- ## Option 4: Dokyumi REST API (OCR + Structured Extraction in One Call) Dokyumi handles the full pipeline: OCR the document, then extract exactly the fields you defined. You get clean JSON back with no regex, no text parsing, no positional coordinate math. **Setup:** 1. Sign up at [dokyumi.com](https://dokyumi.com) (free tier: 100 extractions/month) 2. Create a schema — describe your document and the fields you want 3. Generate an API key 4. Call the API **Basic extraction:** ```typescript import FormData from 'form-data'; import fs from 'fs'; import fetch from 'node-fetch'; interface ExtractionResult { status: 'completed' | 'failed' | 'review'; extraction_id: string; data: Record; confidence_scores: Record; } async function extractFromPdf( filePath: string, schemaSlug: string, apiKey: string ): Promise { const form = new FormData(); form.append('file', fs.createReadStream(filePath), { filename: 'document.pdf', contentType: 'application/pdf', }); form.append('schema', schemaSlug); const response = await fetch('https://dokyumi.com/api/v1/extract', { method: 'POST', headers: { Authorization: \`Bearer \${apiKey}\`, ...form.getHeaders(), }, body: form, }); if (!response.ok) { const error = await response.json(); throw new Error(\`Extraction failed: \${JSON.stringify(error)}\`); } return response.json() as Promise; } // Usage const result = await extractFromPdf( './invoice.pdf', 'invoice-standard', 'dk_live_your_key_here' ); console.log(result.data); // { // vendor_name: "Acme Corp", // invoice_number: "INV-1042", // invoice_date: "2026-01-15", // due_date: "2026-02-15", // subtotal: 1650.00, // tax_amount: 132.00, // total_amount: 1782.00, // currency: "USD" // } console.log(result.confidence_scores); // { vendor_name: 0.98, invoice_number: 0.99, total_amount: 0.97, ... } ``` No regex. No text parsing. Works on both digital PDFs and scanned documents. --- ## Handling Multiple Document Types If you're processing different document types — invoices from one folder, bank statements from another — create a separate schema for each type: ```typescript const SCHEMAS = { invoice: 'invoice-standard', bankStatement: 'bank-statement-v2', contract: 'contract-nda', } as const; type DocumentType = keyof typeof SCHEMAS; async function processDocument( filePath: string, docType: DocumentType, apiKey: string ) { const result = await extractFromPdf(filePath, SCHEMAS[docType], apiKey); if (result.status === 'failed') { console.error(\`Extraction failed for \${filePath}\`); return null; } if (result.status === 'review') { // Low confidence — flag for human review console.warn(\`Low confidence on \${filePath} — queued for review\`); return { ...result, needsReview: true }; } return result.data; } ``` --- ## Batch Processing Multiple PDFs ```typescript import path from 'path'; import { glob } from 'glob'; interface BatchResult { file: string; success: boolean; data?: Record; error?: string; } async function batchExtract( directory: string, schemaSlug: string, apiKey: string, concurrency = 3 ): Promise { const files = await glob('**/*.pdf', { cwd: directory, absolute: true }); const results: BatchResult[] = []; // Process in chunks to avoid rate limits for (let i = 0; i < files.length; i += concurrency) { const chunk = files.slice(i, i + concurrency); const chunkResults = await Promise.allSettled( chunk.map(async (file) => { const result = await extractFromPdf(file, schemaSlug, apiKey); return { file: path.basename(file), success: true, data: result.data }; }) ); chunkResults.forEach((r, idx) => { if (r.status === 'fulfilled') { results.push(r.value); } else { results.push({ file: path.basename(chunk[idx]), success: false, error: r.reason?.message, }); } }); // Brief pause between chunks if (i + concurrency < files.length) { await new Promise(resolve => setTimeout(resolve, 200)); } } return results; } // Process all invoices in a directory const results = await batchExtract( './invoices', 'invoice-standard', 'dk_live_your_key_here' ); console.log(\`Processed: \${results.filter(r => r.success).length}/\${results.length}\`); ``` --- ## With Express: Building a Document Processing Endpoint If you're building a backend service that accepts document uploads: ```typescript import express from 'express'; import multer from 'multer'; import FormData from 'form-data'; import fetch from 'node-fetch'; const app = express(); const upload = multer({ storage: multer.memoryStorage() }); app.post('/process-document', upload.single('file'), async (req, res) => { if (!req.file) { return res.status(400).json({ error: 'No file uploaded' }); } const { schema } = req.body; if (!schema) { return res.status(400).json({ error: 'Schema slug required' }); } try { const form = new FormData(); form.append('file', req.file.buffer, { filename: req.file.originalname, contentType: req.file.mimetype, }); form.append('schema', schema); const response = await fetch('https://dokyumi.com/api/v1/extract', { method: 'POST', headers: { Authorization: \`Bearer \${process.env.DOKYUMI_API_KEY}\`, ...form.getHeaders(), }, body: form, }); if (!response.ok) { const err = await response.json(); return res.status(response.status).json(err); } const result = await response.json(); res.json(result); } catch (err) { console.error('Document processing error:', err); res.status(500).json({ error: 'Processing failed' }); } }); app.listen(3000, () => console.log('Document processor running on :3000')); ``` --- ## Comparison: When to Use What | Approach | Best for | Handles scanned docs? | Setup time | Maintenance | |---|---|---|---|---| | pdf-parse | Digital PDFs, simple extraction | ❌ No | 5 min | High (regex upkeep) | | pdfjs-dist | Position-aware parsing | ❌ No | 30 min | High (coordinate logic) | | tesseract.js | Simple scans, offline/local only | ✅ (limited accuracy) | 1-2 hrs | High (accuracy tuning) | | Dokyumi API | Any document type, production use | ✅ Full OCR pipeline | 5 min | None | --- ## Performance Notes - **Cold start:** Dokyumi API calls typically complete in 2-6 seconds for a single page document, 5-15 seconds for multi-page PDFs. Plan accordingly for synchronous API routes. - **Async patterns:** For user-facing uploads, return a 202 Accepted immediately and use a job queue (Bull, BullMQ) to process documents in the background. Webhook delivery is available on Growth plan and above. - **Error handling:** Always check `result.status`. A `review` status means the extraction succeeded but confidence was low on some fields — you'll want to flag these for human verification rather than silently passing them downstream. --- ## Getting Started 1. **Sign up at [dokyumi.com](https://dokyumi.com)** — free tier includes 100 extractions/month 2. **Create a schema** — describe your document in plain English. Dokyumi's AI will suggest field names and types. Takes under 2 minutes. 3. **Generate an API key** from the dashboard 4. **Test with curl first:** ```bash curl -X POST https://dokyumi.com/api/v1/extract \ -H "Authorization: Bearer dk_live_YOUR_KEY" \ -F "file=@invoice.pdf" \ -F "schema=YOUR_SCHEMA_SLUG" ``` 5. **Drop in the TypeScript wrapper above** and you're processing PDFs. If you're coming from a Python stack and need the equivalent guide, see [How to Extract Data from PDF with Python](/blog/extract-data-from-pdf-python-complete-guide).

Start extracting in under 2 minutes

100 free extractions every month. No credit card required.

How to Extract Data from PDF with Node.js: A Complete Developer Guide (2026) | Dokyumi