extract data from pdf node.jsparse pdf javascriptpdf to json nodejs
How to Extract Data from PDF with Node.js: A Complete Developer Guide (2026)
March 16, 2026
## The Problem With PDF Parsing in Node.js
PDFs are everywhere in production systems. Invoices, bank statements, contracts, insurance forms — businesses run on documents, and developers are constantly asked to extract structured data from them.
The Node.js ecosystem has several libraries for this. Most of them are frustrating to use for anything beyond simple text extraction.
Here's what you're actually dealing with:
- **pdf-parse** — extracts raw text, but returns it as one flat string. You have to write regex to find your fields. Works okay for simple docs. Fails badly on scanned documents (which have no embedded text — just image layers).
- **pdfjs-dist** — the Mozilla PDF renderer, ported to Node. Can extract text with position data, which helps. But scanned PDFs still return nothing useful without a separate OCR step.
- **tesseract.js** — open-source OCR you can run locally. Good for simple scans. Accuracy drops fast on handwriting, rotated pages, low-DPI scans, or anything with a complex layout.
- **AWS Textract / Google Document AI** — production-grade OCR, but you're now managing AWS credentials, IAM roles, or GCP service accounts. Textract returns raw key-value pairs that still need significant post-processing. Not a 5-minute integration.
This guide walks through all three tiers: raw library parsing, self-hosted OCR, and a REST API (Dokyumi) that handles OCR and structured extraction in one call.
---
## Option 1: pdf-parse (Simple PDFs Only)
```bash
npm install pdf-parse
```
```typescript
import pdfParse from 'pdf-parse';
import fs from 'fs';
async function extractText(filePath: string): Promise {
const dataBuffer = fs.readFileSync(filePath);
const data = await pdfParse(dataBuffer);
return data.text;
}
// Usage
const text = await extractText('./invoice.pdf');
console.log(text);
// Returns: flat string with newlines. All formatting collapsed.
// "Acme Corp\nInvoice #1042\nDate: 2026-01-15\nTotal: $1,847.50"
```
**Then you still need to parse the string:**
```typescript
function extractInvoiceTotal(rawText: string): string | null {
const match = rawText.match(/Total[:\s]+\$?([\d,]+\.\d{2})/i);
return match ? match[1] : null;
}
const total = extractInvoiceTotal(text); // "1847.50" (maybe)
```
The regex approach works until it doesn't. The moment a vendor changes their template, or you get a scanned PDF instead of a digital one, it breaks. And you need different regexes for every document type.
---
## Option 2: pdfjs-dist (Text with Position Data)
For documents where field positions matter (like tables), pdfjs-dist gives you more control:
```bash
npm install pdfjs-dist
```
```typescript
import * as pdfjsLib from 'pdfjs-dist/legacy/build/pdf.js';
interface TextItem {
str: string;
transform: number[]; // [scaleX, skewX, skewY, scaleY, x, y]
}
async function getTextWithPositions(filePath: string) {
const data = new Uint8Array(fs.readFileSync(filePath));
const doc = await pdfjsLib.getDocument({ data }).promise;
const page = await doc.getPage(1);
const content = await page.getTextContent();
return content.items
.filter((item): item is TextItem => 'str' in item)
.map(item => ({
text: item.str,
x: item.transform[4],
y: item.transform[5],
}));
}
```
This gives you positional text, which you can use to reconstruct tables or find fields by coordinate. It's a lot more code to maintain and still fails completely on scanned documents.
---
## Option 3: tesseract.js (OCR for Scanned Docs)
If you need to handle scanned PDFs, you need OCR. tesseract.js runs Tesseract locally:
```bash
npm install tesseract.js pdf-to-img
```
```typescript
import Tesseract from 'tesseract.js';
import { pdf } from 'pdf-to-img';
async function ocrPdf(filePath: string): Promise {
const pages = await pdf(filePath, { scale: 2 });
let fullText = '';
for await (const page of pages) {
const { data: { text } } = await Tesseract.recognize(page, 'eng', {
logger: () => {}, // suppress verbose output
});
fullText += text + '\n';
}
return fullText;
}
```
This works. But Tesseract accuracy on real-world documents — low DPI scans, skewed pages, forms with checkboxes, mixed fonts — ranges from okay to bad. You'll spend a lot of time tuning and you still end up with raw text that you have to parse with regexes.
---
## Option 4: Dokyumi REST API (OCR + Structured Extraction in One Call)
Dokyumi handles the full pipeline: OCR the document, then extract exactly the fields you defined. You get clean JSON back with no regex, no text parsing, no positional coordinate math.
**Setup:**
1. Sign up at [dokyumi.com](https://dokyumi.com) (free tier: 100 extractions/month)
2. Create a schema — describe your document and the fields you want
3. Generate an API key
4. Call the API
**Basic extraction:**
```typescript
import FormData from 'form-data';
import fs from 'fs';
import fetch from 'node-fetch';
interface ExtractionResult {
status: 'completed' | 'failed' | 'review';
extraction_id: string;
data: Record;
confidence_scores: Record;
}
async function extractFromPdf(
filePath: string,
schemaSlug: string,
apiKey: string
): Promise {
const form = new FormData();
form.append('file', fs.createReadStream(filePath), {
filename: 'document.pdf',
contentType: 'application/pdf',
});
form.append('schema', schemaSlug);
const response = await fetch('https://dokyumi.com/api/v1/extract', {
method: 'POST',
headers: {
Authorization: \`Bearer \${apiKey}\`,
...form.getHeaders(),
},
body: form,
});
if (!response.ok) {
const error = await response.json();
throw new Error(\`Extraction failed: \${JSON.stringify(error)}\`);
}
return response.json() as Promise;
}
// Usage
const result = await extractFromPdf(
'./invoice.pdf',
'invoice-standard',
'dk_live_your_key_here'
);
console.log(result.data);
// {
// vendor_name: "Acme Corp",
// invoice_number: "INV-1042",
// invoice_date: "2026-01-15",
// due_date: "2026-02-15",
// subtotal: 1650.00,
// tax_amount: 132.00,
// total_amount: 1782.00,
// currency: "USD"
// }
console.log(result.confidence_scores);
// { vendor_name: 0.98, invoice_number: 0.99, total_amount: 0.97, ... }
```
No regex. No text parsing. Works on both digital PDFs and scanned documents.
---
## Handling Multiple Document Types
If you're processing different document types — invoices from one folder, bank statements from another — create a separate schema for each type:
```typescript
const SCHEMAS = {
invoice: 'invoice-standard',
bankStatement: 'bank-statement-v2',
contract: 'contract-nda',
} as const;
type DocumentType = keyof typeof SCHEMAS;
async function processDocument(
filePath: string,
docType: DocumentType,
apiKey: string
) {
const result = await extractFromPdf(filePath, SCHEMAS[docType], apiKey);
if (result.status === 'failed') {
console.error(\`Extraction failed for \${filePath}\`);
return null;
}
if (result.status === 'review') {
// Low confidence — flag for human review
console.warn(\`Low confidence on \${filePath} — queued for review\`);
return { ...result, needsReview: true };
}
return result.data;
}
```
---
## Batch Processing Multiple PDFs
```typescript
import path from 'path';
import { glob } from 'glob';
interface BatchResult {
file: string;
success: boolean;
data?: Record;
error?: string;
}
async function batchExtract(
directory: string,
schemaSlug: string,
apiKey: string,
concurrency = 3
): Promise {
const files = await glob('**/*.pdf', { cwd: directory, absolute: true });
const results: BatchResult[] = [];
// Process in chunks to avoid rate limits
for (let i = 0; i < files.length; i += concurrency) {
const chunk = files.slice(i, i + concurrency);
const chunkResults = await Promise.allSettled(
chunk.map(async (file) => {
const result = await extractFromPdf(file, schemaSlug, apiKey);
return { file: path.basename(file), success: true, data: result.data };
})
);
chunkResults.forEach((r, idx) => {
if (r.status === 'fulfilled') {
results.push(r.value);
} else {
results.push({
file: path.basename(chunk[idx]),
success: false,
error: r.reason?.message,
});
}
});
// Brief pause between chunks
if (i + concurrency < files.length) {
await new Promise(resolve => setTimeout(resolve, 200));
}
}
return results;
}
// Process all invoices in a directory
const results = await batchExtract(
'./invoices',
'invoice-standard',
'dk_live_your_key_here'
);
console.log(\`Processed: \${results.filter(r => r.success).length}/\${results.length}\`);
```
---
## With Express: Building a Document Processing Endpoint
If you're building a backend service that accepts document uploads:
```typescript
import express from 'express';
import multer from 'multer';
import FormData from 'form-data';
import fetch from 'node-fetch';
const app = express();
const upload = multer({ storage: multer.memoryStorage() });
app.post('/process-document', upload.single('file'), async (req, res) => {
if (!req.file) {
return res.status(400).json({ error: 'No file uploaded' });
}
const { schema } = req.body;
if (!schema) {
return res.status(400).json({ error: 'Schema slug required' });
}
try {
const form = new FormData();
form.append('file', req.file.buffer, {
filename: req.file.originalname,
contentType: req.file.mimetype,
});
form.append('schema', schema);
const response = await fetch('https://dokyumi.com/api/v1/extract', {
method: 'POST',
headers: {
Authorization: \`Bearer \${process.env.DOKYUMI_API_KEY}\`,
...form.getHeaders(),
},
body: form,
});
if (!response.ok) {
const err = await response.json();
return res.status(response.status).json(err);
}
const result = await response.json();
res.json(result);
} catch (err) {
console.error('Document processing error:', err);
res.status(500).json({ error: 'Processing failed' });
}
});
app.listen(3000, () => console.log('Document processor running on :3000'));
```
---
## Comparison: When to Use What
| Approach | Best for | Handles scanned docs? | Setup time | Maintenance |
|---|---|---|---|---|
| pdf-parse | Digital PDFs, simple extraction | ❌ No | 5 min | High (regex upkeep) |
| pdfjs-dist | Position-aware parsing | ❌ No | 30 min | High (coordinate logic) |
| tesseract.js | Simple scans, offline/local only | ✅ (limited accuracy) | 1-2 hrs | High (accuracy tuning) |
| Dokyumi API | Any document type, production use | ✅ Full OCR pipeline | 5 min | None |
---
## Performance Notes
- **Cold start:** Dokyumi API calls typically complete in 2-6 seconds for a single page document, 5-15 seconds for multi-page PDFs. Plan accordingly for synchronous API routes.
- **Async patterns:** For user-facing uploads, return a 202 Accepted immediately and use a job queue (Bull, BullMQ) to process documents in the background. Webhook delivery is available on Growth plan and above.
- **Error handling:** Always check `result.status`. A `review` status means the extraction succeeded but confidence was low on some fields — you'll want to flag these for human verification rather than silently passing them downstream.
---
## Getting Started
1. **Sign up at [dokyumi.com](https://dokyumi.com)** — free tier includes 100 extractions/month
2. **Create a schema** — describe your document in plain English. Dokyumi's AI will suggest field names and types. Takes under 2 minutes.
3. **Generate an API key** from the dashboard
4. **Test with curl first:**
```bash
curl -X POST https://dokyumi.com/api/v1/extract \
-H "Authorization: Bearer dk_live_YOUR_KEY" \
-F "file=@invoice.pdf" \
-F "schema=YOUR_SCHEMA_SLUG"
```
5. **Drop in the TypeScript wrapper above** and you're processing PDFs.
If you're coming from a Python stack and need the equivalent guide, see [How to Extract Data from PDF with Python](/blog/extract-data-from-pdf-python-complete-guide).
More from Dokyumi
Start extracting in under 2 minutes
100 free extractions every month. No credit card required.