How to Extract Data from PDF with Python: A Complete Developer Guide (2026)
March 16, 2026
The Problem with Raw PDF Extraction in Python
Every Python developer eventually hits the PDF wall. You've got invoices, bank statements, or contracts to process. You reach for PyPDF2 or pdfplumber. You get text. A wall of text — no structure, no field boundaries, no way to reliably pull out the vendor name or invoice total without writing brittle regex that breaks on the next slightly-different PDF.
This guide walks through four approaches to PDF data extraction in Python, from the most basic to the most production-ready, with real code for each.
Approach 1: PyPDF2 — Basic Text Extraction
PyPDF2 is the simplest entry point. It extracts text page by page, but gives you no awareness of layout, tables, or field structure.
import PyPDF2
def extract_text_pypdf2(pdf_path: str) -> str:
with open(pdf_path, 'rb') as f:
reader = PyPDF2.PdfReader(f)
text = ''
for page in reader.pages:
text += page.extract_text() + '\n'
return text
# What you get back:
# "INVOICE\nInvoice #: INV-2026-0042\nDate: March 15, 2026\nBill To: Acme Corp..."
Problems: No structure. You're left parsing a string. Multi-column PDFs produce garbled output. Scanned PDFs return empty strings (no OCR).
Approach 2: pdfplumber — Layout-Aware Extraction
pdfplumber understands layout better than PyPDF2. It can extract tables and has better positional awareness.
import pdfplumber
def extract_with_pdfplumber(pdf_path: str):
with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages:
# Extract tables if present
tables = page.extract_tables()
for table in tables:
for row in table:
print(row) # Each row is a list of cell values
# Extract full text
text = page.extract_text()
print(text)
Better, but still: You have to write table parsing logic per document type. Scanned PDFs still return nothing. Different invoice templates from different vendors break your parser.
Approach 3: AWS Textract via boto3
AWS Textract handles scanned documents and can extract forms and tables. But the setup is significant.
import boto3
import json
textract = boto3.client('textract', region_name='us-east-1')
def extract_with_textract(pdf_path: str):
with open(pdf_path, 'rb') as f:
response = textract.analyze_document(
Document={'Bytes': f.read()},
FeatureTypes=['FORMS', 'TABLES']
)
# Extract key-value pairs from forms
key_map = {}
value_map = {}
block_map = {}
for block in response['Blocks']:
block_map[block['Id']] = block
if block['BlockType'] == 'KEY_VALUE_SET':
if 'KEY' in block.get('EntityTypes', []):
key_map[block['Id']] = block
else:
value_map[block['Id']] = block
# ... 40 more lines to reconstruct key-value pairs
# ... then you still need to map Textract's keys to your field names
return key_map, value_map
The real cost: AWS account required, IAM credentials to configure, $1.50 per 1,000 pages for form/table extraction, and you still have to write post-processing to map Textract's generic KEY_VALUE_SET blocks to your actual fields (vendor_name, invoice_total, due_date, etc.).
Approach 4: Schema-First Extraction with Dokyumi
Dokyumi takes a different approach: you define exactly which fields you want, and the API returns validated JSON with those fields — nothing more, nothing less. Handles scanned PDFs, digital PDFs, and images. No AWS account. Free tier: 100 extractions/month.
Step 1: Define your schema (one time, in the dashboard)
Go to dokyumi.com/dashboard, create a schema for "Invoice" with fields: vendor_name, invoice_number, invoice_date, due_date, subtotal, tax_amount, total_amount, line_items (array).
Or let AI infer the schema for you by describing the document type in plain English.
Your schema gets a slug (e.g., invoice-extractor) and a dedicated API endpoint.
Step 2: Extract a PDF with Python
import requests
import json
from pathlib import Path
DOKYUMI_API_KEY = "dk_live_your_key_here"
SCHEMA_SLUG = "invoice-extractor"
def extract_invoice(pdf_path: str) -> dict:
with open(pdf_path, 'rb') as f:
response = requests.post(
"https://dokyumi.com/api/v1/extract",
headers={"Authorization": f"Bearer {DOKYUMI_API_KEY}"},
files={"file": (Path(pdf_path).name, f, "application/pdf")},
data={"schema": SCHEMA_SLUG},
)
response.raise_for_status()
return response.json()
# What you get back:
result = extract_invoice("vendor-invoice.pdf")
# {
# "status": "completed",
# "data": {
# "vendor_name": "Acme Supplies Inc.",
# "invoice_number": "INV-2026-0042",
# "invoice_date": "2026-03-10",
# "due_date": "2026-04-10",
# "subtotal": 4250.00,
# "tax_amount": 382.50,
# "total_amount": 4632.50,
# "line_items": [
# {"description": "Server hardware", "qty": 2, "unit_price": 1800.00, "amount": 3600.00},
# {"description": "Installation", "qty": 1, "unit_price": 650.00, "amount": 650.00}
# ]
# },
# "confidence": 0.94,
# "extraction_id": "ext_01abc123",
# "processing_ms": 1843
# }
Step 3: Handle confidence and routing
def process_invoice(pdf_path: str) -> dict:
result = extract_invoice(pdf_path)
if result["status"] == "completed":
if result["confidence"] >= 0.85:
# High confidence — route directly to your system
return {"action": "auto_approve", "data": result["data"]}
else:
# Lower confidence — flag for human review
return {"action": "review", "data": result["data"],
"confidence": result["confidence"]}
elif result["status"] == "review":
# Dokyumi flagged this for review (below its threshold)
return {"action": "manual_entry", "extraction_id": result.get("extraction_id")}
else:
raise Exception(f"Extraction failed: {result.get('error')}")
Step 4: Batch processing multiple PDFs
import concurrent.futures
from pathlib import Path
def batch_extract(pdf_directory: str, max_workers: int = 5) -> list:
pdf_files = list(Path(pdf_directory).glob("*.pdf"))
results = []
with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
futures = {
executor.submit(extract_invoice, str(pdf)): pdf
for pdf in pdf_files
}
for future in concurrent.futures.as_completed(futures):
pdf_file = futures[future]
try:
result = future.result()
results.append({"file": pdf_file.name, "result": result})
except Exception as e:
results.append({"file": pdf_file.name, "error": str(e)})
return results
Step 5: Webhook for async processing (production pattern)
For large batches or slow uploads, use webhooks instead of polling:
from flask import Flask, request, jsonify
import hmac, hashlib
app = Flask(__name__)
WEBHOOK_SECRET = "your_webhook_secret"
@app.route("/webhooks/dokyumi", methods=["POST"])
def handle_extraction_complete():
# Verify signature
sig = request.headers.get("X-Dokyumi-Signature", "")
payload = request.get_data()
expected = hmac.new(WEBHOOK_SECRET.encode(), payload, hashlib.sha256).hexdigest()
if not hmac.compare_digest(f"sha256={expected}", sig):
return jsonify({"error": "Invalid signature"}), 401
event = request.json
if event["event"] == "extraction.completed":
extraction_data = event["data"]
# Process the completed extraction
save_to_database(extraction_data)
return jsonify({"received": True})
Handling Edge Cases
Scanned PDFs and low-quality images
Dokyumi's Mistral OCR layer handles scanned documents automatically. No code change needed on your end. For best results: 150+ DPI, avoid extreme skew, ensure good contrast. The confidence score will reflect OCR quality — route low-confidence results to manual review.
Multi-page documents
Multi-page PDFs are handled as a single document. Fields are extracted across all pages. Line items from a 10-page invoice are consolidated into a single line_items array.
International invoices
Include date format and currency in your schema description: "Extract dates in ISO format (YYYY-MM-DD), amounts as numeric values without currency symbols." The AI follows these instructions.
Comparison: Python PDF Extraction Methods
| Method | Setup time | Output format | Custom fields | Scanned PDFs | Cost |
|---|---|---|---|---|---|
| PyPDF2 | 5 min | Raw text string | ✗ (you write regex) | ✗ | Free |
| pdfplumber | 10 min | Text + tables | ✗ (you write parser) | ✗ | Free |
| AWS Textract | 1-2 weeks | KEY_VALUE_SET blocks | Partial (post-processing) | ✓ | $1.50/1K pages |
| Dokyumi | <2 min | Validated JSON (your fields) | ✓ (schema-defined) | ✓ | Free up to 100/mo |
When to Use Each Approach
- PyPDF2 / pdfplumber: One-off scripts, simple text search, documents where you only need raw text for embedding or search. Not suitable for structured extraction.
- AWS Textract: If you're already deep in AWS, need very high volume (millions of pages), or require built-in audit trails. Prepare for significant integration work.
- Dokyumi: Any time you need specific named fields from documents at production scale. Especially if your document types vary (invoices from 50 different vendors), you handle scanned documents, or you want to go from zero to working API in under an hour.
Get Started
Sign up for free at dokyumi.com — 100 extractions/month, no credit card required. Create your first schema in 2 minutes. The full API reference (including error codes, TypeScript types, and webhook payload docs) is at dokyumi.com/docs.
More from Dokyumi
Start extracting in under 2 minutes
100 free extractions every month. No credit card required.