extract data from PDF pythonpdf data extraction pythonpython pdf parser

How to Extract Data from PDF with Python: A Complete Developer Guide (2026)

March 16, 2026

The Problem with Raw PDF Extraction in Python

Every Python developer eventually hits the PDF wall. You've got invoices, bank statements, or contracts to process. You reach for PyPDF2 or pdfplumber. You get text. A wall of text — no structure, no field boundaries, no way to reliably pull out the vendor name or invoice total without writing brittle regex that breaks on the next slightly-different PDF.

This guide walks through four approaches to PDF data extraction in Python, from the most basic to the most production-ready, with real code for each.

Approach 1: PyPDF2 — Basic Text Extraction

PyPDF2 is the simplest entry point. It extracts text page by page, but gives you no awareness of layout, tables, or field structure.

import PyPDF2

def extract_text_pypdf2(pdf_path: str) -> str:
    with open(pdf_path, 'rb') as f:
        reader = PyPDF2.PdfReader(f)
        text = ''
        for page in reader.pages:
            text += page.extract_text() + '\n'
    return text

# What you get back:
# "INVOICE\nInvoice #: INV-2026-0042\nDate: March 15, 2026\nBill To: Acme Corp..."

Problems: No structure. You're left parsing a string. Multi-column PDFs produce garbled output. Scanned PDFs return empty strings (no OCR).

Approach 2: pdfplumber — Layout-Aware Extraction

pdfplumber understands layout better than PyPDF2. It can extract tables and has better positional awareness.

import pdfplumber

def extract_with_pdfplumber(pdf_path: str):
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            # Extract tables if present
            tables = page.extract_tables()
            for table in tables:
                for row in table:
                    print(row)  # Each row is a list of cell values
            
            # Extract full text
            text = page.extract_text()
            print(text)

Better, but still: You have to write table parsing logic per document type. Scanned PDFs still return nothing. Different invoice templates from different vendors break your parser.

Approach 3: AWS Textract via boto3

AWS Textract handles scanned documents and can extract forms and tables. But the setup is significant.

import boto3
import json

textract = boto3.client('textract', region_name='us-east-1')

def extract_with_textract(pdf_path: str):
    with open(pdf_path, 'rb') as f:
        response = textract.analyze_document(
            Document={'Bytes': f.read()},
            FeatureTypes=['FORMS', 'TABLES']
        )
    
    # Extract key-value pairs from forms
    key_map = {}
    value_map = {}
    block_map = {}
    
    for block in response['Blocks']:
        block_map[block['Id']] = block
        if block['BlockType'] == 'KEY_VALUE_SET':
            if 'KEY' in block.get('EntityTypes', []):
                key_map[block['Id']] = block
            else:
                value_map[block['Id']] = block
    
    # ... 40 more lines to reconstruct key-value pairs
    # ... then you still need to map Textract's keys to your field names
    return key_map, value_map

The real cost: AWS account required, IAM credentials to configure, $1.50 per 1,000 pages for form/table extraction, and you still have to write post-processing to map Textract's generic KEY_VALUE_SET blocks to your actual fields (vendor_name, invoice_total, due_date, etc.).

Approach 4: Schema-First Extraction with Dokyumi

Dokyumi takes a different approach: you define exactly which fields you want, and the API returns validated JSON with those fields — nothing more, nothing less. Handles scanned PDFs, digital PDFs, and images. No AWS account. Free tier: 100 extractions/month.

Step 1: Define your schema (one time, in the dashboard)

Go to dokyumi.com/dashboard, create a schema for "Invoice" with fields: vendor_name, invoice_number, invoice_date, due_date, subtotal, tax_amount, total_amount, line_items (array).

Or let AI infer the schema for you by describing the document type in plain English.

Your schema gets a slug (e.g., invoice-extractor) and a dedicated API endpoint.

Step 2: Extract a PDF with Python

import requests
import json
from pathlib import Path

DOKYUMI_API_KEY = "dk_live_your_key_here"
SCHEMA_SLUG = "invoice-extractor"

def extract_invoice(pdf_path: str) -> dict:
    with open(pdf_path, 'rb') as f:
        response = requests.post(
            "https://dokyumi.com/api/v1/extract",
            headers={"Authorization": f"Bearer {DOKYUMI_API_KEY}"},
            files={"file": (Path(pdf_path).name, f, "application/pdf")},
            data={"schema": SCHEMA_SLUG},
        )
    response.raise_for_status()
    return response.json()

# What you get back:
result = extract_invoice("vendor-invoice.pdf")
# {
#   "status": "completed",
#   "data": {
#     "vendor_name": "Acme Supplies Inc.",
#     "invoice_number": "INV-2026-0042",
#     "invoice_date": "2026-03-10",
#     "due_date": "2026-04-10",
#     "subtotal": 4250.00,
#     "tax_amount": 382.50,
#     "total_amount": 4632.50,
#     "line_items": [
#       {"description": "Server hardware", "qty": 2, "unit_price": 1800.00, "amount": 3600.00},
#       {"description": "Installation", "qty": 1, "unit_price": 650.00, "amount": 650.00}
#     ]
#   },
#   "confidence": 0.94,
#   "extraction_id": "ext_01abc123",
#   "processing_ms": 1843
# }

Step 3: Handle confidence and routing

def process_invoice(pdf_path: str) -> dict:
    result = extract_invoice(pdf_path)
    
    if result["status"] == "completed":
        if result["confidence"] >= 0.85:
            # High confidence — route directly to your system
            return {"action": "auto_approve", "data": result["data"]}
        else:
            # Lower confidence — flag for human review
            return {"action": "review", "data": result["data"], 
                    "confidence": result["confidence"]}
    
    elif result["status"] == "review":
        # Dokyumi flagged this for review (below its threshold)
        return {"action": "manual_entry", "extraction_id": result.get("extraction_id")}
    
    else:
        raise Exception(f"Extraction failed: {result.get('error')}")

Step 4: Batch processing multiple PDFs

import concurrent.futures
from pathlib import Path

def batch_extract(pdf_directory: str, max_workers: int = 5) -> list:
    pdf_files = list(Path(pdf_directory).glob("*.pdf"))
    results = []
    
    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = {
            executor.submit(extract_invoice, str(pdf)): pdf 
            for pdf in pdf_files
        }
        for future in concurrent.futures.as_completed(futures):
            pdf_file = futures[future]
            try:
                result = future.result()
                results.append({"file": pdf_file.name, "result": result})
            except Exception as e:
                results.append({"file": pdf_file.name, "error": str(e)})
    
    return results

Step 5: Webhook for async processing (production pattern)

For large batches or slow uploads, use webhooks instead of polling:

from flask import Flask, request, jsonify
import hmac, hashlib

app = Flask(__name__)
WEBHOOK_SECRET = "your_webhook_secret"

@app.route("/webhooks/dokyumi", methods=["POST"])
def handle_extraction_complete():
    # Verify signature
    sig = request.headers.get("X-Dokyumi-Signature", "")
    payload = request.get_data()
    expected = hmac.new(WEBHOOK_SECRET.encode(), payload, hashlib.sha256).hexdigest()
    
    if not hmac.compare_digest(f"sha256={expected}", sig):
        return jsonify({"error": "Invalid signature"}), 401
    
    event = request.json
    if event["event"] == "extraction.completed":
        extraction_data = event["data"]
        # Process the completed extraction
        save_to_database(extraction_data)
    
    return jsonify({"received": True})

Handling Edge Cases

Scanned PDFs and low-quality images

Dokyumi's Mistral OCR layer handles scanned documents automatically. No code change needed on your end. For best results: 150+ DPI, avoid extreme skew, ensure good contrast. The confidence score will reflect OCR quality — route low-confidence results to manual review.

Multi-page documents

Multi-page PDFs are handled as a single document. Fields are extracted across all pages. Line items from a 10-page invoice are consolidated into a single line_items array.

International invoices

Include date format and currency in your schema description: "Extract dates in ISO format (YYYY-MM-DD), amounts as numeric values without currency symbols." The AI follows these instructions.

Comparison: Python PDF Extraction Methods

MethodSetup timeOutput formatCustom fieldsScanned PDFsCost
PyPDF25 minRaw text string✗ (you write regex)Free
pdfplumber10 minText + tables✗ (you write parser)Free
AWS Textract1-2 weeksKEY_VALUE_SET blocksPartial (post-processing)$1.50/1K pages
Dokyumi<2 minValidated JSON (your fields)✓ (schema-defined)Free up to 100/mo

When to Use Each Approach

  • PyPDF2 / pdfplumber: One-off scripts, simple text search, documents where you only need raw text for embedding or search. Not suitable for structured extraction.
  • AWS Textract: If you're already deep in AWS, need very high volume (millions of pages), or require built-in audit trails. Prepare for significant integration work.
  • Dokyumi: Any time you need specific named fields from documents at production scale. Especially if your document types vary (invoices from 50 different vendors), you handle scanned documents, or you want to go from zero to working API in under an hour.

Get Started

Sign up for free at dokyumi.com — 100 extractions/month, no credit card required. Create your first schema in 2 minutes. The full API reference (including error codes, TypeScript types, and webhook payload docs) is at dokyumi.com/docs.

Start extracting in under 2 minutes

100 free extractions every month. No credit card required.