Webhook-Driven Document Processing: Build Automated Pipelines with Dokyumi
March 16, 2026
Why Polling Your Document Parser Is Killing Your Pipeline
Most developers build document processing pipelines the same way: upload a file, wait, poll an endpoint every few seconds, check if extraction is done, process the result. It works. It also wastes compute, adds latency, and falls apart the moment you're processing thousands of documents per day.
Webhooks solve this. Instead of asking "is it done?" every 2 seconds, your server gets told exactly when it's done — with the extracted data already in the payload. No polling loop. No wasted API calls. No race conditions.
This guide walks through building a proper webhook-driven document processing pipeline with Dokyumi. By the end, you'll have a system that ingests documents, extracts structured JSON, and triggers downstream workflows automatically — zero polling required.
How Dokyumi Webhooks Work
When you configure a webhook on a Dokyumi schema, the flow looks like this:
- A document is uploaded (via API, white-label portal, or direct upload)
- Dokyumi extracts structured data against your schema
- As soon as extraction completes, Dokyumi POSTs the result to your webhook URL
- Your server receives the payload and runs whatever logic you need
The webhook payload includes the full extraction result, the original document metadata, and a signature header so you can verify the request actually came from Dokyumi.
Setting Up Your Webhook Endpoint
First, you need a publicly reachable HTTPS endpoint. For development, ngrok or Cloudflare Tunnel works fine. For production, this is just a route on your existing API server.
Here's a minimal Node.js / Express receiver:
const express = require('express');
const crypto = require('crypto');
const app = express();
// Raw body needed for signature verification
app.use('/webhooks/dokyumi', express.raw({ type: 'application/json' }));
const WEBHOOK_SECRET = process.env.DOKYUMI_WEBHOOK_SECRET;
app.post('/webhooks/dokyumi', (req, res) => {
// 1. Verify signature
const signature = req.headers['x-dokyumi-signature'];
const expectedSig = crypto
.createHmac('sha256', WEBHOOK_SECRET)
.update(req.body)
.digest('hex');
if (signature !== \`sha256=\${expectedSig}\`) {
return res.status(401).json({ error: 'Invalid signature' });
}
// 2. Parse payload
const event = JSON.parse(req.body.toString());
// 3. Handle the event
if (event.event === 'extraction.completed') {
processExtraction(event.data);
}
// Always return 200 immediately - process async
res.status(200).json({ received: true });
});
async function processExtraction(data) {
const { document_id, schema_id, result, metadata } = data;
console.log('Extracted data:', JSON.stringify(result, null, 2));
await saveToDatabase(document_id, result);
await triggerDownstreamJob(schema_id, result);
}
app.listen(3000);
The same pattern in Python (FastAPI):
from fastapi import FastAPI, Request, HTTPException, BackgroundTasks
import hmac
import hashlib
import json
import os
app = FastAPI()
WEBHOOK_SECRET = os.environ['DOKYUMI_WEBHOOK_SECRET']
@app.post("/webhooks/dokyumi")
async def handle_dokyumi_webhook(request: Request, background_tasks: BackgroundTasks):
body = await request.body()
signature = request.headers.get('x-dokyumi-signature', '')
expected = 'sha256=' + hmac.new(
WEBHOOK_SECRET.encode(),
body,
hashlib.sha256
).hexdigest()
if not hmac.compare_digest(signature, expected):
raise HTTPException(status_code=401, detail="Invalid signature")
event = json.loads(body)
if event['event'] == 'extraction.completed':
background_tasks.add_task(process_extraction, event['data'])
return {"received": True}
async def process_extraction(data: dict):
document_id = data['document_id']
result = data['result']
await save_to_database(document_id, result)
Registering the Webhook in Dokyumi
Webhooks are configured per schema in your Dokyumi dashboard (Growth plan and above). In the schema settings, add your endpoint URL and save the generated secret to your environment variables.
You can also manage webhooks via the API:
# Register a webhook
curl -X POST https://dokyumi.com/api/v1/schemas/{schema_id}/webhooks \
-H "Authorization: Bearer $DOKYUMI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://your-api.com/webhooks/dokyumi",
"events": ["extraction.completed", "extraction.failed"],
"active": true
}'
The Webhook Payload
Every extraction.completed event delivers a consistent payload structure:
{
"event": "extraction.completed",
"id": "evt_01HX8K2NVQMB3TPZC4Y7R9WJF",
"created_at": "2026-03-16T12:15:00Z",
"data": {
"document_id": "doc_01HX8K1NVQMB3TPZC4Y7R9WJE",
"schema_id": "sch_01HX7YNVQMB3TPZC4Y6R8WIE",
"schema_name": "Invoice Parser",
"status": "completed",
"result": {
"invoice_number": "INV-2026-0342",
"vendor_name": "Acme Corp",
"total_amount": 4250.00,
"currency": "USD",
"line_items": [
{
"description": "Professional Services",
"quantity": 17,
"unit_price": 250.00,
"total": 4250.00
}
],
"due_date": "2026-04-15"
},
"metadata": {
"filename": "invoice-acme-march.pdf",
"pages": 2,
"processing_time_ms": 1842,
"confidence": 0.97
}
}
}
For failed extractions, you get extraction.failed with an error field instead of result. Handle both event types so your pipeline degrades gracefully.
Building a Real Pipeline: Invoice Processing Example
Here's what a production invoice processing pipeline looks like with Dokyumi webhooks:
// 1. Upload step
async function uploadInvoice(fileBuffer, filename) {
const formData = new FormData();
formData.append('file', new Blob([fileBuffer]), filename);
formData.append('schema_id', process.env.INVOICE_SCHEMA_ID);
const res = await fetch('https://dokyumi.com/api/v1/extract', {
method: 'POST',
headers: { 'Authorization': \`Bearer \${process.env.DOKYUMI_API_KEY}\` },
body: formData,
});
const { document_id } = await res.json();
// Store pending - webhook will update it
await db.invoices.create({
document_id,
filename,
status: 'processing',
created_at: new Date(),
});
return document_id;
}
// 2. Webhook handler - fires when extraction completes
app.post('/webhooks/dokyumi', verifySignature, async (req, res) => {
res.status(200).json({ received: true }); // Ack immediately
const { event, data } = req.body;
if (event === 'extraction.completed') {
const invoice = data.result;
await db.invoices.update({
where: { document_id: data.document_id },
data: {
status: 'extracted',
vendor: invoice.vendor_name,
amount: invoice.total_amount,
due_date: new Date(invoice.due_date),
line_items: invoice.line_items,
}
});
// Trigger downstream - auto-create bill, notify approver
await quickbooks.createBill(invoice);
await slack.notifyApprover({
vendor: invoice.vendor_name,
amount: invoice.total_amount,
dueDate: invoice.due_date,
});
} else if (event === 'extraction.failed') {
await db.invoices.update({
where: { document_id: data.document_id },
data: { status: 'failed', error: data.error }
});
await slack.alertOps(\`Invoice extraction failed: \${data.error}\`);
}
});
Handling Failures and Retries
Dokyumi retries failed webhook deliveries automatically: 3 attempts with exponential backoff (5s, 30s, 5min). But you should also handle failures on your end.
The key pattern: acknowledge immediately with a 200 OK, then process asynchronously. If your webhook handler takes more than a few seconds to respond, Dokyumi treats it as failed and retries. Offload heavy processing to a job queue:
app.post('/webhooks/dokyumi', verifySignature, (req, res) => {
res.status(200).json({ received: true }); // Always ack immediately
queue.add('process-dokyumi-event', req.body, {
attempts: 3,
backoff: { type: 'exponential', delay: 5000 },
});
});
Idempotency
Because webhooks can be delivered more than once (network failures, retries), your handler needs to be idempotent. Use the event id field to deduplicate:
async function processExtraction(event) {
const existing = await db.processedEvents.findOne({ event_id: event.id });
if (existing) return; // Already handled
await db.processedEvents.create({ event_id: event.id });
await doYourThing(event.data);
}
Testing Your Webhook Locally
Use the Dokyumi dashboard to replay webhook events against a local endpoint. Expose your local server with ngrok first:
ngrok http 3000
# Paste the HTTPS URL into Dokyumi dashboard -> Schema -> Webhooks -> Test delivery
You can also test signature verification directly without going through Dokyumi:
const testPayload = JSON.stringify({
event: 'extraction.completed',
id: 'evt_test_123',
data: { /* ... */ }
});
const sig = 'sha256=' + crypto
.createHmac('sha256', WEBHOOK_SECRET)
.update(testPayload)
.digest('hex');
// POST testPayload to your local endpoint with X-Dokyumi-Signature: sig
When to Use Webhooks vs Polling
Webhooks are the right choice when you're processing more than ~50 documents per day, your pipeline has downstream steps (notifications, DB updates, third-party API calls), or you need real-time response under 5 seconds from upload to downstream action.
Polling is fine in an interactive UI where the user is waiting for a result, or at very low volume where simplicity matters more than efficiency. For server-to-server automation, webhooks are almost always the right answer.
Getting Started
Webhook support is available on Dokyumi's Growth plan ($499/mo) and above. Growth includes unlimited webhooks, automatic retry logic, and delivery logs so you can debug failed deliveries.
If you're building a production pipeline that processes documents at volume, webhooks alone justify the upgrade — you eliminate polling overhead entirely and get sub-second downstream triggers on every extraction.
Open your Dokyumi dashboard and head to Schema settings to configure your first webhook.
More from Dokyumi
Start extracting in under 2 minutes
100 free extractions every month. No credit card required.