API Reference
Base URL: https://dokyumi.com/api/v1
All endpoints return JSON. Authentication required on all requests.
Quickstart
From zero to your first extraction in under 5 minutes.
Create an account
Sign up at dokyumi.com/dashboard. Free tier includes 100 extractions/month — no card required.
Define your schema
Go to Schemas → New Schema. Describe your document type and the fields you want to extract. AI will infer the full schema for you. Give it a slug like invoice-parser.
Generate an API key
Go to API Keys → New Key. Copy it now — it won't be shown again.
Send your first extraction
curl -X POST https://dokyumi.com/api/v1/extract \ -H "Authorization: Bearer dk_live_your_api_key" \ -F "file=@invoice.pdf" \ -F "schema=invoice-parser"
Handle the response
Check status —completed means all fields extracted with high confidence.review means some fields need verification. Your structured data is in data.
Authentication
All requests require an API key passed as a Bearer token. Keys follow the format dk_live_*.
Authorization: Bearer dk_live_your_api_key_here
Scoped keys
Keys can be scoped to specific schemas. A scoped key will return a 403 if used to call an unauthorized schema. Create scoped keys for multi-tenant apps where each customer should only access their own schemas.
Unscoped keys (the default) can access all schemas in your organization.
/api/v1/extract
Extract structured data from a document using a predefined schema. Returns a fully validated JSON object with per-field confidence scores.
Request
Send a multipart/form-data request.
| Field | Type | Required | Description |
|---|---|---|---|
| file | File | Yes | PDF, PNG, JPG, or TIFF. Max 20MB. |
| schema | String | No | Schema slug. Defaults to your first active schema if omitted. |
Code examples
cURL
curl -X POST https://dokyumi.com/api/v1/extract \ -H "Authorization: Bearer dk_live_your_api_key" \ -F "file=@invoice.pdf" \ -F "schema=invoice-parser"
JavaScript / TypeScript
async function extractDocument(file: File, schema: string) {
const formData = new FormData()
formData.append('file', file)
formData.append('schema', schema)
const response = await fetch('https://dokyumi.com/api/v1/extract', {
method: 'POST',
headers: {
'Authorization': 'Bearer dk_live_your_api_key'
},
body: formData
})
if (!response.ok) {
const err = await response.json()
throw new Error(err.error)
}
const result: ExtractionResult = await response.json()
if (result.status === 'review') {
// Some fields need manual verification
console.log('Low confidence fields:', result.validation.low_confidence_fields)
}
return result.data
}Python
import requests
def extract_document(file_path: str, schema: str, api_key: str) -> dict:
with open(file_path, 'rb') as f:
response = requests.post(
'https://dokyumi.com/api/v1/extract',
headers={'Authorization': f'Bearer {api_key}'},
files={'file': f},
data={'schema': schema}
)
response.raise_for_status()
result = response.json()
if result['status'] == 'review':
print(f"Review needed: {result['validation']['low_confidence_fields']}")
return result['data']
# Example
data = extract_document(
'invoice.pdf',
schema='invoice-parser',
api_key='dk_live_your_api_key'
)
print(data['vendor_name'], data['total_amount'])Node.js (server-side)
import { readFileSync } from 'fs'
import FormData from 'form-data'
import fetch from 'node-fetch'
async function extractDocument(filePath, schema) {
const form = new FormData()
form.append('file', readFileSync(filePath), 'document.pdf')
form.append('schema', schema)
const res = await fetch('https://dokyumi.com/api/v1/extract', {
method: 'POST',
headers: {
'Authorization': 'Bearer dk_live_your_api_key',
...form.getHeaders()
},
body: form
})
if (!res.ok) throw new Error((await res.json()).error)
return await res.json()
}Response Format
Every successful extraction returns this structure. Check status first, then read your fields from data.
{
"id": "ext_abc123",
"status": "completed", // "completed" | "review" | "failed"
"schema": "invoice-parser",
// Your extracted fields — structure matches your schema definition
"data": {
"vendor_name": "Acme Corp",
"invoice_number": "INV-2026-001",
"invoice_date": "2026-02-12",
"total_amount": 1250.00,
"line_items": [
{
"description": "Consulting Services",
"quantity": 10,
"unit_price": 125.00,
"amount": 1250.00
}
]
},
// Per-field confidence: 0.0 – 1.0
"confidence": {
"vendor_name": 0.98,
"invoice_number": 0.95,
"invoice_date": 0.99,
"total_amount": 0.97,
"line_items": 0.88
},
// Validation result (Zod schema)
"validation": {
"valid": true,
"errors": [],
"low_confidence_fields": [] // Fields below your schema's confidence threshold
},
"meta": {
"processing_time_ms": 3420,
"page_count": 1,
"ocr_cached": false, // true = OCR skipped, saves time and cost
"model": "claude-sonnet-4-20250514"
}
}| Field | Type | Description |
|---|---|---|
| id | string | Unique extraction ID. Prefix ext_. |
| status | string | completed · review · failed |
| data | object | Your extracted fields. Structure matches your schema definition. |
| confidence | object | Per-field confidence scores (0.0–1.0). Scores below your schema's threshold trigger review status. |
| validation.valid | boolean | Whether the extracted data passed schema type validation. |
| validation.errors | array | Zod validation issues if valid is false. |
| meta.ocr_cached | boolean | If true, OCR was served from cache (faster, lower cost). |
Handling review Status
An extraction returns status: "review" when:
- One or more fields have confidence scores below your schema's threshold (default: 0.7)
- The extracted data fails Zod type validation (e.g., a date field returned a non-date string)
The data object is still populated — the extraction ran. But you should treat flagged fields with extra scrutiny. A common pattern:
const result = await extract(file, schema)
if (result.status === 'completed') {
// All fields extracted with high confidence
await saveToDatabase(result.data)
} else if (result.status === 'review') {
// Flag specific fields for human verification
const lowConfidenceFields = result.validation.low_confidence_fields
// e.g. ["invoice_date", "total_amount"]
await saveToDatabase({
...result.data,
_needs_review: true,
_review_fields: lowConfidenceFields
})
await notifyReviewer(result.id, lowConfidenceFields)
} else {
// status === 'failed'
throw new Error('Extraction failed — document may be unreadable')
}review flags for documents where some ambiguity is acceptable. Setting it higher (e.g., 0.9) catches more edge cases.TypeScript Types
Copy these into your project for full type safety.
// Paste into dokyumi.d.ts or your types file
export interface ExtractionResult<T = Record<string, unknown>> {
id: string
status: 'completed' | 'review' | 'failed'
schema: string
data: T
confidence: Record<keyof T & string, number>
validation: {
valid: boolean
errors: ValidationError[]
low_confidence_fields: string[]
}
meta: {
processing_time_ms: number
page_count: number
ocr_cached: boolean
model: string
}
}
export interface ValidationError {
code: string
message: string
path: (string | number)[]
}
export interface ApiError {
error: string
}
// Example: typed extraction for invoices
interface InvoiceData {
vendor_name: string
invoice_number: string
invoice_date: string // ISO 8601
total_amount: number
line_items: {
description: string
quantity: number
unit_price: number
amount: number
}[]
}
type InvoiceExtraction = ExtractionResult<InvoiceData>Error Codes
All errors return JSON with an error field explaining what went wrong.
{ "error": "Monthly extraction limit reached. Upgrade to continue." }| Status | Error message | Cause & fix |
|---|---|---|
| 400 | Missing file | No file was attached. Add a file field to your multipart request. |
| 400 | No schema found | The schema slug doesn't exist or is inactive. Check the slug in your dashboard. |
| 401 | Missing or invalid API key | The Authorization header is missing or doesn't start with Bearer dk_live_. |
| 401 | Invalid API key | Key not found or revoked. Generate a new key in your dashboard. |
| 402 | Monthly extraction limit reached | You've hit your plan's monthly limit. Upgrade or wait for the next billing cycle. |
| 403 | API key not authorized for this schema | You're using a scoped key that doesn't include this schema. Use an unscoped key or update the key's scopes. |
| 429 | Rate limit exceeded | Too many requests per minute. Implement exponential backoff and retry after the Retry-After header value. |
| 500 | Extraction failed | Unexpected error during processing. Retry once — if it persists, the document may be corrupted or unreadable. |
Retry logic
async function extractWithRetry(file, schema, maxRetries = 3) {
for (let attempt = 0; attempt < maxRetries; attempt++) {
const res = await fetch('https://dokyumi.com/api/v1/extract', {
method: 'POST',
headers: { 'Authorization': 'Bearer dk_live_...' },
body: buildFormData(file, schema)
})
// Don't retry client errors (4xx) — they won't resolve on their own
if (res.status >= 400 && res.status < 500) {
const err = await res.json()
throw new Error(err.error)
}
if (res.ok) return await res.json()
// Retry 5xx with exponential backoff
if (attempt < maxRetries - 1) {
await new Promise(r => setTimeout(r, 1000 * Math.pow(2, attempt)))
}
}
throw new Error('Max retries exceeded')
}Rate Limits
| Plan | Extractions/mo | Requests/min | Schemas | White-label sites |
|---|---|---|---|---|
| Free | 100 | 10 | 2 | — |
| Starter $79/mo | 1,000 | 30 | 10 | — |
| Growth $499/mo | 10,000 | 120 | 50 | 10 |
| Enterprise $1,299/mo | 50,000 | 300 | Unlimited | Unlimited |
Rate limits are enforced per API key, not per organization. Need higher limits? Contact us.
File Requirements
Accepted formats
- ✓ PDF (.pdf)
- ✓ JPEG (.jpg, .jpeg)
- ✓ PNG (.png)
- ✓ TIFF (.tiff)
Size & quality
- Max file size: 20MB
- Recommended image resolution: 150+ DPI
- Multi-page PDFs: supported
- Scanned documents: supported
- Handwritten text: limited accuracy
OCR modes
Schemas can be configured in two OCR modes (set per-schema in your dashboard):
| Mode | Pipeline | Best for |
|---|---|---|
| standard | Mistral OCR → Claude | Most documents. Faster, cheaper due to OCR caching on repeat files. |
| vision | Claude Vision (direct) | Complex layouts, charts, mixed content where visual context matters. |
standard mode, Dokyumi caches OCR results by file hash. If you extract the same document twice (same bytes), the OCR step is skipped on the second call — faster response and lower cost. The meta.ocr_cached field tells you when this fires./api/v1/schemas
List all active schemas available to your API key. Useful for dynamic schema selection or verifying which schemas a scoped key can access.
curl https://dokyumi.com/api/v1/schemas \ -H "Authorization: Bearer dk_live_your_api_key"
// Response
{
"schemas": [
{
"slug": "invoice-parser",
"name": "Invoice Parser",
"description": "Extracts vendor, amounts, and line items from invoices",
"ocr_mode": "standard",
"fields": [
{ "key": "vendor_name", "type": "string", "required": true },
{ "key": "invoice_number", "type": "string", "required": true },
{ "key": "total_amount", "type": "currency", "required": true },
{ "key": "line_items", "type": "array", "required": false }
],
"extraction_count": 142
}
]
}Webhooks
Webhooks fire when a document is submitted through a white-label upload portal (Growth and Enterprise plans). Configure your webhook URL in Sites → Settings.
Dokyumi sends a POST request to your URL with an HMAC-SHA256 signature in the X-Dokyumi-Signature header.
Webhook payload
// Headers
X-Dokyumi-Signature: sha256_hmac_hex_string
X-Dokyumi-Event: extraction.completed
Content-Type: application/json
// Body
{
"event": "extraction.completed",
"extraction_id": "ext_abc123",
"site_id": "site_xyz789",
"timestamp": "2026-03-15T04:15:00.000Z",
"data": {
// Your extracted fields — same as the data object in the API response
"vendor_name": "Acme Corp",
"total_amount": 1250.00
}
}Verifying signatures
Node.js
const crypto = require('crypto')
function verifyWebhook(rawBody, signature, secret) {
const expected = crypto
.createHmac('sha256', secret)
.update(rawBody, 'utf8')
.digest('hex')
return crypto.timingSafeEqual(
Buffer.from(signature, 'hex'),
Buffer.from(expected, 'hex')
)
}
// Express example
app.post('/webhook', express.raw({ type: 'application/json' }), (req, res) => {
const sig = req.headers['x-dokyumi-signature']
if (!verifyWebhook(req.body, sig, process.env.WEBHOOK_SECRET)) {
return res.status(400).send('Invalid signature')
}
const payload = JSON.parse(req.body)
// Handle payload.data
res.sendStatus(200)
})Python
import hmac
import hashlib
def verify_webhook(raw_body: bytes, signature: str, secret: str) -> bool:
expected = hmac.new(
secret.encode(),
raw_body,
hashlib.sha256
).hexdigest()
return hmac.compare_digest(expected, signature)
# FastAPI example
@app.post("/webhook")
async def webhook(request: Request):
raw = await request.body()
sig = request.headers.get("x-dokyumi-signature", "")
if not verify_webhook(raw, sig, WEBHOOK_SECRET):
raise HTTPException(status_code=400, detail="Invalid signature")
payload = json.loads(raw)
return {"ok": True}Something missing from these docs?
Email support@dokyumi.com — we usually respond same day.