API Reference

Base URL: https://dokyumi.com/api/v1

All endpoints return JSON. Authentication required on all requests.

Quickstart

From zero to your first extraction in under 5 minutes.

1

Create an account

Sign up at dokyumi.com/dashboard. Free tier includes 100 extractions/month — no card required.

2

Define your schema

Go to Schemas → New Schema. Describe your document type and the fields you want to extract. AI will infer the full schema for you. Give it a slug like invoice-parser.

3

Generate an API key

Go to API Keys → New Key. Copy it now — it won't be shown again.

4

Send your first extraction

curl -X POST https://dokyumi.com/api/v1/extract \
  -H "Authorization: Bearer dk_live_your_api_key" \
  -F "file=@invoice.pdf" \
  -F "schema=invoice-parser"
5

Handle the response

Check statuscompleted means all fields extracted with high confidence.review means some fields need verification. Your structured data is in data.

Authentication

All requests require an API key passed as a Bearer token. Keys follow the format dk_live_*.

Authorization: Bearer dk_live_your_api_key_here
Never expose your API key. Always make extraction requests server-side. If a key is compromised, rotate it immediately from the API Keys page.

Scoped keys

Keys can be scoped to specific schemas. A scoped key will return a 403 if used to call an unauthorized schema. Create scoped keys for multi-tenant apps where each customer should only access their own schemas.

Unscoped keys (the default) can access all schemas in your organization.

POST

/api/v1/extract

Extract structured data from a document using a predefined schema. Returns a fully validated JSON object with per-field confidence scores.

Request

Send a multipart/form-data request.

FieldTypeRequiredDescription
fileFileYesPDF, PNG, JPG, or TIFF. Max 20MB.
schemaStringNoSchema slug. Defaults to your first active schema if omitted.

Code examples

cURL

curl -X POST https://dokyumi.com/api/v1/extract \
  -H "Authorization: Bearer dk_live_your_api_key" \
  -F "file=@invoice.pdf" \
  -F "schema=invoice-parser"

JavaScript / TypeScript

async function extractDocument(file: File, schema: string) {
  const formData = new FormData()
  formData.append('file', file)
  formData.append('schema', schema)

  const response = await fetch('https://dokyumi.com/api/v1/extract', {
    method: 'POST',
    headers: {
      'Authorization': 'Bearer dk_live_your_api_key'
    },
    body: formData
  })

  if (!response.ok) {
    const err = await response.json()
    throw new Error(err.error)
  }

  const result: ExtractionResult = await response.json()
  
  if (result.status === 'review') {
    // Some fields need manual verification
    console.log('Low confidence fields:', result.validation.low_confidence_fields)
  }

  return result.data
}

Python

import requests

def extract_document(file_path: str, schema: str, api_key: str) -> dict:
    with open(file_path, 'rb') as f:
        response = requests.post(
            'https://dokyumi.com/api/v1/extract',
            headers={'Authorization': f'Bearer {api_key}'},
            files={'file': f},
            data={'schema': schema}
        )
    
    response.raise_for_status()
    result = response.json()
    
    if result['status'] == 'review':
        print(f"Review needed: {result['validation']['low_confidence_fields']}")
    
    return result['data']

# Example
data = extract_document(
    'invoice.pdf',
    schema='invoice-parser',
    api_key='dk_live_your_api_key'
)
print(data['vendor_name'], data['total_amount'])

Node.js (server-side)

import { readFileSync } from 'fs'
import FormData from 'form-data'
import fetch from 'node-fetch'

async function extractDocument(filePath, schema) {
  const form = new FormData()
  form.append('file', readFileSync(filePath), 'document.pdf')
  form.append('schema', schema)

  const res = await fetch('https://dokyumi.com/api/v1/extract', {
    method: 'POST',
    headers: {
      'Authorization': 'Bearer dk_live_your_api_key',
      ...form.getHeaders()
    },
    body: form
  })

  if (!res.ok) throw new Error((await res.json()).error)
  return await res.json()
}

Response Format

Every successful extraction returns this structure. Check status first, then read your fields from data.

{
  "id": "ext_abc123",
  "status": "completed",          // "completed" | "review" | "failed"
  "schema": "invoice-parser",

  // Your extracted fields — structure matches your schema definition
  "data": {
    "vendor_name": "Acme Corp",
    "invoice_number": "INV-2026-001",
    "invoice_date": "2026-02-12",
    "total_amount": 1250.00,
    "line_items": [
      {
        "description": "Consulting Services",
        "quantity": 10,
        "unit_price": 125.00,
        "amount": 1250.00
      }
    ]
  },

  // Per-field confidence: 0.0 – 1.0
  "confidence": {
    "vendor_name": 0.98,
    "invoice_number": 0.95,
    "invoice_date": 0.99,
    "total_amount": 0.97,
    "line_items": 0.88
  },

  // Validation result (Zod schema)
  "validation": {
    "valid": true,
    "errors": [],
    "low_confidence_fields": []   // Fields below your schema's confidence threshold
  },

  "meta": {
    "processing_time_ms": 3420,
    "page_count": 1,
    "ocr_cached": false,          // true = OCR skipped, saves time and cost
    "model": "claude-sonnet-4-20250514"
  }
}
FieldTypeDescription
idstringUnique extraction ID. Prefix ext_.
statusstringcompleted · review · failed
dataobjectYour extracted fields. Structure matches your schema definition.
confidenceobjectPer-field confidence scores (0.0–1.0). Scores below your schema's threshold trigger review status.
validation.validbooleanWhether the extracted data passed schema type validation.
validation.errorsarrayZod validation issues if valid is false.
meta.ocr_cachedbooleanIf true, OCR was served from cache (faster, lower cost).

Handling review Status

An extraction returns status: "review" when:

  • One or more fields have confidence scores below your schema's threshold (default: 0.7)
  • The extracted data fails Zod type validation (e.g., a date field returned a non-date string)

The data object is still populated — the extraction ran. But you should treat flagged fields with extra scrutiny. A common pattern:

const result = await extract(file, schema)

if (result.status === 'completed') {
  // All fields extracted with high confidence
  await saveToDatabase(result.data)

} else if (result.status === 'review') {
  // Flag specific fields for human verification
  const lowConfidenceFields = result.validation.low_confidence_fields
  // e.g. ["invoice_date", "total_amount"]
  
  await saveToDatabase({
    ...result.data,
    _needs_review: true,
    _review_fields: lowConfidenceFields
  })
  await notifyReviewer(result.id, lowConfidenceFields)

} else {
  // status === 'failed'
  throw new Error('Extraction failed — document may be unreadable')
}
Tip: Adjust your schema's confidence threshold in the dashboard. Setting it lower (e.g., 0.5) reduces review flags for documents where some ambiguity is acceptable. Setting it higher (e.g., 0.9) catches more edge cases.

TypeScript Types

Copy these into your project for full type safety.

// Paste into dokyumi.d.ts or your types file

export interface ExtractionResult<T = Record<string, unknown>> {
  id: string
  status: 'completed' | 'review' | 'failed'
  schema: string
  data: T
  confidence: Record<keyof T & string, number>
  validation: {
    valid: boolean
    errors: ValidationError[]
    low_confidence_fields: string[]
  }
  meta: {
    processing_time_ms: number
    page_count: number
    ocr_cached: boolean
    model: string
  }
}

export interface ValidationError {
  code: string
  message: string
  path: (string | number)[]
}

export interface ApiError {
  error: string
}

// Example: typed extraction for invoices
interface InvoiceData {
  vendor_name: string
  invoice_number: string
  invoice_date: string       // ISO 8601
  total_amount: number
  line_items: {
    description: string
    quantity: number
    unit_price: number
    amount: number
  }[]
}

type InvoiceExtraction = ExtractionResult<InvoiceData>

Error Codes

All errors return JSON with an error field explaining what went wrong.

{ "error": "Monthly extraction limit reached. Upgrade to continue." }
StatusError messageCause & fix
400Missing fileNo file was attached. Add a file field to your multipart request.
400No schema foundThe schema slug doesn't exist or is inactive. Check the slug in your dashboard.
401Missing or invalid API keyThe Authorization header is missing or doesn't start with Bearer dk_live_.
401Invalid API keyKey not found or revoked. Generate a new key in your dashboard.
402Monthly extraction limit reachedYou've hit your plan's monthly limit. Upgrade or wait for the next billing cycle.
403API key not authorized for this schemaYou're using a scoped key that doesn't include this schema. Use an unscoped key or update the key's scopes.
429Rate limit exceededToo many requests per minute. Implement exponential backoff and retry after the Retry-After header value.
500Extraction failedUnexpected error during processing. Retry once — if it persists, the document may be corrupted or unreadable.

Retry logic

async function extractWithRetry(file, schema, maxRetries = 3) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    const res = await fetch('https://dokyumi.com/api/v1/extract', {
      method: 'POST',
      headers: { 'Authorization': 'Bearer dk_live_...' },
      body: buildFormData(file, schema)
    })

    // Don't retry client errors (4xx) — they won't resolve on their own
    if (res.status >= 400 && res.status < 500) {
      const err = await res.json()
      throw new Error(err.error)
    }

    if (res.ok) return await res.json()

    // Retry 5xx with exponential backoff
    if (attempt < maxRetries - 1) {
      await new Promise(r => setTimeout(r, 1000 * Math.pow(2, attempt)))
    }
  }
  throw new Error('Max retries exceeded')
}

Rate Limits

PlanExtractions/moRequests/minSchemasWhite-label sites
Free100102
Starter $79/mo1,0003010
Growth $499/mo10,0001205010
Enterprise $1,299/mo50,000300UnlimitedUnlimited

Rate limits are enforced per API key, not per organization. Need higher limits? Contact us.

File Requirements

Accepted formats

  • PDF (.pdf)
  • JPEG (.jpg, .jpeg)
  • PNG (.png)
  • TIFF (.tiff)

Size & quality

  • Max file size: 20MB
  • Recommended image resolution: 150+ DPI
  • Multi-page PDFs: supported
  • Scanned documents: supported
  • Handwritten text: limited accuracy

OCR modes

Schemas can be configured in two OCR modes (set per-schema in your dashboard):

ModePipelineBest for
standardMistral OCR → ClaudeMost documents. Faster, cheaper due to OCR caching on repeat files.
visionClaude Vision (direct)Complex layouts, charts, mixed content where visual context matters.
OCR caching: In standard mode, Dokyumi caches OCR results by file hash. If you extract the same document twice (same bytes), the OCR step is skipped on the second call — faster response and lower cost. The meta.ocr_cached field tells you when this fires.
GET

/api/v1/schemas

List all active schemas available to your API key. Useful for dynamic schema selection or verifying which schemas a scoped key can access.

curl https://dokyumi.com/api/v1/schemas \
  -H "Authorization: Bearer dk_live_your_api_key"
// Response
{
  "schemas": [
    {
      "slug": "invoice-parser",
      "name": "Invoice Parser",
      "description": "Extracts vendor, amounts, and line items from invoices",
      "ocr_mode": "standard",
      "fields": [
        { "key": "vendor_name",    "type": "string",   "required": true },
        { "key": "invoice_number", "type": "string",   "required": true },
        { "key": "total_amount",   "type": "currency", "required": true },
        { "key": "line_items",     "type": "array",    "required": false }
      ],
      "extraction_count": 142
    }
  ]
}

Webhooks

Webhooks fire when a document is submitted through a white-label upload portal (Growth and Enterprise plans). Configure your webhook URL in Sites → Settings.

Dokyumi sends a POST request to your URL with an HMAC-SHA256 signature in the X-Dokyumi-Signature header.

Webhook payload

// Headers
X-Dokyumi-Signature: sha256_hmac_hex_string
X-Dokyumi-Event: extraction.completed
Content-Type: application/json

// Body
{
  "event": "extraction.completed",
  "extraction_id": "ext_abc123",
  "site_id": "site_xyz789",
  "timestamp": "2026-03-15T04:15:00.000Z",
  "data": {
    // Your extracted fields — same as the data object in the API response
    "vendor_name": "Acme Corp",
    "total_amount": 1250.00
  }
}

Verifying signatures

Node.js

const crypto = require('crypto')

function verifyWebhook(rawBody, signature, secret) {
  const expected = crypto
    .createHmac('sha256', secret)
    .update(rawBody, 'utf8')
    .digest('hex')

  return crypto.timingSafeEqual(
    Buffer.from(signature, 'hex'),
    Buffer.from(expected, 'hex')
  )
}

// Express example
app.post('/webhook', express.raw({ type: 'application/json' }), (req, res) => {
  const sig = req.headers['x-dokyumi-signature']
  if (!verifyWebhook(req.body, sig, process.env.WEBHOOK_SECRET)) {
    return res.status(400).send('Invalid signature')
  }
  const payload = JSON.parse(req.body)
  // Handle payload.data
  res.sendStatus(200)
})

Python

import hmac
import hashlib

def verify_webhook(raw_body: bytes, signature: str, secret: str) -> bool:
    expected = hmac.new(
        secret.encode(),
        raw_body,
        hashlib.sha256
    ).hexdigest()
    return hmac.compare_digest(expected, signature)

# FastAPI example
@app.post("/webhook")
async def webhook(request: Request):
    raw = await request.body()
    sig = request.headers.get("x-dokyumi-signature", "")
    if not verify_webhook(raw, sig, WEBHOOK_SECRET):
        raise HTTPException(status_code=400, detail="Invalid signature")
    payload = json.loads(raw)
    return {"ok": True}
Retries: If your endpoint returns a non-2xx status, Dokyumi will retry delivery up to 3 times with exponential backoff (1m, 5m, 30m). After that, the webhook is marked failed and visible in your site's submission history. Replay it manually from the dashboard.

Something missing from these docs?

Email support@dokyumi.com — we usually respond same day.