How to Extract Data from Documents in Java: Dokyumi API Integration Guide
March 16, 2026
The Java Document Parsing Problem
Java developers working with document processing have it worse than Python or Node.js developers. The Java ecosystem for PDF and document parsing is old, fragmented, and painful: Apache PDFBox handles text extraction but not structured data, iText is commercial, Tika is a heavyweight dependency with aggressive classpath conflicts, and Tesseract via Tess4J requires native library management that breaks in CI/CD pipelines half the time.
The cleaner solution is calling a document parsing API over HTTP. Java speaks HTTP just fine. You get structured JSON back, you don't maintain any OCR or ML dependencies, and you can swap the underlying model without touching your code.
This guide covers integrating the Dokyumi document parsing API into a Java application — from basic extraction to production patterns with async processing and webhook delivery.
Prerequisites
- Java 11+ (uses java.net.http.HttpClient from the standard library)
- A Dokyumi API key — get one at dokyumi.com/dashboard
- A schema configured for the document type you want to extract
No additional dependencies required for the basic examples. The advanced async example uses a simple executor.
Basic Document Upload and Extraction
Dokyumi's API accepts multipart/form-data uploads. Here's a complete working example using Java's built-in HttpClient (no external HTTP library needed):
import java.io.*;
import java.net.*;
import java.net.http.*;
import java.nio.file.*;
import java.time.*;
import java.util.*;
public class DokyumiClient {
private static final String API_BASE = "https://dokyumi.com/api/v1";
private final String apiKey;
private final HttpClient httpClient;
public DokyumiClient(String apiKey) {
this.apiKey = apiKey;
this.httpClient = HttpClient.newBuilder()
.connectTimeout(Duration.ofSeconds(30))
.build();
}
public String extractDocument(Path filePath, String schemaId) throws Exception {
String boundary = "boundary-" + UUID.randomUUID().toString().replace("-", "");
byte[] fileBytes = Files.readAllBytes(filePath);
String fileName = filePath.getFileName().toString();
// Build multipart body
byte[] body = buildMultipartBody(boundary, fileBytes, fileName, schemaId);
HttpRequest request = HttpRequest.newBuilder()
.uri(URI.create(API_BASE + "/extract"))
.header("Authorization", "Bearer " + apiKey)
.header("Content-Type", "multipart/form-data; boundary=" + boundary)
.timeout(Duration.ofSeconds(120))
.POST(HttpRequest.BodyPublishers.ofByteArray(body))
.build();
HttpResponse<String> response = httpClient.send(
request,
HttpResponse.BodyHandlers.ofString()
);
if (response.statusCode() != 200) {
throw new RuntimeException("Extraction failed: " + response.statusCode() + " " + response.body());
}
return response.body(); // JSON string with extracted data
}
private byte[] buildMultipartBody(String boundary, byte[] fileBytes, String fileName, String schemaId) throws IOException {
ByteArrayOutputStream out = new ByteArrayOutputStream();
PrintWriter writer = new PrintWriter(new OutputStreamWriter(out, "UTF-8"), true);
// File part
writer.append("--").append(boundary).append("
");
writer.append("Content-Disposition: form-data; name="file"; filename="").append(fileName).append(""").append("
");
writer.append("Content-Type: application/octet-stream").append("
");
writer.append("
");
writer.flush();
out.write(fileBytes);
writer.append("
");
// Schema ID part
writer.append("--").append(boundary).append("
");
writer.append("Content-Disposition: form-data; name="schema_id"").append("
");
writer.append("
");
writer.append(schemaId).append("
");
writer.append("--").append(boundary).append("--").append("
");
writer.flush();
return out.toByteArray();
}
public static void main(String[] args) throws Exception {
DokyumiClient client = new DokyumiClient(System.getenv("DOKYUMI_API_KEY"));
String result = client.extractDocument(
Path.of("invoice.pdf"),
"your-schema-id"
);
System.out.println("Extracted: " + result);
// Parse with your preferred JSON library (Jackson, Gson, etc.)
}
}
Parsing the Response with Jackson
The extraction result is JSON. Here's how to parse it with Jackson, the most common Java JSON library:
// Add to pom.xml:
// <dependency>
// <groupId>com.fasterxml.jackson.core</groupId>
// <artifactId>jackson-databind</artifactId>
// <version>2.17.0</version>
// </dependency>
import com.fasterxml.jackson.databind.*;
public class ExtractionResult {
public String documentId;
public String status;
public JsonNode result; // Your schema's extracted fields
public Metadata metadata;
public static class Metadata {
public String filename;
public int pages;
public long processingTimeMs;
public double confidence;
}
}
// Usage:
ObjectMapper mapper = new ObjectMapper();
ExtractionResult extraction = mapper.readValue(jsonResponse, ExtractionResult.class);
// Access typed fields from your schema
String invoiceNumber = extraction.result.get("invoice_number").asText();
double totalAmount = extraction.result.get("total_amount").asDouble();
String vendorName = extraction.result.get("vendor_name").asText();
System.out.printf("Invoice %s from %s: $%.2f%n",
invoiceNumber, vendorName, totalAmount);
Async Processing with CompletableFuture
For processing multiple documents without blocking, use CompletableFuture with the async HttpClient methods:
import java.util.concurrent.*;
import java.util.stream.*;
public class AsyncDokyumiClient {
private final HttpClient httpClient;
private final String apiKey;
private final String schemaId;
public AsyncDokyumiClient(String apiKey, String schemaId) {
this.apiKey = apiKey;
this.schemaId = schemaId;
this.httpClient = HttpClient.newBuilder()
.executor(Executors.newFixedThreadPool(8))
.connectTimeout(Duration.ofSeconds(30))
.build();
}
public CompletableFuture<String> extractAsync(Path filePath) {
try {
byte[] fileBytes = Files.readAllBytes(filePath);
String boundary = "boundary-" + UUID.randomUUID().toString();
byte[] body = buildMultipartBody(boundary, fileBytes,
filePath.getFileName().toString(), schemaId);
HttpRequest request = HttpRequest.newBuilder()
.uri(URI.create("https://dokyumi.com/api/v1/extract"))
.header("Authorization", "Bearer " + apiKey)
.header("Content-Type", "multipart/form-data; boundary=" + boundary)
.timeout(Duration.ofSeconds(120))
.POST(HttpRequest.BodyPublishers.ofByteArray(body))
.build();
return httpClient.sendAsync(request, HttpResponse.BodyHandlers.ofString())
.thenApply(response -> {
if (response.statusCode() != 200) {
throw new RuntimeException("Failed: " + response.statusCode());
}
return response.body();
});
} catch (IOException e) {
return CompletableFuture.failedFuture(e);
}
}
public List<String> extractBatch(List<Path> files) throws Exception {
List<CompletableFuture<String>> futures = files.stream()
.map(this::extractAsync)
.collect(Collectors.toList());
CompletableFuture.allOf(futures.toArray(new CompletableFuture[0])).join();
return futures.stream()
.map(CompletableFuture::join)
.collect(Collectors.toList());
}
}
// Process 20 invoices concurrently:
AsyncDokyumiClient client = new AsyncDokyumiClient(
System.getenv("DOKYUMI_API_KEY"),
"your-schema-id"
);
List<Path> invoiceFiles = List.of(
Path.of("invoice1.pdf"),
Path.of("invoice2.pdf"),
// ...
);
List<String> results = client.extractBatch(invoiceFiles);
results.forEach(System.out::println);
Spring Boot Integration
For Spring Boot applications, wrap the client as a service bean:
import org.springframework.stereotype.Service;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.web.multipart.MultipartFile;
@Service
public class DocumentParsingService {
@Value("${dokyumi.api-key}")
private String apiKey;
@Value("${dokyumi.schema-id}")
private String schemaId;
private final DokyumiClient dokyumiClient;
public DocumentParsingService(@Value("${dokyumi.api-key}") String apiKey) {
this.dokyumiClient = new DokyumiClient(apiKey);
}
public ExtractionResult parseDocument(MultipartFile file) throws Exception {
// Write temp file
Path tempFile = Files.createTempFile("dokyumi-", "-" + file.getOriginalFilename());
try {
file.transferTo(tempFile);
String json = dokyumiClient.extractDocument(tempFile, schemaId);
ObjectMapper mapper = new ObjectMapper();
return mapper.readValue(json, ExtractionResult.class);
} finally {
Files.deleteIfExists(tempFile);
}
}
}
// application.properties:
// dokyumi.api-key=${DOKYUMI_API_KEY}
// dokyumi.schema-id=your-schema-id
// Controller:
@RestController
@RequestMapping("/api/documents")
public class DocumentController {
@Autowired
private DocumentParsingService parsingService;
@PostMapping("/parse")
public ResponseEntity<ExtractionResult> parseDocument(
@RequestParam("file") MultipartFile file) throws Exception {
ExtractionResult result = parsingService.parseDocument(file);
return ResponseEntity.ok(result);
}
}
Handling Webhooks in Spring Boot
If you're on the Growth plan, Dokyumi can push results to your Spring Boot app via webhook instead of polling:
import org.springframework.web.bind.annotation.*;
import javax.crypto.*;
import javax.crypto.spec.*;
import java.security.*;
@RestController
@RequestMapping("/webhooks")
public class DokyumiWebhookController {
@Value("${dokyumi.webhook-secret}")
private String webhookSecret;
@PostMapping(value = "/dokyumi", consumes = "application/json")
public ResponseEntity<Map<String, String>> handleWebhook(
@RequestHeader("X-Dokyumi-Signature") String signature,
@RequestBody byte[] rawBody) {
// Verify signature
if (!verifySignature(rawBody, signature)) {
return ResponseEntity.status(401).body(Map.of("error", "Invalid signature"));
}
// Ack immediately - process async
processWebhookAsync(rawBody);
return ResponseEntity.ok(Map.of("received", "true"));
}
private boolean verifySignature(byte[] body, String signature) {
try {
Mac mac = Mac.getInstance("HmacSHA256");
mac.init(new SecretKeySpec(webhookSecret.getBytes(), "HmacSHA256"));
byte[] hash = mac.doFinal(body);
String expected = "sha256=" + bytesToHex(hash);
return MessageDigest.isEqual(expected.getBytes(), signature.getBytes());
} catch (Exception e) {
return false;
}
}
private String bytesToHex(byte[] bytes) {
StringBuilder sb = new StringBuilder();
for (byte b : bytes) sb.append(String.format("%02x", b));
return sb.toString();
}
@Async
private void processWebhookAsync(byte[] rawBody) {
ObjectMapper mapper = new ObjectMapper();
try {
JsonNode event = mapper.readTree(rawBody);
String eventType = event.get("event").asText();
if ("extraction.completed".equals(eventType)) {
JsonNode data = event.get("data");
String documentId = data.get("document_id").asText();
JsonNode result = data.get("result");
// Process result...
}
} catch (Exception e) {
// Log error, don't rethrow
}
}
}
Error Handling and Retries
Add retry logic for transient failures:
public String extractWithRetry(Path filePath, String schemaId, int maxAttempts) throws Exception {
int attempt = 0;
long delayMs = 1000;
while (attempt < maxAttempts) {
try {
return extractDocument(filePath, schemaId);
} catch (RuntimeException e) {
attempt++;
if (attempt == maxAttempts) throw e;
// Don't retry client errors (4xx)
if (e.getMessage().contains("4")) throw e;
Thread.sleep(delayMs);
delayMs = Math.min(delayMs * 2, 30000); // Cap at 30s
}
}
throw new RuntimeException("Max retries exceeded");
}
Common Java-Specific Gotchas
Classpath conflicts: If you're using Tika or iText elsewhere in your project, they won't conflict with the Dokyumi API client since it only uses the standard library or Jackson.
SSL/TLS: Java's default TrustStore handles dokyumi.com's certificate chain fine. No custom TrustManager needed.
Memory: For large documents (50MB+), avoid loading the entire file into memory with Files.readAllBytes(). Use streaming upload:
// Stream large files instead of loading fully
HttpRequest request = HttpRequest.newBuilder()
.uri(URI.create(API_BASE + "/extract"))
.header("Authorization", "Bearer " + apiKey)
.header("Content-Type", "multipart/form-data; boundary=" + boundary)
.POST(HttpRequest.BodyPublishers.ofInputStream(() -> {
try { return Files.newInputStream(filePath); }
catch (IOException e) { throw new UncheckedIOException(e); }
}))
.build();
Getting Started
Get your API key at dokyumi.com/dashboard and create a schema for your document type. The Starter plan ($79/mo) includes the REST API. Webhooks are available on Growth ($499/mo) and above.
Full API reference is at dokyumi.com/docs.
More from Dokyumi
Start extracting in under 2 minutes
100 free extractions every month. No credit card required.