§01 — Pipeline
From raw files
to structured answers.
OCR, extraction, organization, and grounded Q&A — four stages, one self-hosted
binary. Swap the model behind each stage. Refine the taxonomy over time. Keep
everything on your own infrastructure.
01 · OCR & EXTRACTION
~3s / page
Read messy PDFs like a person would.
Switch between local OCR and an LLM model per document. Paperwise handles scans,
multi-column layouts, handwritten margins, and 50-page statements without
babysitting.
Extracted · last batch
| Document | Type | Received |
| Sonic Invoice | Invoice | Mar 12, 2026 |
| Annual Financials | Statement | Sep 1, 2025 |
| Auto Rate Notice | Insurance | Apr 5, 2026 |
| Kindergarten Packet | Form | Mar 14, 2026 |
02 · ORGANIZATION
auto-tagged
Tagged by type, party, date.
Auto-classify each document into a taxonomy you control. Refine the rules, merge
categories, and the library reorganizes itself.
Active taxonomy · 13 tags
12Contract
28Medical
41Invoice
7Amendment
3Tuition
19Utility
6Legal
22Finance
14Billing
11Insurance
03 · GROUNDED Q&A
cited
Answers that cite their source.
Every answer comes back with the page-level citations it was built from. Click through,
verify, and trust.
"What is the billing cap in the Renfroe contract?"
Aggregate liability is capped at $1,000,000 in the executed
agreement.
renfroe-msa.pdf · p.4
amendment-2.pdf · p.1
04 · MODEL ROUTING
3 slots · BYO key
The right model for each job.
Configure separate models for OCR, extraction, and Q&A — cheap and fast where it
doesn't matter, slow and accurate where it does. Local, OpenAI-compatible, or hosted.
Active configuration
OCR
gemini-2.5-flash
fast
Extract
gpt-4.1-mini
balanced
Q&A
gpt-4.1
accurate