Skip to content
Paperwise
GitHub

Q&A

How does text extraction and OCR work in Paperwise?

Section titled “How does text extraction and OCR work in Paperwise?”

Paperwise first tries to extract text directly from the document when possible. This works well for machine-readable PDFs and avoids unnecessary OCR.

When the file is a scan, image-heavy PDF, or low-quality document, Paperwise falls back to OCR based on the mode you choose in Settings > Model Config.

  • Local Tesseract keeps OCR on your machine using tesseract and pdftoppm.
  • LLM OCR using your main connection sends rendered page images to the same provider family you use elsewhere.
  • Separate OCR model routing lets you choose a dedicated OCR model independently from metadata extraction and grounded Q&A.

For OpenAI-based vision OCR, you can also tune image detail to auto, low, or high.

Paperwise works well with both GPT and Gemini setups. Use task-specific models so OCR, extraction, and grounded Q&A are tuned separately.

TaskGPT exampleGemini exampleNotes
OCRgpt-5-minigemini-2.5-flashGood fast multimodal starting points for scanned PDFs and forms
Metadata extractiongpt-5-minigemini-2.5-flashBalanced choices for structured fields
Grounded Q&Agpt-5.1gemini-2.5-proBetter picks for more demanding cross-document questions
Budget / bulk workgpt-5-nanogemini-2.5-flash-liteUseful for lighter classification and triage

If your documents are mostly clean text PDFs, start with the faster models and only move up when quality is not good enough.

Why does OCR sometimes time out on dense PDFs?

Section titled “Why does OCR sometimes time out on dense PDFs?”

Large page images can take longer to process. Paperwise logs per-page OCR progress and retries timed-out pages. Longer OCR request timeouts are enabled for vision calls.

Documents are deduplicated by SHA256 checksum to prevent repeated ingestion of the same file content.

Open an issue in the repository: