Re-OCR Your Digitised Collections for ~$0.002/Page – Daniel van Strien
“Many libraries and cultural heritage institutions digitised their collections years ago. The OCR from that era — often Tesseract or ABBYY — was state of the art at the time, but often struggles with historical typefaces, degraded scans, and complex layouts. In the last few years, a new generation of OCR models based on Vision Language Models (VLMs) has emerged. These models are primarily the result of 'running out of tokens' and the consequent desire from AI companies to find new sources of data to train on. This led to the development of OCR models using VLMs as backbones which usually aim to output 'reading order' text — i.e. text with minimal markup, usually targeting Markdown. These models can perform much better on the same scans that older tools struggled with, producing cleaner, more structured output.”
Re-OCR Your Digitised Collections for ~$0.002/Page – Daniel van Strien