How well do VLM-based OCR models handle Victorian theatre playbills? 🎭

Last week I shared OCR Time Capsule for comparing traditional vs VLM-based OCR. I've now added some examples from challenging collections: The British Library's Theatrical playbills from Britain and Ireland collection.

These 150-year-old documents are brutal for OCR:
- Decorative fonts in every size imaginable
- Multi-column layouts with text at odd angles
- Faded ink and show-through from the reverse
- ALL CAPS DRAMATIC ANNOUNCEMENTS!!!

For this dataset I used the RolmOCR model from Reducto (processed via HF Jobs - love how easy UV scripts make GPU inference!). The results? The improvements over traditional OCR are even more dramatic than with exam papers.

🔗 Explore the app: https://huggingface.co/spaces/davanstrien/ocr-time-capsule
📚 BL Theatre dataset: https://bl.iro.bl.uk/concern/datasets/a8534aff-c8e3-4fc8-adc1-da542080b1e3

I'll continue to work through the suggestions I got last week but feel free to suggest other hairy OCR challenges to compare VLMs vs existing OCR!

#DigitalHumanities #OCR #GLAM #BritishLibrary #TheatreHistory