|
|
<!DOCTYPE html> |
|
|
<html lang="en"> |
|
|
|
|
|
<head> |
|
|
|
|
|
<script async src="https://www.googletagmanager.com/gtag/js?id=G-KEDJFQ6MS9"></script> |
|
|
<script> |
|
|
window.dataLayer = window.dataLayer || []; |
|
|
function gtag(){dataLayer.push(arguments);} |
|
|
gtag('js', new Date()); |
|
|
|
|
|
gtag('config', 'G-KEDJFQ6MS9'); |
|
|
</script> |
|
|
<meta charset="UTF-8"> |
|
|
|
|
|
<title>DTLR General Detection-based Text Line Recognition</title> |
|
|
<style> |
|
|
|
|
|
code { |
|
|
background-color: #f4f4f4; |
|
|
border-radius: 10px; |
|
|
font-size: 14px; |
|
|
width: 100%; |
|
|
|
|
|
} |
|
|
pre { |
|
|
border-radius: 50px; |
|
|
} |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
body { |
|
|
font-family: 'Roboto', sans-serif; |
|
|
font-size: 16px; |
|
|
color: #333; |
|
|
line-height: 1.6; |
|
|
background-color: #f9f9f9; |
|
|
margin: 10px 5px !important; |
|
|
padding: 10px; |
|
|
|
|
|
|
|
|
} |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
header { |
|
|
text-align: center; |
|
|
color: #333; |
|
|
} |
|
|
|
|
|
header h1 { |
|
|
font-size: 38px; |
|
|
} |
|
|
|
|
|
|
|
|
.authors { |
|
|
display: flex; |
|
|
justify-content: center; |
|
|
align-items: center; |
|
|
flex-direction: column; |
|
|
font-size: 18px; |
|
|
} |
|
|
.authors a { |
|
|
color: inherit; |
|
|
text-decoration: none; |
|
|
} |
|
|
.authors a:hover { |
|
|
text-decoration: underline; |
|
|
} |
|
|
|
|
|
.content { |
|
|
display: flex; |
|
|
|
|
|
justify-content: space-between; |
|
|
} |
|
|
|
|
|
|
|
|
.affiliations { |
|
|
text-align: center; |
|
|
margin-bottom: 20px; |
|
|
font-size: 16px; |
|
|
} |
|
|
|
|
|
.conference { |
|
|
text-align: center; |
|
|
} |
|
|
|
|
|
|
|
|
.icon-links { |
|
|
display: flex; |
|
|
justify-content: center; |
|
|
align-items: center; |
|
|
flex-direction: row; |
|
|
gap: 20px; |
|
|
} |
|
|
|
|
|
.icon-links a { |
|
|
text-decoration: none; |
|
|
background-color: #02A4D3; |
|
|
color: white; |
|
|
width: 100px; |
|
|
height: 40px; |
|
|
line-height: 40px; |
|
|
border-radius: 8px; |
|
|
font-weight: bold; |
|
|
text-align: center; |
|
|
transition: background-color 0.3s, transform 0.2s; |
|
|
} |
|
|
|
|
|
.icon-links a:hover { |
|
|
background-color: #02A4D3; |
|
|
transform: translateY(-3px); |
|
|
} |
|
|
|
|
|
.icon-links a:active { |
|
|
transform: translateY(1px); |
|
|
} |
|
|
|
|
|
|
|
|
.container { |
|
|
width: 100%; |
|
|
max-width: none; |
|
|
margin: 0 auto; |
|
|
padding: 20px; |
|
|
box-shadow: 0 0 10px rgba(0, 0, 0, 0.1); |
|
|
background-color: #fff; |
|
|
} |
|
|
.title__ { |
|
|
color: #02A4D3; |
|
|
} |
|
|
|
|
|
code, pre { |
|
|
background-color: #f4f4f4; |
|
|
padding: 10px; |
|
|
border-radius: 5px; |
|
|
font-family: "Courier New", Courier, monospace; |
|
|
font-size: 14px; |
|
|
white-space: pre-wrap; |
|
|
overflow-x: auto; |
|
|
} |
|
|
|
|
|
pre { |
|
|
margin: 20px 0; |
|
|
border: 1px solid #ccc; |
|
|
} |
|
|
|
|
|
|
|
|
.container h1 { |
|
|
text-align: center; |
|
|
font-size: 38px; |
|
|
margin-right: 20px; |
|
|
margin-left: 40px; |
|
|
} |
|
|
|
|
|
|
|
|
.img-with-text { |
|
|
width: 100%; |
|
|
display: block; |
|
|
margin: 0 auto; |
|
|
} |
|
|
|
|
|
|
|
|
.icon { |
|
|
width: 80px; |
|
|
|
|
|
height: 80px; |
|
|
|
|
|
background-color: #ddd; |
|
|
border-radius: 50%; |
|
|
display: flex; |
|
|
justify-content: center; |
|
|
align-items: center; |
|
|
margin: 0 auto; |
|
|
} |
|
|
|
|
|
.right-aligned-image { |
|
|
margin-left: auto; |
|
|
|
|
|
align-self: flex-start; |
|
|
|
|
|
margin-top: 0; |
|
|
|
|
|
} |
|
|
|
|
|
.right-aligned-image img { |
|
|
max-width: 300px; |
|
|
|
|
|
height: auto; |
|
|
} |
|
|
|
|
|
|
|
|
.center-content { |
|
|
display: flex; |
|
|
flex-direction: column; |
|
|
align-items: center; |
|
|
justify-content: center; |
|
|
text-align: center; |
|
|
} |
|
|
|
|
|
|
|
|
|
|
|
h1 { |
|
|
font-size: 2rem; |
|
|
|
|
|
margin-bottom: 10px; |
|
|
} |
|
|
|
|
|
p { |
|
|
font-size: 1.0rem; |
|
|
color: #333; |
|
|
|
|
|
} |
|
|
|
|
|
.icon-label { |
|
|
margin-top: 5px; |
|
|
font-size: 0.9rem; |
|
|
text-align: center; |
|
|
} |
|
|
|
|
|
.References p ul li { |
|
|
font-size: 0.8rem; |
|
|
} |
|
|
|
|
|
.teaser-image img { |
|
|
display: block; |
|
|
margin: 0 auto; |
|
|
|
|
|
max-width: 100%; |
|
|
|
|
|
height: auto; |
|
|
} |
|
|
|
|
|
.centered-image { |
|
|
text-align: center; |
|
|
|
|
|
} |
|
|
|
|
|
.centered-image img { |
|
|
max-width: 800px; |
|
|
|
|
|
margin: 0 auto; |
|
|
|
|
|
padding: 20px; |
|
|
|
|
|
text-align: justify; |
|
|
|
|
|
} |
|
|
|
|
|
.logo-image { |
|
|
text-align: center; |
|
|
} |
|
|
|
|
|
.logo-image img { |
|
|
margin: 0 auto; |
|
|
|
|
|
max-width: 90%; |
|
|
|
|
|
height: 6; |
|
|
} |
|
|
|
|
|
.conference-image img { |
|
|
margin: 0 auto; |
|
|
|
|
|
width: 100px; |
|
|
|
|
|
height: auto; |
|
|
} |
|
|
.blue-line { |
|
|
border: none; |
|
|
border-top: 3px solid #02A4D3; |
|
|
width: 100%; |
|
|
margin: -10px 0; |
|
|
} |
|
|
|
|
|
.abstract { |
|
|
max-width: 1000px; |
|
|
|
|
|
margin: 0 auto; |
|
|
|
|
|
|
|
|
text-align: justify; |
|
|
|
|
|
} |
|
|
|
|
|
.abstract h2 { |
|
|
text-align: center; |
|
|
|
|
|
font-size: 1.5rem; |
|
|
|
|
|
margin-bottom: 20px; |
|
|
|
|
|
color: #02A4D3; |
|
|
|
|
|
} |
|
|
|
|
|
.method h2 { |
|
|
text-align: center; |
|
|
|
|
|
font-size: 1.5rem; |
|
|
|
|
|
margin-bottom: 20px; |
|
|
|
|
|
color: #02A4D3; |
|
|
|
|
|
} |
|
|
|
|
|
|
|
|
|
|
|
.para p { |
|
|
max-width: 90%; |
|
|
|
|
|
line-height: 1.6; |
|
|
|
|
|
|
|
|
margin: 0 auto; |
|
|
|
|
|
margin-top: 20px; |
|
|
} |
|
|
|
|
|
.row::after { |
|
|
content: ""; |
|
|
clear: both; |
|
|
justify-content: center; |
|
|
flex-wrap: wrap; |
|
|
|
|
|
display: table; |
|
|
} |
|
|
.imgcontainer { |
|
|
max-width: 950px; |
|
|
margin: 0 auto; |
|
|
text-align: justify; |
|
|
} |
|
|
.Teaser { |
|
|
max-width: 1000px; |
|
|
|
|
|
margin: 0 auto; |
|
|
} |
|
|
|
|
|
|
|
|
.grid-container { |
|
|
display: grid; |
|
|
grid-template-columns: repeat(1, 1fr); |
|
|
gap: 20px; |
|
|
} |
|
|
|
|
|
|
|
|
.grid-item { |
|
|
display: flex; |
|
|
justify-content: center; |
|
|
align-items: center; |
|
|
} |
|
|
|
|
|
.grid-item img { |
|
|
width: 100%; |
|
|
height: auto; |
|
|
display: block; |
|
|
} |
|
|
.image-pair { |
|
|
flex: 1; |
|
|
margin-bottom: 2px; |
|
|
|
|
|
} |
|
|
|
|
|
.image-pair img { |
|
|
max-width: 100%; |
|
|
|
|
|
height: auto; |
|
|
} |
|
|
figcaption { |
|
|
max-width: 800px; |
|
|
|
|
|
margin: 0 auto; |
|
|
|
|
|
padding: 20px; |
|
|
|
|
|
text-align: justify; |
|
|
|
|
|
} |
|
|
.centered-image img { |
|
|
max-width: 90%; |
|
|
height: auto; |
|
|
display: block; |
|
|
margin: 0 auto; |
|
|
} |
|
|
|
|
|
</style> |
|
|
</head> |
|
|
|
|
|
|
|
|
|
|
|
<body> |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<div class="container"> |
|
|
<div class = "title__"> |
|
|
<h1> General Detection-based Text Line Recognition <br> |
|
|
<span style="color: black;font-size: 0.8em;">(NeurIPS 2024)</span> |
|
|
</h1> |
|
|
</div> |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<div class="authors"> |
|
|
<b> <a href="https://raphael-baena.github.io/" target="_blank">Raphael Baena</a>, <a href="https://imagine-lab.enpc.fr/staff-members/syrine-kalleli/" target="_blank">Syrine Kalleli</a>, <a href="https://imagine.enpc.fr/~aubrym/" target="_blank">Mathieu Aubry</a> </b> |
|
|
</div> |
|
|
|
|
|
<div class="affiliations"> |
|
|
<i> LIGM, Ecole des Ponts, Univ Gustave Eiffel, CNRS, Marne-la-Vallée, France</i> |
|
|
|
|
|
|
|
|
</div> |
|
|
|
|
|
<div class="icon-links"> |
|
|
<a href="https://arxiv.org/pdf/2409.17095" |
|
|
target="_blank"> |
|
|
<div class="center-content"> |
|
|
<b>Paper </b> |
|
|
</div> |
|
|
|
|
|
</a> |
|
|
<a href="https://github.com/raphael-baena/DTLR" target="_blank"> |
|
|
<div class="center-content"> |
|
|
<b>Code</b> |
|
|
</div> |
|
|
</a> |
|
|
|
|
|
<a href="index.html" target="_blank"> |
|
|
<div class="center-content"> |
|
|
<b> Presentation </b> |
|
|
</div> |
|
|
</a> |
|
|
|
|
|
|
|
|
|
|
|
</div> |
|
|
|
|
|
|
|
|
<div class="Teaser"> |
|
|
<div class="para"> |
|
|
<figure class="centered-image"> |
|
|
<img src="teaser.png" alt="Description of image"> |
|
|
<figcaption> |
|
|
Our HTR model is general and can be used on diverse datasets, including challenging handwritten script, Chinese script, and ciphers. From left to right and top to bottom we show results on Google1000, IAM, READ, RIMES, CASIA, and Cipher datasets. |
|
|
</figcaption> |
|
|
</figure> |
|
|
</div> |
|
|
</div> |
|
|
<div class="abstract"> |
|
|
<h2>Abstract</h2> |
|
|
<hr class="blue-line"> |
|
|
<div class="para"> |
|
|
<p> |
|
|
We introduce a general detection-based approach to text line recognition, be it printed (OCR) or handwritten (HTR), with Latin, Chinese, or ciphered characters. Detection-based approaches have until now been largely discarded for HTR because reading characters separately is often challenging, and character-level annotation is difficult and expensive. We overcome these challenges thanks to three main insights: |
|
|
(i) synthetic pre-training with sufficiently diverse data enables learning reasonable character localization for any script; (ii) modern transformer-based detectors can jointly detect a large number of instances, and, if trained with an adequate masking strategy, leverage consistency between the different detections; (iii) once a pre-trained detection model with approximate character localization is available, it is possible to fine-tune it with line-level annotation on real data, even with a different alphabet. Our approach builds on a completely different paradigm than state-of-the-art HTR methods, which rely on autoregressive decoding, predicting character values one by one, while we treat a complete line in parallel. Remarkably, we demonstrate good performance on a large range of scripts, usually tackled with specialized approaches. We surpass state-of-the-art results for Chinese script on the CASIA v2 dataset, and for ciphers such as Borg and Copiale, while also performing well with Latin scripts. |
|
|
</p> |
|
|
</div> |
|
|
</div> |
|
|
|
|
|
<div class="abstract"> |
|
|
<h2>Method</h2> |
|
|
<hr class="blue-line"> |
|
|
<div class="para"> |
|
|
<p> |
|
|
Given an input text-line image, our goal is to predict its transcription, i.e., a sequence of characters. |
|
|
We tackle this problem as a character detection task and build on the DINO-DETR architecture, |
|
|
shown in the Figure below, to simultaneously detect all characters. |
|
|
<figure class="centered-image"> |
|
|
<img src="architecture_figure.png" alt="Description of image"> |
|
|
</figure> |
|
|
Given an input image, the backbone extracts multi-scale |
|
|
features which are fed to the Transformer encoder along with a positional encoding. The |
|
|
primitive queries, composed of content (filled) and modified positional (empty) queries, go |
|
|
through the Transformer decoder where they probe the enhanced encoder features through |
|
|
deformable cross-attention. Queries are refined layer-by-layer in the decoder, to finally |
|
|
predict the characters and associated bounding boxes. |
|
|
</p> |
|
|
</div> |
|
|
</div> |
|
|
<div class="abstract" style="display: flex; justify-content: center; flex-direction: column; align-items: center;"> |
|
|
<h2>Qualitative Results</h2> |
|
|
<hr class="blue-line"> |
|
|
|
|
|
<h3>IAM</h3> |
|
|
<div class="imgcontainer"> |
|
|
<div class="grid-container"> |
|
|
<div class="grid-item"> |
|
|
<img src="IAM/40.png" alt="IAM Image 1"> |
|
|
</div> |
|
|
<div class="grid-item"> |
|
|
<img src="IAM/105.png" alt="IAM Image 2"> |
|
|
</div> |
|
|
<div class="grid-item"> |
|
|
<img src="IAM/111.png" alt="IAM Image 3"> |
|
|
</div> |
|
|
<div class="grid-item"> |
|
|
<img src="IAM/125.png" alt="IAM Image 4"> |
|
|
</div> |
|
|
</div> |
|
|
</div> |
|
|
|
|
|
|
|
|
<h3>READ</h3> |
|
|
<div class="imgcontainer"> |
|
|
<div class="grid-container"> |
|
|
<div class="grid-item"> |
|
|
<img src="READ/223.png" alt="IAM Image 1"> |
|
|
</div> |
|
|
<div class="grid-item"> |
|
|
<img src="READ/235.png" alt="IAM Image 2"> |
|
|
</div> |
|
|
<div class="grid-item"> |
|
|
<img src="READ/273.png" alt="IAM Image 3"> |
|
|
</div> |
|
|
<div class="grid-item"> |
|
|
<img src="READ/479.png" alt="IAM Image 4"> |
|
|
</div> |
|
|
</div> |
|
|
</div> |
|
|
|
|
|
|
|
|
|
|
|
<h3>RIMES</h3> |
|
|
<div class="imgcontainer"> |
|
|
<div class="grid-container"> |
|
|
<div class="grid-item"> |
|
|
<img src="RIMES/21.png" alt="IAM Image 1"> |
|
|
</div> |
|
|
<div class="grid-item"> |
|
|
<img src="RIMES/38.png" alt="IAM Image 2"> |
|
|
</div> |
|
|
<div class="grid-item"> |
|
|
<img src="RIMES/47.png" alt="IAM Image 3"> |
|
|
</div> |
|
|
<div class="grid-item"> |
|
|
<img src="RIMES/69.png" alt="IAM Image 4"> |
|
|
</div> |
|
|
</div> |
|
|
</div> |
|
|
|
|
|
|
|
|
<h3>Copiale</h3> |
|
|
<div class="imgcontainer"> |
|
|
<div class="grid-container"> |
|
|
<div class="grid-item"> |
|
|
<img src="Copiale/226.png" alt="IAM Image 1"> |
|
|
</div> |
|
|
<div class="grid-item"> |
|
|
<img src="Copiale/228.png" alt="IAM Image 2"> |
|
|
</div> |
|
|
<div class="grid-item"> |
|
|
<img src="Copiale/229.png" alt="IAM Image 3"> |
|
|
</div> |
|
|
<div class="grid-item"> |
|
|
<img src="Copiale/405.png" alt="IAM Image 4"> |
|
|
</div> |
|
|
</div> |
|
|
</div> |
|
|
</div> |
|
|
|
|
|
|
|
|
<div class="abstract" |
|
|
style="display: flex; margin-top: 40px;justify-content: center; flex-direction: column; align-items: center;"> |
|
|
|
|
|
<h2>Acknowledgements</h2> |
|
|
<hr class="blue-line"> |
|
|
<div class="para"> |
|
|
<p> |
|
|
This work was funded by ANR project EIDA ANR-22-CE38-0014, ANR project VHS ANR-21-CE38-0008, ANR project sharp ANR-23-PEIA-0008, in the context of the PEPR IA, and ERC project DISCOVER funded by |
|
|
the European Union’s Horizon Europe Research and Innovation program under grant agreement No. 101076028. We thank Ségolène Albouy, Zeynep Sonat Baltacı, Ioannis Siglidis, Elliot Vincent and Malamatenia Vlachou for feedback and fruitful discussions. |
|
|
</p> |
|
|
</div> |
|
|
</div> |
|
|
|
|
|
<div class="abstract" |
|
|
style="display: flex; flex-direction: column;"> |
|
|
<h2>BibTeX</h2> |
|
|
<pre style="text-align: center; margin-top: -15px;"> |
|
|
<p style="margin-top: -20px;margin-bottom: -30px; margin-left: 20px; margin-right: 20px; text-align: left;">@article{baena2024DTLR, title={General Detection-based Text Line Recognition}, <br>author={Raphael Baena and Syrine Kalleli and Mathieu Aubry}, <br>booktitle={NeurIPS},<br>year={2024},<br>url={https://arxiv.org/abs/2409.17095}} |
|
|
</p> |
|
|
</pre> |
|
|
</div> |
|
|
|
|
|
|
|
|
|
|
|
</body> |
|
|
|
|
|
</html> |
|
|
|