Spaces:
Running
Running
Update app.py
Browse files
app.py
CHANGED
|
@@ -81,7 +81,9 @@ with gr.Blocks(title="OCR QA Demo") as demo:
|
|
| 81 |
gr.HTML(
|
| 82 |
"""
|
| 83 |
<a href="https://impresso-project.ch" target="_blank">
|
| 84 |
-
<img src="https://huggingface.co/spaces/impresso-project/ocrqa-demo/resolve/main/logo.jpeg"
|
|
|
|
|
|
|
| 85 |
</a>
|
| 86 |
"""
|
| 87 |
)
|
|
@@ -127,12 +129,12 @@ with gr.Blocks(title="OCR QA Demo") as demo:
|
|
| 127 |
# Info modal/accordion for pipeline details
|
| 128 |
with gr.Accordion("📝 About the OCR QA Method", open=False, visible=False) as info_accordion:
|
| 129 |
gr.Markdown(
|
| 130 |
-
|
| 131 |
-
### 📝 About the OCR QA Method
|
| 132 |
|
| 133 |
-
This pipeline estimates OCR quality by analyzing the proportion of **unique words** in a text that match curated wordlists for a given language.
|
| 134 |
|
| 135 |
-
#### How it works:
|
| 136 |
- **Scoring**: The quality score ranges from **0.0** (poor) to **1.0** (excellent) and is based on the ratio of recognized to unrecognized unique word forms.
|
| 137 |
- **Lexical resources**: Words are matched against precompiled lists derived from **Wikipedia** and **Wortschatz Leipzig**, using **Bloom filters** for fast, memory-efficient lookup.
|
| 138 |
- **Multilingual support**: Available for several languages (e.g., German, French, English). If not specified, the language is detected automatically.
|
|
@@ -140,10 +142,10 @@ This pipeline estimates OCR quality by analyzing the proportion of **unique word
|
|
| 140 |
- ✅ **Known tokens**: Words found in the reference wordlist, presumed correctly OCR’d.
|
| 141 |
- ❌ **Unrecognized tokens**: Words not found in the list—often OCR errors, rare forms, or out-of-vocabulary items (e.g., names, historical terms).
|
| 142 |
|
| 143 |
-
#### ⚠️ Limitations:
|
| 144 |
- The wordlists are **not exhaustive**, particularly for **historical vocabulary**, **dialects**, or **named entities**.
|
| 145 |
- The method may fail to flag **short OCR artifacts** (e.g., 1–2 character noise) and **non-alphabetic symbols**.
|
| 146 |
-
|
| 147 |
As such, the score should be understood as a **heuristic indicator**, best used for:
|
| 148 |
- Comparative assessments between OCR outputs
|
| 149 |
- Filtering low-quality text from large corpora
|
|
|
|
| 81 |
gr.HTML(
|
| 82 |
"""
|
| 83 |
<a href="https://impresso-project.ch" target="_blank">
|
| 84 |
+
<img src="https://huggingface.co/spaces/impresso-project/ocrqa-demo/resolve/main/logo.jpeg"
|
| 85 |
+
alt="Impresso Project Logo"
|
| 86 |
+
style="height: 84px; display: block; margin: 0 auto;">
|
| 87 |
</a>
|
| 88 |
"""
|
| 89 |
)
|
|
|
|
| 129 |
# Info modal/accordion for pipeline details
|
| 130 |
with gr.Accordion("📝 About the OCR QA Method", open=False, visible=False) as info_accordion:
|
| 131 |
gr.Markdown(
|
| 132 |
+
"""
|
| 133 |
+
### 📝 About the OCR QA Method
|
| 134 |
|
| 135 |
+
This pipeline estimates OCR quality by analyzing the proportion of **unique words** in a text that match curated wordlists for a given language.
|
| 136 |
|
| 137 |
+
#### How it works:
|
| 138 |
- **Scoring**: The quality score ranges from **0.0** (poor) to **1.0** (excellent) and is based on the ratio of recognized to unrecognized unique word forms.
|
| 139 |
- **Lexical resources**: Words are matched against precompiled lists derived from **Wikipedia** and **Wortschatz Leipzig**, using **Bloom filters** for fast, memory-efficient lookup.
|
| 140 |
- **Multilingual support**: Available for several languages (e.g., German, French, English). If not specified, the language is detected automatically.
|
|
|
|
| 142 |
- ✅ **Known tokens**: Words found in the reference wordlist, presumed correctly OCR’d.
|
| 143 |
- ❌ **Unrecognized tokens**: Words not found in the list—often OCR errors, rare forms, or out-of-vocabulary items (e.g., names, historical terms).
|
| 144 |
|
| 145 |
+
#### ⚠️ Limitations:
|
| 146 |
- The wordlists are **not exhaustive**, particularly for **historical vocabulary**, **dialects**, or **named entities**.
|
| 147 |
- The method may fail to flag **short OCR artifacts** (e.g., 1–2 character noise) and **non-alphabetic symbols**.
|
| 148 |
+
|
| 149 |
As such, the score should be understood as a **heuristic indicator**, best used for:
|
| 150 |
- Comparative assessments between OCR outputs
|
| 151 |
- Filtering low-quality text from large corpora
|