Spaces:
Sleeping
Sleeping
Update app.py
Browse files
app.py
CHANGED
|
@@ -128,11 +128,11 @@ with gr.Blocks(title="OCR QA Demo") as demo:
|
|
| 128 |
with gr.Accordion("📝 About the OCR QA Method", open=False, visible=False) as info_accordion:
|
| 129 |
gr.Markdown(
|
| 130 |
"""
|
| 131 |
-
|
| 132 |
|
| 133 |
-
|
| 134 |
|
| 135 |
-
|
| 136 |
- **Scoring**: The quality score ranges from **0.0** (poor) to **1.0** (excellent) and is based on the ratio of recognized to unrecognized unique word forms.
|
| 137 |
- **Lexical resources**: Words are matched against precompiled lists derived from **Wikipedia** and **Wortschatz Leipzig**, using **Bloom filters** for fast, memory-efficient lookup.
|
| 138 |
- **Multilingual support**: Available for several languages (e.g., German, French, English). If not specified, the language is detected automatically.
|
|
@@ -140,7 +140,7 @@ with gr.Blocks(title="OCR QA Demo") as demo:
|
|
| 140 |
- ✅ **Known tokens**: Words found in the reference wordlist, presumed correctly OCR’d.
|
| 141 |
- ❌ **Unrecognized tokens**: Words not found in the list—often OCR errors, rare forms, or out-of-vocabulary items (e.g., names, historical terms).
|
| 142 |
|
| 143 |
-
|
| 144 |
- The wordlists are **not exhaustive**, particularly for **historical vocabulary**, **dialects**, or **named entities**.
|
| 145 |
- The method may fail to flag **short OCR artifacts** (e.g., 1–2 character noise) and **non-alphabetic symbols**.
|
| 146 |
|
|
|
|
| 128 |
with gr.Accordion("📝 About the OCR QA Method", open=False, visible=False) as info_accordion:
|
| 129 |
gr.Markdown(
|
| 130 |
"""
|
| 131 |
+
### 📝 About the OCR QA Method
|
| 132 |
|
| 133 |
+
This pipeline estimates OCR quality by analyzing the proportion of **unique words** in a text that match curated wordlists for a given language.
|
| 134 |
|
| 135 |
+
#### How it works:
|
| 136 |
- **Scoring**: The quality score ranges from **0.0** (poor) to **1.0** (excellent) and is based on the ratio of recognized to unrecognized unique word forms.
|
| 137 |
- **Lexical resources**: Words are matched against precompiled lists derived from **Wikipedia** and **Wortschatz Leipzig**, using **Bloom filters** for fast, memory-efficient lookup.
|
| 138 |
- **Multilingual support**: Available for several languages (e.g., German, French, English). If not specified, the language is detected automatically.
|
|
|
|
| 140 |
- ✅ **Known tokens**: Words found in the reference wordlist, presumed correctly OCR’d.
|
| 141 |
- ❌ **Unrecognized tokens**: Words not found in the list—often OCR errors, rare forms, or out-of-vocabulary items (e.g., names, historical terms).
|
| 142 |
|
| 143 |
+
#### ⚠️ Limitations:
|
| 144 |
- The wordlists are **not exhaustive**, particularly for **historical vocabulary**, **dialects**, or **named entities**.
|
| 145 |
- The method may fail to flag **short OCR artifacts** (e.g., 1–2 character noise) and **non-alphabetic symbols**.
|
| 146 |
|