Spaces:
Running
Running
Update app.py
Browse files
app.py
CHANGED
|
@@ -83,7 +83,7 @@ with gr.Blocks(title="OCR QA Demo") as demo:
|
|
| 83 |
<a href="https://impresso-project.ch" target="_blank">
|
| 84 |
<img src="https://huggingface.co/spaces/impresso-project/ocrqa-demo/resolve/main/logo.jpeg"
|
| 85 |
alt="Impresso Project Logo"
|
| 86 |
-
style="height:
|
| 87 |
</a>
|
| 88 |
"""
|
| 89 |
)
|
|
@@ -91,11 +91,12 @@ with gr.Blocks(title="OCR QA Demo") as demo:
|
|
| 91 |
"""
|
| 92 |
# 🔍 OCR Quality Assessment Demo
|
| 93 |
|
| 94 |
-
This demo showcases the **OCR Quality Assessment (OCRQA)**
|
|
|
|
| 95 |
|
| 96 |
It returns:
|
| 97 |
- a **quality score** between **0.0 (poor)** and **1.0 (excellent)**, and
|
| 98 |
-
- a list of **potential OCR errors** (unrecognized tokens).
|
| 99 |
|
| 100 |
You can try the example below (a German text containing typical OCR errors), or paste your own OCR-processed text to assess its quality.
|
| 101 |
"""
|
|
@@ -141,6 +142,7 @@ with gr.Blocks(title="OCR QA Demo") as demo:
|
|
| 141 |
- **Diagnostics output**:
|
| 142 |
- ✅ **Known tokens**: Words found in the reference wordlist, presumed correctly OCR’d.
|
| 143 |
- ❌ **Unrecognized tokens**: Words not found in the list—often OCR errors, rare forms, or out-of-vocabulary items (e.g., names, historical terms).
|
|
|
|
| 144 |
|
| 145 |
#### ⚠️ Limitations:
|
| 146 |
- The wordlists are **not exhaustive**, particularly for **historical vocabulary**, **dialects**, or **named entities**.
|
|
|
|
| 83 |
<a href="https://impresso-project.ch" target="_blank">
|
| 84 |
<img src="https://huggingface.co/spaces/impresso-project/ocrqa-demo/resolve/main/logo.jpeg"
|
| 85 |
alt="Impresso Project Logo"
|
| 86 |
+
style="height: 42px; display: block; margin: 0 auto;">
|
| 87 |
</a>
|
| 88 |
"""
|
| 89 |
)
|
|
|
|
| 91 |
"""
|
| 92 |
# 🔍 OCR Quality Assessment Demo
|
| 93 |
|
| 94 |
+
This demo showcases the **OCR Quality Assessment (OCRQA)** of the [Impresso Project](https://impresso-project.ch).
|
| 95 |
+
The pipeline evaluates the quality of text extracted via **Optical Character Recognition (OCR)** by estimating the proportion of recognizable words.
|
| 96 |
|
| 97 |
It returns:
|
| 98 |
- a **quality score** between **0.0 (poor)** and **1.0 (excellent)**, and
|
| 99 |
+
- a list of **potential OCR errors** (unrecognized tokens) as well as the known tokens.
|
| 100 |
|
| 101 |
You can try the example below (a German text containing typical OCR errors), or paste your own OCR-processed text to assess its quality.
|
| 102 |
"""
|
|
|
|
| 142 |
- **Diagnostics output**:
|
| 143 |
- ✅ **Known tokens**: Words found in the reference wordlist, presumed correctly OCR’d.
|
| 144 |
- ❌ **Unrecognized tokens**: Words not found in the list—often OCR errors, rare forms, or out-of-vocabulary items (e.g., names, historical terms).
|
| 145 |
+
- Note: Non-alphabetic characters will be removed. For efficiency reasons, all digits are replace by the digit 0.
|
| 146 |
|
| 147 |
#### ⚠️ Limitations:
|
| 148 |
- The wordlists are **not exhaustive**, particularly for **historical vocabulary**, **dialects**, or **named entities**.
|