simon-clmtd commited on
Commit
d2137e3
·
verified ·
1 Parent(s): 79a3f1d

Update app.py

Browse files
Files changed (1) hide show
  1. app.py +9 -7
app.py CHANGED
@@ -81,7 +81,9 @@ with gr.Blocks(title="OCR QA Demo") as demo:
81
  gr.HTML(
82
  """
83
  <a href="https://impresso-project.ch" target="_blank">
84
- <img src="https://huggingface.co/spaces/impresso-project/ocrqa-demo/resolve/main/logo.jpeg" alt="Impresso Project Logo" style="height: 84px;">
 
 
85
  </a>
86
  """
87
  )
@@ -127,12 +129,12 @@ with gr.Blocks(title="OCR QA Demo") as demo:
127
  # Info modal/accordion for pipeline details
128
  with gr.Accordion("📝 About the OCR QA Method", open=False, visible=False) as info_accordion:
129
  gr.Markdown(
130
- """
131
- ### 📝 About the OCR QA Method
132
 
133
- This pipeline estimates OCR quality by analyzing the proportion of **unique words** in a text that match curated wordlists for a given language.
134
 
135
- #### How it works:
136
  - **Scoring**: The quality score ranges from **0.0** (poor) to **1.0** (excellent) and is based on the ratio of recognized to unrecognized unique word forms.
137
  - **Lexical resources**: Words are matched against precompiled lists derived from **Wikipedia** and **Wortschatz Leipzig**, using **Bloom filters** for fast, memory-efficient lookup.
138
  - **Multilingual support**: Available for several languages (e.g., German, French, English). If not specified, the language is detected automatically.
@@ -140,10 +142,10 @@ This pipeline estimates OCR quality by analyzing the proportion of **unique word
140
  - ✅ **Known tokens**: Words found in the reference wordlist, presumed correctly OCR’d.
141
  - ❌ **Unrecognized tokens**: Words not found in the list—often OCR errors, rare forms, or out-of-vocabulary items (e.g., names, historical terms).
142
 
143
- #### ⚠️ Limitations:
144
  - The wordlists are **not exhaustive**, particularly for **historical vocabulary**, **dialects**, or **named entities**.
145
  - The method may fail to flag **short OCR artifacts** (e.g., 1–2 character noise) and **non-alphabetic symbols**.
146
-
147
  As such, the score should be understood as a **heuristic indicator**, best used for:
148
  - Comparative assessments between OCR outputs
149
  - Filtering low-quality text from large corpora
 
81
  gr.HTML(
82
  """
83
  <a href="https://impresso-project.ch" target="_blank">
84
+ <img src="https://huggingface.co/spaces/impresso-project/ocrqa-demo/resolve/main/logo.jpeg"
85
+ alt="Impresso Project Logo"
86
+ style="height: 84px; display: block; margin: 0 auto;">
87
  </a>
88
  """
89
  )
 
129
  # Info modal/accordion for pipeline details
130
  with gr.Accordion("📝 About the OCR QA Method", open=False, visible=False) as info_accordion:
131
  gr.Markdown(
132
+ """
133
+ ### 📝 About the OCR QA Method
134
 
135
+ This pipeline estimates OCR quality by analyzing the proportion of **unique words** in a text that match curated wordlists for a given language.
136
 
137
+ #### How it works:
138
  - **Scoring**: The quality score ranges from **0.0** (poor) to **1.0** (excellent) and is based on the ratio of recognized to unrecognized unique word forms.
139
  - **Lexical resources**: Words are matched against precompiled lists derived from **Wikipedia** and **Wortschatz Leipzig**, using **Bloom filters** for fast, memory-efficient lookup.
140
  - **Multilingual support**: Available for several languages (e.g., German, French, English). If not specified, the language is detected automatically.
 
142
  - ✅ **Known tokens**: Words found in the reference wordlist, presumed correctly OCR’d.
143
  - ❌ **Unrecognized tokens**: Words not found in the list—often OCR errors, rare forms, or out-of-vocabulary items (e.g., names, historical terms).
144
 
145
+ #### ⚠️ Limitations:
146
  - The wordlists are **not exhaustive**, particularly for **historical vocabulary**, **dialects**, or **named entities**.
147
  - The method may fail to flag **short OCR artifacts** (e.g., 1–2 character noise) and **non-alphabetic symbols**.
148
+
149
  As such, the score should be understood as a **heuristic indicator**, best used for:
150
  - Comparative assessments between OCR outputs
151
  - Filtering low-quality text from large corpora