simon-clmtd commited on
Commit
79a3f1d
·
verified ·
1 Parent(s): 31e0f17

Update app.py

Browse files
Files changed (1) hide show
  1. app.py +4 -4
app.py CHANGED
@@ -128,11 +128,11 @@ with gr.Blocks(title="OCR QA Demo") as demo:
128
  with gr.Accordion("📝 About the OCR QA Method", open=False, visible=False) as info_accordion:
129
  gr.Markdown(
130
  """
131
- ### 📝 About the OCR QA Method
132
 
133
- This pipeline estimates OCR quality by analyzing the proportion of **unique words** in a text that match curated wordlists for a given language.
134
 
135
- #### How it works:
136
  - **Scoring**: The quality score ranges from **0.0** (poor) to **1.0** (excellent) and is based on the ratio of recognized to unrecognized unique word forms.
137
  - **Lexical resources**: Words are matched against precompiled lists derived from **Wikipedia** and **Wortschatz Leipzig**, using **Bloom filters** for fast, memory-efficient lookup.
138
  - **Multilingual support**: Available for several languages (e.g., German, French, English). If not specified, the language is detected automatically.
@@ -140,7 +140,7 @@ with gr.Blocks(title="OCR QA Demo") as demo:
140
  - ✅ **Known tokens**: Words found in the reference wordlist, presumed correctly OCR’d.
141
  - ❌ **Unrecognized tokens**: Words not found in the list—often OCR errors, rare forms, or out-of-vocabulary items (e.g., names, historical terms).
142
 
143
- #### ⚠️ Limitations:
144
  - The wordlists are **not exhaustive**, particularly for **historical vocabulary**, **dialects**, or **named entities**.
145
  - The method may fail to flag **short OCR artifacts** (e.g., 1–2 character noise) and **non-alphabetic symbols**.
146
 
 
128
  with gr.Accordion("📝 About the OCR QA Method", open=False, visible=False) as info_accordion:
129
  gr.Markdown(
130
  """
131
+ ### 📝 About the OCR QA Method
132
 
133
+ This pipeline estimates OCR quality by analyzing the proportion of **unique words** in a text that match curated wordlists for a given language.
134
 
135
+ #### How it works:
136
  - **Scoring**: The quality score ranges from **0.0** (poor) to **1.0** (excellent) and is based on the ratio of recognized to unrecognized unique word forms.
137
  - **Lexical resources**: Words are matched against precompiled lists derived from **Wikipedia** and **Wortschatz Leipzig**, using **Bloom filters** for fast, memory-efficient lookup.
138
  - **Multilingual support**: Available for several languages (e.g., German, French, English). If not specified, the language is detected automatically.
 
140
  - ✅ **Known tokens**: Words found in the reference wordlist, presumed correctly OCR’d.
141
  - ❌ **Unrecognized tokens**: Words not found in the list—often OCR errors, rare forms, or out-of-vocabulary items (e.g., names, historical terms).
142
 
143
+ #### ⚠️ Limitations:
144
  - The wordlists are **not exhaustive**, particularly for **historical vocabulary**, **dialects**, or **named entities**.
145
  - The method may fail to flag **short OCR artifacts** (e.g., 1–2 character noise) and **non-alphabetic symbols**.
146