Spaces:

impresso-project
/

ocrqa-demo

Sleeping

App Files Files Community

simon-clmtd commited on Sep 12

Commit

79a3f1d

verified ·

1 Parent(s): 31e0f17

Update app.py

Browse files

Files changed (1) hide show

app.py +4 -4

app.py CHANGED Viewed

@@ -128,11 +128,11 @@ with gr.Blocks(title="OCR QA Demo") as demo:
     with gr.Accordion("📝 About the OCR QA Method", open=False, visible=False) as info_accordion:
         gr.Markdown(
             """
-        ### 📝 About the OCR QA Method
-    This pipeline estimates OCR quality by analyzing the proportion of **unique words** in a text that match curated wordlists for a given language.
-    #### How it works:
     - **Scoring**: The quality score ranges from **0.0** (poor) to **1.0** (excellent) and is based on the ratio of recognized to unrecognized unique word forms.
     - **Lexical resources**: Words are matched against precompiled lists derived from **Wikipedia** and **Wortschatz Leipzig**, using **Bloom filters** for fast, memory-efficient lookup.
     - **Multilingual support**: Available for several languages (e.g., German, French, English). If not specified, the language is detected automatically.
@@ -140,7 +140,7 @@ with gr.Blocks(title="OCR QA Demo") as demo:
         - ✅ **Known tokens**: Words found in the reference wordlist, presumed correctly OCR’d.
         - ❌ **Unrecognized tokens**: Words not found in the list—often OCR errors, rare forms, or out-of-vocabulary items (e.g., names, historical terms).
-    #### ⚠️ Limitations:
     - The wordlists are **not exhaustive**, particularly for **historical vocabulary**, **dialects**, or **named entities**.
     - The method may fail to flag **short OCR artifacts** (e.g., 1–2 character noise) and **non-alphabetic symbols**.

     with gr.Accordion("📝 About the OCR QA Method", open=False, visible=False) as info_accordion:
         gr.Markdown(
             """
+### 📝 About the OCR QA Method
+This pipeline estimates OCR quality by analyzing the proportion of **unique words** in a text that match curated wordlists for a given language.
+#### How it works:
     - **Scoring**: The quality score ranges from **0.0** (poor) to **1.0** (excellent) and is based on the ratio of recognized to unrecognized unique word forms.
     - **Lexical resources**: Words are matched against precompiled lists derived from **Wikipedia** and **Wortschatz Leipzig**, using **Bloom filters** for fast, memory-efficient lookup.
     - **Multilingual support**: Available for several languages (e.g., German, French, English). If not specified, the language is detected automatically.
         - ✅ **Known tokens**: Words found in the reference wordlist, presumed correctly OCR’d.
         - ❌ **Unrecognized tokens**: Words not found in the list—often OCR errors, rare forms, or out-of-vocabulary items (e.g., names, historical terms).
+#### ⚠️ Limitations:
     - The wordlists are **not exhaustive**, particularly for **historical vocabulary**, **dialects**, or **named entities**.
     - The method may fail to flag **short OCR artifacts** (e.g., 1–2 character noise) and **non-alphabetic symbols**.