Spaces:

impresso-project
/

ocrqa-demo

Running

App Files Files Community

simon-clmtd commited on Sep 12

Commit

d2137e3

verified ·

1 Parent(s): 79a3f1d

Update app.py

Browse files

Files changed (1) hide show

app.py +9 -7

app.py CHANGED Viewed

@@ -81,7 +81,9 @@ with gr.Blocks(title="OCR QA Demo") as demo:
     gr.HTML(
     """
     <a href="https://impresso-project.ch" target="_blank">
-        <img src="https://huggingface.co/spaces/impresso-project/ocrqa-demo/resolve/main/logo.jpeg" alt="Impresso Project Logo" style="height: 84px;">
     </a>
     """
 )
@@ -127,12 +129,12 @@ with gr.Blocks(title="OCR QA Demo") as demo:
     # Info modal/accordion for pipeline details
     with gr.Accordion("📝 About the OCR QA Method", open=False, visible=False) as info_accordion:
         gr.Markdown(
-            """
-### 📝 About the OCR QA Method
-This pipeline estimates OCR quality by analyzing the proportion of **unique words** in a text that match curated wordlists for a given language.
-#### How it works:
     - **Scoring**: The quality score ranges from **0.0** (poor) to **1.0** (excellent) and is based on the ratio of recognized to unrecognized unique word forms.
     - **Lexical resources**: Words are matched against precompiled lists derived from **Wikipedia** and **Wortschatz Leipzig**, using **Bloom filters** for fast, memory-efficient lookup.
     - **Multilingual support**: Available for several languages (e.g., German, French, English). If not specified, the language is detected automatically.
@@ -140,10 +142,10 @@ This pipeline estimates OCR quality by analyzing the proportion of **unique word
         - ✅ **Known tokens**: Words found in the reference wordlist, presumed correctly OCR’d.
         - ❌ **Unrecognized tokens**: Words not found in the list—often OCR errors, rare forms, or out-of-vocabulary items (e.g., names, historical terms).
-#### ⚠️ Limitations:
     - The wordlists are **not exhaustive**, particularly for **historical vocabulary**, **dialects**, or **named entities**.
     - The method may fail to flag **short OCR artifacts** (e.g., 1–2 character noise) and **non-alphabetic symbols**.
     As such, the score should be understood as a **heuristic indicator**, best used for:
     - Comparative assessments between OCR outputs
     - Filtering low-quality text from large corpora

     gr.HTML(
     """
     <a href="https://impresso-project.ch" target="_blank">
+        <img src="https://huggingface.co/spaces/impresso-project/ocrqa-demo/resolve/main/logo.jpeg"
+             alt="Impresso Project Logo"
+             style="height: 84px; display: block; margin: 0 auto;">
     </a>
     """
 )
     # Info modal/accordion for pipeline details
     with gr.Accordion("📝 About the OCR QA Method", open=False, visible=False) as info_accordion:
         gr.Markdown(
+    """
+    ### 📝 About the OCR QA Method
+    This pipeline estimates OCR quality by analyzing the proportion of **unique words** in a text that match curated wordlists for a given language.
+    #### How it works:
     - **Scoring**: The quality score ranges from **0.0** (poor) to **1.0** (excellent) and is based on the ratio of recognized to unrecognized unique word forms.
     - **Lexical resources**: Words are matched against precompiled lists derived from **Wikipedia** and **Wortschatz Leipzig**, using **Bloom filters** for fast, memory-efficient lookup.
     - **Multilingual support**: Available for several languages (e.g., German, French, English). If not specified, the language is detected automatically.
         - ✅ **Known tokens**: Words found in the reference wordlist, presumed correctly OCR’d.
         - ❌ **Unrecognized tokens**: Words not found in the list—often OCR errors, rare forms, or out-of-vocabulary items (e.g., names, historical terms).
+    #### ⚠️ Limitations:
     - The wordlists are **not exhaustive**, particularly for **historical vocabulary**, **dialects**, or **named entities**.
     - The method may fail to flag **short OCR artifacts** (e.g., 1–2 character noise) and **non-alphabetic symbols**.
     As such, the score should be understood as a **heuristic indicator**, best used for:
     - Comparative assessments between OCR outputs
     - Filtering low-quality text from large corpora