Spaces:

impresso-project
/

solr-normalization-demo

Sleeping

App Files Files Community

simon-clmtd commited on Sep 13

Commit

e054c82

verified ·

1 Parent(s): affc40d

Update app.py

Browse files

Files changed (1) hide show

app.py +40 -21

app.py CHANGED Viewed

@@ -40,16 +40,28 @@ with gr.Blocks(title="Solr Normalization Demo") as demo:
     gr.Image("logo.jpeg", label=None, show_label=False, container=False, height=100)
     gr.Markdown(
-        """
-        # 🧹 Solr Normalization Pipeline Demo
-        **Solr normalization** is meant to demonstrate how text is normalized in the **Impresso** project.
-        This pipeline replicates Solr's text processing functionality, showing how text goes through various
-        analyzers including tokenization, stopword removal, and language-specific transformations.
-        Try the example below or enter your own text to see how it gets processed!
-        """
-    )
     with gr.Row():
         with gr.Column():
@@ -78,17 +90,24 @@ with gr.Blocks(title="Solr Normalization Demo") as demo:
     # Info modal/accordion for pipeline details
     with gr.Accordion("📝 About the Pipeline", open=False, visible=False) as info_accordion:
-        gr.Markdown(
-            """
-            - **Tokenization**: Splits text into individual tokens
-            - **Tokenfilter**: Applies various transformations like:
-                - elision: removes leading apostrophes and articles in languages like French and Italian
-                - lowercase: converts to lowercase
-                - asciifolding: converts accented characters to ASCII
-                - stop: removes common stopwords
-                - stemmer: reduces words to a common base or stem, improving recall in search
-                - normalization: applies language-specific normalization
-            """
         )
     submit_btn.click(

     gr.Image("logo.jpeg", label=None, show_label=False, container=False, height=100)
     gr.Markdown(
+    """
+    # 🧹 Solr Normalization Pipeline Demo
+    This demo showcases the **Solr Normalization Pipeline**, which replicates the text preprocessing steps applied by Solr during indexing to help you understand how raw input is transformed before becoming searchable.
+    The pipeline applies:
+    - **Tokenization** (splitting text into searchable units)
+    - **Stopword removal** (filtering out common, uninformative words)
+    - **Lowercasing and normalization**
+    - **Language-specific filters** (e.g., stemming, elision)
+    These steps are crucial for improving **search recall** and maintaining **linguistic consistency** across large, multilingual corpora.
+    🧠 **Why is this useful?**
+    - It explains why search results might not exactly match the words you entered.
+    - It shows how different word forms are **collapsed** into searchable stems.
+    - It helps interpret unexpected matches (or mismatches) when querying historical text collections.
+    You can try the example below, or enter your own text to explore how it is normalized behind the scenes.
+    """
+)
     with gr.Row():
         with gr.Column():
     # Info modal/accordion for pipeline details
     with gr.Accordion("📝 About the Pipeline", open=False, visible=False) as info_accordion:
+        gr.Markdown("""
+    This pipeline mirrors the standard **Solr analyzer sequence** used in the Impresso project’s indexing infrastructure. It helps interpret how raw text is processed before being indexed.
+    #### Key Components:
+    - **Tokenization**: Splits input text into individual word units (tokens).
+    - **Token Filters**: Applies a series of language-aware transformations, including:
+        - `elision`: Removes leading apostrophes/articles (e.g., *l’homme* → *homme*).
+        - `lowercase`: Converts tokens to lowercase.
+        - `asciifolding`: Converts accented characters to basic ASCII (e.g., *é* → *e*).
+        - `stop`: Removes common stopwords (e.g., *the*, *and*, *le*).
+        - `stemmer`: Reduces words to their root form (e.g., *running* → *run*).
+        - `normalization`: Applies custom language-specific rules.
+    #### Use Cases:
+    - Understand how language-specific rules impact search.
+    - Evaluate the effect of stopwords, stemming, and normalization.
+    - Debug or fine-tune analyzer behavior for multilingual corpora.
+    """
         )
     submit_btn.click(