File size: 6,877 Bytes
327bd85
 
 
 
 
b09d94b
 
42c4e1a
 
 
8c7a402
42c4e1a
8e796ef
 
 
 
42c4e1a
e36aaa8
 
 
527919e
 
 
 
 
 
 
 
 
 
 
 
e36aaa8
 
 
 
8e796ef
 
 
 
 
 
e054c82
 
 
9f9797d
e054c82
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8e796ef
9f9797d
 
 
9d4f5d0
 
 
 
 
 
 
 
 
9f9797d
 
8e796ef
 
 
 
 
 
 
 
 
 
 
 
 
 
16b3687
8e796ef
 
40bbaba
 
 
 
 
 
 
16b3687
40bbaba
 
64f0f95
e054c82
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40bbaba
8e796ef
 
 
 
 
 
40bbaba
 
 
 
 
 
42c4e1a
66d1427
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
import os

# Redirect cache to a writable path inside container
os.environ["XDG_CACHE_HOME"] = "/tmp/.cache"

import gradio as gr
from impresso_pipelines.solrnormalization import SolrNormalizationPipeline

pipeline = SolrNormalizationPipeline()

LANGUAGES = ["de", "fr", "es", "it", "pt", "nl", "en", "general"]

# Example text and default language
EXAMPLE_TEXT = "The quick brown fox jumps over the lazy dog. This is a sample text for demonstration purposes."
DEFAULT_LANGUAGE = "en"

def normalize(text, lang_choice):
    try:
        lang = None if lang_choice == "Auto-detect" else lang_choice
        result = pipeline(text, lang=lang, diagnostics=True)
        
        # Format analyzer pipeline for better readability
        analyzer_steps = []
        if 'analyzer_pipeline' in result and result['analyzer_pipeline']:
            for i, step in enumerate(result['analyzer_pipeline'], 1):
                step_type = step.get('type', 'unknown')
                step_name = step.get('name', 'unnamed')
                analyzer_steps.append(f"  {i}. {step_type}: {step_name}")
        
        analyzer_display = "\n".join(analyzer_steps) if analyzer_steps else "  No analyzer steps found"
        
        return f"🌍 Language: {result['language']}\n\n🔤 Tokens:\n{result['tokens']}\n\n🚫 Detected stopwords:\n{result['stopwords_detected']}\n\n⚙️ Analyzer pipeline:\n{analyzer_display}"
    except Exception as e:
        print("❌ Pipeline error:", e)
        return f"Error: {e}"

# Create the interface with logo and improved description
with gr.Blocks(title="Solr Normalization Demo") as demo:
    # Add logo at the top
    gr.Image("logo.jpeg", label=None, show_label=False, container=False, height=100)
    
    gr.Markdown(
    """
    # 🧹 Solr Normalization Pipeline Demo

    This demo showcases the **Solr Normalization Pipeline**, which replicates the text preprocessing steps applied by Solr <span title="Solr is the platform that provides search capabilities in Impresso. Several preprocessing steps must be undertaken to prepare data to be searchable in Solr. These steps are common in Natural Language Processing pipelines, as they help with normalising textual data by, for example, making the whole text lowercase. This makes possible non case-sensitive searches, where if you either write 'Dog' or 'dog', you can get the same results.">ℹ️</span> during indexing to help you understand how raw input is transformed before becoming searchable.

    The pipeline applies:
    - **Tokenization** (splitting text into searchable units)
    - **Stopword removal** (filtering out common, uninformative words)
    - **Lowercasing and normalization**
    - **Language-specific filters** (e.g., stemming, elision)

    These steps are crucial for improving **search recall** and maintaining **linguistic consistency** across large, multilingual corpora.

    🧠 **Why is this useful?**
    
    - It explains why search results might not exactly match the words you entered.
    - It shows how different word forms are **collapsed** into searchable stems.
    - It helps interpret unexpected matches (or mismatches) when querying historical text collections.

    You can try the example below, or enter your own text to explore how it is normalized behind the scenes.
    """
)
    
    # Add Solr explanation accordion
    with gr.Accordion("❓ What is Solr?", open=False) as solr_info:
        gr.Markdown("""
        **Solr is the search engine platform used to power fast and flexible information retrieval.**  
        It indexes large collections of text and allows users to query them efficiently, returning the most relevant results.  

        Before data can be used in Solr, it must go through several **preprocessing and indexing steps**.  
        These include tokenization (splitting text into words), lowercasing, stopword removal (e.g., ignoring common words like "the" or "and"), and stemming or lemmatization (reducing words to their root forms).  

        Such steps are common in **Natural Language Processing (NLP)** pipelines, as they help standardize text and make search more robust.  
        For example, thanks to normalization, a search for "running" can also match documents containing "run."  
        Similarly, lowercasing ensures that "History" and "history" are treated as the same word, making searches case-insensitive.
        """)
    
    with gr.Row():
        with gr.Column():
            text_input = gr.Textbox(
                label="Enter Text", 
                value=EXAMPLE_TEXT,
                lines=3,
                placeholder="Enter your text here..."
            )
            lang_dropdown = gr.Dropdown(
                choices=["Auto-detect"] + LANGUAGES, 
                value=DEFAULT_LANGUAGE, 
                label="Language"
            )
            submit_btn = gr.Button("🚀 Normalize Text", variant="primary")
            info_btn = gr.Button("Help", size="sm", scale=1)
        
        with gr.Column():
            with gr.Row():
                output = gr.Textbox(
                    label="Normalized Output", 
                    lines=15,
                    placeholder="Results will appear here...",
                    scale=10
                )
                
    
    # Info modal/accordion for pipeline details
    with gr.Accordion("ℹ️ Help", open=False, visible=False) as info_accordion:
        gr.Markdown("""
    This pipeline mirrors the standard **Solr analyzer sequence** used in the Impresso project’s indexing infrastructure. It helps interpret how raw text is processed before being indexed.

    #### Key Components:
    - **Tokenization**: Splits input text into individual word units (tokens).
    - **Token Filters**: Applies a series of language-aware transformations, including:
        - `elision`: Removes leading apostrophes/articles (e.g., *l’homme* → *homme*).
        - `lowercase`: Converts tokens to lowercase.
        - `asciifolding`: Converts accented characters to basic ASCII (e.g., *é* → *e*).
        - `stop`: Removes common stopwords (e.g., *the*, *and*, *le*).
        - `stemmer`: Reduces words to their root form (e.g., *running* → *run*).
        - `normalization`: Applies custom language-specific rules.

    #### Use Cases:
    - Understand how language-specific rules impact search.
    - Evaluate the effect of stopwords, stemming, and normalization.
    - Debug or fine-tune analyzer behavior for multilingual corpora.
    """
        )
    
    submit_btn.click(
        fn=normalize,
        inputs=[text_input, lang_dropdown],
        outputs=output
    )
    
    # Toggle info visibility when info button is clicked
    info_btn.click(
        fn=lambda: gr.Accordion(visible=True, open=True),
        outputs=info_accordion
    )

demo.launch(server_name="0.0.0.0", server_port=7860)