Spaces:

nhop
/

L3Score

Running

App Files Files Community

nhop commited on Apr 15

Commit

26a4d51

1 Parent(s): 8f7a170

Update gradio app

Browse files

Files changed (1) hide show

app.py +132 -3

app.py CHANGED Viewed

@@ -1,6 +1,135 @@
 import evaluate
-from evaluate.utils import launch_gradio_widget
-module = evaluate.load("nhop/L3Score")
-launch_gradio_widget(module)

+import gradio as gr
 import evaluate
+l3score = evaluate.load("your-username/L3Score")
+def compute_l3score(api_key, provider, model, questions, predictions, references):
+    try:
+        result = l3score.compute(
+            questions=[q.strip() for q in questions.split("\n") if q.strip()],
+            predictions=[p.strip() for p in predictions.split("\n") if p.strip()],
+            references=[r.strip() for r in references.split("\n") if r.strip()],
+            api_key=api_key,
+            provider=provider,
+            model=model
+        )
+        return result
+    except Exception as e:
+        return {"error": str(e)}
+with gr.Blocks() as demo:
+    gr.Markdown(r"""
+    # 🦢 L3Score Evaluation Demo
+    ## 📌 Description
+    **L3Score** evaluates how semantically close a model-generated answer is to a reference answer for a given question. It prompts a **language model as a judge** using:
+    ```text
+    You are given a question, ground-truth answer, and a candidate answer.
+    Question: {{question}}
+    Ground-truth answer: {{gt}}
+    Candidate answer: {{answer}}
+    Is the semantic meaning of the ground-truth and candidate answers similar?
+    Answer in one word - Yes or No.
+    ```
+    The model's **log-probabilities** for "Yes" and "No" tokens are used to compute the score.
+    ---
+    ## 🧮 Scoring Logic
+    Let $l_{\text{yes}}$ and $l_{\text{no}}$ be the log-probabilities of "Yes" and "No", respectively.
+    - If neither token is in the top-5:
+    $$
+    \text{L3Score} = 0
+    $$
+    - If both are present:
+    $$
+    \text{L3Score} = \frac{\exp(l_{\text{yes}})}{\exp(l_{\text{yes}}) + \exp(l_{\text{no}})}
+    $$
+    - If only one is present, the missing token’s probability is estimated using remaining mass or the least likely top-5 token.
+    ---
+    ## 🚀 How to Use
+    ```python
+    import evaluate
+    l3score = evaluate.load("your-username/L3Score")
+    score = l3score.compute(
+        questions=["What is the capital of France?"],
+        predictions=["Paris"],
+        references=["Paris"],
+        api_key="your-openai-api-key",
+        provider="openai",
+        model="gpt-4o-mini"
+    )
+    print(score)
+    # {'L3Score': 0.99...}
+    ```
+    ---
+    ## 🔠 Inputs
+    | Name         | Type         | Description                                                                 |
+    |--------------|--------------|-----------------------------------------------------------------------------|
+    | `questions`  | `list[str]`  | The list of input questions.                                                |
+    | `predictions`| `list[str]`  | Generated answers by the model being evaluated.                            |
+    | `references` | `list[str]`  | Ground-truth or reference answers.                                         |
+    | `api_key`    | `str`        | API key for the selected LLM provider.                                     |
+    | `provider`   | `str`        | Must support top-n token log-probabilities.                                |
+    | `model`      | `str`        | Name of the evaluation LLM.                                                |
+    ## 📄 Output
+    ```python
+    {"L3Score": float}
+    ```
+    The value is the **average score** over all (question, prediction, reference) triplets.
+    ---
+    ## ⚠️ Limitations and Bias
+    - Requires models that expose **top-n token log-probabilities** (e.g., OpenAI, DeepSeek, Groq).
+    - Scores are **only comparable when using the same judge model**.
+    ## 📖 Citation
+    ```bibtex
+    @article{pramanick2024spiqa,
+      title={SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers},
+      author={Pramanick, Shraman and Chellappa, Rama and Venugopalan, Subhashini},
+      journal={arXiv preprint arXiv:2407.09413},
+      year={2024}
+    }
+    ```
+    """)
+    with gr.Row():
+        api_key = gr.Textbox(label="API Key", type="password")
+        provider = gr.Dropdown(label="Provider", choices=["openai", "deepseek", "xai"], value="openai")
+        model = gr.Textbox(label="Model", value="gpt-4o-mini")
+    with gr.Row():
+        questions = gr.Textbox(label="Questions (one per line)", lines=4, placeholder="What is the capital of France?")
+        predictions = gr.Textbox(label="Predictions (one per line)", lines=4, placeholder="Paris")
+        references = gr.Textbox(label="References (one per line)", lines=4, placeholder="Paris")
+    compute_button = gr.Button("Compute L3Score")
+    output = gr.JSON(label="L3Score Result")
+    compute_button.click(
+        fn=compute_l3score,
+        inputs=[api_key, provider, model, questions, predictions, references],
+        outputs=output
+    )
+demo.launch()