Spaces:

KeenWoo
/

AD_Multimodal_Chatbot

Running

App Files Files Community

KeenWoo commited on Sep 17

Commit

a09a9f3

verified ·

1 Parent(s): 5681fa9

Update evaluate.py

Browse files

Files changed (1) hide show

evaluate.py +46 -12

evaluate.py CHANGED Viewed

@@ -52,17 +52,49 @@ except ImportError:
 # --- LLM-as-a-Judge Prompt for Answer Correctness ---
-ANSWER_CORRECTNESS_JUDGE_PROMPT = """You are an expert evaluator. Your task is to assess the factual correctness of a generated answer against a ground truth answer.
-- GROUND_TRUTH_ANSWER: This is the gold-standard, correct answer.
-- GENERATED_ANSWER: This is the answer produced by the AI model.
-Evaluate if the GENERATED_ANSWER is factually aligned with the GROUND_TRUTH_ANSWER. Ignore minor differences in phrasing, tone, or structure. The key is factual accuracy.
-Respond with a single JSON object containing a float score from 0.0 to 1.0.
-- 1.0: The generated answer is factually correct and aligns perfectly with the ground truth.
-- 0.5: The generated answer is partially correct but misses key information or contains minor inaccuracies.
-- 0.0: The generated answer is factually incorrect or contradicts the ground truth.
 --- DATA TO EVALUATE ---
 GROUND_TRUTH_ANSWER:
@@ -72,12 +104,14 @@ GENERATED_ANSWER:
 {generated_answer}
 ---
-Return a single JSON object with your score:
 {{
   "correctness_score": <float>
 }}
 """
 test_fixtures = []
 def load_test_fixtures():

 # --- LLM-as-a-Judge Prompt for Answer Correctness ---
+# Aware of QUERY TYPE
+ANSWER_CORRECTNESS_JUDGE_PROMPT = """You are an expert evaluator. Your task is to assess a GENERATED_ANSWER against a GROUND_TRUTH_ANSWER based on the provided QUERY_TYPE and the scoring rubric below.
+QUERY_TYPE: {query_type}
+--- General Rules (Apply to ALL evaluations) ---
+- Ignore minor differences in phrasing, tone, or structure. Your evaluation should be based on the substance of the answer, not its style.
+--- Scoring Rubric ---
+- 1.0 (Fully Correct): The generated answer contains all the key factual points and advice from the ground truth.
+- 0.8 (Mostly Correct): The generated answer captures the main point and is factually correct, but it misses a secondary detail or a specific actionable step.
+- 0.5 (Partially Correct): The generated answer is factually correct in what it states but is too generic or vague. It misses the primary advice or the most critical information.
+- 0.0 (Incorrect): The generated answer is factually incorrect, contains hallucinations, or contradicts the core advice of the ground truth.
+--- Specific Judging Criteria by QUERY_TYPE ---
+- If QUERY_TYPE is 'caregiving_scenario' AND the user is the patient:
+  - Apply the rubric with a focus on **emotional support and validation**. The answer does NOT need to be factually exhaustive to get a high score. A 1.0 score means it provided excellent emotional comfort that aligns with the ground truth's intent.
+- If QUERY_TYPE is 'factual_question':
+  - Apply the rubric with a focus on **strict factual accuracy**. The answer must be factually aligned with the ground truth to get a high score.
+- For all other QUERY_TYPEs:
+  - Default to applying the rubric with a focus on factual accuracy.
+--- Examples ---
+# Example for a 1.0 Score (Different Tone, Same Facts)
+GROUND_TRUTH: For a withdrawn person, a powerful approach is personalized music therapy. Creating a playlist of music from their youth can help them reconnect.
+GENERATED_ANSWER: It's hard when he's so withdrawn. You could try making a playlist of his favorite songs from when he was younger. Music is a wonderful way to connect.
+Score: 1.0
+# Example for a 0.8 Score (Mostly Correct but Incomplete)
+GROUND_TRUTH: A calm and reassuring approach is best. Instead of arguing, validate their feelings and suggest looking for the item together.
+GENERATED_ANSWER: It's important to stay calm and reassure them. You should tell them you understand they are upset.
+Score: 0.8
+# Example for a 0.5 Score (Partially Correct but Vague)
+GROUND_TRUTH: Repetitive questioning happens because the brain isn't retaining new info. Answer calmly, and consider writing the answer on a visible whiteboard.
+GENERATED_ANSWER: It's important to be patient when they ask the same question over and over.
+Score: 0.5
+# Example for a 0.0 Score (Contradicts Core Advice)
+GROUND_TRUTH: A calm and reassuring approach is best. Try not to argue about the facts.
+GENERATED_ANSWER: You need to firmly correct him and explain that the carer did not steal his watch. It is important to confront these delusions directly with facts.
+Score: 0.0
+---
 --- DATA TO EVALUATE ---
 GROUND_TRUTH_ANSWER:
 {generated_answer}
 ---
+Return a single JSON object with your score based on the rubric and examples:
 {{
   "correctness_score": <float>
 }}
 """
 test_fixtures = []
 def load_test_fixtures():