Spaces:

KeenWoo
/

AD_Multimodal_Chatbot

Build error

App Files Files Community

KeenWoo commited on Sep 17

Commit

9767ea6

verified ·

1 Parent(s): 83265c8

Update evaluate.py

Browse files

Files changed (1) hide show

evaluate.py +74 -3

evaluate.py CHANGED Viewed

@@ -52,8 +52,78 @@ except ImportError:
 # --- LLM-as-a-Judge Prompt for Answer Correctness ---
-# Aware of QUERY TYPE
-ANSWER_CORRECTNESS_JUDGE_PROMPT = """You are an expert evaluator. Your task is to assess a GENERATED_ANSWER against a GROUND_TRUTH_ANSWER based on the provided QUERY_TYPE and the scoring rubric below.
 QUERY_TYPE: {query_type}
@@ -343,7 +413,8 @@ def run_comprehensive_evaluation(
                 judge_msg = ANSWER_CORRECTNESS_JUDGE_PROMPT.format(
                     ground_truth_answer=ground_truth_answer,
                     generated_answer=answer_text,
-                    query_type=expected_route  # <-- Add this line
                 )
                 # judge_msg = ANSWER_CORRECTNESS_JUDGE_PROMPT.format(ground_truth_answer=ground_truth_answer, generated_answer=answer_text)
                 # print(f"  - Judge Prompt Sent:\n{judge_msg}")

 # --- LLM-as-a-Judge Prompt for Answer Correctness ---
+# Aware of QUERY TYPE and ROLE
+# In prompts.py or evaluate.py
+ANSWER_CORRECTNESS_JUDGE_PROMPT = """You are an expert evaluator. Your task is to assess a GENERATED_ANSWER against a GROUND_TRUTH_ANSWER based on the provided context (QUERY_TYPE and USER_ROLE) and the scoring rubric below.
+--- CONTEXT FOR EVALUATION ---
+QUERY_TYPE: {query_type}
+USER_ROLE: {role}
+--- General Rules (Apply to ALL evaluations) ---
+- Ignore minor differences in phrasing, tone, or structure. Your evaluation should be based on the substance of the answer, not its style.
+--- Scoring Rubric ---
+- 1.0 (Fully Correct): The generated answer contains all the key factual points and advice from the ground truth.
+- 0.8 (Mostly Correct): The generated answer captures the main point and is factually correct, but it misses a secondary detail or a specific actionable step.
+- 0.5 (Partially Correct): The generated answer is factually correct in what it states but is too generic or vague. It misses the primary advice or the most critical information.
+- 0.0 (Incorrect): The generated answer is factually incorrect, contains hallucinations, or contradicts the core advice of the ground truth.
+--- Specific Judging Criteria by Context ---
+- If QUERY_TYPE is 'caregiving_scenario' AND USER_ROLE is 'patient':
+  - Apply the rubric with a focus on **emotional support and validation**. The answer does NOT need to be factually exhaustive to get a high score.
+- If QUERY_TYPE is 'caregiving_scenario' AND USER_ROLE is 'caregiver':
+  - Apply the rubric with a focus on a **blend of empathy and practical, actionable advice**. The answer should be factually aligned with the ground truth.
+- If QUERY_TYPE is 'factual_question':
+  - Your evaluation should be based on **factual accuracy**. Any empathetic or conversational language should be ignored.
+- For all other QUERY_TYPEs:
+  - Default to applying the rubric with a focus on factual accuracy.
+--- Examples ---
+# Example for a 1.0 Score (Patient Role - Emotional Support)
+GROUND_TRUTH: It's frustrating when something important goes missing. I understand why you're upset. Why don't we look for it together?
+GENERATED_ANSWER: I hear how frustrating this is for you. You're not alone, let's try and find it together.
+Score: 1.0
+# --- NEW CAREGIVER EXAMPLE ---
+# Example for a 1.0 Score (Caregiver Role - Empathy + Action)
+GROUND_TRUTH: This can be very trying. Repetitive questioning happens because the brain isn't retaining new information. Try to answer in a calm, reassuring tone each time.
+GENERATED_ANSWER: It can be very frustrating to answer the same question repeatedly. Remember that this is due to memory changes. The best approach is to stay patient and answer calmly.
+Score: 1.0
+# --- END NEW EXAMPLE ---
+# Example for a 0.8 Score (Mostly Correct but Incomplete)
+GROUND_TRUTH: A calm and reassuring approach is best. Instead of arguing, validate their feelings and suggest looking for the item together.
+GENERATED_ANSWER: It's important to stay calm and reassure them. You should tell them you understand they are upset.
+Score: 0.8
+# Example for a 0.5 Score (Partially Correct but Vague)
+GROUND_TRUTH: Repetitive questioning happens because the brain isn't retaining new info. Answer calmly, and consider writing the answer on a visible whiteboard.
+GENERATED_ANSWER: It's important to be patient when they ask the same question over and over.
+Score: 0.5
+# Example for a 0.0 Score (Contradicts Core Advice)
+GROUND_TRUTH: A calm and reassuring approach is best. Try not to argue about the facts.
+GENERATED_ANSWER: You need to firmly correct him and explain that the carer did not steal his watch. It is important to confront these delusions directly with facts.
+Score: 0.0
+---
+--- DATA TO EVALUATE ---
+GROUND_TRUTH_ANSWER:
+{ground_truth_answer}
+GENERATED_ANSWER:
+{generated_answer}
+---
+Return a single JSON object with your score based on the rubric and examples:
+{{
+  "correctness_score": <float>
+}}
+"""
+ORIG_ANSWER_CORRECTNESS_JUDGE_PROMPT = """You are an expert evaluator. Your task is to assess a GENERATED_ANSWER against a GROUND_TRUTH_ANSWER based on the provided QUERY_TYPE and the scoring rubric below.
 QUERY_TYPE: {query_type}
                 judge_msg = ANSWER_CORRECTNESS_JUDGE_PROMPT.format(
                     ground_truth_answer=ground_truth_answer,
                     generated_answer=answer_text,
+                    query_type=expected_route,  # <-- Add this line
+                    role=current_test_role  # <-- ADD THIS LINE
                 )
                 # judge_msg = ANSWER_CORRECTNESS_JUDGE_PROMPT.format(ground_truth_answer=ground_truth_answer, generated_answer=answer_text)
                 # print(f"  - Judge Prompt Sent:\n{judge_msg}")