KeenWoo commited on
Commit
a09a9f3
·
verified ·
1 Parent(s): 5681fa9

Update evaluate.py

Browse files
Files changed (1) hide show
  1. evaluate.py +46 -12
evaluate.py CHANGED
@@ -52,17 +52,49 @@ except ImportError:
52
 
53
 
54
  # --- LLM-as-a-Judge Prompt for Answer Correctness ---
55
- ANSWER_CORRECTNESS_JUDGE_PROMPT = """You are an expert evaluator. Your task is to assess the factual correctness of a generated answer against a ground truth answer.
56
-
57
- - GROUND_TRUTH_ANSWER: This is the gold-standard, correct answer.
58
- - GENERATED_ANSWER: This is the answer produced by the AI model.
59
-
60
- Evaluate if the GENERATED_ANSWER is factually aligned with the GROUND_TRUTH_ANSWER. Ignore minor differences in phrasing, tone, or structure. The key is factual accuracy.
61
-
62
- Respond with a single JSON object containing a float score from 0.0 to 1.0.
63
- - 1.0: The generated answer is factually correct and aligns perfectly with the ground truth.
64
- - 0.5: The generated answer is partially correct but misses key information or contains minor inaccuracies.
65
- - 0.0: The generated answer is factually incorrect or contradicts the ground truth.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66
 
67
  --- DATA TO EVALUATE ---
68
  GROUND_TRUTH_ANSWER:
@@ -72,12 +104,14 @@ GENERATED_ANSWER:
72
  {generated_answer}
73
  ---
74
 
75
- Return a single JSON object with your score:
76
  {{
77
  "correctness_score": <float>
78
  }}
79
  """
80
 
 
 
81
  test_fixtures = []
82
 
83
  def load_test_fixtures():
 
52
 
53
 
54
  # --- LLM-as-a-Judge Prompt for Answer Correctness ---
55
+ # Aware of QUERY TYPE
56
+ ANSWER_CORRECTNESS_JUDGE_PROMPT = """You are an expert evaluator. Your task is to assess a GENERATED_ANSWER against a GROUND_TRUTH_ANSWER based on the provided QUERY_TYPE and the scoring rubric below.
57
+
58
+ QUERY_TYPE: {query_type}
59
+
60
+ --- General Rules (Apply to ALL evaluations) ---
61
+ - Ignore minor differences in phrasing, tone, or structure. Your evaluation should be based on the substance of the answer, not its style.
62
+
63
+ --- Scoring Rubric ---
64
+ - 1.0 (Fully Correct): The generated answer contains all the key factual points and advice from the ground truth.
65
+ - 0.8 (Mostly Correct): The generated answer captures the main point and is factually correct, but it misses a secondary detail or a specific actionable step.
66
+ - 0.5 (Partially Correct): The generated answer is factually correct in what it states but is too generic or vague. It misses the primary advice or the most critical information.
67
+ - 0.0 (Incorrect): The generated answer is factually incorrect, contains hallucinations, or contradicts the core advice of the ground truth.
68
+
69
+ --- Specific Judging Criteria by QUERY_TYPE ---
70
+ - If QUERY_TYPE is 'caregiving_scenario' AND the user is the patient:
71
+ - Apply the rubric with a focus on **emotional support and validation**. The answer does NOT need to be factually exhaustive to get a high score. A 1.0 score means it provided excellent emotional comfort that aligns with the ground truth's intent.
72
+ - If QUERY_TYPE is 'factual_question':
73
+ - Apply the rubric with a focus on **strict factual accuracy**. The answer must be factually aligned with the ground truth to get a high score.
74
+ - For all other QUERY_TYPEs:
75
+ - Default to applying the rubric with a focus on factual accuracy.
76
+
77
+ --- Examples ---
78
+ # Example for a 1.0 Score (Different Tone, Same Facts)
79
+ GROUND_TRUTH: For a withdrawn person, a powerful approach is personalized music therapy. Creating a playlist of music from their youth can help them reconnect.
80
+ GENERATED_ANSWER: It's hard when he's so withdrawn. You could try making a playlist of his favorite songs from when he was younger. Music is a wonderful way to connect.
81
+ Score: 1.0
82
+
83
+ # Example for a 0.8 Score (Mostly Correct but Incomplete)
84
+ GROUND_TRUTH: A calm and reassuring approach is best. Instead of arguing, validate their feelings and suggest looking for the item together.
85
+ GENERATED_ANSWER: It's important to stay calm and reassure them. You should tell them you understand they are upset.
86
+ Score: 0.8
87
+
88
+ # Example for a 0.5 Score (Partially Correct but Vague)
89
+ GROUND_TRUTH: Repetitive questioning happens because the brain isn't retaining new info. Answer calmly, and consider writing the answer on a visible whiteboard.
90
+ GENERATED_ANSWER: It's important to be patient when they ask the same question over and over.
91
+ Score: 0.5
92
+
93
+ # Example for a 0.0 Score (Contradicts Core Advice)
94
+ GROUND_TRUTH: A calm and reassuring approach is best. Try not to argue about the facts.
95
+ GENERATED_ANSWER: You need to firmly correct him and explain that the carer did not steal his watch. It is important to confront these delusions directly with facts.
96
+ Score: 0.0
97
+ ---
98
 
99
  --- DATA TO EVALUATE ---
100
  GROUND_TRUTH_ANSWER:
 
104
  {generated_answer}
105
  ---
106
 
107
+ Return a single JSON object with your score based on the rubric and examples:
108
  {{
109
  "correctness_score": <float>
110
  }}
111
  """
112
 
113
+
114
+
115
  test_fixtures = []
116
 
117
  def load_test_fixtures():