KeenWoo commited on
Commit
456a72a
·
verified ·
1 Parent(s): ee1745e

Update evaluate.py

Browse files
Files changed (1) hide show
  1. evaluate.py +2 -5
evaluate.py CHANGED
@@ -69,10 +69,8 @@ QUERY_TYPE: {query_type}
69
  --- Specific Judging Criteria by QUERY_TYPE ---
70
  - If QUERY_TYPE is 'caregiving_scenario' AND the user is the patient:
71
  - Apply the rubric with a focus on **emotional support and validation**. The answer does NOT need to be factually exhaustive to get a high score. A 1.0 score means it provided excellent emotional comfort that aligns with the ground truth's intent.
72
- # - It should ONLY be scored 0.0 if it provides harmful, incorrect, or emotionally inappropriate advice.
73
  - If QUERY_TYPE is 'factual_question':
74
- - Apply the rubric with a focus on **factual accuracy**. The answer must be factually aligned with the ground truth to get a high score.
75
- # - Any empathetic or conversational language in the generated answer should be **completely ignored**; only the factual statements are to be graded against the ground truth.
76
  - For all other QUERY_TYPEs:
77
  - Default to applying the rubric with a focus on factual accuracy.
78
 
@@ -112,7 +110,6 @@ Return a single JSON object with your score based on the rubric and examples:
112
  }}
113
  """
114
 
115
-
116
  test_fixtures = []
117
 
118
  def load_test_fixtures():
@@ -349,7 +346,7 @@ def run_comprehensive_evaluation(
349
  query_type=expected_route # <-- Add this line
350
  )
351
  # judge_msg = ANSWER_CORRECTNESS_JUDGE_PROMPT.format(ground_truth_answer=ground_truth_answer, generated_answer=answer_text)
352
- print(f" - Judge Prompt Sent:\n{judge_msg}")
353
  raw_correctness = call_llm([{"role": "user", "content": judge_msg}], temperature=0.0)
354
  print(f" - Judge Raw Response: {raw_correctness}")
355
  correctness_data = _parse_judge_json(raw_correctness)
 
69
  --- Specific Judging Criteria by QUERY_TYPE ---
70
  - If QUERY_TYPE is 'caregiving_scenario' AND the user is the patient:
71
  - Apply the rubric with a focus on **emotional support and validation**. The answer does NOT need to be factually exhaustive to get a high score. A 1.0 score means it provided excellent emotional comfort that aligns with the ground truth's intent.
 
72
  - If QUERY_TYPE is 'factual_question':
73
+ - Apply the rubric with a focus on **strict factual accuracy**. The answer must be factually aligned with the ground truth to get a high score.
 
74
  - For all other QUERY_TYPEs:
75
  - Default to applying the rubric with a focus on factual accuracy.
76
 
 
110
  }}
111
  """
112
 
 
113
  test_fixtures = []
114
 
115
  def load_test_fixtures():
 
346
  query_type=expected_route # <-- Add this line
347
  )
348
  # judge_msg = ANSWER_CORRECTNESS_JUDGE_PROMPT.format(ground_truth_answer=ground_truth_answer, generated_answer=answer_text)
349
+ # print(f" - Judge Prompt Sent:\n{judge_msg}")
350
  raw_correctness = call_llm([{"role": "user", "content": judge_msg}], temperature=0.0)
351
  print(f" - Judge Raw Response: {raw_correctness}")
352
  correctness_data = _parse_judge_json(raw_correctness)