Spaces:
Sleeping
Sleeping
Update evaluate.py
Browse files- evaluate.py +2 -5
evaluate.py
CHANGED
|
@@ -69,10 +69,8 @@ QUERY_TYPE: {query_type}
|
|
| 69 |
--- Specific Judging Criteria by QUERY_TYPE ---
|
| 70 |
- If QUERY_TYPE is 'caregiving_scenario' AND the user is the patient:
|
| 71 |
- Apply the rubric with a focus on **emotional support and validation**. The answer does NOT need to be factually exhaustive to get a high score. A 1.0 score means it provided excellent emotional comfort that aligns with the ground truth's intent.
|
| 72 |
-
# - It should ONLY be scored 0.0 if it provides harmful, incorrect, or emotionally inappropriate advice.
|
| 73 |
- If QUERY_TYPE is 'factual_question':
|
| 74 |
-
- Apply the rubric with a focus on **factual accuracy**. The answer must be factually aligned with the ground truth to get a high score.
|
| 75 |
-
# - Any empathetic or conversational language in the generated answer should be **completely ignored**; only the factual statements are to be graded against the ground truth.
|
| 76 |
- For all other QUERY_TYPEs:
|
| 77 |
- Default to applying the rubric with a focus on factual accuracy.
|
| 78 |
|
|
@@ -112,7 +110,6 @@ Return a single JSON object with your score based on the rubric and examples:
|
|
| 112 |
}}
|
| 113 |
"""
|
| 114 |
|
| 115 |
-
|
| 116 |
test_fixtures = []
|
| 117 |
|
| 118 |
def load_test_fixtures():
|
|
@@ -349,7 +346,7 @@ def run_comprehensive_evaluation(
|
|
| 349 |
query_type=expected_route # <-- Add this line
|
| 350 |
)
|
| 351 |
# judge_msg = ANSWER_CORRECTNESS_JUDGE_PROMPT.format(ground_truth_answer=ground_truth_answer, generated_answer=answer_text)
|
| 352 |
-
print(f" - Judge Prompt Sent:\n{judge_msg}")
|
| 353 |
raw_correctness = call_llm([{"role": "user", "content": judge_msg}], temperature=0.0)
|
| 354 |
print(f" - Judge Raw Response: {raw_correctness}")
|
| 355 |
correctness_data = _parse_judge_json(raw_correctness)
|
|
|
|
| 69 |
--- Specific Judging Criteria by QUERY_TYPE ---
|
| 70 |
- If QUERY_TYPE is 'caregiving_scenario' AND the user is the patient:
|
| 71 |
- Apply the rubric with a focus on **emotional support and validation**. The answer does NOT need to be factually exhaustive to get a high score. A 1.0 score means it provided excellent emotional comfort that aligns with the ground truth's intent.
|
|
|
|
| 72 |
- If QUERY_TYPE is 'factual_question':
|
| 73 |
+
- Apply the rubric with a focus on **strict factual accuracy**. The answer must be factually aligned with the ground truth to get a high score.
|
|
|
|
| 74 |
- For all other QUERY_TYPEs:
|
| 75 |
- Default to applying the rubric with a focus on factual accuracy.
|
| 76 |
|
|
|
|
| 110 |
}}
|
| 111 |
"""
|
| 112 |
|
|
|
|
| 113 |
test_fixtures = []
|
| 114 |
|
| 115 |
def load_test_fixtures():
|
|
|
|
| 346 |
query_type=expected_route # <-- Add this line
|
| 347 |
)
|
| 348 |
# judge_msg = ANSWER_CORRECTNESS_JUDGE_PROMPT.format(ground_truth_answer=ground_truth_answer, generated_answer=answer_text)
|
| 349 |
+
# print(f" - Judge Prompt Sent:\n{judge_msg}")
|
| 350 |
raw_correctness = call_llm([{"role": "user", "content": judge_msg}], temperature=0.0)
|
| 351 |
print(f" - Judge Raw Response: {raw_correctness}")
|
| 352 |
correctness_data = _parse_judge_json(raw_correctness)
|