Spaces:
Running
Running
Update common.py
Browse files
common.py
CHANGED
|
@@ -163,9 +163,21 @@ We’d love to hear your feedback! For general feature requests or to submit / s
|
|
| 163 |
\nPlease file any issues on our [Github](https://github.com/atla-ai/judge-arena)."""
|
| 164 |
|
| 165 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 166 |
|
| 167 |
#**What are the Evaluator Prompt Templates based on?**
|
| 168 |
|
| 169 |
#As a quick start, we've set up templates that cover the most popular evaluation metrics out there on LLM evaluation / monitoring tools, often known as 'base metrics'. The data samples used in these were randomly picked from popular datasets from academia - [ARC](https://huggingface.co/datasets/allenai/ai2_arc), [Preference Collection](https://huggingface.co/datasets/prometheus-eval/Preference-Collection), [RewardBench](https://huggingface.co/datasets/allenai/reward-bench), [RAGTruth](https://arxiv.org/abs/2401.00396).
|
| 170 |
|
| 171 |
-
#These templates are designed as a starting point to showcase how to interact with the Judge Arena, especially for those less familiar with using LLM judges.
|
|
|
|
| 163 |
\nPlease file any issues on our [Github](https://github.com/atla-ai/judge-arena)."""
|
| 164 |
|
| 165 |
|
| 166 |
+
# Default values for compatible mode
|
| 167 |
+
DEFAULT_EVAL_CRITERIA = """Evaluate the helpfulness of the chatbot response given the user's instructions. Focus on relevance, accuracy, and completeness while being objective. Do not consider response length in your evaluation."""
|
| 168 |
+
|
| 169 |
+
DEFAULT_SCORE_1 = "The response is unhelpful, providing irrelevant or incorrect content that does not address the request."
|
| 170 |
+
|
| 171 |
+
DEFAULT_SCORE_2 = "The response is partially helpful, missing key elements or including minor inaccuracies, and lacks depth in addressing the request."
|
| 172 |
+
|
| 173 |
+
DEFAULT_SCORE_3 = "The response is adequately helpful, correctly addressing the main request with relevant information and some depth."
|
| 174 |
+
|
| 175 |
+
DEFAULT_SCORE_4 = "The response is very helpful, addressing the request thoroughly with accurate and detailed content, but may lack a minor aspect of helpfulness."
|
| 176 |
+
|
| 177 |
+
DEFAULT_SCORE_5 = "The response is exceptionally helpful, providing precise, comprehensive content that fully resolves the request with insight and clarity."
|
| 178 |
|
| 179 |
#**What are the Evaluator Prompt Templates based on?**
|
| 180 |
|
| 181 |
#As a quick start, we've set up templates that cover the most popular evaluation metrics out there on LLM evaluation / monitoring tools, often known as 'base metrics'. The data samples used in these were randomly picked from popular datasets from academia - [ARC](https://huggingface.co/datasets/allenai/ai2_arc), [Preference Collection](https://huggingface.co/datasets/prometheus-eval/Preference-Collection), [RewardBench](https://huggingface.co/datasets/allenai/reward-bench), [RAGTruth](https://arxiv.org/abs/2401.00396).
|
| 182 |
|
| 183 |
+
#These templates are designed as a starting point to showcase how to interact with the Judge Arena, especially for those less familiar with using LLM judges.
|