Romain Fayoux
commited on
Commit
Β·
f9cf36d
1
Parent(s):
3ac0a19
Added ground evaluation and phoenix login
Browse files- GAIA_COMPARISON.md +142 -0
- app.py +39 -1
- comparison.py +160 -0
- data/metadata.jsonl +0 -0
- debug_phoenix.py +285 -0
- phoenix_evaluator.py +214 -0
- requirements.txt +1 -0
- test_comparison.py +144 -0
- test_phoenix_logging.py +261 -0
- test_phoenix_simple.py +132 -0
GAIA_COMPARISON.md
ADDED
|
@@ -0,0 +1,142 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# GAIA Ground Truth Comparison
|
| 2 |
+
|
| 3 |
+
This project now includes automatic comparison of your agent's answers against the GAIA dataset ground truth, with Phoenix observability integration.
|
| 4 |
+
|
| 5 |
+
## Features
|
| 6 |
+
|
| 7 |
+
- **Ground Truth Comparison**: Automatically compares agent answers to correct answers from `data/metadata.jsonl`
|
| 8 |
+
- **Multiple Evaluation Metrics**: Exact match, similarity score, and contains-answer detection
|
| 9 |
+
- **Phoenix Integration**: Logs evaluations to Phoenix for persistent tracking and analysis
|
| 10 |
+
- **Enhanced Results Display**: Shows ground truth and comparison results in the Gradio interface
|
| 11 |
+
|
| 12 |
+
## How It Works
|
| 13 |
+
|
| 14 |
+
### 1. Ground Truth Loading
|
| 15 |
+
- Loads correct answers from `data/metadata.jsonl`
|
| 16 |
+
- Maps task IDs to ground truth answers
|
| 17 |
+
- Currently supports 165 questions from the GAIA dataset
|
| 18 |
+
|
| 19 |
+
### 2. Answer Comparison
|
| 20 |
+
For each agent answer, the system calculates:
|
| 21 |
+
- **Exact Match**: Boolean indicating if answers match exactly (after normalization)
|
| 22 |
+
- **Similarity Score**: 0-1 score using difflib.SequenceMatcher
|
| 23 |
+
- **Contains Answer**: Boolean indicating if the correct answer is contained in the agent's response
|
| 24 |
+
|
| 25 |
+
### 3. Answer Normalization
|
| 26 |
+
Before comparison, answers are normalized by:
|
| 27 |
+
- Converting to lowercase
|
| 28 |
+
- Removing punctuation (.,;:!?"')
|
| 29 |
+
- Normalizing whitespace
|
| 30 |
+
- Trimming leading/trailing spaces
|
| 31 |
+
|
| 32 |
+
### 4. Phoenix Integration
|
| 33 |
+
- Evaluations are automatically logged to Phoenix
|
| 34 |
+
- Each evaluation includes score, label, explanation, and detailed metrics
|
| 35 |
+
- Viewable in Phoenix UI for historical tracking and analysis
|
| 36 |
+
|
| 37 |
+
## Usage
|
| 38 |
+
|
| 39 |
+
### In Your Agent App
|
| 40 |
+
The comparison happens automatically when you run your agent:
|
| 41 |
+
|
| 42 |
+
1. **Run your agent** - Process questions as usual
|
| 43 |
+
2. **Automatic comparison** - System compares answers to ground truth
|
| 44 |
+
3. **Enhanced results** - Results table includes comparison columns
|
| 45 |
+
4. **Phoenix logging** - Evaluations are logged for persistent tracking
|
| 46 |
+
|
| 47 |
+
### Results Display
|
| 48 |
+
Your results table now includes these additional columns:
|
| 49 |
+
- **Ground Truth**: The correct answer from GAIA dataset
|
| 50 |
+
- **Exact Match**: True/False for exact matches
|
| 51 |
+
- **Similarity**: Similarity score (0-1)
|
| 52 |
+
- **Contains Answer**: True/False if correct answer is contained
|
| 53 |
+
|
| 54 |
+
### Status Message
|
| 55 |
+
The status message now includes:
|
| 56 |
+
```
|
| 57 |
+
Ground Truth Comparison:
|
| 58 |
+
Exact matches: 15/50 (30.0%)
|
| 59 |
+
Average similarity: 0.654
|
| 60 |
+
Contains correct answer: 22/50 (44.0%)
|
| 61 |
+
Evaluations logged to Phoenix β
|
| 62 |
+
```
|
| 63 |
+
|
| 64 |
+
## Testing
|
| 65 |
+
|
| 66 |
+
Run the test suite to verify functionality:
|
| 67 |
+
|
| 68 |
+
```bash
|
| 69 |
+
python test_comparison.py
|
| 70 |
+
```
|
| 71 |
+
|
| 72 |
+
This will test:
|
| 73 |
+
- Basic comparison functionality
|
| 74 |
+
- Results enhancement
|
| 75 |
+
- Phoenix integration
|
| 76 |
+
- Ground truth loading
|
| 77 |
+
|
| 78 |
+
## Files Added
|
| 79 |
+
|
| 80 |
+
- `comparison.py`: Main comparison logic and AnswerComparator class
|
| 81 |
+
- `phoenix_evaluator.py`: Phoenix integration for logging evaluations
|
| 82 |
+
- `test_comparison.py`: Test suite for verification
|
| 83 |
+
- `GAIA_COMPARISON.md`: This documentation
|
| 84 |
+
|
| 85 |
+
## Dependencies Added
|
| 86 |
+
|
| 87 |
+
- `arize-phoenix`: For observability and evaluation logging
|
| 88 |
+
- `pandas`: For data manipulation (if not already present)
|
| 89 |
+
|
| 90 |
+
## Example Evaluation Result
|
| 91 |
+
|
| 92 |
+
```python
|
| 93 |
+
{
|
| 94 |
+
"task_id": "8e867cd7-cff9-4e6c-867a-ff5ddc2550be",
|
| 95 |
+
"predicted_answer": "3",
|
| 96 |
+
"actual_answer": "3",
|
| 97 |
+
"exact_match": True,
|
| 98 |
+
"similarity_score": 1.0,
|
| 99 |
+
"contains_answer": True,
|
| 100 |
+
"error": None
|
| 101 |
+
}
|
| 102 |
+
```
|
| 103 |
+
|
| 104 |
+
## Phoenix UI
|
| 105 |
+
|
| 106 |
+
In the Phoenix interface, you can:
|
| 107 |
+
- View evaluation results alongside agent traces
|
| 108 |
+
- Track accuracy over time
|
| 109 |
+
- Filter by correct/incorrect answers
|
| 110 |
+
- Analyze which question types your agent struggles with
|
| 111 |
+
- Export evaluation data for further analysis
|
| 112 |
+
|
| 113 |
+
## Troubleshooting
|
| 114 |
+
|
| 115 |
+
### No Ground Truth Available
|
| 116 |
+
If you see "N/A" for ground truth, the question's task_id is not in the metadata.jsonl file.
|
| 117 |
+
|
| 118 |
+
### Phoenix Connection Issues
|
| 119 |
+
If Phoenix logging fails, the comparison will still work but won't be persisted. Ensure Phoenix is running and accessible.
|
| 120 |
+
|
| 121 |
+
### Low Similarity Scores
|
| 122 |
+
Low similarity scores might indicate:
|
| 123 |
+
- Agent is providing verbose answers when short ones are expected
|
| 124 |
+
- Answer format doesn't match expected format
|
| 125 |
+
- Agent is partially correct but not exact
|
| 126 |
+
|
| 127 |
+
## Customization
|
| 128 |
+
|
| 129 |
+
You can adjust the comparison logic in `comparison.py`:
|
| 130 |
+
- Modify `normalize_answer()` for different normalization rules
|
| 131 |
+
- Adjust similarity thresholds
|
| 132 |
+
- Add custom evaluation metrics
|
| 133 |
+
- Modify Phoenix logging format
|
| 134 |
+
|
| 135 |
+
## Performance
|
| 136 |
+
|
| 137 |
+
The comparison adds minimal overhead:
|
| 138 |
+
- Ground truth loading: ~1-2 seconds (one-time)
|
| 139 |
+
- Per-answer comparison: ~1-10ms
|
| 140 |
+
- Phoenix logging: ~10-50ms per evaluation
|
| 141 |
+
|
| 142 |
+
Total additional time: Usually < 5 seconds for 50 questions.
|
app.py
CHANGED
|
@@ -7,6 +7,9 @@ from phoenix.otel import register
|
|
| 7 |
from openinference.instrumentation.smolagents import SmolagentsInstrumentor
|
| 8 |
from llm_only_agent import LLMOnlyAgent
|
| 9 |
from multi_agent import MultiAgent
|
|
|
|
|
|
|
|
|
|
| 10 |
|
| 11 |
|
| 12 |
# (Keep Constants as is)
|
|
@@ -88,7 +91,7 @@ def run_and_submit_all( profile: gr.OAuthProfile | None, limit: int | None):
|
|
| 88 |
results_log = []
|
| 89 |
answers_payload = []
|
| 90 |
# Limit for test purposes
|
| 91 |
-
limit =
|
| 92 |
if limit is not None:
|
| 93 |
questions_data = questions_data[:limit]
|
| 94 |
print(f"Running agent on {len(questions_data)} questions...")
|
|
@@ -115,9 +118,44 @@ def run_and_submit_all( profile: gr.OAuthProfile | None, limit: int | None):
|
|
| 115 |
print("Agent did not produce any answers to submit.")
|
| 116 |
return "Agent did not produce any answers to submit.", pd.DataFrame(results_log)
|
| 117 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 118 |
# 4. Prepare Submission
|
| 119 |
submission_data = {"username": username.strip(), "agent_code": agent_code, "answers": answers_payload}
|
| 120 |
status_update = f"Agent finished. Submitting {len(answers_payload)} answers for user '{username}'..."
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 121 |
print(status_update)
|
| 122 |
|
| 123 |
# 5. Submit
|
|
|
|
| 7 |
from openinference.instrumentation.smolagents import SmolagentsInstrumentor
|
| 8 |
from llm_only_agent import LLMOnlyAgent
|
| 9 |
from multi_agent import MultiAgent
|
| 10 |
+
from comparison import AnswerComparator
|
| 11 |
+
from phoenix_evaluator import log_evaluations_to_phoenix
|
| 12 |
+
import phoenix as px
|
| 13 |
|
| 14 |
|
| 15 |
# (Keep Constants as is)
|
|
|
|
| 91 |
results_log = []
|
| 92 |
answers_payload = []
|
| 93 |
# Limit for test purposes
|
| 94 |
+
limit = 2
|
| 95 |
if limit is not None:
|
| 96 |
questions_data = questions_data[:limit]
|
| 97 |
print(f"Running agent on {len(questions_data)} questions...")
|
|
|
|
| 118 |
print("Agent did not produce any answers to submit.")
|
| 119 |
return "Agent did not produce any answers to submit.", pd.DataFrame(results_log)
|
| 120 |
|
| 121 |
+
# 3.5. Compare with Ground Truth and Log to Phoenix
|
| 122 |
+
print("Comparing answers with ground truth...")
|
| 123 |
+
try:
|
| 124 |
+
# Initialize comparator
|
| 125 |
+
comparator = AnswerComparator()
|
| 126 |
+
|
| 127 |
+
# Evaluate answers
|
| 128 |
+
evaluations_df = comparator.evaluate_batch(answers_payload)
|
| 129 |
+
|
| 130 |
+
# Get summary statistics
|
| 131 |
+
summary_stats = comparator.get_summary_stats(evaluations_df)
|
| 132 |
+
|
| 133 |
+
# Enhance results log with comparison data
|
| 134 |
+
results_log = comparator.enhance_results_log(results_log)
|
| 135 |
+
|
| 136 |
+
# Log evaluations to Phoenix
|
| 137 |
+
log_evaluations_to_phoenix(evaluations_df)
|
| 138 |
+
|
| 139 |
+
print(f"Ground truth comparison completed: {summary_stats['exact_matches']}/{summary_stats['total_questions']} exact matches")
|
| 140 |
+
|
| 141 |
+
except Exception as e:
|
| 142 |
+
print(f"Error during ground truth comparison: {e}")
|
| 143 |
+
summary_stats = {"error": str(e)}
|
| 144 |
+
|
| 145 |
# 4. Prepare Submission
|
| 146 |
submission_data = {"username": username.strip(), "agent_code": agent_code, "answers": answers_payload}
|
| 147 |
status_update = f"Agent finished. Submitting {len(answers_payload)} answers for user '{username}'..."
|
| 148 |
+
|
| 149 |
+
# Add ground truth comparison to status
|
| 150 |
+
if "error" not in summary_stats:
|
| 151 |
+
status_update += f"\n\nGround Truth Comparison:\n"
|
| 152 |
+
status_update += f"Exact matches: {summary_stats['exact_matches']}/{summary_stats['total_questions']} ({summary_stats['exact_match_rate']:.1%})\n"
|
| 153 |
+
status_update += f"Average similarity: {summary_stats['average_similarity']:.3f}\n"
|
| 154 |
+
status_update += f"Contains correct answer: {summary_stats['contains_matches']}/{summary_stats['total_questions']} ({summary_stats['contains_match_rate']:.1%})\n"
|
| 155 |
+
status_update += f"Evaluations logged to Phoenix β
"
|
| 156 |
+
else:
|
| 157 |
+
status_update += f"\n\nGround Truth Comparison Error: {summary_stats['error']}"
|
| 158 |
+
|
| 159 |
print(status_update)
|
| 160 |
|
| 161 |
# 5. Submit
|
comparison.py
ADDED
|
@@ -0,0 +1,160 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import json
|
| 2 |
+
import pandas as pd
|
| 3 |
+
from typing import Dict, List, Any
|
| 4 |
+
from difflib import SequenceMatcher
|
| 5 |
+
import re
|
| 6 |
+
|
| 7 |
+
|
| 8 |
+
class AnswerComparator:
|
| 9 |
+
def __init__(self, metadata_path: str = "data/metadata.jsonl"):
|
| 10 |
+
"""Initialize the comparator with ground truth data."""
|
| 11 |
+
self.ground_truth = self._load_ground_truth(metadata_path)
|
| 12 |
+
print(f"Loaded ground truth for {len(self.ground_truth)} questions")
|
| 13 |
+
|
| 14 |
+
def _load_ground_truth(self, metadata_path: str) -> Dict[str, str]:
|
| 15 |
+
"""Load ground truth answers from metadata.jsonl file."""
|
| 16 |
+
ground_truth = {}
|
| 17 |
+
try:
|
| 18 |
+
with open(metadata_path, 'r', encoding='utf-8') as f:
|
| 19 |
+
for line in f:
|
| 20 |
+
if line.strip():
|
| 21 |
+
data = json.loads(line)
|
| 22 |
+
task_id = data.get("task_id")
|
| 23 |
+
final_answer = data.get("Final answer")
|
| 24 |
+
if task_id and final_answer is not None:
|
| 25 |
+
ground_truth[task_id] = str(final_answer)
|
| 26 |
+
except FileNotFoundError:
|
| 27 |
+
print(f"Warning: Ground truth file {metadata_path} not found")
|
| 28 |
+
except Exception as e:
|
| 29 |
+
print(f"Error loading ground truth: {e}")
|
| 30 |
+
|
| 31 |
+
return ground_truth
|
| 32 |
+
|
| 33 |
+
def normalize_answer(self, answer: str) -> str:
|
| 34 |
+
"""Normalize answer for comparison."""
|
| 35 |
+
if answer is None:
|
| 36 |
+
return ""
|
| 37 |
+
|
| 38 |
+
# Convert to string and strip whitespace
|
| 39 |
+
answer = str(answer).strip()
|
| 40 |
+
|
| 41 |
+
# Convert to lowercase for case-insensitive comparison
|
| 42 |
+
answer = answer.lower()
|
| 43 |
+
|
| 44 |
+
# Remove common punctuation that might not affect correctness
|
| 45 |
+
answer = re.sub(r'[.,;:!?"\']', '', answer)
|
| 46 |
+
|
| 47 |
+
# Normalize whitespace
|
| 48 |
+
answer = re.sub(r'\s+', ' ', answer)
|
| 49 |
+
|
| 50 |
+
return answer
|
| 51 |
+
|
| 52 |
+
def exact_match(self, predicted: str, actual: str) -> bool:
|
| 53 |
+
"""Check if answers match exactly after normalization."""
|
| 54 |
+
return self.normalize_answer(predicted) == self.normalize_answer(actual)
|
| 55 |
+
|
| 56 |
+
def similarity_score(self, predicted: str, actual: str) -> float:
|
| 57 |
+
"""Calculate similarity score between predicted and actual answers."""
|
| 58 |
+
normalized_pred = self.normalize_answer(predicted)
|
| 59 |
+
normalized_actual = self.normalize_answer(actual)
|
| 60 |
+
|
| 61 |
+
if not normalized_pred and not normalized_actual:
|
| 62 |
+
return 1.0
|
| 63 |
+
if not normalized_pred or not normalized_actual:
|
| 64 |
+
return 0.0
|
| 65 |
+
|
| 66 |
+
return SequenceMatcher(None, normalized_pred, normalized_actual).ratio()
|
| 67 |
+
|
| 68 |
+
def contains_answer(self, predicted: str, actual: str) -> bool:
|
| 69 |
+
"""Check if the actual answer is contained in the predicted answer."""
|
| 70 |
+
normalized_pred = self.normalize_answer(predicted)
|
| 71 |
+
normalized_actual = self.normalize_answer(actual)
|
| 72 |
+
|
| 73 |
+
return normalized_actual in normalized_pred
|
| 74 |
+
|
| 75 |
+
def evaluate_answer(self, task_id: str, predicted_answer: str) -> Dict[str, Any]:
|
| 76 |
+
"""Evaluate a single answer against ground truth."""
|
| 77 |
+
actual_answer = self.ground_truth.get(task_id)
|
| 78 |
+
|
| 79 |
+
if actual_answer is None:
|
| 80 |
+
return {
|
| 81 |
+
"task_id": task_id,
|
| 82 |
+
"predicted_answer": predicted_answer,
|
| 83 |
+
"actual_answer": None,
|
| 84 |
+
"exact_match": False,
|
| 85 |
+
"similarity_score": 0.0,
|
| 86 |
+
"contains_answer": False,
|
| 87 |
+
"error": "No ground truth available"
|
| 88 |
+
}
|
| 89 |
+
|
| 90 |
+
return {
|
| 91 |
+
"task_id": task_id,
|
| 92 |
+
"predicted_answer": predicted_answer,
|
| 93 |
+
"actual_answer": actual_answer,
|
| 94 |
+
"exact_match": self.exact_match(predicted_answer, actual_answer),
|
| 95 |
+
"similarity_score": self.similarity_score(predicted_answer, actual_answer),
|
| 96 |
+
"contains_answer": self.contains_answer(predicted_answer, actual_answer),
|
| 97 |
+
"error": None
|
| 98 |
+
}
|
| 99 |
+
|
| 100 |
+
def evaluate_batch(self, results: List[Dict[str, Any]]) -> pd.DataFrame:
|
| 101 |
+
"""Evaluate a batch of results."""
|
| 102 |
+
evaluations = []
|
| 103 |
+
|
| 104 |
+
for result in results:
|
| 105 |
+
task_id = result.get("task_id") or result.get("Task ID")
|
| 106 |
+
predicted_answer = result.get("submitted_answer") or result.get("Submitted Answer", "")
|
| 107 |
+
|
| 108 |
+
if task_id is not None:
|
| 109 |
+
evaluation = self.evaluate_answer(task_id, predicted_answer)
|
| 110 |
+
evaluations.append(evaluation)
|
| 111 |
+
|
| 112 |
+
return pd.DataFrame(evaluations)
|
| 113 |
+
|
| 114 |
+
def get_summary_stats(self, evaluations_df: pd.DataFrame) -> Dict[str, Any]:
|
| 115 |
+
"""Get summary statistics from evaluations."""
|
| 116 |
+
if evaluations_df.empty:
|
| 117 |
+
return {"error": "No evaluations available"}
|
| 118 |
+
|
| 119 |
+
# Filter out entries without ground truth
|
| 120 |
+
valid_evaluations = evaluations_df[evaluations_df['error'].isna()]
|
| 121 |
+
|
| 122 |
+
if valid_evaluations.empty:
|
| 123 |
+
return {"error": "No valid ground truth available"}
|
| 124 |
+
|
| 125 |
+
total_questions = len(valid_evaluations)
|
| 126 |
+
exact_matches = valid_evaluations['exact_match'].sum()
|
| 127 |
+
avg_similarity = valid_evaluations['similarity_score'].mean()
|
| 128 |
+
contains_matches = valid_evaluations['contains_answer'].sum()
|
| 129 |
+
|
| 130 |
+
return {
|
| 131 |
+
"total_questions": total_questions,
|
| 132 |
+
"exact_matches": exact_matches,
|
| 133 |
+
"exact_match_rate": exact_matches / total_questions,
|
| 134 |
+
"average_similarity": avg_similarity,
|
| 135 |
+
"contains_matches": contains_matches,
|
| 136 |
+
"contains_match_rate": contains_matches / total_questions,
|
| 137 |
+
"questions_with_ground_truth": total_questions
|
| 138 |
+
}
|
| 139 |
+
|
| 140 |
+
def enhance_results_log(self, results_log: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
|
| 141 |
+
"""Add comparison columns to results log."""
|
| 142 |
+
enhanced_results = []
|
| 143 |
+
|
| 144 |
+
for result in results_log:
|
| 145 |
+
task_id = result.get("Task ID")
|
| 146 |
+
predicted_answer = result.get("Submitted Answer", "")
|
| 147 |
+
|
| 148 |
+
if task_id is not None:
|
| 149 |
+
evaluation = self.evaluate_answer(task_id, predicted_answer)
|
| 150 |
+
|
| 151 |
+
# Add comparison info to result
|
| 152 |
+
enhanced_result = result.copy()
|
| 153 |
+
enhanced_result["Ground Truth"] = evaluation["actual_answer"] or "N/A"
|
| 154 |
+
enhanced_result["Exact Match"] = evaluation["exact_match"]
|
| 155 |
+
enhanced_result["Similarity"] = f"{evaluation['similarity_score']:.3f}"
|
| 156 |
+
enhanced_result["Contains Answer"] = evaluation["contains_answer"]
|
| 157 |
+
|
| 158 |
+
enhanced_results.append(enhanced_result)
|
| 159 |
+
|
| 160 |
+
return enhanced_results
|
data/metadata.jsonl
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
debug_phoenix.py
ADDED
|
@@ -0,0 +1,285 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Enhanced debug script to check Phoenix status and evaluations.
|
| 4 |
+
"""
|
| 5 |
+
|
| 6 |
+
import phoenix as px
|
| 7 |
+
import pandas as pd
|
| 8 |
+
from comparison import AnswerComparator
|
| 9 |
+
from phoenix_evaluator import log_evaluations_to_phoenix
|
| 10 |
+
import time
|
| 11 |
+
from datetime import datetime
|
| 12 |
+
|
| 13 |
+
|
| 14 |
+
def check_phoenix_connection():
|
| 15 |
+
"""Check if Phoenix is running and accessible."""
|
| 16 |
+
try:
|
| 17 |
+
client = px.Client()
|
| 18 |
+
print("β
Phoenix client connected successfully")
|
| 19 |
+
|
| 20 |
+
# Try to get basic info
|
| 21 |
+
try:
|
| 22 |
+
spans_df = client.get_spans_dataframe()
|
| 23 |
+
print(f"β
Phoenix API working - can retrieve spans")
|
| 24 |
+
return client
|
| 25 |
+
except Exception as e:
|
| 26 |
+
print(f"β οΈ Phoenix connected but API might have issues: {e}")
|
| 27 |
+
return client
|
| 28 |
+
|
| 29 |
+
except Exception as e:
|
| 30 |
+
print(f"β Phoenix connection failed: {e}")
|
| 31 |
+
print("Make sure Phoenix is running. You should see a message like:")
|
| 32 |
+
print("π To view the Phoenix app in your browser, visit http://localhost:6006")
|
| 33 |
+
return None
|
| 34 |
+
|
| 35 |
+
|
| 36 |
+
def check_spans(client):
|
| 37 |
+
"""Check spans in Phoenix."""
|
| 38 |
+
try:
|
| 39 |
+
spans_df = client.get_spans_dataframe()
|
| 40 |
+
print(f"π Found {len(spans_df)} spans in Phoenix")
|
| 41 |
+
|
| 42 |
+
if len(spans_df) > 0:
|
| 43 |
+
print("Recent spans:")
|
| 44 |
+
for i, (_, span) in enumerate(spans_df.head(5).iterrows()):
|
| 45 |
+
span_id = span.get('context.span_id', 'no-id')
|
| 46 |
+
span_name = span.get('name', 'unnamed')
|
| 47 |
+
start_time = span.get('start_time', 'unknown')
|
| 48 |
+
print(f" {i+1}. {span_name} ({span_id[:8]}...) - {start_time}")
|
| 49 |
+
|
| 50 |
+
# Show input/output samples
|
| 51 |
+
print("\nSpan content samples:")
|
| 52 |
+
for i, (_, span) in enumerate(spans_df.head(3).iterrows()):
|
| 53 |
+
input_val = str(span.get('input.value', ''))[:100]
|
| 54 |
+
output_val = str(span.get('output.value', ''))[:100]
|
| 55 |
+
print(f" Span {i+1}:")
|
| 56 |
+
print(f" Input: {input_val}...")
|
| 57 |
+
print(f" Output: {output_val}...")
|
| 58 |
+
|
| 59 |
+
else:
|
| 60 |
+
print("β οΈ No spans found. Run your agent first to generate traces.")
|
| 61 |
+
|
| 62 |
+
return spans_df
|
| 63 |
+
|
| 64 |
+
except Exception as e:
|
| 65 |
+
print(f"β Error getting spans: {e}")
|
| 66 |
+
return pd.DataFrame()
|
| 67 |
+
|
| 68 |
+
|
| 69 |
+
def check_evaluations(client):
|
| 70 |
+
"""Check evaluations in Phoenix."""
|
| 71 |
+
try:
|
| 72 |
+
# Try different methods to get evaluations
|
| 73 |
+
print("π Checking evaluations...")
|
| 74 |
+
|
| 75 |
+
# Method 1: Direct evaluation dataframe
|
| 76 |
+
try:
|
| 77 |
+
evals_df = client.get_evaluations_dataframe()
|
| 78 |
+
print(f"π Found {len(evals_df)} evaluations in Phoenix")
|
| 79 |
+
|
| 80 |
+
if len(evals_df) > 0:
|
| 81 |
+
print("Evaluation breakdown:")
|
| 82 |
+
eval_names = evals_df['name'].value_counts()
|
| 83 |
+
for name, count in eval_names.items():
|
| 84 |
+
print(f" - {name}: {count} evaluations")
|
| 85 |
+
|
| 86 |
+
# Check for GAIA evaluations specifically
|
| 87 |
+
gaia_evals = evals_df[evals_df['name'] == 'gaia_ground_truth']
|
| 88 |
+
if len(gaia_evals) > 0:
|
| 89 |
+
print(f"β
Found {len(gaia_evals)} GAIA ground truth evaluations")
|
| 90 |
+
|
| 91 |
+
# Show sample evaluation
|
| 92 |
+
sample = gaia_evals.iloc[0]
|
| 93 |
+
print("Sample GAIA evaluation:")
|
| 94 |
+
print(f" - Score: {sample.get('score', 'N/A')}")
|
| 95 |
+
print(f" - Label: {sample.get('label', 'N/A')}")
|
| 96 |
+
print(f" - Explanation: {sample.get('explanation', 'N/A')[:100]}...")
|
| 97 |
+
|
| 98 |
+
# Show metadata if available
|
| 99 |
+
metadata = sample.get('metadata', {})
|
| 100 |
+
if metadata:
|
| 101 |
+
print(f" - Metadata keys: {list(metadata.keys())}")
|
| 102 |
+
|
| 103 |
+
else:
|
| 104 |
+
print("β No GAIA ground truth evaluations found")
|
| 105 |
+
print("Available evaluation types:", list(eval_names.keys()))
|
| 106 |
+
|
| 107 |
+
else:
|
| 108 |
+
print("β οΈ No evaluations found in Phoenix")
|
| 109 |
+
|
| 110 |
+
return evals_df
|
| 111 |
+
|
| 112 |
+
except AttributeError as e:
|
| 113 |
+
print(f"β οΈ get_evaluations_dataframe not available: {e}")
|
| 114 |
+
print("This might be a Phoenix version issue")
|
| 115 |
+
return pd.DataFrame()
|
| 116 |
+
|
| 117 |
+
except Exception as e:
|
| 118 |
+
print(f"β Error getting evaluations: {e}")
|
| 119 |
+
return pd.DataFrame()
|
| 120 |
+
|
| 121 |
+
|
| 122 |
+
def test_evaluation_creation_and_logging():
|
| 123 |
+
"""Test creating and logging evaluations."""
|
| 124 |
+
print("\nπ§ͺ Testing evaluation creation and logging...")
|
| 125 |
+
|
| 126 |
+
# Create sample evaluations
|
| 127 |
+
sample_data = [
|
| 128 |
+
{
|
| 129 |
+
"task_id": "debug-test-1",
|
| 130 |
+
"predicted_answer": "test answer 1",
|
| 131 |
+
"actual_answer": "correct answer 1",
|
| 132 |
+
"exact_match": False,
|
| 133 |
+
"similarity_score": 0.75,
|
| 134 |
+
"contains_answer": True,
|
| 135 |
+
"error": None
|
| 136 |
+
},
|
| 137 |
+
{
|
| 138 |
+
"task_id": "debug-test-2",
|
| 139 |
+
"predicted_answer": "exact match",
|
| 140 |
+
"actual_answer": "exact match",
|
| 141 |
+
"exact_match": True,
|
| 142 |
+
"similarity_score": 1.0,
|
| 143 |
+
"contains_answer": True,
|
| 144 |
+
"error": None
|
| 145 |
+
}
|
| 146 |
+
]
|
| 147 |
+
|
| 148 |
+
evaluations_df = pd.DataFrame(sample_data)
|
| 149 |
+
print(f"Created {len(evaluations_df)} test evaluations")
|
| 150 |
+
|
| 151 |
+
# Try to log to Phoenix
|
| 152 |
+
try:
|
| 153 |
+
print("Attempting to log evaluations to Phoenix...")
|
| 154 |
+
result = log_evaluations_to_phoenix(evaluations_df)
|
| 155 |
+
|
| 156 |
+
if result is not None:
|
| 157 |
+
print("β
Test evaluation logging successful")
|
| 158 |
+
print(f"Logged {len(result)} evaluations")
|
| 159 |
+
return True
|
| 160 |
+
else:
|
| 161 |
+
print("β Test evaluation logging failed - no result returned")
|
| 162 |
+
return False
|
| 163 |
+
|
| 164 |
+
except Exception as e:
|
| 165 |
+
print(f"β Test evaluation logging error: {e}")
|
| 166 |
+
import traceback
|
| 167 |
+
traceback.print_exc()
|
| 168 |
+
return False
|
| 169 |
+
|
| 170 |
+
|
| 171 |
+
def check_gaia_data():
|
| 172 |
+
"""Check GAIA ground truth data availability."""
|
| 173 |
+
print("\nπ Checking GAIA ground truth data...")
|
| 174 |
+
|
| 175 |
+
try:
|
| 176 |
+
comparator = AnswerComparator()
|
| 177 |
+
|
| 178 |
+
print(f"β
Loaded {len(comparator.ground_truth)} GAIA ground truth answers")
|
| 179 |
+
|
| 180 |
+
if len(comparator.ground_truth) > 0:
|
| 181 |
+
# Show sample
|
| 182 |
+
sample_task_id = list(comparator.ground_truth.keys())[0]
|
| 183 |
+
sample_answer = comparator.ground_truth[sample_task_id]
|
| 184 |
+
print(f"Sample: {sample_task_id} -> '{sample_answer}'")
|
| 185 |
+
|
| 186 |
+
# Test evaluation
|
| 187 |
+
test_eval = comparator.evaluate_answer(sample_task_id, "test answer")
|
| 188 |
+
print(f"Test evaluation result: {test_eval}")
|
| 189 |
+
|
| 190 |
+
return True
|
| 191 |
+
else:
|
| 192 |
+
print("β No GAIA ground truth data found")
|
| 193 |
+
return False
|
| 194 |
+
|
| 195 |
+
except Exception as e:
|
| 196 |
+
print(f"β Error checking GAIA data: {e}")
|
| 197 |
+
return False
|
| 198 |
+
|
| 199 |
+
|
| 200 |
+
def show_phoenix_ui_info():
|
| 201 |
+
"""Show information about Phoenix UI."""
|
| 202 |
+
print("\nπ Phoenix UI Information:")
|
| 203 |
+
print("-" * 30)
|
| 204 |
+
print("Phoenix UI should be available at: http://localhost:6006")
|
| 205 |
+
print("")
|
| 206 |
+
print("In the Phoenix UI, look for:")
|
| 207 |
+
print(" β’ 'Evaluations' tab or section")
|
| 208 |
+
print(" β’ 'Evals' section")
|
| 209 |
+
print(" β’ 'Annotations' tab")
|
| 210 |
+
print(" β’ In 'Spans' view, look for evaluation badges on spans")
|
| 211 |
+
print("")
|
| 212 |
+
print("If you see evaluations, they should be named 'gaia_ground_truth'")
|
| 213 |
+
print("Each evaluation should show:")
|
| 214 |
+
print(" - Score (similarity score 0-1)")
|
| 215 |
+
print(" - Label (correct/incorrect)")
|
| 216 |
+
print(" - Explanation (predicted vs ground truth)")
|
| 217 |
+
print(" - Metadata (task_id, exact_match, etc.)")
|
| 218 |
+
|
| 219 |
+
|
| 220 |
+
def main():
|
| 221 |
+
"""Main debug function."""
|
| 222 |
+
print("π Enhanced Phoenix Debug Script")
|
| 223 |
+
print("=" * 50)
|
| 224 |
+
|
| 225 |
+
# Check Phoenix connection
|
| 226 |
+
client = check_phoenix_connection()
|
| 227 |
+
if not client:
|
| 228 |
+
print("\nβ Cannot proceed without Phoenix connection")
|
| 229 |
+
print("Make sure your agent app is running (it starts Phoenix)")
|
| 230 |
+
return
|
| 231 |
+
|
| 232 |
+
print("\nπ Checking Phoenix Data:")
|
| 233 |
+
print("-" * 30)
|
| 234 |
+
|
| 235 |
+
# Check spans
|
| 236 |
+
spans_df = check_spans(client)
|
| 237 |
+
|
| 238 |
+
# Check evaluations
|
| 239 |
+
evals_df = check_evaluations(client)
|
| 240 |
+
|
| 241 |
+
# Test evaluation creation
|
| 242 |
+
test_success = test_evaluation_creation_and_logging()
|
| 243 |
+
|
| 244 |
+
# Wait a moment and recheck evaluations
|
| 245 |
+
if test_success:
|
| 246 |
+
print("\nβ³ Waiting for evaluations to be processed...")
|
| 247 |
+
time.sleep(3)
|
| 248 |
+
|
| 249 |
+
print("π Rechecking evaluations after test logging...")
|
| 250 |
+
evals_df_after = check_evaluations(client)
|
| 251 |
+
|
| 252 |
+
if len(evals_df_after) > len(evals_df):
|
| 253 |
+
print("β
New evaluations detected after test!")
|
| 254 |
+
else:
|
| 255 |
+
print("β οΈ No new evaluations detected")
|
| 256 |
+
|
| 257 |
+
# Check GAIA data
|
| 258 |
+
gaia_available = check_gaia_data()
|
| 259 |
+
|
| 260 |
+
# Show Phoenix UI info
|
| 261 |
+
show_phoenix_ui_info()
|
| 262 |
+
|
| 263 |
+
# Final summary
|
| 264 |
+
print("\n" + "=" * 50)
|
| 265 |
+
print("π Summary:")
|
| 266 |
+
print(f" β’ Phoenix connected: {'β
' if client else 'β'}")
|
| 267 |
+
print(f" β’ Spans available: {len(spans_df)} spans")
|
| 268 |
+
print(f" β’ Evaluations found: {len(evals_df)} evaluations")
|
| 269 |
+
print(f" β’ GAIA data available: {'β
' if gaia_available else 'β'}")
|
| 270 |
+
print(f" β’ Test logging worked: {'β
' if test_success else 'β'}")
|
| 271 |
+
|
| 272 |
+
print("\nπ‘ Next Steps:")
|
| 273 |
+
if len(spans_df) == 0:
|
| 274 |
+
print(" β’ Run your agent to generate traces first")
|
| 275 |
+
if len(evals_df) == 0:
|
| 276 |
+
print(" β’ Check if evaluations are being logged correctly")
|
| 277 |
+
print(" β’ Verify Phoenix version compatibility")
|
| 278 |
+
if not gaia_available:
|
| 279 |
+
print(" β’ Check that data/metadata.jsonl exists and is readable")
|
| 280 |
+
|
| 281 |
+
print(f"\nπ Phoenix UI: http://localhost:6006")
|
| 282 |
+
|
| 283 |
+
|
| 284 |
+
if __name__ == "__main__":
|
| 285 |
+
main()
|
phoenix_evaluator.py
ADDED
|
@@ -0,0 +1,214 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import pandas as pd
|
| 2 |
+
from typing import Dict, Any, List, Optional
|
| 3 |
+
from comparison import AnswerComparator
|
| 4 |
+
import phoenix as px
|
| 5 |
+
from phoenix.trace import SpanEvaluations
|
| 6 |
+
|
| 7 |
+
|
| 8 |
+
class GAIAPhoenixEvaluator:
|
| 9 |
+
"""Phoenix evaluator for GAIA dataset ground truth comparison."""
|
| 10 |
+
|
| 11 |
+
def __init__(self, metadata_path: str = "data/metadata.jsonl"):
|
| 12 |
+
self.comparator = AnswerComparator(metadata_path)
|
| 13 |
+
self.eval_name = "gaia_ground_truth"
|
| 14 |
+
|
| 15 |
+
def evaluate_spans(self, spans_df: pd.DataFrame) -> List[SpanEvaluations]:
|
| 16 |
+
"""Evaluate spans and return Phoenix SpanEvaluations."""
|
| 17 |
+
evaluations = []
|
| 18 |
+
|
| 19 |
+
for _, span in spans_df.iterrows():
|
| 20 |
+
# Extract task_id and answer from span
|
| 21 |
+
task_id = self._extract_task_id(span)
|
| 22 |
+
predicted_answer = self._extract_predicted_answer(span)
|
| 23 |
+
span_id = span.get("context.span_id")
|
| 24 |
+
|
| 25 |
+
if task_id and predicted_answer is not None and span_id:
|
| 26 |
+
evaluation = self.comparator.evaluate_answer(task_id, predicted_answer)
|
| 27 |
+
|
| 28 |
+
# Create evaluation record for Phoenix
|
| 29 |
+
eval_record = {
|
| 30 |
+
"span_id": span_id,
|
| 31 |
+
"score": 1.0 if evaluation["exact_match"] else evaluation["similarity_score"],
|
| 32 |
+
"label": "correct" if evaluation["exact_match"] else "incorrect",
|
| 33 |
+
"explanation": self._create_explanation(evaluation),
|
| 34 |
+
"task_id": task_id,
|
| 35 |
+
"predicted_answer": evaluation["predicted_answer"],
|
| 36 |
+
"ground_truth": evaluation["actual_answer"],
|
| 37 |
+
"exact_match": evaluation["exact_match"],
|
| 38 |
+
"similarity_score": evaluation["similarity_score"],
|
| 39 |
+
"contains_answer": evaluation["contains_answer"]
|
| 40 |
+
}
|
| 41 |
+
|
| 42 |
+
evaluations.append(eval_record)
|
| 43 |
+
|
| 44 |
+
if evaluations:
|
| 45 |
+
# Create SpanEvaluations object
|
| 46 |
+
eval_df = pd.DataFrame(evaluations)
|
| 47 |
+
return [SpanEvaluations(eval_name=self.eval_name, dataframe=eval_df)]
|
| 48 |
+
|
| 49 |
+
return []
|
| 50 |
+
|
| 51 |
+
def _extract_task_id(self, span) -> Optional[str]:
|
| 52 |
+
"""Extract task_id from span data."""
|
| 53 |
+
# Try span attributes first
|
| 54 |
+
attributes = span.get("attributes", {})
|
| 55 |
+
if isinstance(attributes, dict):
|
| 56 |
+
if "task_id" in attributes:
|
| 57 |
+
return attributes["task_id"]
|
| 58 |
+
|
| 59 |
+
# Try input data
|
| 60 |
+
input_data = span.get("input", {})
|
| 61 |
+
if isinstance(input_data, dict):
|
| 62 |
+
if "task_id" in input_data:
|
| 63 |
+
return input_data["task_id"]
|
| 64 |
+
|
| 65 |
+
# Try to extract from input value if it's a string
|
| 66 |
+
input_value = span.get("input.value", "")
|
| 67 |
+
if isinstance(input_value, str):
|
| 68 |
+
# Look for UUID pattern in input
|
| 69 |
+
import re
|
| 70 |
+
uuid_pattern = r'[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}'
|
| 71 |
+
match = re.search(uuid_pattern, input_value)
|
| 72 |
+
if match:
|
| 73 |
+
return match.group(0)
|
| 74 |
+
|
| 75 |
+
# Try span name
|
| 76 |
+
span_name = span.get("name", "")
|
| 77 |
+
if isinstance(span_name, str):
|
| 78 |
+
import re
|
| 79 |
+
uuid_pattern = r'[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}'
|
| 80 |
+
match = re.search(uuid_pattern, span_name)
|
| 81 |
+
if match:
|
| 82 |
+
return match.group(0)
|
| 83 |
+
|
| 84 |
+
return None
|
| 85 |
+
|
| 86 |
+
def _extract_predicted_answer(self, span) -> Optional[str]:
|
| 87 |
+
"""Extract predicted answer from span output."""
|
| 88 |
+
# Try different output fields
|
| 89 |
+
output_fields = ["output.value", "output", "response", "result"]
|
| 90 |
+
|
| 91 |
+
for field in output_fields:
|
| 92 |
+
value = span.get(field)
|
| 93 |
+
if value is not None:
|
| 94 |
+
return str(value)
|
| 95 |
+
|
| 96 |
+
return None
|
| 97 |
+
|
| 98 |
+
def _create_explanation(self, evaluation: Dict[str, Any]) -> str:
|
| 99 |
+
"""Create human-readable explanation of the evaluation."""
|
| 100 |
+
predicted = evaluation["predicted_answer"]
|
| 101 |
+
actual = evaluation["actual_answer"]
|
| 102 |
+
exact_match = evaluation["exact_match"]
|
| 103 |
+
similarity = evaluation["similarity_score"]
|
| 104 |
+
contains = evaluation["contains_answer"]
|
| 105 |
+
|
| 106 |
+
if actual is None:
|
| 107 |
+
return "β No ground truth available for comparison"
|
| 108 |
+
|
| 109 |
+
explanation = f"Predicted: '{predicted}' | Ground Truth: '{actual}' | "
|
| 110 |
+
|
| 111 |
+
if exact_match:
|
| 112 |
+
explanation += "β
Exact match"
|
| 113 |
+
elif contains:
|
| 114 |
+
explanation += f"β οΈ Contains correct answer (similarity: {similarity:.3f})"
|
| 115 |
+
else:
|
| 116 |
+
explanation += f"β Incorrect (similarity: {similarity:.3f})"
|
| 117 |
+
|
| 118 |
+
return explanation
|
| 119 |
+
|
| 120 |
+
|
| 121 |
+
def add_gaia_evaluations_to_phoenix(spans_df: pd.DataFrame, metadata_path: str = "data/metadata.jsonl") -> List[SpanEvaluations]:
|
| 122 |
+
"""Add GAIA evaluation results to Phoenix spans."""
|
| 123 |
+
evaluator = GAIAPhoenixEvaluator(metadata_path)
|
| 124 |
+
return evaluator.evaluate_spans(spans_df)
|
| 125 |
+
|
| 126 |
+
|
| 127 |
+
def log_evaluations_to_phoenix(evaluations_df: pd.DataFrame, session_id: Optional[str] = None) -> Optional[pd.DataFrame]:
|
| 128 |
+
"""Log evaluation results directly to Phoenix."""
|
| 129 |
+
try:
|
| 130 |
+
client = px.Client()
|
| 131 |
+
|
| 132 |
+
# Get current spans to match evaluations to span_ids
|
| 133 |
+
spans_df = client.get_spans_dataframe()
|
| 134 |
+
|
| 135 |
+
if spans_df is None or spans_df.empty:
|
| 136 |
+
print("No spans found to attach evaluations to")
|
| 137 |
+
return None
|
| 138 |
+
|
| 139 |
+
# Create evaluation records for Phoenix
|
| 140 |
+
evaluation_records = []
|
| 141 |
+
spans_with_evals = []
|
| 142 |
+
|
| 143 |
+
for _, eval_row in evaluations_df.iterrows():
|
| 144 |
+
task_id = eval_row["task_id"]
|
| 145 |
+
|
| 146 |
+
# Try to find matching span by searching for task_id in span input
|
| 147 |
+
matching_spans = spans_df[
|
| 148 |
+
spans_df['input.value'].astype(str).str.contains(task_id, na=False, case=False)
|
| 149 |
+
]
|
| 150 |
+
|
| 151 |
+
if len(matching_spans) == 0:
|
| 152 |
+
# Try alternative search in span attributes or name
|
| 153 |
+
matching_spans = spans_df[
|
| 154 |
+
spans_df['name'].astype(str).str.contains(task_id, na=False, case=False)
|
| 155 |
+
]
|
| 156 |
+
|
| 157 |
+
if len(matching_spans) > 0:
|
| 158 |
+
span_id = matching_spans.iloc[0]['context.span_id']
|
| 159 |
+
|
| 160 |
+
# Create evaluation record in Phoenix format
|
| 161 |
+
evaluation_record = {
|
| 162 |
+
"span_id": span_id,
|
| 163 |
+
"name": "gaia_ground_truth",
|
| 164 |
+
"score": eval_row["similarity_score"],
|
| 165 |
+
"label": "correct" if bool(eval_row["exact_match"]) else "incorrect",
|
| 166 |
+
"explanation": f"Predicted: '{eval_row['predicted_answer']}' | Ground Truth: '{eval_row['actual_answer']}' | Similarity: {eval_row['similarity_score']:.3f} | Exact Match: {eval_row['exact_match']}",
|
| 167 |
+
"annotator_kind": "HUMAN",
|
| 168 |
+
"metadata": {
|
| 169 |
+
"task_id": task_id,
|
| 170 |
+
"exact_match": eval_row["exact_match"],
|
| 171 |
+
"similarity_score": eval_row["similarity_score"],
|
| 172 |
+
"contains_answer": eval_row["contains_answer"],
|
| 173 |
+
"predicted_answer": eval_row["predicted_answer"],
|
| 174 |
+
"ground_truth": eval_row["actual_answer"]
|
| 175 |
+
}
|
| 176 |
+
}
|
| 177 |
+
|
| 178 |
+
evaluation_records.append(evaluation_record)
|
| 179 |
+
spans_with_evals.append(span_id)
|
| 180 |
+
|
| 181 |
+
if evaluation_records:
|
| 182 |
+
# Convert to DataFrame for Phoenix
|
| 183 |
+
eval_df = pd.DataFrame(evaluation_records)
|
| 184 |
+
|
| 185 |
+
# Create SpanEvaluations object
|
| 186 |
+
span_evaluations = SpanEvaluations(
|
| 187 |
+
eval_name="gaia_ground_truth",
|
| 188 |
+
dataframe=eval_df
|
| 189 |
+
)
|
| 190 |
+
|
| 191 |
+
# Log evaluations to Phoenix
|
| 192 |
+
try:
|
| 193 |
+
# Try the newer Phoenix API
|
| 194 |
+
px.log_evaluations(span_evaluations)
|
| 195 |
+
print(f"β
Successfully logged {len(evaluation_records)} evaluations to Phoenix")
|
| 196 |
+
except AttributeError:
|
| 197 |
+
# Fallback for older Phoenix versions
|
| 198 |
+
client.log_evaluations(span_evaluations)
|
| 199 |
+
print(f"β
Successfully logged {len(evaluation_records)} evaluations to Phoenix (fallback)")
|
| 200 |
+
|
| 201 |
+
return eval_df
|
| 202 |
+
else:
|
| 203 |
+
print("β οΈ No matching spans found for evaluations")
|
| 204 |
+
if spans_df is not None:
|
| 205 |
+
print(f"Available spans: {len(spans_df)}")
|
| 206 |
+
if len(spans_df) > 0:
|
| 207 |
+
print("Sample span names:", spans_df['name'].head(3).tolist())
|
| 208 |
+
return None
|
| 209 |
+
|
| 210 |
+
except Exception as e:
|
| 211 |
+
print(f"β Could not log evaluations to Phoenix: {e}")
|
| 212 |
+
import traceback
|
| 213 |
+
traceback.print_exc()
|
| 214 |
+
return None
|
requirements.txt
CHANGED
|
@@ -8,3 +8,4 @@ markdownify
|
|
| 8 |
requests
|
| 9 |
smolagents[telemetry,toolkit]
|
| 10 |
chess
|
|
|
|
|
|
| 8 |
requests
|
| 9 |
smolagents[telemetry,toolkit]
|
| 10 |
chess
|
| 11 |
+
pandas
|
test_comparison.py
ADDED
|
@@ -0,0 +1,144 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Test script for GAIA comparison functionality.
|
| 4 |
+
"""
|
| 5 |
+
|
| 6 |
+
import sys
|
| 7 |
+
import os
|
| 8 |
+
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
|
| 9 |
+
|
| 10 |
+
from comparison import AnswerComparator
|
| 11 |
+
from phoenix_evaluator import log_evaluations_to_phoenix
|
| 12 |
+
import pandas as pd
|
| 13 |
+
|
| 14 |
+
|
| 15 |
+
def test_basic_comparison():
|
| 16 |
+
"""Test basic comparison functionality."""
|
| 17 |
+
print("Testing basic comparison...")
|
| 18 |
+
|
| 19 |
+
# Initialize comparator
|
| 20 |
+
comparator = AnswerComparator()
|
| 21 |
+
|
| 22 |
+
# Test with some sample data
|
| 23 |
+
sample_results = [
|
| 24 |
+
{"task_id": "8e867cd7-cff9-4e6c-867a-ff5ddc2550be", "submitted_answer": "3"},
|
| 25 |
+
{"task_id": "a1e91b78-d3d8-4675-bb8d-62741b4b68a6", "submitted_answer": "3"},
|
| 26 |
+
{"task_id": "nonexistent-task", "submitted_answer": "test"}
|
| 27 |
+
]
|
| 28 |
+
|
| 29 |
+
# Evaluate batch
|
| 30 |
+
evaluations_df = comparator.evaluate_batch(sample_results)
|
| 31 |
+
print(f"Evaluated {len(evaluations_df)} answers")
|
| 32 |
+
|
| 33 |
+
# Get summary stats
|
| 34 |
+
summary_stats = comparator.get_summary_stats(evaluations_df)
|
| 35 |
+
print("Summary statistics:")
|
| 36 |
+
for key, value in summary_stats.items():
|
| 37 |
+
print(f" {key}: {value}")
|
| 38 |
+
|
| 39 |
+
# Test single evaluation
|
| 40 |
+
print("\nTesting single evaluation...")
|
| 41 |
+
single_eval = comparator.evaluate_answer("8e867cd7-cff9-4e6c-867a-ff5ddc2550be", "3")
|
| 42 |
+
print(f"Single evaluation result: {single_eval}")
|
| 43 |
+
|
| 44 |
+
return evaluations_df
|
| 45 |
+
|
| 46 |
+
|
| 47 |
+
def test_results_enhancement():
|
| 48 |
+
"""Test results log enhancement."""
|
| 49 |
+
print("\nTesting results log enhancement...")
|
| 50 |
+
|
| 51 |
+
comparator = AnswerComparator()
|
| 52 |
+
|
| 53 |
+
# Sample results log (like what comes from your agent)
|
| 54 |
+
sample_results_log = [
|
| 55 |
+
{
|
| 56 |
+
"Task ID": "8e867cd7-cff9-4e6c-867a-ff5ddc2550be",
|
| 57 |
+
"Question": "How many studio albums were published by Mercedes Sosa between 2000 and 2009?",
|
| 58 |
+
"Submitted Answer": "3"
|
| 59 |
+
},
|
| 60 |
+
{
|
| 61 |
+
"Task ID": "a1e91b78-d3d8-4675-bb8d-62741b4b68a6",
|
| 62 |
+
"Question": "Test question",
|
| 63 |
+
"Submitted Answer": "wrong answer"
|
| 64 |
+
}
|
| 65 |
+
]
|
| 66 |
+
|
| 67 |
+
# Enhance results
|
| 68 |
+
enhanced_results = comparator.enhance_results_log(sample_results_log)
|
| 69 |
+
|
| 70 |
+
print("Enhanced results:")
|
| 71 |
+
for result in enhanced_results:
|
| 72 |
+
print(f" Task: {result['Task ID']}")
|
| 73 |
+
print(f" Answer: {result['Submitted Answer']}")
|
| 74 |
+
print(f" Ground Truth: {result['Ground Truth']}")
|
| 75 |
+
print(f" Exact Match: {result['Exact Match']}")
|
| 76 |
+
print(f" Similarity: {result['Similarity']}")
|
| 77 |
+
print()
|
| 78 |
+
|
| 79 |
+
|
| 80 |
+
def test_phoenix_integration():
|
| 81 |
+
"""Test Phoenix integration (basic)."""
|
| 82 |
+
print("\nTesting Phoenix integration...")
|
| 83 |
+
|
| 84 |
+
# Create sample evaluations
|
| 85 |
+
sample_evaluations = pd.DataFrame([
|
| 86 |
+
{
|
| 87 |
+
"task_id": "8e867cd7-cff9-4e6c-867a-ff5ddc2550be",
|
| 88 |
+
"predicted_answer": "3",
|
| 89 |
+
"actual_answer": "3",
|
| 90 |
+
"exact_match": True,
|
| 91 |
+
"similarity_score": 1.0,
|
| 92 |
+
"contains_answer": True,
|
| 93 |
+
"error": None
|
| 94 |
+
},
|
| 95 |
+
{
|
| 96 |
+
"task_id": "a1e91b78-d3d8-4675-bb8d-62741b4b68a6",
|
| 97 |
+
"predicted_answer": "wrong",
|
| 98 |
+
"actual_answer": "3",
|
| 99 |
+
"exact_match": False,
|
| 100 |
+
"similarity_score": 0.2,
|
| 101 |
+
"contains_answer": False,
|
| 102 |
+
"error": None
|
| 103 |
+
}
|
| 104 |
+
])
|
| 105 |
+
|
| 106 |
+
# Try to log to Phoenix
|
| 107 |
+
try:
|
| 108 |
+
result = log_evaluations_to_phoenix(sample_evaluations)
|
| 109 |
+
if result is not None:
|
| 110 |
+
print("β
Phoenix integration successful")
|
| 111 |
+
else:
|
| 112 |
+
print("β οΈ Phoenix integration failed (likely Phoenix not running)")
|
| 113 |
+
except Exception as e:
|
| 114 |
+
print(f"β οΈ Phoenix integration error: {e}")
|
| 115 |
+
|
| 116 |
+
|
| 117 |
+
def main():
|
| 118 |
+
"""Run all tests."""
|
| 119 |
+
print("="*50)
|
| 120 |
+
print("GAIA Comparison Test Suite")
|
| 121 |
+
print("="*50)
|
| 122 |
+
|
| 123 |
+
try:
|
| 124 |
+
# Test basic comparison
|
| 125 |
+
evaluations_df = test_basic_comparison()
|
| 126 |
+
|
| 127 |
+
# Test results enhancement
|
| 128 |
+
test_results_enhancement()
|
| 129 |
+
|
| 130 |
+
# Test Phoenix integration
|
| 131 |
+
test_phoenix_integration()
|
| 132 |
+
|
| 133 |
+
print("\n" + "="*50)
|
| 134 |
+
print("All tests completed!")
|
| 135 |
+
print("="*50)
|
| 136 |
+
|
| 137 |
+
except Exception as e:
|
| 138 |
+
print(f"β Test failed with error: {e}")
|
| 139 |
+
import traceback
|
| 140 |
+
traceback.print_exc()
|
| 141 |
+
|
| 142 |
+
|
| 143 |
+
if __name__ == "__main__":
|
| 144 |
+
main()
|
test_phoenix_logging.py
ADDED
|
@@ -0,0 +1,261 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Test script to verify Phoenix evaluations logging.
|
| 4 |
+
"""
|
| 5 |
+
|
| 6 |
+
import sys
|
| 7 |
+
import os
|
| 8 |
+
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
|
| 9 |
+
|
| 10 |
+
import phoenix as px
|
| 11 |
+
import pandas as pd
|
| 12 |
+
from comparison import AnswerComparator
|
| 13 |
+
from phoenix_evaluator import log_evaluations_to_phoenix
|
| 14 |
+
from datetime import datetime
|
| 15 |
+
import time
|
| 16 |
+
|
| 17 |
+
|
| 18 |
+
def test_phoenix_connection():
|
| 19 |
+
"""Test Phoenix connection and basic functionality."""
|
| 20 |
+
print("π Testing Phoenix Connection...")
|
| 21 |
+
|
| 22 |
+
try:
|
| 23 |
+
client = px.Client()
|
| 24 |
+
print("β
Phoenix client connected successfully")
|
| 25 |
+
|
| 26 |
+
# Check if Phoenix is actually running
|
| 27 |
+
spans_df = client.get_spans_dataframe()
|
| 28 |
+
print(f"π Found {len(spans_df)} existing spans in Phoenix")
|
| 29 |
+
|
| 30 |
+
return client, spans_df
|
| 31 |
+
except Exception as e:
|
| 32 |
+
print(f"β Phoenix connection failed: {e}")
|
| 33 |
+
print("Make sure Phoenix is running and accessible at http://localhost:6006")
|
| 34 |
+
return None, None
|
| 35 |
+
|
| 36 |
+
|
| 37 |
+
def create_test_evaluations():
|
| 38 |
+
"""Create test evaluations for logging."""
|
| 39 |
+
print("\nπ§ͺ Creating test evaluations...")
|
| 40 |
+
|
| 41 |
+
test_data = [
|
| 42 |
+
{
|
| 43 |
+
"task_id": "test-exact-match",
|
| 44 |
+
"predicted_answer": "Paris",
|
| 45 |
+
"actual_answer": "Paris",
|
| 46 |
+
"exact_match": True,
|
| 47 |
+
"similarity_score": 1.0,
|
| 48 |
+
"contains_answer": True,
|
| 49 |
+
"error": None
|
| 50 |
+
},
|
| 51 |
+
{
|
| 52 |
+
"task_id": "test-partial-match",
|
| 53 |
+
"predicted_answer": "The capital of France is Paris",
|
| 54 |
+
"actual_answer": "Paris",
|
| 55 |
+
"exact_match": False,
|
| 56 |
+
"similarity_score": 0.75,
|
| 57 |
+
"contains_answer": True,
|
| 58 |
+
"error": None
|
| 59 |
+
},
|
| 60 |
+
{
|
| 61 |
+
"task_id": "test-no-match",
|
| 62 |
+
"predicted_answer": "London",
|
| 63 |
+
"actual_answer": "Paris",
|
| 64 |
+
"exact_match": False,
|
| 65 |
+
"similarity_score": 0.2,
|
| 66 |
+
"contains_answer": False,
|
| 67 |
+
"error": None
|
| 68 |
+
}
|
| 69 |
+
]
|
| 70 |
+
|
| 71 |
+
evaluations_df = pd.DataFrame(test_data)
|
| 72 |
+
print(f"Created {len(evaluations_df)} test evaluations")
|
| 73 |
+
|
| 74 |
+
return evaluations_df
|
| 75 |
+
|
| 76 |
+
|
| 77 |
+
def create_mock_spans(client):
|
| 78 |
+
"""Create mock spans for testing (if no real spans exist)."""
|
| 79 |
+
print("\nπ Creating mock spans for testing...")
|
| 80 |
+
|
| 81 |
+
# Note: This is a simplified mock - in real usage, spans are created by agent runs
|
| 82 |
+
mock_spans = [
|
| 83 |
+
{
|
| 84 |
+
"context.span_id": "mock-span-1",
|
| 85 |
+
"name": "test_agent_run",
|
| 86 |
+
"input.value": "Question about test-exact-match",
|
| 87 |
+
"output.value": "Paris",
|
| 88 |
+
"start_time": datetime.now(),
|
| 89 |
+
"end_time": datetime.now()
|
| 90 |
+
},
|
| 91 |
+
{
|
| 92 |
+
"context.span_id": "mock-span-2",
|
| 93 |
+
"name": "test_agent_run",
|
| 94 |
+
"input.value": "Question about test-partial-match",
|
| 95 |
+
"output.value": "The capital of France is Paris",
|
| 96 |
+
"start_time": datetime.now(),
|
| 97 |
+
"end_time": datetime.now()
|
| 98 |
+
},
|
| 99 |
+
{
|
| 100 |
+
"context.span_id": "mock-span-3",
|
| 101 |
+
"name": "test_agent_run",
|
| 102 |
+
"input.value": "Question about test-no-match",
|
| 103 |
+
"output.value": "London",
|
| 104 |
+
"start_time": datetime.now(),
|
| 105 |
+
"end_time": datetime.now()
|
| 106 |
+
}
|
| 107 |
+
]
|
| 108 |
+
|
| 109 |
+
print(f"Created {len(mock_spans)} mock spans")
|
| 110 |
+
return pd.DataFrame(mock_spans)
|
| 111 |
+
|
| 112 |
+
|
| 113 |
+
def test_evaluation_logging():
|
| 114 |
+
"""Test the actual evaluation logging to Phoenix."""
|
| 115 |
+
print("\nπ Testing evaluation logging...")
|
| 116 |
+
|
| 117 |
+
# Create test evaluations
|
| 118 |
+
evaluations_df = create_test_evaluations()
|
| 119 |
+
|
| 120 |
+
# Try to log to Phoenix
|
| 121 |
+
try:
|
| 122 |
+
result = log_evaluations_to_phoenix(evaluations_df)
|
| 123 |
+
|
| 124 |
+
if result is not None:
|
| 125 |
+
print("β
Evaluation logging test successful!")
|
| 126 |
+
print(f"Logged {len(result)} evaluations")
|
| 127 |
+
return True
|
| 128 |
+
else:
|
| 129 |
+
print("β Evaluation logging test failed - no result returned")
|
| 130 |
+
return False
|
| 131 |
+
|
| 132 |
+
except Exception as e:
|
| 133 |
+
print(f"β Evaluation logging test failed with error: {e}")
|
| 134 |
+
import traceback
|
| 135 |
+
traceback.print_exc()
|
| 136 |
+
return False
|
| 137 |
+
|
| 138 |
+
|
| 139 |
+
def verify_logged_evaluations(client):
|
| 140 |
+
"""Verify that evaluations were actually logged to Phoenix."""
|
| 141 |
+
print("\nπ Verifying logged evaluations...")
|
| 142 |
+
|
| 143 |
+
try:
|
| 144 |
+
# Give Phoenix a moment to process
|
| 145 |
+
time.sleep(2)
|
| 146 |
+
|
| 147 |
+
# Try to retrieve evaluations
|
| 148 |
+
evals_df = client.get_evaluations_dataframe()
|
| 149 |
+
print(f"π Found {len(evals_df)} total evaluations in Phoenix")
|
| 150 |
+
|
| 151 |
+
# Look for our specific evaluations
|
| 152 |
+
if len(evals_df) > 0:
|
| 153 |
+
gaia_evals = evals_df[evals_df['name'] == 'gaia_ground_truth']
|
| 154 |
+
print(f"π― Found {len(gaia_evals)} GAIA ground truth evaluations")
|
| 155 |
+
|
| 156 |
+
if len(gaia_evals) > 0:
|
| 157 |
+
print("β
Successfully verified evaluations in Phoenix!")
|
| 158 |
+
|
| 159 |
+
# Show sample evaluation
|
| 160 |
+
sample_eval = gaia_evals.iloc[0]
|
| 161 |
+
print(f"Sample evaluation:")
|
| 162 |
+
print(f" - Score: {sample_eval.get('score', 'N/A')}")
|
| 163 |
+
print(f" - Label: {sample_eval.get('label', 'N/A')}")
|
| 164 |
+
print(f" - Explanation: {sample_eval.get('explanation', 'N/A')}")
|
| 165 |
+
|
| 166 |
+
return True
|
| 167 |
+
else:
|
| 168 |
+
print("β No GAIA evaluations found after logging")
|
| 169 |
+
return False
|
| 170 |
+
else:
|
| 171 |
+
print("β No evaluations found in Phoenix")
|
| 172 |
+
return False
|
| 173 |
+
|
| 174 |
+
except Exception as e:
|
| 175 |
+
print(f"β Error verifying evaluations: {e}")
|
| 176 |
+
return False
|
| 177 |
+
|
| 178 |
+
|
| 179 |
+
def test_with_real_gaia_data():
|
| 180 |
+
"""Test with actual GAIA data if available."""
|
| 181 |
+
print("\nπ Testing with real GAIA data...")
|
| 182 |
+
|
| 183 |
+
try:
|
| 184 |
+
# Initialize comparator
|
| 185 |
+
comparator = AnswerComparator()
|
| 186 |
+
|
| 187 |
+
if len(comparator.ground_truth) == 0:
|
| 188 |
+
print("β οΈ No GAIA ground truth data available")
|
| 189 |
+
return False
|
| 190 |
+
|
| 191 |
+
# Create a real evaluation with GAIA data
|
| 192 |
+
real_task_id = list(comparator.ground_truth.keys())[0]
|
| 193 |
+
real_ground_truth = comparator.ground_truth[real_task_id]
|
| 194 |
+
|
| 195 |
+
real_evaluation = comparator.evaluate_answer(real_task_id, "test answer")
|
| 196 |
+
|
| 197 |
+
real_eval_df = pd.DataFrame([real_evaluation])
|
| 198 |
+
|
| 199 |
+
# Log to Phoenix
|
| 200 |
+
result = log_evaluations_to_phoenix(real_eval_df)
|
| 201 |
+
|
| 202 |
+
if result is not None:
|
| 203 |
+
print("β
Real GAIA data logging successful!")
|
| 204 |
+
print(f"Task ID: {real_task_id}")
|
| 205 |
+
print(f"Ground Truth: {real_ground_truth}")
|
| 206 |
+
print(f"Similarity Score: {real_evaluation['similarity_score']:.3f}")
|
| 207 |
+
return True
|
| 208 |
+
else:
|
| 209 |
+
print("β Real GAIA data logging failed")
|
| 210 |
+
return False
|
| 211 |
+
|
| 212 |
+
except Exception as e:
|
| 213 |
+
print(f"β Error testing with real GAIA data: {e}")
|
| 214 |
+
return False
|
| 215 |
+
|
| 216 |
+
|
| 217 |
+
def main():
|
| 218 |
+
"""Main test function."""
|
| 219 |
+
print("π Phoenix Evaluations Logging Test")
|
| 220 |
+
print("=" * 50)
|
| 221 |
+
|
| 222 |
+
# Test Phoenix connection
|
| 223 |
+
client, spans_df = test_phoenix_connection()
|
| 224 |
+
if not client:
|
| 225 |
+
print("β Cannot proceed without Phoenix connection")
|
| 226 |
+
return
|
| 227 |
+
|
| 228 |
+
# Run tests
|
| 229 |
+
tests_passed = 0
|
| 230 |
+
total_tests = 3
|
| 231 |
+
|
| 232 |
+
print(f"\nπ§ͺ Running {total_tests} tests...")
|
| 233 |
+
|
| 234 |
+
# Test 1: Basic evaluation logging
|
| 235 |
+
if test_evaluation_logging():
|
| 236 |
+
tests_passed += 1
|
| 237 |
+
|
| 238 |
+
# Test 2: Verify evaluations were logged
|
| 239 |
+
if verify_logged_evaluations(client):
|
| 240 |
+
tests_passed += 1
|
| 241 |
+
|
| 242 |
+
# Test 3: Test with real GAIA data
|
| 243 |
+
if test_with_real_gaia_data():
|
| 244 |
+
tests_passed += 1
|
| 245 |
+
|
| 246 |
+
# Summary
|
| 247 |
+
print("\n" + "=" * 50)
|
| 248 |
+
print(f"π― Test Results: {tests_passed}/{total_tests} tests passed")
|
| 249 |
+
|
| 250 |
+
if tests_passed == total_tests:
|
| 251 |
+
print("π All tests passed! Phoenix evaluations logging is working correctly.")
|
| 252 |
+
print("You should now see 'gaia_ground_truth' evaluations in the Phoenix UI.")
|
| 253 |
+
else:
|
| 254 |
+
print("β οΈ Some tests failed. Check the output above for details.")
|
| 255 |
+
|
| 256 |
+
print(f"\nπ Phoenix UI: http://localhost:6006")
|
| 257 |
+
print("Look for 'Evaluations' or 'Evals' tab to see the logged evaluations.")
|
| 258 |
+
|
| 259 |
+
|
| 260 |
+
if __name__ == "__main__":
|
| 261 |
+
main()
|
test_phoenix_simple.py
ADDED
|
@@ -0,0 +1,132 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Simple test for Phoenix evaluations logging.
|
| 4 |
+
"""
|
| 5 |
+
|
| 6 |
+
import sys
|
| 7 |
+
import os
|
| 8 |
+
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
|
| 9 |
+
|
| 10 |
+
import phoenix as px
|
| 11 |
+
import pandas as pd
|
| 12 |
+
from comparison import AnswerComparator
|
| 13 |
+
from phoenix_evaluator import log_evaluations_to_phoenix
|
| 14 |
+
|
| 15 |
+
|
| 16 |
+
def test_phoenix_logging():
|
| 17 |
+
"""Test Phoenix evaluations logging with simple data."""
|
| 18 |
+
print("π§ͺ Testing Phoenix Evaluations Logging")
|
| 19 |
+
print("=" * 50)
|
| 20 |
+
|
| 21 |
+
# Step 1: Check Phoenix connection
|
| 22 |
+
print("1. Checking Phoenix connection...")
|
| 23 |
+
try:
|
| 24 |
+
client = px.Client()
|
| 25 |
+
print("β
Phoenix connected successfully")
|
| 26 |
+
except Exception as e:
|
| 27 |
+
print(f"β Phoenix connection failed: {e}")
|
| 28 |
+
return False
|
| 29 |
+
|
| 30 |
+
# Step 2: Create test evaluations
|
| 31 |
+
print("\n2. Creating test evaluations...")
|
| 32 |
+
test_evaluations = pd.DataFrame([
|
| 33 |
+
{
|
| 34 |
+
"task_id": "8e867cd7-cff9-4e6c-867a-ff5ddc2550be",
|
| 35 |
+
"predicted_answer": "3",
|
| 36 |
+
"actual_answer": "3",
|
| 37 |
+
"exact_match": True,
|
| 38 |
+
"similarity_score": 1.0,
|
| 39 |
+
"contains_answer": True,
|
| 40 |
+
"error": None
|
| 41 |
+
},
|
| 42 |
+
{
|
| 43 |
+
"task_id": "a1e91b78-d3d8-4675-bb8d-62741b4b68a6",
|
| 44 |
+
"predicted_answer": "5",
|
| 45 |
+
"actual_answer": "3",
|
| 46 |
+
"exact_match": False,
|
| 47 |
+
"similarity_score": 0.2,
|
| 48 |
+
"contains_answer": False,
|
| 49 |
+
"error": None
|
| 50 |
+
}
|
| 51 |
+
])
|
| 52 |
+
print(f"β
Created {len(test_evaluations)} test evaluations")
|
| 53 |
+
|
| 54 |
+
# Step 3: Check existing spans
|
| 55 |
+
print("\n3. Checking existing spans...")
|
| 56 |
+
try:
|
| 57 |
+
spans_df = client.get_spans_dataframe()
|
| 58 |
+
print(f"π Found {len(spans_df)} existing spans")
|
| 59 |
+
|
| 60 |
+
if len(spans_df) == 0:
|
| 61 |
+
print("β οΈ No spans found - you need to run your agent first to create spans")
|
| 62 |
+
return False
|
| 63 |
+
|
| 64 |
+
except Exception as e:
|
| 65 |
+
print(f"β Error getting spans: {e}")
|
| 66 |
+
return False
|
| 67 |
+
|
| 68 |
+
# Step 4: Test logging
|
| 69 |
+
print("\n4. Testing evaluation logging...")
|
| 70 |
+
try:
|
| 71 |
+
result = log_evaluations_to_phoenix(test_evaluations)
|
| 72 |
+
|
| 73 |
+
if result is not None:
|
| 74 |
+
print(f"β
Successfully logged {len(result)} evaluations to Phoenix")
|
| 75 |
+
print("Sample evaluation:")
|
| 76 |
+
print(f" - Score: {result.iloc[0]['score']}")
|
| 77 |
+
print(f" - Label: {result.iloc[0]['label']}")
|
| 78 |
+
print(f" - Explanation: {result.iloc[0]['explanation'][:100]}...")
|
| 79 |
+
|
| 80 |
+
# Step 5: Verify evaluations were logged
|
| 81 |
+
print("\n5. Verifying evaluations in Phoenix...")
|
| 82 |
+
try:
|
| 83 |
+
import time
|
| 84 |
+
time.sleep(2) # Give Phoenix time to process
|
| 85 |
+
|
| 86 |
+
evals_df = client.get_evaluations_dataframe()
|
| 87 |
+
gaia_evals = evals_df[evals_df['name'] == 'gaia_ground_truth']
|
| 88 |
+
|
| 89 |
+
print(f"π Found {len(gaia_evals)} GAIA evaluations in Phoenix")
|
| 90 |
+
|
| 91 |
+
if len(gaia_evals) > 0:
|
| 92 |
+
print("β
Evaluations successfully verified in Phoenix!")
|
| 93 |
+
return True
|
| 94 |
+
else:
|
| 95 |
+
print("β οΈ No GAIA evaluations found in Phoenix")
|
| 96 |
+
return False
|
| 97 |
+
|
| 98 |
+
except Exception as e:
|
| 99 |
+
print(f"β οΈ Could not verify evaluations: {e}")
|
| 100 |
+
print("β
Logging appeared successful though")
|
| 101 |
+
return True
|
| 102 |
+
|
| 103 |
+
else:
|
| 104 |
+
print("β Evaluation logging failed")
|
| 105 |
+
return False
|
| 106 |
+
|
| 107 |
+
except Exception as e:
|
| 108 |
+
print(f"β Error during logging: {e}")
|
| 109 |
+
import traceback
|
| 110 |
+
traceback.print_exc()
|
| 111 |
+
return False
|
| 112 |
+
|
| 113 |
+
|
| 114 |
+
def main():
|
| 115 |
+
"""Main test function."""
|
| 116 |
+
success = test_phoenix_logging()
|
| 117 |
+
|
| 118 |
+
print("\n" + "=" * 50)
|
| 119 |
+
if success:
|
| 120 |
+
print("π Phoenix evaluations logging test PASSED!")
|
| 121 |
+
print("You should now see 'gaia_ground_truth' evaluations in Phoenix UI")
|
| 122 |
+
print("π Visit: http://localhost:6006")
|
| 123 |
+
else:
|
| 124 |
+
print("β Phoenix evaluations logging test FAILED!")
|
| 125 |
+
print("Make sure:")
|
| 126 |
+
print(" 1. Your agent app is running (it starts Phoenix)")
|
| 127 |
+
print(" 2. You've run your agent at least once to create spans")
|
| 128 |
+
print(" 3. Phoenix is accessible at http://localhost:6006")
|
| 129 |
+
|
| 130 |
+
|
| 131 |
+
if __name__ == "__main__":
|
| 132 |
+
main()
|