Final_Assignment_Template

Running

App Files Files Community

Romain Fayoux commited on Jul 11

Commit

f9cf36d

1 Parent(s): 3ac0a19

Added ground evaluation and phoenix login

Browse files

Files changed (10) hide show

GAIA_COMPARISON.md +142 -0
app.py +39 -1
comparison.py +160 -0
data/metadata.jsonl +0 -0
debug_phoenix.py +285 -0
phoenix_evaluator.py +214 -0
requirements.txt +1 -0
test_comparison.py +144 -0
test_phoenix_logging.py +261 -0
test_phoenix_simple.py +132 -0

GAIA_COMPARISON.md ADDED Viewed

	@@ -0,0 +1,142 @@

+# GAIA Ground Truth Comparison
+This project now includes automatic comparison of your agent's answers against the GAIA dataset ground truth, with Phoenix observability integration.
+## Features
+- **Ground Truth Comparison**: Automatically compares agent answers to correct answers from `data/metadata.jsonl`
+- **Multiple Evaluation Metrics**: Exact match, similarity score, and contains-answer detection
+- **Phoenix Integration**: Logs evaluations to Phoenix for persistent tracking and analysis
+- **Enhanced Results Display**: Shows ground truth and comparison results in the Gradio interface
+## How It Works
+### 1. Ground Truth Loading
+- Loads correct answers from `data/metadata.jsonl`
+- Maps task IDs to ground truth answers
+- Currently supports 165 questions from the GAIA dataset
+### 2. Answer Comparison
+For each agent answer, the system calculates:
+- **Exact Match**: Boolean indicating if answers match exactly (after normalization)
+- **Similarity Score**: 0-1 score using difflib.SequenceMatcher
+- **Contains Answer**: Boolean indicating if the correct answer is contained in the agent's response
+### 3. Answer Normalization
+Before comparison, answers are normalized by:
+- Converting to lowercase
+- Removing punctuation (.,;:!?"')
+- Normalizing whitespace
+- Trimming leading/trailing spaces
+### 4. Phoenix Integration
+- Evaluations are automatically logged to Phoenix
+- Each evaluation includes score, label, explanation, and detailed metrics
+- Viewable in Phoenix UI for historical tracking and analysis
+## Usage
+### In Your Agent App
+The comparison happens automatically when you run your agent:
+1. **Run your agent** - Process questions as usual
+2. **Automatic comparison** - System compares answers to ground truth
+3. **Enhanced results** - Results table includes comparison columns
+4. **Phoenix logging** - Evaluations are logged for persistent tracking
+### Results Display
+Your results table now includes these additional columns:
+- **Ground Truth**: The correct answer from GAIA dataset
+- **Exact Match**: True/False for exact matches
+- **Similarity**: Similarity score (0-1)
+- **Contains Answer**: True/False if correct answer is contained
+### Status Message
+The status message now includes:
+```
+Ground Truth Comparison:
+Exact matches: 15/50 (30.0%)
+Average similarity: 0.654
+Contains correct answer: 22/50 (44.0%)
+Evaluations logged to Phoenix ✅
+```
+## Testing
+Run the test suite to verify functionality:
+```bash
+python test_comparison.py
+```
+This will test:
+- Basic comparison functionality
+- Results enhancement
+- Phoenix integration
+- Ground truth loading
+## Files Added
+- `comparison.py`: Main comparison logic and AnswerComparator class
+- `phoenix_evaluator.py`: Phoenix integration for logging evaluations
+- `test_comparison.py`: Test suite for verification
+- `GAIA_COMPARISON.md`: This documentation
+## Dependencies Added
+- `arize-phoenix`: For observability and evaluation logging
+- `pandas`: For data manipulation (if not already present)
+## Example Evaluation Result
+```python
+{
+    "task_id": "8e867cd7-cff9-4e6c-867a-ff5ddc2550be",
+    "predicted_answer": "3",
+    "actual_answer": "3",
+    "exact_match": True,
+    "similarity_score": 1.0,
+    "contains_answer": True,
+    "error": None
+}
+```
+## Phoenix UI
+In the Phoenix interface, you can:
+- View evaluation results alongside agent traces
+- Track accuracy over time
+- Filter by correct/incorrect answers
+- Analyze which question types your agent struggles with
+- Export evaluation data for further analysis
+## Troubleshooting
+### No Ground Truth Available
+If you see "N/A" for ground truth, the question's task_id is not in the metadata.jsonl file.
+### Phoenix Connection Issues
+If Phoenix logging fails, the comparison will still work but won't be persisted. Ensure Phoenix is running and accessible.
+### Low Similarity Scores
+Low similarity scores might indicate:
+- Agent is providing verbose answers when short ones are expected
+- Answer format doesn't match expected format
+- Agent is partially correct but not exact
+## Customization
+You can adjust the comparison logic in `comparison.py`:
+- Modify `normalize_answer()` for different normalization rules
+- Adjust similarity thresholds
+- Add custom evaluation metrics
+- Modify Phoenix logging format
+## Performance
+The comparison adds minimal overhead:
+- Ground truth loading: ~1-2 seconds (one-time)
+- Per-answer comparison: ~1-10ms
+- Phoenix logging: ~10-50ms per evaluation
+Total additional time: Usually < 5 seconds for 50 questions.

app.py CHANGED Viewed

@@ -7,6 +7,9 @@ from phoenix.otel import register
 from openinference.instrumentation.smolagents import SmolagentsInstrumentor
 from llm_only_agent import LLMOnlyAgent
 from multi_agent import MultiAgent
 # (Keep Constants as is)
@@ -88,7 +91,7 @@ def run_and_submit_all( profile: gr.OAuthProfile | None, limit: int | None):
     results_log = []
     answers_payload = []
     # Limit for test purposes
-    limit = None
     if limit is not None:
         questions_data = questions_data[:limit]
     print(f"Running agent on {len(questions_data)} questions...")
@@ -115,9 +118,44 @@ def run_and_submit_all( profile: gr.OAuthProfile | None, limit: int | None):
         print("Agent did not produce any answers to submit.")
         return "Agent did not produce any answers to submit.", pd.DataFrame(results_log)
     # 4. Prepare Submission
     submission_data = {"username": username.strip(), "agent_code": agent_code, "answers": answers_payload}
     status_update = f"Agent finished. Submitting {len(answers_payload)} answers for user '{username}'..."
     print(status_update)
     # 5. Submit

 from openinference.instrumentation.smolagents import SmolagentsInstrumentor
 from llm_only_agent import LLMOnlyAgent
 from multi_agent import MultiAgent
+from comparison import AnswerComparator
+from phoenix_evaluator import log_evaluations_to_phoenix
+import phoenix as px
 # (Keep Constants as is)
     results_log = []
     answers_payload = []
     # Limit for test purposes
+    limit = 2
     if limit is not None:
         questions_data = questions_data[:limit]
     print(f"Running agent on {len(questions_data)} questions...")
         print("Agent did not produce any answers to submit.")
         return "Agent did not produce any answers to submit.", pd.DataFrame(results_log)
+    # 3.5. Compare with Ground Truth and Log to Phoenix
+    print("Comparing answers with ground truth...")
+    try:
+        # Initialize comparator
+        comparator = AnswerComparator()
+        # Evaluate answers
+        evaluations_df = comparator.evaluate_batch(answers_payload)
+        # Get summary statistics
+        summary_stats = comparator.get_summary_stats(evaluations_df)
+        # Enhance results log with comparison data
+        results_log = comparator.enhance_results_log(results_log)
+        # Log evaluations to Phoenix
+        log_evaluations_to_phoenix(evaluations_df)
+        print(f"Ground truth comparison completed: {summary_stats['exact_matches']}/{summary_stats['total_questions']} exact matches")
+    except Exception as e:
+        print(f"Error during ground truth comparison: {e}")
+        summary_stats = {"error": str(e)}
     # 4. Prepare Submission
     submission_data = {"username": username.strip(), "agent_code": agent_code, "answers": answers_payload}
     status_update = f"Agent finished. Submitting {len(answers_payload)} answers for user '{username}'..."
+    # Add ground truth comparison to status
+    if "error" not in summary_stats:
+        status_update += f"\n\nGround Truth Comparison:\n"
+        status_update += f"Exact matches: {summary_stats['exact_matches']}/{summary_stats['total_questions']} ({summary_stats['exact_match_rate']:.1%})\n"
+        status_update += f"Average similarity: {summary_stats['average_similarity']:.3f}\n"
+        status_update += f"Contains correct answer: {summary_stats['contains_matches']}/{summary_stats['total_questions']} ({summary_stats['contains_match_rate']:.1%})\n"
+        status_update += f"Evaluations logged to Phoenix ✅"
+    else:
+        status_update += f"\n\nGround Truth Comparison Error: {summary_stats['error']}"
     print(status_update)
     # 5. Submit

comparison.py ADDED Viewed

	@@ -0,0 +1,160 @@

+import json
+import pandas as pd
+from typing import Dict, List, Any
+from difflib import SequenceMatcher
+import re
+class AnswerComparator:
+    def __init__(self, metadata_path: str = "data/metadata.jsonl"):
+        """Initialize the comparator with ground truth data."""
+        self.ground_truth = self._load_ground_truth(metadata_path)
+        print(f"Loaded ground truth for {len(self.ground_truth)} questions")
+    def _load_ground_truth(self, metadata_path: str) -> Dict[str, str]:
+        """Load ground truth answers from metadata.jsonl file."""
+        ground_truth = {}
+        try:
+            with open(metadata_path, 'r', encoding='utf-8') as f:
+                for line in f:
+                    if line.strip():
+                        data = json.loads(line)
+                        task_id = data.get("task_id")
+                        final_answer = data.get("Final answer")
+                        if task_id and final_answer is not None:
+                            ground_truth[task_id] = str(final_answer)
+        except FileNotFoundError:
+            print(f"Warning: Ground truth file {metadata_path} not found")
+        except Exception as e:
+            print(f"Error loading ground truth: {e}")
+        return ground_truth
+    def normalize_answer(self, answer: str) -> str:
+        """Normalize answer for comparison."""
+        if answer is None:
+            return ""
+        # Convert to string and strip whitespace
+        answer = str(answer).strip()
+        # Convert to lowercase for case-insensitive comparison
+        answer = answer.lower()
+        # Remove common punctuation that might not affect correctness
+        answer = re.sub(r'[.,;:!?"\']', '', answer)
+        # Normalize whitespace
+        answer = re.sub(r'\s+', ' ', answer)
+        return answer
+    def exact_match(self, predicted: str, actual: str) -> bool:
+        """Check if answers match exactly after normalization."""
+        return self.normalize_answer(predicted) == self.normalize_answer(actual)
+    def similarity_score(self, predicted: str, actual: str) -> float:
+        """Calculate similarity score between predicted and actual answers."""
+        normalized_pred = self.normalize_answer(predicted)
+        normalized_actual = self.normalize_answer(actual)
+        if not normalized_pred and not normalized_actual:
+            return 1.0
+        if not normalized_pred or not normalized_actual:
+            return 0.0
+        return SequenceMatcher(None, normalized_pred, normalized_actual).ratio()
+    def contains_answer(self, predicted: str, actual: str) -> bool:
+        """Check if the actual answer is contained in the predicted answer."""
+        normalized_pred = self.normalize_answer(predicted)
+        normalized_actual = self.normalize_answer(actual)
+        return normalized_actual in normalized_pred
+    def evaluate_answer(self, task_id: str, predicted_answer: str) -> Dict[str, Any]:
+        """Evaluate a single answer against ground truth."""
+        actual_answer = self.ground_truth.get(task_id)
+        if actual_answer is None:
+            return {
+                "task_id": task_id,
+                "predicted_answer": predicted_answer,
+                "actual_answer": None,
+                "exact_match": False,
+                "similarity_score": 0.0,
+                "contains_answer": False,
+                "error": "No ground truth available"
+            }
+        return {
+            "task_id": task_id,
+            "predicted_answer": predicted_answer,
+            "actual_answer": actual_answer,
+            "exact_match": self.exact_match(predicted_answer, actual_answer),
+            "similarity_score": self.similarity_score(predicted_answer, actual_answer),
+            "contains_answer": self.contains_answer(predicted_answer, actual_answer),
+            "error": None
+        }
+    def evaluate_batch(self, results: List[Dict[str, Any]]) -> pd.DataFrame:
+        """Evaluate a batch of results."""
+        evaluations = []
+        for result in results:
+            task_id = result.get("task_id") or result.get("Task ID")
+            predicted_answer = result.get("submitted_answer") or result.get("Submitted Answer", "")
+            if task_id is not None:
+                evaluation = self.evaluate_answer(task_id, predicted_answer)
+                evaluations.append(evaluation)
+        return pd.DataFrame(evaluations)
+    def get_summary_stats(self, evaluations_df: pd.DataFrame) -> Dict[str, Any]:
+        """Get summary statistics from evaluations."""
+        if evaluations_df.empty:
+            return {"error": "No evaluations available"}
+        # Filter out entries without ground truth
+        valid_evaluations = evaluations_df[evaluations_df['error'].isna()]
+        if valid_evaluations.empty:
+            return {"error": "No valid ground truth available"}
+        total_questions = len(valid_evaluations)
+        exact_matches = valid_evaluations['exact_match'].sum()
+        avg_similarity = valid_evaluations['similarity_score'].mean()
+        contains_matches = valid_evaluations['contains_answer'].sum()
+        return {
+            "total_questions": total_questions,
+            "exact_matches": exact_matches,
+            "exact_match_rate": exact_matches / total_questions,
+            "average_similarity": avg_similarity,
+            "contains_matches": contains_matches,
+            "contains_match_rate": contains_matches / total_questions,
+            "questions_with_ground_truth": total_questions
+        }
+    def enhance_results_log(self, results_log: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
+        """Add comparison columns to results log."""
+        enhanced_results = []
+        for result in results_log:
+            task_id = result.get("Task ID")
+            predicted_answer = result.get("Submitted Answer", "")
+            if task_id is not None:
+                evaluation = self.evaluate_answer(task_id, predicted_answer)
+                # Add comparison info to result
+                enhanced_result = result.copy()
+                enhanced_result["Ground Truth"] = evaluation["actual_answer"] or "N/A"
+                enhanced_result["Exact Match"] = evaluation["exact_match"]
+                enhanced_result["Similarity"] = f"{evaluation['similarity_score']:.3f}"
+                enhanced_result["Contains Answer"] = evaluation["contains_answer"]
+                enhanced_results.append(enhanced_result)
+        return enhanced_results

data/metadata.jsonl ADDED Viewed

The diff for this file is too large to render. See raw diff

debug_phoenix.py ADDED Viewed

	@@ -0,0 +1,285 @@

+#!/usr/bin/env python3
+"""
+Enhanced debug script to check Phoenix status and evaluations.
+"""
+import phoenix as px
+import pandas as pd
+from comparison import AnswerComparator
+from phoenix_evaluator import log_evaluations_to_phoenix
+import time
+from datetime import datetime
+def check_phoenix_connection():
+    """Check if Phoenix is running and accessible."""
+    try:
+        client = px.Client()
+        print("✅ Phoenix client connected successfully")
+        # Try to get basic info
+        try:
+            spans_df = client.get_spans_dataframe()
+            print(f"✅ Phoenix API working - can retrieve spans")
+            return client
+        except Exception as e:
+            print(f"⚠️ Phoenix connected but API might have issues: {e}")
+            return client
+    except Exception as e:
+        print(f"❌ Phoenix connection failed: {e}")
+        print("Make sure Phoenix is running. You should see a message like:")
+        print("🌍 To view the Phoenix app in your browser, visit http://localhost:6006")
+        return None
+def check_spans(client):
+    """Check spans in Phoenix."""
+    try:
+        spans_df = client.get_spans_dataframe()
+        print(f"📊 Found {len(spans_df)} spans in Phoenix")
+        if len(spans_df) > 0:
+            print("Recent spans:")
+            for i, (_, span) in enumerate(spans_df.head(5).iterrows()):
+                span_id = span.get('context.span_id', 'no-id')
+                span_name = span.get('name', 'unnamed')
+                start_time = span.get('start_time', 'unknown')
+                print(f"  {i+1}. {span_name} ({span_id[:8]}...) - {start_time}")
+            # Show input/output samples
+            print("\nSpan content samples:")
+            for i, (_, span) in enumerate(spans_df.head(3).iterrows()):
+                input_val = str(span.get('input.value', ''))[:100]
+                output_val = str(span.get('output.value', ''))[:100]
+                print(f"  Span {i+1}:")
+                print(f"    Input: {input_val}...")
+                print(f"    Output: {output_val}...")
+        else:
+            print("⚠️ No spans found. Run your agent first to generate traces.")
+        return spans_df
+    except Exception as e:
+        print(f"❌ Error getting spans: {e}")
+        return pd.DataFrame()
+def check_evaluations(client):
+    """Check evaluations in Phoenix."""
+    try:
+        # Try different methods to get evaluations
+        print("🔍 Checking evaluations...")
+        # Method 1: Direct evaluation dataframe
+        try:
+            evals_df = client.get_evaluations_dataframe()
+            print(f"📊 Found {len(evals_df)} evaluations in Phoenix")
+            if len(evals_df) > 0:
+                print("Evaluation breakdown:")
+                eval_names = evals_df['name'].value_counts()
+                for name, count in eval_names.items():
+                    print(f"  - {name}: {count} evaluations")
+                # Check for GAIA evaluations specifically
+                gaia_evals = evals_df[evals_df['name'] == 'gaia_ground_truth']
+                if len(gaia_evals) > 0:
+                    print(f"✅ Found {len(gaia_evals)} GAIA ground truth evaluations")
+                    # Show sample evaluation
+                    sample = gaia_evals.iloc[0]
+                    print("Sample GAIA evaluation:")
+                    print(f"  - Score: {sample.get('score', 'N/A')}")
+                    print(f"  - Label: {sample.get('label', 'N/A')}")
+                    print(f"  - Explanation: {sample.get('explanation', 'N/A')[:100]}...")
+                    # Show metadata if available
+                    metadata = sample.get('metadata', {})
+                    if metadata:
+                        print(f"  - Metadata keys: {list(metadata.keys())}")
+                else:
+                    print("❌ No GAIA ground truth evaluations found")
+                    print("Available evaluation types:", list(eval_names.keys()))
+            else:
+                print("⚠️ No evaluations found in Phoenix")
+            return evals_df
+        except AttributeError as e:
+            print(f"⚠️ get_evaluations_dataframe not available: {e}")
+            print("This might be a Phoenix version issue")
+            return pd.DataFrame()
+    except Exception as e:
+        print(f"❌ Error getting evaluations: {e}")
+        return pd.DataFrame()
+def test_evaluation_creation_and_logging():
+    """Test creating and logging evaluations."""
+    print("\n🧪 Testing evaluation creation and logging...")
+    # Create sample evaluations
+    sample_data = [
+        {
+            "task_id": "debug-test-1",
+            "predicted_answer": "test answer 1",
+            "actual_answer": "correct answer 1",
+            "exact_match": False,
+            "similarity_score": 0.75,
+            "contains_answer": True,
+            "error": None
+        },
+        {
+            "task_id": "debug-test-2",
+            "predicted_answer": "exact match",
+            "actual_answer": "exact match",
+            "exact_match": True,
+            "similarity_score": 1.0,
+            "contains_answer": True,
+            "error": None
+        }
+    ]
+    evaluations_df = pd.DataFrame(sample_data)
+    print(f"Created {len(evaluations_df)} test evaluations")
+    # Try to log to Phoenix
+    try:
+        print("Attempting to log evaluations to Phoenix...")
+        result = log_evaluations_to_phoenix(evaluations_df)
+        if result is not None:
+            print("✅ Test evaluation logging successful")
+            print(f"Logged {len(result)} evaluations")
+            return True
+        else:
+            print("❌ Test evaluation logging failed - no result returned")
+            return False
+    except Exception as e:
+        print(f"❌ Test evaluation logging error: {e}")
+        import traceback
+        traceback.print_exc()
+        return False
+def check_gaia_data():
+    """Check GAIA ground truth data availability."""
+    print("\n📚 Checking GAIA ground truth data...")
+    try:
+        comparator = AnswerComparator()
+        print(f"✅ Loaded {len(comparator.ground_truth)} GAIA ground truth answers")
+        if len(comparator.ground_truth) > 0:
+            # Show sample
+            sample_task_id = list(comparator.ground_truth.keys())[0]
+            sample_answer = comparator.ground_truth[sample_task_id]
+            print(f"Sample: {sample_task_id} -> '{sample_answer}'")
+            # Test evaluation
+            test_eval = comparator.evaluate_answer(sample_task_id, "test answer")
+            print(f"Test evaluation result: {test_eval}")
+            return True
+        else:
+            print("❌ No GAIA ground truth data found")
+            return False
+    except Exception as e:
+        print(f"❌ Error checking GAIA data: {e}")
+        return False
+def show_phoenix_ui_info():
+    """Show information about Phoenix UI."""
+    print("\n🌐 Phoenix UI Information:")
+    print("-" * 30)
+    print("Phoenix UI should be available at: http://localhost:6006")
+    print("")
+    print("In the Phoenix UI, look for:")
+    print("  • 'Evaluations' tab or section")
+    print("  • 'Evals' section")
+    print("  • 'Annotations' tab")
+    print("  • In 'Spans' view, look for evaluation badges on spans")
+    print("")
+    print("If you see evaluations, they should be named 'gaia_ground_truth'")
+    print("Each evaluation should show:")
+    print("  - Score (similarity score 0-1)")
+    print("  - Label (correct/incorrect)")
+    print("  - Explanation (predicted vs ground truth)")
+    print("  - Metadata (task_id, exact_match, etc.)")
+def main():
+    """Main debug function."""
+    print("🔍 Enhanced Phoenix Debug Script")
+    print("=" * 50)
+    # Check Phoenix connection
+    client = check_phoenix_connection()
+    if not client:
+        print("\n❌ Cannot proceed without Phoenix connection")
+        print("Make sure your agent app is running (it starts Phoenix)")
+        return
+    print("\n📋 Checking Phoenix Data:")
+    print("-" * 30)
+    # Check spans
+    spans_df = check_spans(client)
+    # Check evaluations
+    evals_df = check_evaluations(client)
+    # Test evaluation creation
+    test_success = test_evaluation_creation_and_logging()
+    # Wait a moment and recheck evaluations
+    if test_success:
+        print("\n⏳ Waiting for evaluations to be processed...")
+        time.sleep(3)
+        print("🔍 Rechecking evaluations after test logging...")
+        evals_df_after = check_evaluations(client)
+        if len(evals_df_after) > len(evals_df):
+            print("✅ New evaluations detected after test!")
+        else:
+            print("⚠️ No new evaluations detected")
+    # Check GAIA data
+    gaia_available = check_gaia_data()
+    # Show Phoenix UI info
+    show_phoenix_ui_info()
+    # Final summary
+    print("\n" + "=" * 50)
+    print("📊 Summary:")
+    print(f"  • Phoenix connected: {'✅' if client else '❌'}")
+    print(f"  • Spans available: {len(spans_df)} spans")
+    print(f"  • Evaluations found: {len(evals_df)} evaluations")
+    print(f"  • GAIA data available: {'✅' if gaia_available else '❌'}")
+    print(f"  • Test logging worked: {'✅' if test_success else '❌'}")
+    print("\n💡 Next Steps:")
+    if len(spans_df) == 0:
+        print("  • Run your agent to generate traces first")
+    if len(evals_df) == 0:
+        print("  • Check if evaluations are being logged correctly")
+        print("  • Verify Phoenix version compatibility")
+    if not gaia_available:
+        print("  • Check that data/metadata.jsonl exists and is readable")
+    print(f"\n🌐 Phoenix UI: http://localhost:6006")
+if __name__ == "__main__":
+    main()

phoenix_evaluator.py ADDED Viewed

	@@ -0,0 +1,214 @@

+import pandas as pd
+from typing import Dict, Any, List, Optional
+from comparison import AnswerComparator
+import phoenix as px
+from phoenix.trace import SpanEvaluations
+class GAIAPhoenixEvaluator:
+    """Phoenix evaluator for GAIA dataset ground truth comparison."""
+    def __init__(self, metadata_path: str = "data/metadata.jsonl"):
+        self.comparator = AnswerComparator(metadata_path)
+        self.eval_name = "gaia_ground_truth"
+    def evaluate_spans(self, spans_df: pd.DataFrame) -> List[SpanEvaluations]:
+        """Evaluate spans and return Phoenix SpanEvaluations."""
+        evaluations = []
+        for _, span in spans_df.iterrows():
+            # Extract task_id and answer from span
+            task_id = self._extract_task_id(span)
+            predicted_answer = self._extract_predicted_answer(span)
+            span_id = span.get("context.span_id")
+            if task_id and predicted_answer is not None and span_id:
+                evaluation = self.comparator.evaluate_answer(task_id, predicted_answer)
+                # Create evaluation record for Phoenix
+                eval_record = {
+                    "span_id": span_id,
+                    "score": 1.0 if evaluation["exact_match"] else evaluation["similarity_score"],
+                    "label": "correct" if evaluation["exact_match"] else "incorrect",
+                    "explanation": self._create_explanation(evaluation),
+                    "task_id": task_id,
+                    "predicted_answer": evaluation["predicted_answer"],
+                    "ground_truth": evaluation["actual_answer"],
+                    "exact_match": evaluation["exact_match"],
+                    "similarity_score": evaluation["similarity_score"],
+                    "contains_answer": evaluation["contains_answer"]
+                }
+                evaluations.append(eval_record)
+        if evaluations:
+            # Create SpanEvaluations object
+            eval_df = pd.DataFrame(evaluations)
+            return [SpanEvaluations(eval_name=self.eval_name, dataframe=eval_df)]
+        return []
+    def _extract_task_id(self, span) -> Optional[str]:
+        """Extract task_id from span data."""
+        # Try span attributes first
+        attributes = span.get("attributes", {})
+        if isinstance(attributes, dict):
+            if "task_id" in attributes:
+                return attributes["task_id"]
+        # Try input data
+        input_data = span.get("input", {})
+        if isinstance(input_data, dict):
+            if "task_id" in input_data:
+                return input_data["task_id"]
+        # Try to extract from input value if it's a string
+        input_value = span.get("input.value", "")
+        if isinstance(input_value, str):
+            # Look for UUID pattern in input
+            import re
+            uuid_pattern = r'[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}'
+            match = re.search(uuid_pattern, input_value)
+            if match:
+                return match.group(0)
+        # Try span name
+        span_name = span.get("name", "")
+        if isinstance(span_name, str):
+            import re
+            uuid_pattern = r'[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}'
+            match = re.search(uuid_pattern, span_name)
+            if match:
+                return match.group(0)
+        return None
+    def _extract_predicted_answer(self, span) -> Optional[str]:
+        """Extract predicted answer from span output."""
+        # Try different output fields
+        output_fields = ["output.value", "output", "response", "result"]
+        for field in output_fields:
+            value = span.get(field)
+            if value is not None:
+                return str(value)
+        return None
+    def _create_explanation(self, evaluation: Dict[str, Any]) -> str:
+        """Create human-readable explanation of the evaluation."""
+        predicted = evaluation["predicted_answer"]
+        actual = evaluation["actual_answer"]
+        exact_match = evaluation["exact_match"]
+        similarity = evaluation["similarity_score"]
+        contains = evaluation["contains_answer"]
+        if actual is None:
+            return "❓ No ground truth available for comparison"
+        explanation = f"Predicted: '{predicted}' | Ground Truth: '{actual}' | "
+        if exact_match:
+            explanation += "✅ Exact match"
+        elif contains:
+            explanation += f"⚠️ Contains correct answer (similarity: {similarity:.3f})"
+        else:
+            explanation += f"❌ Incorrect (similarity: {similarity:.3f})"
+        return explanation
+def add_gaia_evaluations_to_phoenix(spans_df: pd.DataFrame, metadata_path: str = "data/metadata.jsonl") -> List[SpanEvaluations]:
+    """Add GAIA evaluation results to Phoenix spans."""
+    evaluator = GAIAPhoenixEvaluator(metadata_path)
+    return evaluator.evaluate_spans(spans_df)
+def log_evaluations_to_phoenix(evaluations_df: pd.DataFrame, session_id: Optional[str] = None) -> Optional[pd.DataFrame]:
+    """Log evaluation results directly to Phoenix."""
+    try:
+        client = px.Client()
+        # Get current spans to match evaluations to span_ids
+        spans_df = client.get_spans_dataframe()
+        if spans_df is None or spans_df.empty:
+            print("No spans found to attach evaluations to")
+            return None
+        # Create evaluation records for Phoenix
+        evaluation_records = []
+        spans_with_evals = []
+        for _, eval_row in evaluations_df.iterrows():
+            task_id = eval_row["task_id"]
+            # Try to find matching span by searching for task_id in span input
+            matching_spans = spans_df[
+                spans_df['input.value'].astype(str).str.contains(task_id, na=False, case=False)
+            ]
+            if len(matching_spans) == 0:
+                # Try alternative search in span attributes or name
+                matching_spans = spans_df[
+                    spans_df['name'].astype(str).str.contains(task_id, na=False, case=False)
+                ]
+            if len(matching_spans) > 0:
+                span_id = matching_spans.iloc[0]['context.span_id']
+                # Create evaluation record in Phoenix format
+                evaluation_record = {
+                    "span_id": span_id,
+                    "name": "gaia_ground_truth",
+                    "score": eval_row["similarity_score"],
+                    "label": "correct" if bool(eval_row["exact_match"]) else "incorrect",
+                    "explanation": f"Predicted: '{eval_row['predicted_answer']}' | Ground Truth: '{eval_row['actual_answer']}' | Similarity: {eval_row['similarity_score']:.3f} | Exact Match: {eval_row['exact_match']}",
+                    "annotator_kind": "HUMAN",
+                    "metadata": {
+                        "task_id": task_id,
+                        "exact_match": eval_row["exact_match"],
+                        "similarity_score": eval_row["similarity_score"],
+                        "contains_answer": eval_row["contains_answer"],
+                        "predicted_answer": eval_row["predicted_answer"],
+                        "ground_truth": eval_row["actual_answer"]
+                    }
+                }
+                evaluation_records.append(evaluation_record)
+                spans_with_evals.append(span_id)
+        if evaluation_records:
+            # Convert to DataFrame for Phoenix
+            eval_df = pd.DataFrame(evaluation_records)
+            # Create SpanEvaluations object
+            span_evaluations = SpanEvaluations(
+                eval_name="gaia_ground_truth",
+                dataframe=eval_df
+            )
+            # Log evaluations to Phoenix
+            try:
+                # Try the newer Phoenix API
+                px.log_evaluations(span_evaluations)
+                print(f"✅ Successfully logged {len(evaluation_records)} evaluations to Phoenix")
+            except AttributeError:
+                # Fallback for older Phoenix versions
+                client.log_evaluations(span_evaluations)
+                print(f"✅ Successfully logged {len(evaluation_records)} evaluations to Phoenix (fallback)")
+            return eval_df
+        else:
+            print("⚠️ No matching spans found for evaluations")
+            if spans_df is not None:
+                print(f"Available spans: {len(spans_df)}")
+                if len(spans_df) > 0:
+                    print("Sample span names:", spans_df['name'].head(3).tolist())
+            return None
+    except Exception as e:
+        print(f"❌ Could not log evaluations to Phoenix: {e}")
+        import traceback
+        traceback.print_exc()
+        return None

requirements.txt CHANGED Viewed

@@ -8,3 +8,4 @@ markdownify
 requests
 smolagents[telemetry,toolkit]
 chess

 requests
 smolagents[telemetry,toolkit]
 chess
+pandas

test_comparison.py ADDED Viewed

	@@ -0,0 +1,144 @@

+#!/usr/bin/env python3
+"""
+Test script for GAIA comparison functionality.
+"""
+import sys
+import os
+sys.path.append(os.path.dirname(os.path.abspath(__file__)))
+from comparison import AnswerComparator
+from phoenix_evaluator import log_evaluations_to_phoenix
+import pandas as pd
+def test_basic_comparison():
+    """Test basic comparison functionality."""
+    print("Testing basic comparison...")
+    # Initialize comparator
+    comparator = AnswerComparator()
+    # Test with some sample data
+    sample_results = [
+        {"task_id": "8e867cd7-cff9-4e6c-867a-ff5ddc2550be", "submitted_answer": "3"},
+        {"task_id": "a1e91b78-d3d8-4675-bb8d-62741b4b68a6", "submitted_answer": "3"},
+        {"task_id": "nonexistent-task", "submitted_answer": "test"}
+    ]
+    # Evaluate batch
+    evaluations_df = comparator.evaluate_batch(sample_results)
+    print(f"Evaluated {len(evaluations_df)} answers")
+    # Get summary stats
+    summary_stats = comparator.get_summary_stats(evaluations_df)
+    print("Summary statistics:")
+    for key, value in summary_stats.items():
+        print(f"  {key}: {value}")
+    # Test single evaluation
+    print("\nTesting single evaluation...")
+    single_eval = comparator.evaluate_answer("8e867cd7-cff9-4e6c-867a-ff5ddc2550be", "3")
+    print(f"Single evaluation result: {single_eval}")
+    return evaluations_df
+def test_results_enhancement():
+    """Test results log enhancement."""
+    print("\nTesting results log enhancement...")
+    comparator = AnswerComparator()
+    # Sample results log (like what comes from your agent)
+    sample_results_log = [
+        {
+            "Task ID": "8e867cd7-cff9-4e6c-867a-ff5ddc2550be",
+            "Question": "How many studio albums were published by Mercedes Sosa between 2000 and 2009?",
+            "Submitted Answer": "3"
+        },
+        {
+            "Task ID": "a1e91b78-d3d8-4675-bb8d-62741b4b68a6",
+            "Question": "Test question",
+            "Submitted Answer": "wrong answer"
+        }
+    ]
+    # Enhance results
+    enhanced_results = comparator.enhance_results_log(sample_results_log)
+    print("Enhanced results:")
+    for result in enhanced_results:
+        print(f"  Task: {result['Task ID']}")
+        print(f"  Answer: {result['Submitted Answer']}")
+        print(f"  Ground Truth: {result['Ground Truth']}")
+        print(f"  Exact Match: {result['Exact Match']}")
+        print(f"  Similarity: {result['Similarity']}")
+        print()
+def test_phoenix_integration():
+    """Test Phoenix integration (basic)."""
+    print("\nTesting Phoenix integration...")
+    # Create sample evaluations
+    sample_evaluations = pd.DataFrame([
+        {
+            "task_id": "8e867cd7-cff9-4e6c-867a-ff5ddc2550be",
+            "predicted_answer": "3",
+            "actual_answer": "3",
+            "exact_match": True,
+            "similarity_score": 1.0,
+            "contains_answer": True,
+            "error": None
+        },
+        {
+            "task_id": "a1e91b78-d3d8-4675-bb8d-62741b4b68a6",
+            "predicted_answer": "wrong",
+            "actual_answer": "3",
+            "exact_match": False,
+            "similarity_score": 0.2,
+            "contains_answer": False,
+            "error": None
+        }
+    ])
+    # Try to log to Phoenix
+    try:
+        result = log_evaluations_to_phoenix(sample_evaluations)
+        if result is not None:
+            print("✅ Phoenix integration successful")
+        else:
+            print("⚠️ Phoenix integration failed (likely Phoenix not running)")
+    except Exception as e:
+        print(f"⚠️ Phoenix integration error: {e}")
+def main():
+    """Run all tests."""
+    print("="*50)
+    print("GAIA Comparison Test Suite")
+    print("="*50)
+    try:
+        # Test basic comparison
+        evaluations_df = test_basic_comparison()
+        # Test results enhancement
+        test_results_enhancement()
+        # Test Phoenix integration
+        test_phoenix_integration()
+        print("\n" + "="*50)
+        print("All tests completed!")
+        print("="*50)
+    except Exception as e:
+        print(f"❌ Test failed with error: {e}")
+        import traceback
+        traceback.print_exc()
+if __name__ == "__main__":
+    main()

test_phoenix_logging.py ADDED Viewed

	@@ -0,0 +1,261 @@

+#!/usr/bin/env python3
+"""
+Test script to verify Phoenix evaluations logging.
+"""
+import sys
+import os
+sys.path.append(os.path.dirname(os.path.abspath(__file__)))
+import phoenix as px
+import pandas as pd
+from comparison import AnswerComparator
+from phoenix_evaluator import log_evaluations_to_phoenix
+from datetime import datetime
+import time
+def test_phoenix_connection():
+    """Test Phoenix connection and basic functionality."""
+    print("🔍 Testing Phoenix Connection...")
+    try:
+        client = px.Client()
+        print("✅ Phoenix client connected successfully")
+        # Check if Phoenix is actually running
+        spans_df = client.get_spans_dataframe()
+        print(f"📊 Found {len(spans_df)} existing spans in Phoenix")
+        return client, spans_df
+    except Exception as e:
+        print(f"❌ Phoenix connection failed: {e}")
+        print("Make sure Phoenix is running and accessible at http://localhost:6006")
+        return None, None
+def create_test_evaluations():
+    """Create test evaluations for logging."""
+    print("\n🧪 Creating test evaluations...")
+    test_data = [
+        {
+            "task_id": "test-exact-match",
+            "predicted_answer": "Paris",
+            "actual_answer": "Paris",
+            "exact_match": True,
+            "similarity_score": 1.0,
+            "contains_answer": True,
+            "error": None
+        },
+        {
+            "task_id": "test-partial-match",
+            "predicted_answer": "The capital of France is Paris",
+            "actual_answer": "Paris",
+            "exact_match": False,
+            "similarity_score": 0.75,
+            "contains_answer": True,
+            "error": None
+        },
+        {
+            "task_id": "test-no-match",
+            "predicted_answer": "London",
+            "actual_answer": "Paris",
+            "exact_match": False,
+            "similarity_score": 0.2,
+            "contains_answer": False,
+            "error": None
+        }
+    ]
+    evaluations_df = pd.DataFrame(test_data)
+    print(f"Created {len(evaluations_df)} test evaluations")
+    return evaluations_df
+def create_mock_spans(client):
+    """Create mock spans for testing (if no real spans exist)."""
+    print("\n🎭 Creating mock spans for testing...")
+    # Note: This is a simplified mock - in real usage, spans are created by agent runs
+    mock_spans = [
+        {
+            "context.span_id": "mock-span-1",
+            "name": "test_agent_run",
+            "input.value": "Question about test-exact-match",
+            "output.value": "Paris",
+            "start_time": datetime.now(),
+            "end_time": datetime.now()
+        },
+        {
+            "context.span_id": "mock-span-2",
+            "name": "test_agent_run",
+            "input.value": "Question about test-partial-match",
+            "output.value": "The capital of France is Paris",
+            "start_time": datetime.now(),
+            "end_time": datetime.now()
+        },
+        {
+            "context.span_id": "mock-span-3",
+            "name": "test_agent_run",
+            "input.value": "Question about test-no-match",
+            "output.value": "London",
+            "start_time": datetime.now(),
+            "end_time": datetime.now()
+        }
+    ]
+    print(f"Created {len(mock_spans)} mock spans")
+    return pd.DataFrame(mock_spans)
+def test_evaluation_logging():
+    """Test the actual evaluation logging to Phoenix."""
+    print("\n📝 Testing evaluation logging...")
+    # Create test evaluations
+    evaluations_df = create_test_evaluations()
+    # Try to log to Phoenix
+    try:
+        result = log_evaluations_to_phoenix(evaluations_df)
+        if result is not None:
+            print("✅ Evaluation logging test successful!")
+            print(f"Logged {len(result)} evaluations")
+            return True
+        else:
+            print("❌ Evaluation logging test failed - no result returned")
+            return False
+    except Exception as e:
+        print(f"❌ Evaluation logging test failed with error: {e}")
+        import traceback
+        traceback.print_exc()
+        return False
+def verify_logged_evaluations(client):
+    """Verify that evaluations were actually logged to Phoenix."""
+    print("\n🔍 Verifying logged evaluations...")
+    try:
+        # Give Phoenix a moment to process
+        time.sleep(2)
+        # Try to retrieve evaluations
+        evals_df = client.get_evaluations_dataframe()
+        print(f"📊 Found {len(evals_df)} total evaluations in Phoenix")
+        # Look for our specific evaluations
+        if len(evals_df) > 0:
+            gaia_evals = evals_df[evals_df['name'] == 'gaia_ground_truth']
+            print(f"🎯 Found {len(gaia_evals)} GAIA ground truth evaluations")
+            if len(gaia_evals) > 0:
+                print("✅ Successfully verified evaluations in Phoenix!")
+                # Show sample evaluation
+                sample_eval = gaia_evals.iloc[0]
+                print(f"Sample evaluation:")
+                print(f"  - Score: {sample_eval.get('score', 'N/A')}")
+                print(f"  - Label: {sample_eval.get('label', 'N/A')}")
+                print(f"  - Explanation: {sample_eval.get('explanation', 'N/A')}")
+                return True
+            else:
+                print("❌ No GAIA evaluations found after logging")
+                return False
+        else:
+            print("❌ No evaluations found in Phoenix")
+            return False
+    except Exception as e:
+        print(f"❌ Error verifying evaluations: {e}")
+        return False
+def test_with_real_gaia_data():
+    """Test with actual GAIA data if available."""
+    print("\n📚 Testing with real GAIA data...")
+    try:
+        # Initialize comparator
+        comparator = AnswerComparator()
+        if len(comparator.ground_truth) == 0:
+            print("⚠️ No GAIA ground truth data available")
+            return False
+        # Create a real evaluation with GAIA data
+        real_task_id = list(comparator.ground_truth.keys())[0]
+        real_ground_truth = comparator.ground_truth[real_task_id]
+        real_evaluation = comparator.evaluate_answer(real_task_id, "test answer")
+        real_eval_df = pd.DataFrame([real_evaluation])
+        # Log to Phoenix
+        result = log_evaluations_to_phoenix(real_eval_df)
+        if result is not None:
+            print("✅ Real GAIA data logging successful!")
+            print(f"Task ID: {real_task_id}")
+            print(f"Ground Truth: {real_ground_truth}")
+            print(f"Similarity Score: {real_evaluation['similarity_score']:.3f}")
+            return True
+        else:
+            print("❌ Real GAIA data logging failed")
+            return False
+    except Exception as e:
+        print(f"❌ Error testing with real GAIA data: {e}")
+        return False
+def main():
+    """Main test function."""
+    print("🚀 Phoenix Evaluations Logging Test")
+    print("=" * 50)
+    # Test Phoenix connection
+    client, spans_df = test_phoenix_connection()
+    if not client:
+        print("❌ Cannot proceed without Phoenix connection")
+        return
+    # Run tests
+    tests_passed = 0
+    total_tests = 3
+    print(f"\n🧪 Running {total_tests} tests...")
+    # Test 1: Basic evaluation logging
+    if test_evaluation_logging():
+        tests_passed += 1
+    # Test 2: Verify evaluations were logged
+    if verify_logged_evaluations(client):
+        tests_passed += 1
+    # Test 3: Test with real GAIA data
+    if test_with_real_gaia_data():
+        tests_passed += 1
+    # Summary
+    print("\n" + "=" * 50)
+    print(f"🎯 Test Results: {tests_passed}/{total_tests} tests passed")
+    if tests_passed == total_tests:
+        print("🎉 All tests passed! Phoenix evaluations logging is working correctly.")
+        print("You should now see 'gaia_ground_truth' evaluations in the Phoenix UI.")
+    else:
+        print("⚠️ Some tests failed. Check the output above for details.")
+    print(f"\n🌐 Phoenix UI: http://localhost:6006")
+    print("Look for 'Evaluations' or 'Evals' tab to see the logged evaluations.")
+if __name__ == "__main__":
+    main()

test_phoenix_simple.py ADDED Viewed

	@@ -0,0 +1,132 @@

+#!/usr/bin/env python3
+"""
+Simple test for Phoenix evaluations logging.
+"""
+import sys
+import os
+sys.path.append(os.path.dirname(os.path.abspath(__file__)))
+import phoenix as px
+import pandas as pd
+from comparison import AnswerComparator
+from phoenix_evaluator import log_evaluations_to_phoenix
+def test_phoenix_logging():
+    """Test Phoenix evaluations logging with simple data."""
+    print("🧪 Testing Phoenix Evaluations Logging")
+    print("=" * 50)
+    # Step 1: Check Phoenix connection
+    print("1. Checking Phoenix connection...")
+    try:
+        client = px.Client()
+        print("✅ Phoenix connected successfully")
+    except Exception as e:
+        print(f"❌ Phoenix connection failed: {e}")
+        return False
+    # Step 2: Create test evaluations
+    print("\n2. Creating test evaluations...")
+    test_evaluations = pd.DataFrame([
+        {
+            "task_id": "8e867cd7-cff9-4e6c-867a-ff5ddc2550be",
+            "predicted_answer": "3",
+            "actual_answer": "3",
+            "exact_match": True,
+            "similarity_score": 1.0,
+            "contains_answer": True,
+            "error": None
+        },
+        {
+            "task_id": "a1e91b78-d3d8-4675-bb8d-62741b4b68a6",
+            "predicted_answer": "5",
+            "actual_answer": "3",
+            "exact_match": False,
+            "similarity_score": 0.2,
+            "contains_answer": False,
+            "error": None
+        }
+    ])
+    print(f"✅ Created {len(test_evaluations)} test evaluations")
+    # Step 3: Check existing spans
+    print("\n3. Checking existing spans...")
+    try:
+        spans_df = client.get_spans_dataframe()
+        print(f"📊 Found {len(spans_df)} existing spans")
+        if len(spans_df) == 0:
+            print("⚠️ No spans found - you need to run your agent first to create spans")
+            return False
+    except Exception as e:
+        print(f"❌ Error getting spans: {e}")
+        return False
+    # Step 4: Test logging
+    print("\n4. Testing evaluation logging...")
+    try:
+        result = log_evaluations_to_phoenix(test_evaluations)
+        if result is not None:
+            print(f"✅ Successfully logged {len(result)} evaluations to Phoenix")
+            print("Sample evaluation:")
+            print(f"  - Score: {result.iloc[0]['score']}")
+            print(f"  - Label: {result.iloc[0]['label']}")
+            print(f"  - Explanation: {result.iloc[0]['explanation'][:100]}...")
+            # Step 5: Verify evaluations were logged
+            print("\n5. Verifying evaluations in Phoenix...")
+            try:
+                import time
+                time.sleep(2)  # Give Phoenix time to process
+                evals_df = client.get_evaluations_dataframe()
+                gaia_evals = evals_df[evals_df['name'] == 'gaia_ground_truth']
+                print(f"📊 Found {len(gaia_evals)} GAIA evaluations in Phoenix")
+                if len(gaia_evals) > 0:
+                    print("✅ Evaluations successfully verified in Phoenix!")
+                    return True
+                else:
+                    print("⚠️ No GAIA evaluations found in Phoenix")
+                    return False
+            except Exception as e:
+                print(f"⚠️ Could not verify evaluations: {e}")
+                print("✅ Logging appeared successful though")
+                return True
+        else:
+            print("❌ Evaluation logging failed")
+            return False
+    except Exception as e:
+        print(f"❌ Error during logging: {e}")
+        import traceback
+        traceback.print_exc()
+        return False
+def main():
+    """Main test function."""
+    success = test_phoenix_logging()
+    print("\n" + "=" * 50)
+    if success:
+        print("🎉 Phoenix evaluations logging test PASSED!")
+        print("You should now see 'gaia_ground_truth' evaluations in Phoenix UI")
+        print("🌐 Visit: http://localhost:6006")
+    else:
+        print("❌ Phoenix evaluations logging test FAILED!")
+        print("Make sure:")
+        print("  1. Your agent app is running (it starts Phoenix)")
+        print("  2. You've run your agent at least once to create spans")
+        print("  3. Phoenix is accessible at http://localhost:6006")
+if __name__ == "__main__":
+    main()