Spaces:
Sleeping
Sleeping
Organize project structure: move test scripts to tests/scripts and documentation to docs/reports
Browse files- docs/reports/MCP_EVALUATION_REPORT.textile +107 -0
- docs/reports/comprehensive_mcp_evaluation.json +956 -0
- docs/reports/final_14_model_evaluation.json +256 -0
- docs/reports/final_model_comparison.json +478 -0
- docs/reports/practical_mcp_results.json +149 -0
- docs/reports/quick_model_comparison.json +43 -0
- docs/reports/qwen_mcp_evaluation_results.json +37 -0
- docs/reports/realistic_mcp_evaluation.json +445 -0
- tests/scripts/comparative_model_evaluation.py +366 -0
- tests/scripts/comprehensive_mcp_evaluation.py +379 -0
- tests/scripts/debug_model_loading.py +120 -0
- tests/scripts/diagnose_mcp_failures.py +182 -0
- tests/scripts/diagnostic_qwen_test.py +227 -0
- tests/scripts/direct_qwen_mcp_test.py +297 -0
- tests/scripts/final_model_comparison.py +263 -0
- tests/scripts/investigate_mcp_failures.py +181 -0
- tests/scripts/practical_mcp_test.py +197 -0
- tests/scripts/prompt_format_investigation.py +156 -0
- tests/scripts/quick_model_comparison.py +176 -0
- tests/scripts/qwen_mcp_capability_assessment.py +367 -0
- tests/scripts/qwen_mcp_evaluation.py +279 -0
- tests/scripts/realistic_mcp_evaluation.py +387 -0
- tests/scripts/simple_model_diagnostic.py +94 -0
- tests/scripts/simple_qwen_test.py +137 -0
- tests/scripts/test_improved_mcp_formats.py +229 -0
- tests/scripts/test_mcp.py +50 -0
- tests/scripts/test_q8_models.py +180 -0
- tests/scripts/test_qwen_mcp_capabilities.py +326 -0
docs/reports/MCP_EVALUATION_REPORT.textile
ADDED
|
@@ -0,0 +1,107 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
h1. MCP Capability Evaluation Report for Small Language Models (SLMs)
|
| 2 |
+
|
| 3 |
+
h2. Executive Summary
|
| 4 |
+
|
| 5 |
+
This report presents a comprehensive evaluation of 14 Small Language Models (SLMs) under 3B parameters for their MCP (Model Context Protocol) capabilities. The evaluation focuses on assessing how effectively these models can convert natural language RTS game commands into structured JSON tool calls.
|
| 6 |
+
|
| 7 |
+
*NEW*: This report has been updated with two additional code-specialized models (Qwen2.5-Coder-1.5B and Yi-Coder-1.5B), revealing a new champion with exceptional MCP performance.
|
| 8 |
+
|
| 9 |
+
h2. What is MCP?
|
| 10 |
+
|
| 11 |
+
MCP (Model Context Protocol) is a standardized protocol that enables AI models to interact with external tools and systems through structured JSON calls. In the context of RTS games, MCP allows:
|
| 12 |
+
|
| 13 |
+
* Conversion of natural language commands into executable actions
|
| 14 |
+
* Structured communication between AI and game engines
|
| 15 |
+
* Standardized tool calling interface
|
| 16 |
+
* Real-time command processing
|
| 17 |
+
|
| 18 |
+
h2. Evaluation Methodology
|
| 19 |
+
|
| 20 |
+
h3. Test Scenarios
|
| 21 |
+
|
| 22 |
+
We evaluated each model on three realistic RTS game scenarios:
|
| 23 |
+
|
| 24 |
+
# *State Command*: "show game state" โ Expected: @get_game_state@
|
| 25 |
+
# *Movement Command*: "move infantry to 150,200" โ Expected: @move_units@ with coordinates
|
| 26 |
+
# *Attack Command*: "attack enemy tank at 300,150" โ Expected: @attack_unit@ with target coordinates
|
| 27 |
+
|
| 28 |
+
h3. Scoring System (0-10 points per test)
|
| 29 |
+
|
| 30 |
+
* +4 points: Correct tool identification
|
| 31 |
+
* +3 points: Valid JSON structure
|
| 32 |
+
* +2 points: Proper tool/action terminology
|
| 33 |
+
* +1 point: Correct coordinate extraction
|
| 34 |
+
|
| 35 |
+
h3. Models Evaluated
|
| 36 |
+
|
| 37 |
+
The evaluation includes 14 models across different categories:
|
| 38 |
+
|
| 39 |
+
* *General-purpose SLMs* (3 models)
|
| 40 |
+
* *MCP-specialized models* (7 models with various quantizations)
|
| 41 |
+
* *Code-specialized models* (3 models)
|
| 42 |
+
* *Code-specialized failing* (1 model)
|
| 43 |
+
|
| 44 |
+
h2. Test Results
|
| 45 |
+
|
| 46 |
+
The comprehensive evaluation revealed significant differences in MCP capabilities across models:
|
| 47 |
+
|
| 48 |
+
h3. Performance Ranking Table
|
| 49 |
+
|
| 50 |
+
|_. Rank|_. Model|_. MCP Score|_. Avg Time|_. Size|_. Efficiency|_. Notes|
|
| 51 |
+
| *1* | *@Qwen2.5-Coder-1.5B-Q4@* | *9.7/10* | *4.12s* | *1017MB* | *2.34 pts/s* | ๐ *Champion* |
|
| 52 |
+
| 2 | @Qwen2.5-Coder-0.5B@ | 4.3/10 | 2.08s | 409MB | 2.08 pts/s | Previous champion |
|
| 53 |
+
| 3 | @Qwen3-0.6B@ | 3.7/10 | 3.98s | 610MB | 0.92 pts/s | |
|
| 54 |
+
| 4 | @Gemma-3-270M@ | 3.7/10 | 2.29s | 428MB | 1.60 pts/s | |
|
| 55 |
+
| 5 | @MCPR-L-3B-Exa-Q8@ | 3.7/10 | 17.42s | 3133MB | 0.21 pts/s | |
|
| 56 |
+
| 6 | @Gemma-3n-E2B-it-Q8@ | 3.7/10 | 14.80s | 4566MB | 0.25 pts/s | |
|
| 57 |
+
| 7 | @Qwen3-1.7B@ | 3.7/10 | 6.24s | 1008MB | 0.59 pts/s | |
|
| 58 |
+
| 8 | @Qwen2.5-0.5B@ | 2.7/10 | 1.17s | 409MB | 2.28 pts/s | |
|
| 59 |
+
| 9 | @Gemma-3n-E2B-it-IQ2@ | 2.3/10 | 14.11s | 1958MB | 0.17 pts/s | |
|
| 60 |
+
| 10 | @Llama-Breeze2-3B-Q2@ | 1.3/10 | 11.39s | 1424MB | 0.12 pts/s | |
|
| 61 |
+
| 11 | @Yi-Coder-1.5B-Q4@ | 0.0/10 | 11.64s | 826MB | 0.00 pts/s | Prompt format issue |
|
| 62 |
+
| 12 | @MCP-Instruct-v1-Q4@ | 0.0/10 | 0.00s | 697MB | 0.00 pts/s | |
|
| 63 |
+
| 13 | @MCPR-L-3B-Exa-Q2@ | 0.0/10 | 10.63s | 1216MB | 0.00 pts/s | |
|
| 64 |
+
| 14 | @MCP-Instruct-v1-Q8@ | 0.0/10 | 0.00s | 1465MB | 0.00 pts/s | |
|
| 65 |
+
|
| 66 |
+
h2. Key Findings
|
| 67 |
+
|
| 68 |
+
h3. Performance Insights
|
| 69 |
+
|
| 70 |
+
* *Code-specialized models dramatically outperform others*: Qwen2.5-Coder-1.5B achieved an exceptional 9.7/10 score, more than 2x better than any other model
|
| 71 |
+
* *Scaling works for code-specialized models*: Increasing from 0.5B to 1.5B parameters improved the score from 4.3/10 to 9.7/10
|
| 72 |
+
* *Near-perfect MCP capability exists in small models*: The 1.5B model achieved 10/10 on 2 out of 3 tests with proper JSON extraction
|
| 73 |
+
* *Smaller models can be more efficient*: The 270M parameter Gemma model performed as well as much larger 3B models
|
| 74 |
+
* *Quantization matters*: Q8 versions generally performed better than Q2/Q4 versions for MCP-specialized models
|
| 75 |
+
|
| 76 |
+
h3. Technical Observations
|
| 77 |
+
|
| 78 |
+
* *Markdown wrapping requires extraction*: Qwen2.5-Coder-1.5B wraps JSON in markdown code blocks (@```json```@), requiring extraction logic
|
| 79 |
+
* *MCP-Instruct models failed completely* due to technical issues (@llama_decode returned -1@)
|
| 80 |
+
* *Yi-Coder has prompt format incompatibility*: Returns the prompt itself rather than generating responses
|
| 81 |
+
* *Larger models don't guarantee better performance*: The 3B models were significantly slower with similar scores (except code-specialized)
|
| 82 |
+
* *Response time varies dramatically*: From 1.17s (Qwen2.5-0.5B) to 17.42s (MCPR-L-3B-Exa-Q8)
|
| 83 |
+
|
| 84 |
+
h2. Recommendations
|
| 85 |
+
|
| 86 |
+
Based on the updated evaluation results, we recommend:
|
| 87 |
+
|
| 88 |
+
# *Primary Choice*: @Qwen2.5-Coder-1.5B-Q4@ - *Exceptional MCP performance* (9.7/10) with reasonable speed (4.12s) and size (1017MB)
|
| 89 |
+
# *Budget Alternative*: @Qwen2.5-Coder-0.5B@ - Best balance for resource-constrained environments (4.3/10, 2.08s, 409MB)
|
| 90 |
+
# *Ultra-lightweight*: @Gemma-3-270M@ - Excellent efficiency for its tiny size (3.7/10, 2.29s, 428MB)
|
| 91 |
+
# *Avoid*: MCP-Instruct models (technical incompatibility), Yi-Coder (prompt format issues)
|
| 92 |
+
|
| 93 |
+
h2. Conclusion
|
| 94 |
+
|
| 95 |
+
This comprehensive 14-model evaluation demonstrates critical insights for MCP capabilities in RTS games:
|
| 96 |
+
|
| 97 |
+
* *Code-specialized models are vastly superior*: The champion (Qwen2.5-Coder-1.5B) achieved 9.7/10, while the best MCP-specialized model only reached 3.7/10
|
| 98 |
+
* *Parameter scaling works for code models*: Tripling parameters (0.5B โ 1.5B) more than doubled MCP performance (4.3 โ 9.7)
|
| 99 |
+
* *Near-perfect MCP is achievable*: Small models under 2B parameters can achieve 10/10 on individual tests with proper implementation
|
| 100 |
+
* *JSON extraction is critical*: Modern code models wrap output in markdown, requiring extraction logic for production use
|
| 101 |
+
* *Efficiency varies dramatically*: The best model is 11.7x more effective than the worst functional model
|
| 102 |
+
|
| 103 |
+
The results provide valuable insights for developers implementing MCP-based AI assistants in gaming applications, demonstrating that code-specialized models offer the most reliable path to high-quality MCP capabilities.
|
| 104 |
+
|
| 105 |
+
*Report generated on: 2025-10-05*
|
| 106 |
+
*Updated on: 2025-10-05* (added Qwen2.5-Coder-1.5B and Yi-Coder-1.5B)
|
| 107 |
+
*Evaluation framework: llama.cpp with MCP protocol simulation*
|
docs/reports/comprehensive_mcp_evaluation.json
ADDED
|
@@ -0,0 +1,956 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"evaluation_type": "comprehensive_mcp_test",
|
| 3 |
+
"total_models_tested": 9,
|
| 4 |
+
"successful_models": 5,
|
| 5 |
+
"results": [
|
| 6 |
+
{
|
| 7 |
+
"name": "Qwen2.5-0.5B",
|
| 8 |
+
"file_size_mb": 408.8689880371094,
|
| 9 |
+
"avg_score": 2.6,
|
| 10 |
+
"avg_time": 2.6360722541809083,
|
| 11 |
+
"efficiency": 0.9863159084036122,
|
| 12 |
+
"results": [
|
| 13 |
+
{
|
| 14 |
+
"test": "Commande simple",
|
| 15 |
+
"difficulty": "easy",
|
| 16 |
+
"score": 0,
|
| 17 |
+
"time": 0.31192469596862793,
|
| 18 |
+
"response": ""
|
| 19 |
+
},
|
| 20 |
+
{
|
| 21 |
+
"test": "Action avec coordonnรฉes",
|
| 22 |
+
"difficulty": "easy",
|
| 23 |
+
"score": 0,
|
| 24 |
+
"time": 0.18253064155578613,
|
| 25 |
+
"response": ""
|
| 26 |
+
},
|
| 27 |
+
{
|
| 28 |
+
"test": "Attaque spรฉcifique",
|
| 29 |
+
"difficulty": "medium",
|
| 30 |
+
"score": 7,
|
| 31 |
+
"time": 4.232211351394653,
|
| 32 |
+
"response": "Where tool_name is the name of the tool used and args is a dictionary containing the arguments for t..."
|
| 33 |
+
},
|
| 34 |
+
{
|
| 35 |
+
"test": "Construction",
|
| 36 |
+
"difficulty": "medium",
|
| 37 |
+
"score": 2,
|
| 38 |
+
"time": 4.225749492645264,
|
| 39 |
+
"response": "Where tool_name is the name of the tool and args is a dictionary with the arguments. If no arguments..."
|
| 40 |
+
},
|
| 41 |
+
{
|
| 42 |
+
"test": "Commande complexe",
|
| 43 |
+
"difficulty": "hard",
|
| 44 |
+
"score": 4,
|
| 45 |
+
"time": 4.22794508934021,
|
| 46 |
+
"response": "where tool_name is the name of the tool and args is a dictionary containing the arguments. If no too..."
|
| 47 |
+
}
|
| 48 |
+
],
|
| 49 |
+
"type": "general"
|
| 50 |
+
},
|
| 51 |
+
{
|
| 52 |
+
"name": "Qwen3-0.6B",
|
| 53 |
+
"file_size_mb": 609.8238830566406,
|
| 54 |
+
"avg_score": 2.8,
|
| 55 |
+
"avg_time": 8.223706769943238,
|
| 56 |
+
"efficiency": 0.3404790659892809,
|
| 57 |
+
"results": [
|
| 58 |
+
{
|
| 59 |
+
"test": "Commande simple",
|
| 60 |
+
"difficulty": "easy",
|
| 61 |
+
"score": 7,
|
| 62 |
+
"time": 8.638539791107178,
|
| 63 |
+
"response": ".\n\nIf the user command is not supported by the available tools, respond with an empty array.\n\nNow, t..."
|
| 64 |
+
},
|
| 65 |
+
{
|
| 66 |
+
"test": "Action avec coordonnรฉes",
|
| 67 |
+
"difficulty": "easy",
|
| 68 |
+
"score": 5,
|
| 69 |
+
"time": 8.075484991073608,
|
| 70 |
+
"response": ".\n\nMake sure to use the correct tool name and format the JSON correctly.\n\nIf the command is not poss..."
|
| 71 |
+
},
|
| 72 |
+
{
|
| 73 |
+
"test": "Attaque spรฉcifique",
|
| 74 |
+
"difficulty": "medium",
|
| 75 |
+
"score": 0,
|
| 76 |
+
"time": 7.951770067214966,
|
| 77 |
+
"response": ".\n\nMake sure to use the correct tool names and format the JSON correctly.\n\nNow, the game state is as..."
|
| 78 |
+
},
|
| 79 |
+
{
|
| 80 |
+
"test": "Construction",
|
| 81 |
+
"difficulty": "medium",
|
| 82 |
+
"score": 2,
|
| 83 |
+
"time": 8.252855062484741,
|
| 84 |
+
"response": ".\n\nMake sure to use the correct tool names and format the JSON properly.\n\nNow, the user is in a game..."
|
| 85 |
+
},
|
| 86 |
+
{
|
| 87 |
+
"test": "Commande complexe",
|
| 88 |
+
"difficulty": "hard",
|
| 89 |
+
"score": 0,
|
| 90 |
+
"time": 8.199883937835693,
|
| 91 |
+
"response": ".\n\nMake sure to use the correct tool name and format the JSON.\n\nIf the command is not possible, retu..."
|
| 92 |
+
}
|
| 93 |
+
],
|
| 94 |
+
"type": "general"
|
| 95 |
+
},
|
| 96 |
+
{
|
| 97 |
+
"name": "Gemma-3-270M",
|
| 98 |
+
"file_size_mb": 428.0401306152344,
|
| 99 |
+
"avg_score": 0.0,
|
| 100 |
+
"avg_time": 0.16690435409545898,
|
| 101 |
+
"efficiency": 0.0,
|
| 102 |
+
"results": [
|
| 103 |
+
{
|
| 104 |
+
"test": "Commande simple",
|
| 105 |
+
"difficulty": "easy",
|
| 106 |
+
"score": 0,
|
| 107 |
+
"time": 0.2941462993621826,
|
| 108 |
+
"response": ""
|
| 109 |
+
},
|
| 110 |
+
{
|
| 111 |
+
"test": "Action avec coordonnรฉes",
|
| 112 |
+
"difficulty": "easy",
|
| 113 |
+
"score": 0,
|
| 114 |
+
"time": 0.13967180252075195,
|
| 115 |
+
"response": ""
|
| 116 |
+
},
|
| 117 |
+
{
|
| 118 |
+
"test": "Attaque spรฉcifique",
|
| 119 |
+
"difficulty": "medium",
|
| 120 |
+
"score": 0,
|
| 121 |
+
"time": 0.1264328956604004,
|
| 122 |
+
"response": ""
|
| 123 |
+
},
|
| 124 |
+
{
|
| 125 |
+
"test": "Construction",
|
| 126 |
+
"difficulty": "medium",
|
| 127 |
+
"score": 0,
|
| 128 |
+
"time": 0.14153170585632324,
|
| 129 |
+
"response": ""
|
| 130 |
+
},
|
| 131 |
+
{
|
| 132 |
+
"test": "Commande complexe",
|
| 133 |
+
"difficulty": "hard",
|
| 134 |
+
"score": 0,
|
| 135 |
+
"time": 0.13273906707763672,
|
| 136 |
+
"response": ""
|
| 137 |
+
}
|
| 138 |
+
],
|
| 139 |
+
"type": "general"
|
| 140 |
+
},
|
| 141 |
+
{
|
| 142 |
+
"name": "Qwen3-1.7B",
|
| 143 |
+
"file_size_mb": 1007.8267211914062,
|
| 144 |
+
"avg_score": 3.0,
|
| 145 |
+
"avg_time": 13.003729963302613,
|
| 146 |
+
"efficiency": 0.23070303739513193,
|
| 147 |
+
"results": [
|
| 148 |
+
{
|
| 149 |
+
"test": "Commande simple",
|
| 150 |
+
"difficulty": "easy",
|
| 151 |
+
"score": 2,
|
| 152 |
+
"time": 12.862720251083374,
|
| 153 |
+
"response": ".\n\nYou must use the JSON format specified, without any additional text or explanation. The JSON must..."
|
| 154 |
+
},
|
| 155 |
+
{
|
| 156 |
+
"test": "Action avec coordonnรฉes",
|
| 157 |
+
"difficulty": "easy",
|
| 158 |
+
"score": 2,
|
| 159 |
+
"time": 12.972241401672363,
|
| 160 |
+
"response": ".\n\nYou must use the correct tool name and format the JSON properly.\n\nThe game state is as follows:\n-..."
|
| 161 |
+
},
|
| 162 |
+
{
|
| 163 |
+
"test": "Attaque spรฉcifique",
|
| 164 |
+
"difficulty": "medium",
|
| 165 |
+
"score": 3,
|
| 166 |
+
"time": 13.497555255889893,
|
| 167 |
+
"response": ".\n\nYou must use the correct tool name and format the JSON properly.\n\nThe tool to use is attack_unit...."
|
| 168 |
+
},
|
| 169 |
+
{
|
| 170 |
+
"test": "Construction",
|
| 171 |
+
"difficulty": "medium",
|
| 172 |
+
"score": 5,
|
| 173 |
+
"time": 12.513315677642822,
|
| 174 |
+
"response": ".\n\nYou must use the correct tool name and format the JSON properly.\n\nThe correct tool name is \"build..."
|
| 175 |
+
},
|
| 176 |
+
{
|
| 177 |
+
"test": "Commande complexe",
|
| 178 |
+
"difficulty": "hard",
|
| 179 |
+
"score": 3,
|
| 180 |
+
"time": 13.17281723022461,
|
| 181 |
+
"response": ".\n\nYou can use the following tool definitions:\n- move_units: move units to a new position\n- attack_u..."
|
| 182 |
+
}
|
| 183 |
+
],
|
| 184 |
+
"type": "general"
|
| 185 |
+
},
|
| 186 |
+
{
|
| 187 |
+
"name": "MCP-Instruct-v1",
|
| 188 |
+
"file_size_mb": 697.0347290039062,
|
| 189 |
+
"avg_score": 0.0,
|
| 190 |
+
"avg_time": 0.1320805072784424,
|
| 191 |
+
"efficiency": 0.0,
|
| 192 |
+
"results": [
|
| 193 |
+
{
|
| 194 |
+
"test": "Commande simple",
|
| 195 |
+
"difficulty": "easy",
|
| 196 |
+
"score": 0,
|
| 197 |
+
"time": 0.6604025363922119,
|
| 198 |
+
"response": ""
|
| 199 |
+
},
|
| 200 |
+
{
|
| 201 |
+
"test": "Action avec coordonnรฉes",
|
| 202 |
+
"difficulty": "easy",
|
| 203 |
+
"score": 0,
|
| 204 |
+
"time": 0,
|
| 205 |
+
"error": "llama_decode returned -1"
|
| 206 |
+
},
|
| 207 |
+
{
|
| 208 |
+
"test": "Attaque spรฉcifique",
|
| 209 |
+
"difficulty": "medium",
|
| 210 |
+
"score": 0,
|
| 211 |
+
"time": 0,
|
| 212 |
+
"error": "llama_decode returned -1"
|
| 213 |
+
},
|
| 214 |
+
{
|
| 215 |
+
"test": "Construction",
|
| 216 |
+
"difficulty": "medium",
|
| 217 |
+
"score": 0,
|
| 218 |
+
"time": 0,
|
| 219 |
+
"error": "llama_decode returned -1"
|
| 220 |
+
},
|
| 221 |
+
{
|
| 222 |
+
"test": "Commande complexe",
|
| 223 |
+
"difficulty": "hard",
|
| 224 |
+
"score": 0,
|
| 225 |
+
"time": 0,
|
| 226 |
+
"error": "llama_decode returned -1"
|
| 227 |
+
}
|
| 228 |
+
],
|
| 229 |
+
"type": "mcp_specialized"
|
| 230 |
+
},
|
| 231 |
+
{
|
| 232 |
+
"name": "MCPR L-3B-Exa",
|
| 233 |
+
"file_size_mb": 1215.7023620605469,
|
| 234 |
+
"avg_score": 0.0,
|
| 235 |
+
"avg_time": 22.14646472930908,
|
| 236 |
+
"efficiency": 0.0,
|
| 237 |
+
"results": [
|
| 238 |
+
{
|
| 239 |
+
"test": "Commande simple",
|
| 240 |
+
"difficulty": "easy",
|
| 241 |
+
"score": 0,
|
| 242 |
+
"time": 22.817347049713135,
|
| 243 |
+
"response": "+\\),),โ), andโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโinitializ..."
|
| 244 |
+
},
|
| 245 |
+
{
|
| 246 |
+
"test": "Action avec coordonnรฉes",
|
| 247 |
+
"difficulty": "easy",
|
| 248 |
+
"score": 0,
|
| 249 |
+
"time": 21.51675510406494,
|
| 250 |
+
"response": "+\\),),โ),3โโโโโโโ),),โโโโโโโโโโโโ),),),), andโ),),), and@@ the Styโก Styโก Styโก Styโก Styโก Styโก Styโก St..."
|
| 251 |
+
},
|
| 252 |
+
{
|
| 253 |
+
"test": "Attaque spรฉcifique",
|
| 254 |
+
"difficulty": "medium",
|
| 255 |
+
"score": 0,
|
| 256 |
+
"time": 22.118958473205566,
|
| 257 |
+
"response": "+\\),),+\\),), andโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ..."
|
| 258 |
+
},
|
| 259 |
+
{
|
| 260 |
+
"test": "Construction",
|
| 261 |
+
"difficulty": "medium",
|
| 262 |
+
"score": 0,
|
| 263 |
+
"time": 22.297714471817017,
|
| 264 |
+
"response": "+\\),),โ), and@@ the Sty mini mini mini mini mini mini mini mini mini theโโโโโโ),),+\\),),), andโ),),)..."
|
| 265 |
+
},
|
| 266 |
+
{
|
| 267 |
+
"test": "Commande complexe",
|
| 268 |
+
"difficulty": "hard",
|
| 269 |
+
"score": 0,
|
| 270 |
+
"time": 21.98154854774475,
|
| 271 |
+
"response": "and@@ Sty@@patterns@@ mini@@ Sty@@ Sty mini mini mini mini mini mini mini mini mini theโโโโโโโโโโโโโ..."
|
| 272 |
+
}
|
| 273 |
+
],
|
| 274 |
+
"type": "mcp_specialized"
|
| 275 |
+
},
|
| 276 |
+
{
|
| 277 |
+
"name": "Gemma-3n-E2B-it",
|
| 278 |
+
"file_size_mb": 1958.3001403808594,
|
| 279 |
+
"avg_score": 0.0,
|
| 280 |
+
"avg_time": 1.5714858055114747,
|
| 281 |
+
"efficiency": 0.0,
|
| 282 |
+
"results": [
|
| 283 |
+
{
|
| 284 |
+
"test": "Commande simple",
|
| 285 |
+
"difficulty": "easy",
|
| 286 |
+
"score": 0,
|
| 287 |
+
"time": 3.1773452758789062,
|
| 288 |
+
"response": ""
|
| 289 |
+
},
|
| 290 |
+
{
|
| 291 |
+
"test": "Action avec coordonnรฉes",
|
| 292 |
+
"difficulty": "easy",
|
| 293 |
+
"score": 0,
|
| 294 |
+
"time": 1.1669323444366455,
|
| 295 |
+
"response": ""
|
| 296 |
+
},
|
| 297 |
+
{
|
| 298 |
+
"test": "Attaque spรฉcifique",
|
| 299 |
+
"difficulty": "medium",
|
| 300 |
+
"score": 0,
|
| 301 |
+
"time": 1.1747264862060547,
|
| 302 |
+
"response": ""
|
| 303 |
+
},
|
| 304 |
+
{
|
| 305 |
+
"test": "Construction",
|
| 306 |
+
"difficulty": "medium",
|
| 307 |
+
"score": 0,
|
| 308 |
+
"time": 1.2873260974884033,
|
| 309 |
+
"response": ""
|
| 310 |
+
},
|
| 311 |
+
{
|
| 312 |
+
"test": "Commande complexe",
|
| 313 |
+
"difficulty": "hard",
|
| 314 |
+
"score": 0,
|
| 315 |
+
"time": 1.0510988235473633,
|
| 316 |
+
"response": ""
|
| 317 |
+
}
|
| 318 |
+
],
|
| 319 |
+
"type": "mcp_specialized"
|
| 320 |
+
},
|
| 321 |
+
{
|
| 322 |
+
"name": "Llama-Breeze2-3B",
|
| 323 |
+
"file_size_mb": 1424.04345703125,
|
| 324 |
+
"avg_score": 3.6,
|
| 325 |
+
"avg_time": 14.693956804275512,
|
| 326 |
+
"efficiency": 0.24499867856917243,
|
| 327 |
+
"results": [
|
| 328 |
+
{
|
| 329 |
+
"test": "Commande simple",
|
| 330 |
+
"difficulty": "easy",
|
| 331 |
+
"score": 3,
|
| 332 |
+
"time": 3.5608396530151367,
|
| 333 |
+
"response": "[get_game_state()]"
|
| 334 |
+
},
|
| 335 |
+
{
|
| 336 |
+
"test": "Action avec coordonnรฉes",
|
| 337 |
+
"difficulty": "easy",
|
| 338 |
+
"score": 3,
|
| 339 |
+
"time": 5.626140356063843,
|
| 340 |
+
"response": "[move_units(unit_ids='infantry', target_x='150', target_y='200')]"
|
| 341 |
+
},
|
| 342 |
+
{
|
| 343 |
+
"test": "Attaque spรฉcifique",
|
| 344 |
+
"difficulty": "medium",
|
| 345 |
+
"score": 5,
|
| 346 |
+
"time": 23.91610813140869,
|
| 347 |
+
"response": "The tool used is \"get_game_state\" and the tool name is \"get_game_state\". The args for this tool is a..."
|
| 348 |
+
},
|
| 349 |
+
{
|
| 350 |
+
"test": "Construction",
|
| 351 |
+
"difficulty": "medium",
|
| 352 |
+
"score": 5,
|
| 353 |
+
"time": 23.745216846466064,
|
| 354 |
+
"response": "{\n \"tool\": \"build_building\",\n \"args\": {\n \"building_type\": \"power plant\",\n \"position_x\": 100,..."
|
| 355 |
+
},
|
| 356 |
+
{
|
| 357 |
+
"test": "Commande complexe",
|
| 358 |
+
"difficulty": "hard",
|
| 359 |
+
"score": 2,
|
| 360 |
+
"time": 16.621479034423828,
|
| 361 |
+
"response": "The game is ready for action. You are the AI, and you are ready to assist. You have the ability to m..."
|
| 362 |
+
}
|
| 363 |
+
],
|
| 364 |
+
"type": "general"
|
| 365 |
+
},
|
| 366 |
+
{
|
| 367 |
+
"name": "Qwen2.5-Coder-0.5B",
|
| 368 |
+
"file_size_mb": 408.8690185546875,
|
| 369 |
+
"avg_score": 4.4,
|
| 370 |
+
"avg_time": 4.1166441440582275,
|
| 371 |
+
"efficiency": 1.0688317585941343,
|
| 372 |
+
"results": [
|
| 373 |
+
{
|
| 374 |
+
"test": "Commande simple",
|
| 375 |
+
"difficulty": "easy",
|
| 376 |
+
"score": 5,
|
| 377 |
+
"time": 4.25421667098999,
|
| 378 |
+
"response": ".\n\nHere's a possible response:\n\n{\"tool\": \"get_game_state\", \"args\": {\"game_state\": {\"units\": [{\"id\": ..."
|
| 379 |
+
},
|
| 380 |
+
{
|
| 381 |
+
"test": "Action avec coordonnรฉes",
|
| 382 |
+
"difficulty": "easy",
|
| 383 |
+
"score": 5,
|
| 384 |
+
"time": 4.333646059036255,
|
| 385 |
+
"response": "where \"tool_name\" is the name of the tool used and \"args\" is a dictionary containing the arguments p..."
|
| 386 |
+
},
|
| 387 |
+
{
|
| 388 |
+
"test": "Attaque spรฉcifique",
|
| 389 |
+
"difficulty": "medium",
|
| 390 |
+
"score": 5,
|
| 391 |
+
"time": 4.139528274536133,
|
| 392 |
+
"response": "where \"tool_name\" is the name of the tool used and \"args\" is a dictionary containing the arguments p..."
|
| 393 |
+
},
|
| 394 |
+
{
|
| 395 |
+
"test": "Construction",
|
| 396 |
+
"difficulty": "medium",
|
| 397 |
+
"score": 2,
|
| 398 |
+
"time": 3.9508562088012695,
|
| 399 |
+
"response": "where \"tool_name\" is the name of the tool used and \"args\" is a JSON object containing the arguments ..."
|
| 400 |
+
},
|
| 401 |
+
{
|
| 402 |
+
"test": "Commande complexe",
|
| 403 |
+
"difficulty": "hard",
|
| 404 |
+
"score": 5,
|
| 405 |
+
"time": 3.9049735069274902,
|
| 406 |
+
"response": "where \"tool_name\" is the name of the tool used and \"args\" is a dictionary containing the arguments p..."
|
| 407 |
+
}
|
| 408 |
+
],
|
| 409 |
+
"type": "code_specialized"
|
| 410 |
+
}
|
| 411 |
+
],
|
| 412 |
+
"ranking_by_score": [
|
| 413 |
+
{
|
| 414 |
+
"name": "Qwen2.5-Coder-0.5B",
|
| 415 |
+
"file_size_mb": 408.8690185546875,
|
| 416 |
+
"avg_score": 4.4,
|
| 417 |
+
"avg_time": 4.1166441440582275,
|
| 418 |
+
"efficiency": 1.0688317585941343,
|
| 419 |
+
"results": [
|
| 420 |
+
{
|
| 421 |
+
"test": "Commande simple",
|
| 422 |
+
"difficulty": "easy",
|
| 423 |
+
"score": 5,
|
| 424 |
+
"time": 4.25421667098999,
|
| 425 |
+
"response": ".\n\nHere's a possible response:\n\n{\"tool\": \"get_game_state\", \"args\": {\"game_state\": {\"units\": [{\"id\": ..."
|
| 426 |
+
},
|
| 427 |
+
{
|
| 428 |
+
"test": "Action avec coordonnรฉes",
|
| 429 |
+
"difficulty": "easy",
|
| 430 |
+
"score": 5,
|
| 431 |
+
"time": 4.333646059036255,
|
| 432 |
+
"response": "where \"tool_name\" is the name of the tool used and \"args\" is a dictionary containing the arguments p..."
|
| 433 |
+
},
|
| 434 |
+
{
|
| 435 |
+
"test": "Attaque spรฉcifique",
|
| 436 |
+
"difficulty": "medium",
|
| 437 |
+
"score": 5,
|
| 438 |
+
"time": 4.139528274536133,
|
| 439 |
+
"response": "where \"tool_name\" is the name of the tool used and \"args\" is a dictionary containing the arguments p..."
|
| 440 |
+
},
|
| 441 |
+
{
|
| 442 |
+
"test": "Construction",
|
| 443 |
+
"difficulty": "medium",
|
| 444 |
+
"score": 2,
|
| 445 |
+
"time": 3.9508562088012695,
|
| 446 |
+
"response": "where \"tool_name\" is the name of the tool used and \"args\" is a JSON object containing the arguments ..."
|
| 447 |
+
},
|
| 448 |
+
{
|
| 449 |
+
"test": "Commande complexe",
|
| 450 |
+
"difficulty": "hard",
|
| 451 |
+
"score": 5,
|
| 452 |
+
"time": 3.9049735069274902,
|
| 453 |
+
"response": "where \"tool_name\" is the name of the tool used and \"args\" is a dictionary containing the arguments p..."
|
| 454 |
+
}
|
| 455 |
+
],
|
| 456 |
+
"type": "code_specialized"
|
| 457 |
+
},
|
| 458 |
+
{
|
| 459 |
+
"name": "Llama-Breeze2-3B",
|
| 460 |
+
"file_size_mb": 1424.04345703125,
|
| 461 |
+
"avg_score": 3.6,
|
| 462 |
+
"avg_time": 14.693956804275512,
|
| 463 |
+
"efficiency": 0.24499867856917243,
|
| 464 |
+
"results": [
|
| 465 |
+
{
|
| 466 |
+
"test": "Commande simple",
|
| 467 |
+
"difficulty": "easy",
|
| 468 |
+
"score": 3,
|
| 469 |
+
"time": 3.5608396530151367,
|
| 470 |
+
"response": "[get_game_state()]"
|
| 471 |
+
},
|
| 472 |
+
{
|
| 473 |
+
"test": "Action avec coordonnรฉes",
|
| 474 |
+
"difficulty": "easy",
|
| 475 |
+
"score": 3,
|
| 476 |
+
"time": 5.626140356063843,
|
| 477 |
+
"response": "[move_units(unit_ids='infantry', target_x='150', target_y='200')]"
|
| 478 |
+
},
|
| 479 |
+
{
|
| 480 |
+
"test": "Attaque spรฉcifique",
|
| 481 |
+
"difficulty": "medium",
|
| 482 |
+
"score": 5,
|
| 483 |
+
"time": 23.91610813140869,
|
| 484 |
+
"response": "The tool used is \"get_game_state\" and the tool name is \"get_game_state\". The args for this tool is a..."
|
| 485 |
+
},
|
| 486 |
+
{
|
| 487 |
+
"test": "Construction",
|
| 488 |
+
"difficulty": "medium",
|
| 489 |
+
"score": 5,
|
| 490 |
+
"time": 23.745216846466064,
|
| 491 |
+
"response": "{\n \"tool\": \"build_building\",\n \"args\": {\n \"building_type\": \"power plant\",\n \"position_x\": 100,..."
|
| 492 |
+
},
|
| 493 |
+
{
|
| 494 |
+
"test": "Commande complexe",
|
| 495 |
+
"difficulty": "hard",
|
| 496 |
+
"score": 2,
|
| 497 |
+
"time": 16.621479034423828,
|
| 498 |
+
"response": "The game is ready for action. You are the AI, and you are ready to assist. You have the ability to m..."
|
| 499 |
+
}
|
| 500 |
+
],
|
| 501 |
+
"type": "general"
|
| 502 |
+
},
|
| 503 |
+
{
|
| 504 |
+
"name": "Qwen3-1.7B",
|
| 505 |
+
"file_size_mb": 1007.8267211914062,
|
| 506 |
+
"avg_score": 3.0,
|
| 507 |
+
"avg_time": 13.003729963302613,
|
| 508 |
+
"efficiency": 0.23070303739513193,
|
| 509 |
+
"results": [
|
| 510 |
+
{
|
| 511 |
+
"test": "Commande simple",
|
| 512 |
+
"difficulty": "easy",
|
| 513 |
+
"score": 2,
|
| 514 |
+
"time": 12.862720251083374,
|
| 515 |
+
"response": ".\n\nYou must use the JSON format specified, without any additional text or explanation. The JSON must..."
|
| 516 |
+
},
|
| 517 |
+
{
|
| 518 |
+
"test": "Action avec coordonnรฉes",
|
| 519 |
+
"difficulty": "easy",
|
| 520 |
+
"score": 2,
|
| 521 |
+
"time": 12.972241401672363,
|
| 522 |
+
"response": ".\n\nYou must use the correct tool name and format the JSON properly.\n\nThe game state is as follows:\n-..."
|
| 523 |
+
},
|
| 524 |
+
{
|
| 525 |
+
"test": "Attaque spรฉcifique",
|
| 526 |
+
"difficulty": "medium",
|
| 527 |
+
"score": 3,
|
| 528 |
+
"time": 13.497555255889893,
|
| 529 |
+
"response": ".\n\nYou must use the correct tool name and format the JSON properly.\n\nThe tool to use is attack_unit...."
|
| 530 |
+
},
|
| 531 |
+
{
|
| 532 |
+
"test": "Construction",
|
| 533 |
+
"difficulty": "medium",
|
| 534 |
+
"score": 5,
|
| 535 |
+
"time": 12.513315677642822,
|
| 536 |
+
"response": ".\n\nYou must use the correct tool name and format the JSON properly.\n\nThe correct tool name is \"build..."
|
| 537 |
+
},
|
| 538 |
+
{
|
| 539 |
+
"test": "Commande complexe",
|
| 540 |
+
"difficulty": "hard",
|
| 541 |
+
"score": 3,
|
| 542 |
+
"time": 13.17281723022461,
|
| 543 |
+
"response": ".\n\nYou can use the following tool definitions:\n- move_units: move units to a new position\n- attack_u..."
|
| 544 |
+
}
|
| 545 |
+
],
|
| 546 |
+
"type": "general"
|
| 547 |
+
},
|
| 548 |
+
{
|
| 549 |
+
"name": "Qwen3-0.6B",
|
| 550 |
+
"file_size_mb": 609.8238830566406,
|
| 551 |
+
"avg_score": 2.8,
|
| 552 |
+
"avg_time": 8.223706769943238,
|
| 553 |
+
"efficiency": 0.3404790659892809,
|
| 554 |
+
"results": [
|
| 555 |
+
{
|
| 556 |
+
"test": "Commande simple",
|
| 557 |
+
"difficulty": "easy",
|
| 558 |
+
"score": 7,
|
| 559 |
+
"time": 8.638539791107178,
|
| 560 |
+
"response": ".\n\nIf the user command is not supported by the available tools, respond with an empty array.\n\nNow, t..."
|
| 561 |
+
},
|
| 562 |
+
{
|
| 563 |
+
"test": "Action avec coordonnรฉes",
|
| 564 |
+
"difficulty": "easy",
|
| 565 |
+
"score": 5,
|
| 566 |
+
"time": 8.075484991073608,
|
| 567 |
+
"response": ".\n\nMake sure to use the correct tool name and format the JSON correctly.\n\nIf the command is not poss..."
|
| 568 |
+
},
|
| 569 |
+
{
|
| 570 |
+
"test": "Attaque spรฉcifique",
|
| 571 |
+
"difficulty": "medium",
|
| 572 |
+
"score": 0,
|
| 573 |
+
"time": 7.951770067214966,
|
| 574 |
+
"response": ".\n\nMake sure to use the correct tool names and format the JSON correctly.\n\nNow, the game state is as..."
|
| 575 |
+
},
|
| 576 |
+
{
|
| 577 |
+
"test": "Construction",
|
| 578 |
+
"difficulty": "medium",
|
| 579 |
+
"score": 2,
|
| 580 |
+
"time": 8.252855062484741,
|
| 581 |
+
"response": ".\n\nMake sure to use the correct tool names and format the JSON properly.\n\nNow, the user is in a game..."
|
| 582 |
+
},
|
| 583 |
+
{
|
| 584 |
+
"test": "Commande complexe",
|
| 585 |
+
"difficulty": "hard",
|
| 586 |
+
"score": 0,
|
| 587 |
+
"time": 8.199883937835693,
|
| 588 |
+
"response": ".\n\nMake sure to use the correct tool name and format the JSON.\n\nIf the command is not possible, retu..."
|
| 589 |
+
}
|
| 590 |
+
],
|
| 591 |
+
"type": "general"
|
| 592 |
+
},
|
| 593 |
+
{
|
| 594 |
+
"name": "Qwen2.5-0.5B",
|
| 595 |
+
"file_size_mb": 408.8689880371094,
|
| 596 |
+
"avg_score": 2.6,
|
| 597 |
+
"avg_time": 2.6360722541809083,
|
| 598 |
+
"efficiency": 0.9863159084036122,
|
| 599 |
+
"results": [
|
| 600 |
+
{
|
| 601 |
+
"test": "Commande simple",
|
| 602 |
+
"difficulty": "easy",
|
| 603 |
+
"score": 0,
|
| 604 |
+
"time": 0.31192469596862793,
|
| 605 |
+
"response": ""
|
| 606 |
+
},
|
| 607 |
+
{
|
| 608 |
+
"test": "Action avec coordonnรฉes",
|
| 609 |
+
"difficulty": "easy",
|
| 610 |
+
"score": 0,
|
| 611 |
+
"time": 0.18253064155578613,
|
| 612 |
+
"response": ""
|
| 613 |
+
},
|
| 614 |
+
{
|
| 615 |
+
"test": "Attaque spรฉcifique",
|
| 616 |
+
"difficulty": "medium",
|
| 617 |
+
"score": 7,
|
| 618 |
+
"time": 4.232211351394653,
|
| 619 |
+
"response": "Where tool_name is the name of the tool used and args is a dictionary containing the arguments for t..."
|
| 620 |
+
},
|
| 621 |
+
{
|
| 622 |
+
"test": "Construction",
|
| 623 |
+
"difficulty": "medium",
|
| 624 |
+
"score": 2,
|
| 625 |
+
"time": 4.225749492645264,
|
| 626 |
+
"response": "Where tool_name is the name of the tool and args is a dictionary with the arguments. If no arguments..."
|
| 627 |
+
},
|
| 628 |
+
{
|
| 629 |
+
"test": "Commande complexe",
|
| 630 |
+
"difficulty": "hard",
|
| 631 |
+
"score": 4,
|
| 632 |
+
"time": 4.22794508934021,
|
| 633 |
+
"response": "where tool_name is the name of the tool and args is a dictionary containing the arguments. If no too..."
|
| 634 |
+
}
|
| 635 |
+
],
|
| 636 |
+
"type": "general"
|
| 637 |
+
}
|
| 638 |
+
],
|
| 639 |
+
"ranking_by_efficiency": [
|
| 640 |
+
{
|
| 641 |
+
"name": "Qwen2.5-Coder-0.5B",
|
| 642 |
+
"file_size_mb": 408.8690185546875,
|
| 643 |
+
"avg_score": 4.4,
|
| 644 |
+
"avg_time": 4.1166441440582275,
|
| 645 |
+
"efficiency": 1.0688317585941343,
|
| 646 |
+
"results": [
|
| 647 |
+
{
|
| 648 |
+
"test": "Commande simple",
|
| 649 |
+
"difficulty": "easy",
|
| 650 |
+
"score": 5,
|
| 651 |
+
"time": 4.25421667098999,
|
| 652 |
+
"response": ".\n\nHere's a possible response:\n\n{\"tool\": \"get_game_state\", \"args\": {\"game_state\": {\"units\": [{\"id\": ..."
|
| 653 |
+
},
|
| 654 |
+
{
|
| 655 |
+
"test": "Action avec coordonnรฉes",
|
| 656 |
+
"difficulty": "easy",
|
| 657 |
+
"score": 5,
|
| 658 |
+
"time": 4.333646059036255,
|
| 659 |
+
"response": "where \"tool_name\" is the name of the tool used and \"args\" is a dictionary containing the arguments p..."
|
| 660 |
+
},
|
| 661 |
+
{
|
| 662 |
+
"test": "Attaque spรฉcifique",
|
| 663 |
+
"difficulty": "medium",
|
| 664 |
+
"score": 5,
|
| 665 |
+
"time": 4.139528274536133,
|
| 666 |
+
"response": "where \"tool_name\" is the name of the tool used and \"args\" is a dictionary containing the arguments p..."
|
| 667 |
+
},
|
| 668 |
+
{
|
| 669 |
+
"test": "Construction",
|
| 670 |
+
"difficulty": "medium",
|
| 671 |
+
"score": 2,
|
| 672 |
+
"time": 3.9508562088012695,
|
| 673 |
+
"response": "where \"tool_name\" is the name of the tool used and \"args\" is a JSON object containing the arguments ..."
|
| 674 |
+
},
|
| 675 |
+
{
|
| 676 |
+
"test": "Commande complexe",
|
| 677 |
+
"difficulty": "hard",
|
| 678 |
+
"score": 5,
|
| 679 |
+
"time": 3.9049735069274902,
|
| 680 |
+
"response": "where \"tool_name\" is the name of the tool used and \"args\" is a dictionary containing the arguments p..."
|
| 681 |
+
}
|
| 682 |
+
],
|
| 683 |
+
"type": "code_specialized"
|
| 684 |
+
},
|
| 685 |
+
{
|
| 686 |
+
"name": "Qwen2.5-0.5B",
|
| 687 |
+
"file_size_mb": 408.8689880371094,
|
| 688 |
+
"avg_score": 2.6,
|
| 689 |
+
"avg_time": 2.6360722541809083,
|
| 690 |
+
"efficiency": 0.9863159084036122,
|
| 691 |
+
"results": [
|
| 692 |
+
{
|
| 693 |
+
"test": "Commande simple",
|
| 694 |
+
"difficulty": "easy",
|
| 695 |
+
"score": 0,
|
| 696 |
+
"time": 0.31192469596862793,
|
| 697 |
+
"response": ""
|
| 698 |
+
},
|
| 699 |
+
{
|
| 700 |
+
"test": "Action avec coordonnรฉes",
|
| 701 |
+
"difficulty": "easy",
|
| 702 |
+
"score": 0,
|
| 703 |
+
"time": 0.18253064155578613,
|
| 704 |
+
"response": ""
|
| 705 |
+
},
|
| 706 |
+
{
|
| 707 |
+
"test": "Attaque spรฉcifique",
|
| 708 |
+
"difficulty": "medium",
|
| 709 |
+
"score": 7,
|
| 710 |
+
"time": 4.232211351394653,
|
| 711 |
+
"response": "Where tool_name is the name of the tool used and args is a dictionary containing the arguments for t..."
|
| 712 |
+
},
|
| 713 |
+
{
|
| 714 |
+
"test": "Construction",
|
| 715 |
+
"difficulty": "medium",
|
| 716 |
+
"score": 2,
|
| 717 |
+
"time": 4.225749492645264,
|
| 718 |
+
"response": "Where tool_name is the name of the tool and args is a dictionary with the arguments. If no arguments..."
|
| 719 |
+
},
|
| 720 |
+
{
|
| 721 |
+
"test": "Commande complexe",
|
| 722 |
+
"difficulty": "hard",
|
| 723 |
+
"score": 4,
|
| 724 |
+
"time": 4.22794508934021,
|
| 725 |
+
"response": "where tool_name is the name of the tool and args is a dictionary containing the arguments. If no too..."
|
| 726 |
+
}
|
| 727 |
+
],
|
| 728 |
+
"type": "general"
|
| 729 |
+
},
|
| 730 |
+
{
|
| 731 |
+
"name": "Qwen3-0.6B",
|
| 732 |
+
"file_size_mb": 609.8238830566406,
|
| 733 |
+
"avg_score": 2.8,
|
| 734 |
+
"avg_time": 8.223706769943238,
|
| 735 |
+
"efficiency": 0.3404790659892809,
|
| 736 |
+
"results": [
|
| 737 |
+
{
|
| 738 |
+
"test": "Commande simple",
|
| 739 |
+
"difficulty": "easy",
|
| 740 |
+
"score": 7,
|
| 741 |
+
"time": 8.638539791107178,
|
| 742 |
+
"response": ".\n\nIf the user command is not supported by the available tools, respond with an empty array.\n\nNow, t..."
|
| 743 |
+
},
|
| 744 |
+
{
|
| 745 |
+
"test": "Action avec coordonnรฉes",
|
| 746 |
+
"difficulty": "easy",
|
| 747 |
+
"score": 5,
|
| 748 |
+
"time": 8.075484991073608,
|
| 749 |
+
"response": ".\n\nMake sure to use the correct tool name and format the JSON correctly.\n\nIf the command is not poss..."
|
| 750 |
+
},
|
| 751 |
+
{
|
| 752 |
+
"test": "Attaque spรฉcifique",
|
| 753 |
+
"difficulty": "medium",
|
| 754 |
+
"score": 0,
|
| 755 |
+
"time": 7.951770067214966,
|
| 756 |
+
"response": ".\n\nMake sure to use the correct tool names and format the JSON correctly.\n\nNow, the game state is as..."
|
| 757 |
+
},
|
| 758 |
+
{
|
| 759 |
+
"test": "Construction",
|
| 760 |
+
"difficulty": "medium",
|
| 761 |
+
"score": 2,
|
| 762 |
+
"time": 8.252855062484741,
|
| 763 |
+
"response": ".\n\nMake sure to use the correct tool names and format the JSON properly.\n\nNow, the user is in a game..."
|
| 764 |
+
},
|
| 765 |
+
{
|
| 766 |
+
"test": "Commande complexe",
|
| 767 |
+
"difficulty": "hard",
|
| 768 |
+
"score": 0,
|
| 769 |
+
"time": 8.199883937835693,
|
| 770 |
+
"response": ".\n\nMake sure to use the correct tool name and format the JSON.\n\nIf the command is not possible, retu..."
|
| 771 |
+
}
|
| 772 |
+
],
|
| 773 |
+
"type": "general"
|
| 774 |
+
},
|
| 775 |
+
{
|
| 776 |
+
"name": "Llama-Breeze2-3B",
|
| 777 |
+
"file_size_mb": 1424.04345703125,
|
| 778 |
+
"avg_score": 3.6,
|
| 779 |
+
"avg_time": 14.693956804275512,
|
| 780 |
+
"efficiency": 0.24499867856917243,
|
| 781 |
+
"results": [
|
| 782 |
+
{
|
| 783 |
+
"test": "Commande simple",
|
| 784 |
+
"difficulty": "easy",
|
| 785 |
+
"score": 3,
|
| 786 |
+
"time": 3.5608396530151367,
|
| 787 |
+
"response": "[get_game_state()]"
|
| 788 |
+
},
|
| 789 |
+
{
|
| 790 |
+
"test": "Action avec coordonnรฉes",
|
| 791 |
+
"difficulty": "easy",
|
| 792 |
+
"score": 3,
|
| 793 |
+
"time": 5.626140356063843,
|
| 794 |
+
"response": "[move_units(unit_ids='infantry', target_x='150', target_y='200')]"
|
| 795 |
+
},
|
| 796 |
+
{
|
| 797 |
+
"test": "Attaque spรฉcifique",
|
| 798 |
+
"difficulty": "medium",
|
| 799 |
+
"score": 5,
|
| 800 |
+
"time": 23.91610813140869,
|
| 801 |
+
"response": "The tool used is \"get_game_state\" and the tool name is \"get_game_state\". The args for this tool is a..."
|
| 802 |
+
},
|
| 803 |
+
{
|
| 804 |
+
"test": "Construction",
|
| 805 |
+
"difficulty": "medium",
|
| 806 |
+
"score": 5,
|
| 807 |
+
"time": 23.745216846466064,
|
| 808 |
+
"response": "{\n \"tool\": \"build_building\",\n \"args\": {\n \"building_type\": \"power plant\",\n \"position_x\": 100,..."
|
| 809 |
+
},
|
| 810 |
+
{
|
| 811 |
+
"test": "Commande complexe",
|
| 812 |
+
"difficulty": "hard",
|
| 813 |
+
"score": 2,
|
| 814 |
+
"time": 16.621479034423828,
|
| 815 |
+
"response": "The game is ready for action. You are the AI, and you are ready to assist. You have the ability to m..."
|
| 816 |
+
}
|
| 817 |
+
],
|
| 818 |
+
"type": "general"
|
| 819 |
+
},
|
| 820 |
+
{
|
| 821 |
+
"name": "Qwen3-1.7B",
|
| 822 |
+
"file_size_mb": 1007.8267211914062,
|
| 823 |
+
"avg_score": 3.0,
|
| 824 |
+
"avg_time": 13.003729963302613,
|
| 825 |
+
"efficiency": 0.23070303739513193,
|
| 826 |
+
"results": [
|
| 827 |
+
{
|
| 828 |
+
"test": "Commande simple",
|
| 829 |
+
"difficulty": "easy",
|
| 830 |
+
"score": 2,
|
| 831 |
+
"time": 12.862720251083374,
|
| 832 |
+
"response": ".\n\nYou must use the JSON format specified, without any additional text or explanation. The JSON must..."
|
| 833 |
+
},
|
| 834 |
+
{
|
| 835 |
+
"test": "Action avec coordonnรฉes",
|
| 836 |
+
"difficulty": "easy",
|
| 837 |
+
"score": 2,
|
| 838 |
+
"time": 12.972241401672363,
|
| 839 |
+
"response": ".\n\nYou must use the correct tool name and format the JSON properly.\n\nThe game state is as follows:\n-..."
|
| 840 |
+
},
|
| 841 |
+
{
|
| 842 |
+
"test": "Attaque spรฉcifique",
|
| 843 |
+
"difficulty": "medium",
|
| 844 |
+
"score": 3,
|
| 845 |
+
"time": 13.497555255889893,
|
| 846 |
+
"response": ".\n\nYou must use the correct tool name and format the JSON properly.\n\nThe tool to use is attack_unit...."
|
| 847 |
+
},
|
| 848 |
+
{
|
| 849 |
+
"test": "Construction",
|
| 850 |
+
"difficulty": "medium",
|
| 851 |
+
"score": 5,
|
| 852 |
+
"time": 12.513315677642822,
|
| 853 |
+
"response": ".\n\nYou must use the correct tool name and format the JSON properly.\n\nThe correct tool name is \"build..."
|
| 854 |
+
},
|
| 855 |
+
{
|
| 856 |
+
"test": "Commande complexe",
|
| 857 |
+
"difficulty": "hard",
|
| 858 |
+
"score": 3,
|
| 859 |
+
"time": 13.17281723022461,
|
| 860 |
+
"response": ".\n\nYou can use the following tool definitions:\n- move_units: move units to a new position\n- attack_u..."
|
| 861 |
+
}
|
| 862 |
+
],
|
| 863 |
+
"type": "general"
|
| 864 |
+
}
|
| 865 |
+
],
|
| 866 |
+
"best_overall": {
|
| 867 |
+
"name": "Qwen2.5-Coder-0.5B",
|
| 868 |
+
"file_size_mb": 408.8690185546875,
|
| 869 |
+
"avg_score": 4.4,
|
| 870 |
+
"avg_time": 4.1166441440582275,
|
| 871 |
+
"efficiency": 1.0688317585941343,
|
| 872 |
+
"results": [
|
| 873 |
+
{
|
| 874 |
+
"test": "Commande simple",
|
| 875 |
+
"difficulty": "easy",
|
| 876 |
+
"score": 5,
|
| 877 |
+
"time": 4.25421667098999,
|
| 878 |
+
"response": ".\n\nHere's a possible response:\n\n{\"tool\": \"get_game_state\", \"args\": {\"game_state\": {\"units\": [{\"id\": ..."
|
| 879 |
+
},
|
| 880 |
+
{
|
| 881 |
+
"test": "Action avec coordonnรฉes",
|
| 882 |
+
"difficulty": "easy",
|
| 883 |
+
"score": 5,
|
| 884 |
+
"time": 4.333646059036255,
|
| 885 |
+
"response": "where \"tool_name\" is the name of the tool used and \"args\" is a dictionary containing the arguments p..."
|
| 886 |
+
},
|
| 887 |
+
{
|
| 888 |
+
"test": "Attaque spรฉcifique",
|
| 889 |
+
"difficulty": "medium",
|
| 890 |
+
"score": 5,
|
| 891 |
+
"time": 4.139528274536133,
|
| 892 |
+
"response": "where \"tool_name\" is the name of the tool used and \"args\" is a dictionary containing the arguments p..."
|
| 893 |
+
},
|
| 894 |
+
{
|
| 895 |
+
"test": "Construction",
|
| 896 |
+
"difficulty": "medium",
|
| 897 |
+
"score": 2,
|
| 898 |
+
"time": 3.9508562088012695,
|
| 899 |
+
"response": "where \"tool_name\" is the name of the tool used and \"args\" is a JSON object containing the arguments ..."
|
| 900 |
+
},
|
| 901 |
+
{
|
| 902 |
+
"test": "Commande complexe",
|
| 903 |
+
"difficulty": "hard",
|
| 904 |
+
"score": 5,
|
| 905 |
+
"time": 3.9049735069274902,
|
| 906 |
+
"response": "where \"tool_name\" is the name of the tool used and \"args\" is a dictionary containing the arguments p..."
|
| 907 |
+
}
|
| 908 |
+
],
|
| 909 |
+
"type": "code_specialized"
|
| 910 |
+
},
|
| 911 |
+
"most_efficient": {
|
| 912 |
+
"name": "Qwen2.5-Coder-0.5B",
|
| 913 |
+
"file_size_mb": 408.8690185546875,
|
| 914 |
+
"avg_score": 4.4,
|
| 915 |
+
"avg_time": 4.1166441440582275,
|
| 916 |
+
"efficiency": 1.0688317585941343,
|
| 917 |
+
"results": [
|
| 918 |
+
{
|
| 919 |
+
"test": "Commande simple",
|
| 920 |
+
"difficulty": "easy",
|
| 921 |
+
"score": 5,
|
| 922 |
+
"time": 4.25421667098999,
|
| 923 |
+
"response": ".\n\nHere's a possible response:\n\n{\"tool\": \"get_game_state\", \"args\": {\"game_state\": {\"units\": [{\"id\": ..."
|
| 924 |
+
},
|
| 925 |
+
{
|
| 926 |
+
"test": "Action avec coordonnรฉes",
|
| 927 |
+
"difficulty": "easy",
|
| 928 |
+
"score": 5,
|
| 929 |
+
"time": 4.333646059036255,
|
| 930 |
+
"response": "where \"tool_name\" is the name of the tool used and \"args\" is a dictionary containing the arguments p..."
|
| 931 |
+
},
|
| 932 |
+
{
|
| 933 |
+
"test": "Attaque spรฉcifique",
|
| 934 |
+
"difficulty": "medium",
|
| 935 |
+
"score": 5,
|
| 936 |
+
"time": 4.139528274536133,
|
| 937 |
+
"response": "where \"tool_name\" is the name of the tool used and \"args\" is a dictionary containing the arguments p..."
|
| 938 |
+
},
|
| 939 |
+
{
|
| 940 |
+
"test": "Construction",
|
| 941 |
+
"difficulty": "medium",
|
| 942 |
+
"score": 2,
|
| 943 |
+
"time": 3.9508562088012695,
|
| 944 |
+
"response": "where \"tool_name\" is the name of the tool used and \"args\" is a JSON object containing the arguments ..."
|
| 945 |
+
},
|
| 946 |
+
{
|
| 947 |
+
"test": "Commande complexe",
|
| 948 |
+
"difficulty": "hard",
|
| 949 |
+
"score": 5,
|
| 950 |
+
"time": 3.9049735069274902,
|
| 951 |
+
"response": "where \"tool_name\" is the name of the tool used and \"args\" is a dictionary containing the arguments p..."
|
| 952 |
+
}
|
| 953 |
+
],
|
| 954 |
+
"type": "code_specialized"
|
| 955 |
+
}
|
| 956 |
+
}
|
docs/reports/final_14_model_evaluation.json
ADDED
|
@@ -0,0 +1,256 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"evaluation_info": {
|
| 3 |
+
"title": "Comprehensive MCP Capability Evaluation - 14 Models",
|
| 4 |
+
"date": "2025-10-05",
|
| 5 |
+
"total_models": 14,
|
| 6 |
+
"framework": "llama-cpp-python 0.3.16",
|
| 7 |
+
"test_scenarios": 3,
|
| 8 |
+
"max_score_per_test": 10
|
| 9 |
+
},
|
| 10 |
+
"test_scenarios": [
|
| 11 |
+
{
|
| 12 |
+
"id": 1,
|
| 13 |
+
"description": "Get the current game state",
|
| 14 |
+
"expected_tool": "get_game_state",
|
| 15 |
+
"scoring": {
|
| 16 |
+
"correct_tool": 4,
|
| 17 |
+
"valid_json": 3,
|
| 18 |
+
"proper_terminology": 2,
|
| 19 |
+
"coordinates": 1
|
| 20 |
+
}
|
| 21 |
+
},
|
| 22 |
+
{
|
| 23 |
+
"id": 2,
|
| 24 |
+
"description": "Move infantry unit to coordinates 150,200",
|
| 25 |
+
"expected_tool": "move_units",
|
| 26 |
+
"expected_coordinates": [150, 200],
|
| 27 |
+
"scoring": {
|
| 28 |
+
"correct_tool": 4,
|
| 29 |
+
"valid_json": 3,
|
| 30 |
+
"proper_terminology": 2,
|
| 31 |
+
"coordinates": 1
|
| 32 |
+
}
|
| 33 |
+
},
|
| 34 |
+
{
|
| 35 |
+
"id": 3,
|
| 36 |
+
"description": "Attack enemy tank at location 300,150",
|
| 37 |
+
"expected_tool": "attack_unit",
|
| 38 |
+
"expected_coordinates": [300, 150],
|
| 39 |
+
"scoring": {
|
| 40 |
+
"correct_tool": 4,
|
| 41 |
+
"valid_json": 3,
|
| 42 |
+
"proper_terminology": 2,
|
| 43 |
+
"coordinates": 1
|
| 44 |
+
}
|
| 45 |
+
}
|
| 46 |
+
],
|
| 47 |
+
"results": [
|
| 48 |
+
{
|
| 49 |
+
"rank": 1,
|
| 50 |
+
"model_name": "Qwen2.5-Coder-1.5B-Q4",
|
| 51 |
+
"filename": "qwen2.5-coder-1.5b-instruct-q4_0.gguf",
|
| 52 |
+
"avg_score": 9.7,
|
| 53 |
+
"avg_time": 4.12,
|
| 54 |
+
"size_mb": 1017,
|
| 55 |
+
"efficiency": 2.34,
|
| 56 |
+
"status": "champion",
|
| 57 |
+
"notes": "Exceptional MCP performance. Wraps JSON in markdown code blocks. 10/10 on tests 2 and 3, 9/10 on test 1.",
|
| 58 |
+
"detailed_scores": [9, 10, 10]
|
| 59 |
+
},
|
| 60 |
+
{
|
| 61 |
+
"rank": 2,
|
| 62 |
+
"model_name": "Qwen2.5-Coder-0.5B",
|
| 63 |
+
"filename": "qwen2.5-coder-0.5b-instruct-q4_0.gguf",
|
| 64 |
+
"avg_score": 4.3,
|
| 65 |
+
"avg_time": 2.08,
|
| 66 |
+
"size_mb": 409,
|
| 67 |
+
"efficiency": 2.08,
|
| 68 |
+
"status": "previous_champion",
|
| 69 |
+
"notes": "Best budget option. Good balance of size and performance."
|
| 70 |
+
},
|
| 71 |
+
{
|
| 72 |
+
"rank": 3,
|
| 73 |
+
"model_name": "Qwen3-0.6B",
|
| 74 |
+
"filename": "Qwen3-0.6B-Q8_0.gguf",
|
| 75 |
+
"avg_score": 3.7,
|
| 76 |
+
"avg_time": 3.98,
|
| 77 |
+
"size_mb": 610,
|
| 78 |
+
"efficiency": 0.92,
|
| 79 |
+
"status": "functional"
|
| 80 |
+
},
|
| 81 |
+
{
|
| 82 |
+
"rank": 4,
|
| 83 |
+
"model_name": "Gemma-3-270M",
|
| 84 |
+
"filename": "gemma-3-270m-it-qat-Q8_0.gguf",
|
| 85 |
+
"avg_score": 3.7,
|
| 86 |
+
"avg_time": 2.29,
|
| 87 |
+
"size_mb": 428,
|
| 88 |
+
"efficiency": 1.60,
|
| 89 |
+
"status": "functional",
|
| 90 |
+
"notes": "Ultra-lightweight champion. Excellent efficiency for its tiny size."
|
| 91 |
+
},
|
| 92 |
+
{
|
| 93 |
+
"rank": 5,
|
| 94 |
+
"model_name": "MCPR-L-3B-Exa-Q8",
|
| 95 |
+
"filename": "mcprl-3b-exa.Q8_0.gguf",
|
| 96 |
+
"avg_score": 3.7,
|
| 97 |
+
"avg_time": 17.42,
|
| 98 |
+
"size_mb": 3133,
|
| 99 |
+
"efficiency": 0.21,
|
| 100 |
+
"status": "functional",
|
| 101 |
+
"notes": "MCP-specialized but slow. Large size, poor efficiency."
|
| 102 |
+
},
|
| 103 |
+
{
|
| 104 |
+
"rank": 6,
|
| 105 |
+
"model_name": "Gemma-3n-E2B-it-Q8",
|
| 106 |
+
"filename": "google_gemma-3n-E2B-it-Q8_0.gguf",
|
| 107 |
+
"avg_score": 3.7,
|
| 108 |
+
"avg_time": 14.80,
|
| 109 |
+
"size_mb": 4566,
|
| 110 |
+
"efficiency": 0.25,
|
| 111 |
+
"status": "functional",
|
| 112 |
+
"notes": "Largest model tested. Poor efficiency despite high quantization."
|
| 113 |
+
},
|
| 114 |
+
{
|
| 115 |
+
"rank": 7,
|
| 116 |
+
"model_name": "Qwen3-1.7B",
|
| 117 |
+
"filename": "Qwen3-1.7B-Q4_0.gguf",
|
| 118 |
+
"avg_score": 3.7,
|
| 119 |
+
"avg_time": 6.24,
|
| 120 |
+
"size_mb": 1008,
|
| 121 |
+
"efficiency": 0.59,
|
| 122 |
+
"status": "functional"
|
| 123 |
+
},
|
| 124 |
+
{
|
| 125 |
+
"rank": 8,
|
| 126 |
+
"model_name": "Qwen2.5-0.5B",
|
| 127 |
+
"filename": "qwen2.5-0.5b-instruct-q4_0.gguf",
|
| 128 |
+
"avg_score": 2.7,
|
| 129 |
+
"avg_time": 1.17,
|
| 130 |
+
"size_mb": 409,
|
| 131 |
+
"efficiency": 2.28,
|
| 132 |
+
"status": "functional",
|
| 133 |
+
"notes": "Fast but limited MCP capability. General-purpose model."
|
| 134 |
+
},
|
| 135 |
+
{
|
| 136 |
+
"rank": 9,
|
| 137 |
+
"model_name": "Gemma-3n-E2B-it-IQ2",
|
| 138 |
+
"filename": "gemma-3n-E2B-it-UD-IQ2_XXS.gguf",
|
| 139 |
+
"avg_score": 2.3,
|
| 140 |
+
"avg_time": 14.11,
|
| 141 |
+
"size_mb": 1958,
|
| 142 |
+
"efficiency": 0.17,
|
| 143 |
+
"status": "functional",
|
| 144 |
+
"notes": "Heavy quantization impacts quality."
|
| 145 |
+
},
|
| 146 |
+
{
|
| 147 |
+
"rank": 10,
|
| 148 |
+
"model_name": "Llama-Breeze2-3B-Q2",
|
| 149 |
+
"filename": "Llama-Breeze2-3B-Instruct-Text.Q2_K.gguf",
|
| 150 |
+
"avg_score": 1.3,
|
| 151 |
+
"avg_time": 11.39,
|
| 152 |
+
"size_mb": 1424,
|
| 153 |
+
"efficiency": 0.12,
|
| 154 |
+
"status": "functional",
|
| 155 |
+
"notes": "Poor performance. Q2 quantization too aggressive."
|
| 156 |
+
},
|
| 157 |
+
{
|
| 158 |
+
"rank": 11,
|
| 159 |
+
"model_name": "Yi-Coder-1.5B-Q4",
|
| 160 |
+
"filename": "Yi-Coder-1.5B.Q4_0.gguf",
|
| 161 |
+
"avg_score": 0.0,
|
| 162 |
+
"avg_time": 11.64,
|
| 163 |
+
"size_mb": 826,
|
| 164 |
+
"efficiency": 0.0,
|
| 165 |
+
"status": "failed",
|
| 166 |
+
"notes": "Prompt format incompatibility. Returns system prompt instead of generating responses."
|
| 167 |
+
},
|
| 168 |
+
{
|
| 169 |
+
"rank": 12,
|
| 170 |
+
"model_name": "MCP-Instruct-v1-Q4",
|
| 171 |
+
"filename": "mcp-instruct-v1.Q4_K_M.gguf",
|
| 172 |
+
"avg_score": 0.0,
|
| 173 |
+
"avg_time": 0.0,
|
| 174 |
+
"size_mb": 697,
|
| 175 |
+
"efficiency": 0.0,
|
| 176 |
+
"status": "failed",
|
| 177 |
+
"notes": "Technical error: llama_decode returned -1"
|
| 178 |
+
},
|
| 179 |
+
{
|
| 180 |
+
"rank": 13,
|
| 181 |
+
"model_name": "MCPR-L-3B-Exa-Q2",
|
| 182 |
+
"filename": "mcprl-3b-exa.Q2_K.gguf",
|
| 183 |
+
"avg_score": 0.0,
|
| 184 |
+
"avg_time": 10.63,
|
| 185 |
+
"size_mb": 1216,
|
| 186 |
+
"efficiency": 0.0,
|
| 187 |
+
"status": "failed",
|
| 188 |
+
"notes": "Produces gibberish output. Q2 quantization too aggressive for this architecture."
|
| 189 |
+
},
|
| 190 |
+
{
|
| 191 |
+
"rank": 14,
|
| 192 |
+
"model_name": "MCP-Instruct-v1-Q8",
|
| 193 |
+
"filename": "mcp-instruct-v1.Q8_0.gguf",
|
| 194 |
+
"avg_score": 0.0,
|
| 195 |
+
"avg_time": 0.0,
|
| 196 |
+
"size_mb": 1465,
|
| 197 |
+
"efficiency": 0.0,
|
| 198 |
+
"status": "failed",
|
| 199 |
+
"notes": "Technical error: llama_decode returned -1. Same issue as Q4 version."
|
| 200 |
+
}
|
| 201 |
+
],
|
| 202 |
+
"key_insights": {
|
| 203 |
+
"champion": {
|
| 204 |
+
"model": "Qwen2.5-Coder-1.5B-Q4",
|
| 205 |
+
"score": 9.7,
|
| 206 |
+
"reason": "Code-specialized models excel at structured JSON generation. Near-perfect MCP capability."
|
| 207 |
+
},
|
| 208 |
+
"scaling_effect": {
|
| 209 |
+
"observation": "Increasing parameters from 0.5B to 1.5B more than doubled MCP score (4.3 โ 9.7)",
|
| 210 |
+
"conclusion": "Parameter scaling works exceptionally well for code-specialized models"
|
| 211 |
+
},
|
| 212 |
+
"mcp_specialized_disappointment": {
|
| 213 |
+
"observation": "MCP-Instruct models completely failed. MCPR-L models scored only 3.7/10 at best.",
|
| 214 |
+
"conclusion": "MCP specialization alone is insufficient. Code training provides better foundation."
|
| 215 |
+
},
|
| 216 |
+
"quantization_impact": {
|
| 217 |
+
"observation": "Q2 quantization caused failures or poor performance. Q4 and Q8 worked well.",
|
| 218 |
+
"conclusion": "Avoid Q2 quantization for MCP tasks. Q4 offers best size/quality tradeoff."
|
| 219 |
+
},
|
| 220 |
+
"size_efficiency": {
|
| 221 |
+
"observation": "Gemma-3-270M (428MB) matched 3133MB model performance",
|
| 222 |
+
"conclusion": "Larger models don't guarantee better MCP performance"
|
| 223 |
+
}
|
| 224 |
+
},
|
| 225 |
+
"recommendations": {
|
| 226 |
+
"primary": {
|
| 227 |
+
"model": "Qwen2.5-Coder-1.5B-Q4",
|
| 228 |
+
"use_case": "Production deployments requiring high-quality MCP",
|
| 229 |
+
"requirement": "JSON extraction logic to handle markdown code blocks"
|
| 230 |
+
},
|
| 231 |
+
"budget": {
|
| 232 |
+
"model": "Qwen2.5-Coder-0.5B",
|
| 233 |
+
"use_case": "Resource-constrained environments",
|
| 234 |
+
"advantage": "2x smaller, 2x faster, still 4.3/10 performance"
|
| 235 |
+
},
|
| 236 |
+
"ultra_lightweight": {
|
| 237 |
+
"model": "Gemma-3-270M",
|
| 238 |
+
"use_case": "Edge devices, embedded systems",
|
| 239 |
+
"advantage": "Only 428MB, decent 3.7/10 performance"
|
| 240 |
+
},
|
| 241 |
+
"avoid": [
|
| 242 |
+
{
|
| 243 |
+
"model": "MCP-Instruct-v1 (all versions)",
|
| 244 |
+
"reason": "Technical incompatibility with llama.cpp"
|
| 245 |
+
},
|
| 246 |
+
{
|
| 247 |
+
"model": "Yi-Coder-1.5B",
|
| 248 |
+
"reason": "Prompt format incompatibility"
|
| 249 |
+
},
|
| 250 |
+
{
|
| 251 |
+
"model": "Any Q2 quantization",
|
| 252 |
+
"reason": "Too aggressive, causes failures or gibberish"
|
| 253 |
+
}
|
| 254 |
+
]
|
| 255 |
+
}
|
| 256 |
+
}
|
docs/reports/final_model_comparison.json
ADDED
|
@@ -0,0 +1,478 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"all_results": [
|
| 3 |
+
{
|
| 4 |
+
"name": "Qwen2.5-0.5B",
|
| 5 |
+
"avg_score": 4.0,
|
| 6 |
+
"avg_time": 3.52385942141215,
|
| 7 |
+
"efficiency": 1.1351190617011169,
|
| 8 |
+
"tests": [
|
| 9 |
+
{
|
| 10 |
+
"test": "Commande simple",
|
| 11 |
+
"score": 7,
|
| 12 |
+
"time": 3.418940305709839,
|
| 13 |
+
"response": ".\n\nOutils: get_game_state(), move_units(unit_ids, ..."
|
| 14 |
+
},
|
| 15 |
+
{
|
| 16 |
+
"test": "Action avec paramรจtres",
|
| 17 |
+
"score": 5,
|
| 18 |
+
"time": 3.8486745357513428,
|
| 19 |
+
"response": "La commande \"move_units\" est utilisรฉe pour dรฉplace..."
|
| 20 |
+
},
|
| 21 |
+
{
|
| 22 |
+
"test": "Vitesse de rรฉponse",
|
| 23 |
+
"score": 0,
|
| 24 |
+
"time": 3.3039634227752686,
|
| 25 |
+
"response": ", je vais faire une tรขche de base. Je vais essayer..."
|
| 26 |
+
}
|
| 27 |
+
]
|
| 28 |
+
},
|
| 29 |
+
{
|
| 30 |
+
"name": "Qwen3-0.6B",
|
| 31 |
+
"avg_score": 6.0,
|
| 32 |
+
"avg_time": 6.404076337814331,
|
| 33 |
+
"efficiency": 0.936903260283084,
|
| 34 |
+
"tests": [
|
| 35 |
+
{
|
| 36 |
+
"test": "Commande simple",
|
| 37 |
+
"score": 7,
|
| 38 |
+
"time": 6.516923427581787,
|
| 39 |
+
"response": "Exemple: {\"tool\": \"get_game_state\", \"args\": {\"unit..."
|
| 40 |
+
},
|
| 41 |
+
{
|
| 42 |
+
"test": "Action avec paramรจtres",
|
| 43 |
+
"score": 7,
|
| 44 |
+
"time": 6.65591287612915,
|
| 45 |
+
"response": "Rรฉponse: {\"tool\": \"move_units\", \"args\": {\"unit_ids..."
|
| 46 |
+
},
|
| 47 |
+
{
|
| 48 |
+
"test": "Vitesse de rรฉponse",
|
| 49 |
+
"score": 4,
|
| 50 |
+
"time": 6.039392709732056,
|
| 51 |
+
"response": ", but not too much. The user is asking for a respo..."
|
| 52 |
+
}
|
| 53 |
+
]
|
| 54 |
+
},
|
| 55 |
+
{
|
| 56 |
+
"name": "Gemma-3-1B",
|
| 57 |
+
"avg_score": 4.0,
|
| 58 |
+
"avg_time": 6.960511525472005,
|
| 59 |
+
"efficiency": 0.5746704082540475,
|
| 60 |
+
"tests": [
|
| 61 |
+
{
|
| 62 |
+
"test": "Commande simple",
|
| 63 |
+
"score": 7,
|
| 64 |
+
"time": 7.20223069190979,
|
| 65 |
+
"response": "```json\n{{\"tool\": \"get_game_state\", \"args\": {\"map\"..."
|
| 66 |
+
},
|
| 67 |
+
{
|
| 68 |
+
"test": "Action avec paramรจtres",
|
| 69 |
+
"score": 5,
|
| 70 |
+
"time": 6.998988628387451,
|
| 71 |
+
"response": "```python\nimport json\n\ndef move_units(unit_ids, ta..."
|
| 72 |
+
},
|
| 73 |
+
{
|
| 74 |
+
"test": "Vitesse de rรฉponse",
|
| 75 |
+
"score": 0,
|
| 76 |
+
"time": 6.680315256118774,
|
| 77 |
+
"response": ".\n\nA. 100\nB. 200\nC. 300\nD. 400\nE. 500\nF. 600\nG. 70..."
|
| 78 |
+
}
|
| 79 |
+
]
|
| 80 |
+
},
|
| 81 |
+
{
|
| 82 |
+
"name": "Gemma-3-270M",
|
| 83 |
+
"avg_score": 4.666666666666667,
|
| 84 |
+
"avg_time": 3.6990818977355957,
|
| 85 |
+
"efficiency": 1.2615743029434903,
|
| 86 |
+
"tests": [
|
| 87 |
+
{
|
| 88 |
+
"test": "Commande simple",
|
| 89 |
+
"score": 5,
|
| 90 |
+
"time": 3.697866201400757,
|
| 91 |
+
"response": "```\n**Explication:**\n\n* `get_game_state()` : Rรฉcup..."
|
| 92 |
+
},
|
| 93 |
+
{
|
| 94 |
+
"test": "Action avec paramรจtres",
|
| 95 |
+
"score": 5,
|
| 96 |
+
"time": 3.690243721008301,
|
| 97 |
+
"response": "```\n\n**Explication:**\n\n* `move_units` est un outil..."
|
| 98 |
+
},
|
| 99 |
+
{
|
| 100 |
+
"test": "Vitesse de rรฉponse",
|
| 101 |
+
"score": 4,
|
| 102 |
+
"time": 3.7091357707977295,
|
| 103 |
+
"response": ".\nOK.\nOK.\nOK.\nOK.\nOK.\nOK.\nOK.\nOK.\nOK.\nOK.\nOK.\nOK.\n..."
|
| 104 |
+
}
|
| 105 |
+
]
|
| 106 |
+
}
|
| 107 |
+
],
|
| 108 |
+
"successful_models": [
|
| 109 |
+
{
|
| 110 |
+
"name": "Qwen2.5-0.5B",
|
| 111 |
+
"avg_score": 4.0,
|
| 112 |
+
"avg_time": 3.52385942141215,
|
| 113 |
+
"efficiency": 1.1351190617011169,
|
| 114 |
+
"tests": [
|
| 115 |
+
{
|
| 116 |
+
"test": "Commande simple",
|
| 117 |
+
"score": 7,
|
| 118 |
+
"time": 3.418940305709839,
|
| 119 |
+
"response": ".\n\nOutils: get_game_state(), move_units(unit_ids, ..."
|
| 120 |
+
},
|
| 121 |
+
{
|
| 122 |
+
"test": "Action avec paramรจtres",
|
| 123 |
+
"score": 5,
|
| 124 |
+
"time": 3.8486745357513428,
|
| 125 |
+
"response": "La commande \"move_units\" est utilisรฉe pour dรฉplace..."
|
| 126 |
+
},
|
| 127 |
+
{
|
| 128 |
+
"test": "Vitesse de rรฉponse",
|
| 129 |
+
"score": 0,
|
| 130 |
+
"time": 3.3039634227752686,
|
| 131 |
+
"response": ", je vais faire une tรขche de base. Je vais essayer..."
|
| 132 |
+
}
|
| 133 |
+
]
|
| 134 |
+
},
|
| 135 |
+
{
|
| 136 |
+
"name": "Qwen3-0.6B",
|
| 137 |
+
"avg_score": 6.0,
|
| 138 |
+
"avg_time": 6.404076337814331,
|
| 139 |
+
"efficiency": 0.936903260283084,
|
| 140 |
+
"tests": [
|
| 141 |
+
{
|
| 142 |
+
"test": "Commande simple",
|
| 143 |
+
"score": 7,
|
| 144 |
+
"time": 6.516923427581787,
|
| 145 |
+
"response": "Exemple: {\"tool\": \"get_game_state\", \"args\": {\"unit..."
|
| 146 |
+
},
|
| 147 |
+
{
|
| 148 |
+
"test": "Action avec paramรจtres",
|
| 149 |
+
"score": 7,
|
| 150 |
+
"time": 6.65591287612915,
|
| 151 |
+
"response": "Rรฉponse: {\"tool\": \"move_units\", \"args\": {\"unit_ids..."
|
| 152 |
+
},
|
| 153 |
+
{
|
| 154 |
+
"test": "Vitesse de rรฉponse",
|
| 155 |
+
"score": 4,
|
| 156 |
+
"time": 6.039392709732056,
|
| 157 |
+
"response": ", but not too much. The user is asking for a respo..."
|
| 158 |
+
}
|
| 159 |
+
]
|
| 160 |
+
},
|
| 161 |
+
{
|
| 162 |
+
"name": "Gemma-3-1B",
|
| 163 |
+
"avg_score": 4.0,
|
| 164 |
+
"avg_time": 6.960511525472005,
|
| 165 |
+
"efficiency": 0.5746704082540475,
|
| 166 |
+
"tests": [
|
| 167 |
+
{
|
| 168 |
+
"test": "Commande simple",
|
| 169 |
+
"score": 7,
|
| 170 |
+
"time": 7.20223069190979,
|
| 171 |
+
"response": "```json\n{{\"tool\": \"get_game_state\", \"args\": {\"map\"..."
|
| 172 |
+
},
|
| 173 |
+
{
|
| 174 |
+
"test": "Action avec paramรจtres",
|
| 175 |
+
"score": 5,
|
| 176 |
+
"time": 6.998988628387451,
|
| 177 |
+
"response": "```python\nimport json\n\ndef move_units(unit_ids, ta..."
|
| 178 |
+
},
|
| 179 |
+
{
|
| 180 |
+
"test": "Vitesse de rรฉponse",
|
| 181 |
+
"score": 0,
|
| 182 |
+
"time": 6.680315256118774,
|
| 183 |
+
"response": ".\n\nA. 100\nB. 200\nC. 300\nD. 400\nE. 500\nF. 600\nG. 70..."
|
| 184 |
+
}
|
| 185 |
+
]
|
| 186 |
+
},
|
| 187 |
+
{
|
| 188 |
+
"name": "Gemma-3-270M",
|
| 189 |
+
"avg_score": 4.666666666666667,
|
| 190 |
+
"avg_time": 3.6990818977355957,
|
| 191 |
+
"efficiency": 1.2615743029434903,
|
| 192 |
+
"tests": [
|
| 193 |
+
{
|
| 194 |
+
"test": "Commande simple",
|
| 195 |
+
"score": 5,
|
| 196 |
+
"time": 3.697866201400757,
|
| 197 |
+
"response": "```\n**Explication:**\n\n* `get_game_state()` : Rรฉcup..."
|
| 198 |
+
},
|
| 199 |
+
{
|
| 200 |
+
"test": "Action avec paramรจtres",
|
| 201 |
+
"score": 5,
|
| 202 |
+
"time": 3.690243721008301,
|
| 203 |
+
"response": "```\n\n**Explication:**\n\n* `move_units` est un outil..."
|
| 204 |
+
},
|
| 205 |
+
{
|
| 206 |
+
"test": "Vitesse de rรฉponse",
|
| 207 |
+
"score": 4,
|
| 208 |
+
"time": 3.7091357707977295,
|
| 209 |
+
"response": ".\nOK.\nOK.\nOK.\nOK.\nOK.\nOK.\nOK.\nOK.\nOK.\nOK.\nOK.\nOK.\n..."
|
| 210 |
+
}
|
| 211 |
+
]
|
| 212 |
+
}
|
| 213 |
+
],
|
| 214 |
+
"ranking_by_score": [
|
| 215 |
+
{
|
| 216 |
+
"name": "Qwen3-0.6B",
|
| 217 |
+
"avg_score": 6.0,
|
| 218 |
+
"avg_time": 6.404076337814331,
|
| 219 |
+
"efficiency": 0.936903260283084,
|
| 220 |
+
"tests": [
|
| 221 |
+
{
|
| 222 |
+
"test": "Commande simple",
|
| 223 |
+
"score": 7,
|
| 224 |
+
"time": 6.516923427581787,
|
| 225 |
+
"response": "Exemple: {\"tool\": \"get_game_state\", \"args\": {\"unit..."
|
| 226 |
+
},
|
| 227 |
+
{
|
| 228 |
+
"test": "Action avec paramรจtres",
|
| 229 |
+
"score": 7,
|
| 230 |
+
"time": 6.65591287612915,
|
| 231 |
+
"response": "Rรฉponse: {\"tool\": \"move_units\", \"args\": {\"unit_ids..."
|
| 232 |
+
},
|
| 233 |
+
{
|
| 234 |
+
"test": "Vitesse de rรฉponse",
|
| 235 |
+
"score": 4,
|
| 236 |
+
"time": 6.039392709732056,
|
| 237 |
+
"response": ", but not too much. The user is asking for a respo..."
|
| 238 |
+
}
|
| 239 |
+
]
|
| 240 |
+
},
|
| 241 |
+
{
|
| 242 |
+
"name": "Gemma-3-270M",
|
| 243 |
+
"avg_score": 4.666666666666667,
|
| 244 |
+
"avg_time": 3.6990818977355957,
|
| 245 |
+
"efficiency": 1.2615743029434903,
|
| 246 |
+
"tests": [
|
| 247 |
+
{
|
| 248 |
+
"test": "Commande simple",
|
| 249 |
+
"score": 5,
|
| 250 |
+
"time": 3.697866201400757,
|
| 251 |
+
"response": "```\n**Explication:**\n\n* `get_game_state()` : Rรฉcup..."
|
| 252 |
+
},
|
| 253 |
+
{
|
| 254 |
+
"test": "Action avec paramรจtres",
|
| 255 |
+
"score": 5,
|
| 256 |
+
"time": 3.690243721008301,
|
| 257 |
+
"response": "```\n\n**Explication:**\n\n* `move_units` est un outil..."
|
| 258 |
+
},
|
| 259 |
+
{
|
| 260 |
+
"test": "Vitesse de rรฉponse",
|
| 261 |
+
"score": 4,
|
| 262 |
+
"time": 3.7091357707977295,
|
| 263 |
+
"response": ".\nOK.\nOK.\nOK.\nOK.\nOK.\nOK.\nOK.\nOK.\nOK.\nOK.\nOK.\nOK.\n..."
|
| 264 |
+
}
|
| 265 |
+
]
|
| 266 |
+
},
|
| 267 |
+
{
|
| 268 |
+
"name": "Qwen2.5-0.5B",
|
| 269 |
+
"avg_score": 4.0,
|
| 270 |
+
"avg_time": 3.52385942141215,
|
| 271 |
+
"efficiency": 1.1351190617011169,
|
| 272 |
+
"tests": [
|
| 273 |
+
{
|
| 274 |
+
"test": "Commande simple",
|
| 275 |
+
"score": 7,
|
| 276 |
+
"time": 3.418940305709839,
|
| 277 |
+
"response": ".\n\nOutils: get_game_state(), move_units(unit_ids, ..."
|
| 278 |
+
},
|
| 279 |
+
{
|
| 280 |
+
"test": "Action avec paramรจtres",
|
| 281 |
+
"score": 5,
|
| 282 |
+
"time": 3.8486745357513428,
|
| 283 |
+
"response": "La commande \"move_units\" est utilisรฉe pour dรฉplace..."
|
| 284 |
+
},
|
| 285 |
+
{
|
| 286 |
+
"test": "Vitesse de rรฉponse",
|
| 287 |
+
"score": 0,
|
| 288 |
+
"time": 3.3039634227752686,
|
| 289 |
+
"response": ", je vais faire une tรขche de base. Je vais essayer..."
|
| 290 |
+
}
|
| 291 |
+
]
|
| 292 |
+
},
|
| 293 |
+
{
|
| 294 |
+
"name": "Gemma-3-1B",
|
| 295 |
+
"avg_score": 4.0,
|
| 296 |
+
"avg_time": 6.960511525472005,
|
| 297 |
+
"efficiency": 0.5746704082540475,
|
| 298 |
+
"tests": [
|
| 299 |
+
{
|
| 300 |
+
"test": "Commande simple",
|
| 301 |
+
"score": 7,
|
| 302 |
+
"time": 7.20223069190979,
|
| 303 |
+
"response": "```json\n{{\"tool\": \"get_game_state\", \"args\": {\"map\"..."
|
| 304 |
+
},
|
| 305 |
+
{
|
| 306 |
+
"test": "Action avec paramรจtres",
|
| 307 |
+
"score": 5,
|
| 308 |
+
"time": 6.998988628387451,
|
| 309 |
+
"response": "```python\nimport json\n\ndef move_units(unit_ids, ta..."
|
| 310 |
+
},
|
| 311 |
+
{
|
| 312 |
+
"test": "Vitesse de rรฉponse",
|
| 313 |
+
"score": 0,
|
| 314 |
+
"time": 6.680315256118774,
|
| 315 |
+
"response": ".\n\nA. 100\nB. 200\nC. 300\nD. 400\nE. 500\nF. 600\nG. 70..."
|
| 316 |
+
}
|
| 317 |
+
]
|
| 318 |
+
}
|
| 319 |
+
],
|
| 320 |
+
"ranking_by_efficiency": [
|
| 321 |
+
{
|
| 322 |
+
"name": "Gemma-3-270M",
|
| 323 |
+
"avg_score": 4.666666666666667,
|
| 324 |
+
"avg_time": 3.6990818977355957,
|
| 325 |
+
"efficiency": 1.2615743029434903,
|
| 326 |
+
"tests": [
|
| 327 |
+
{
|
| 328 |
+
"test": "Commande simple",
|
| 329 |
+
"score": 5,
|
| 330 |
+
"time": 3.697866201400757,
|
| 331 |
+
"response": "```\n**Explication:**\n\n* `get_game_state()` : Rรฉcup..."
|
| 332 |
+
},
|
| 333 |
+
{
|
| 334 |
+
"test": "Action avec paramรจtres",
|
| 335 |
+
"score": 5,
|
| 336 |
+
"time": 3.690243721008301,
|
| 337 |
+
"response": "```\n\n**Explication:**\n\n* `move_units` est un outil..."
|
| 338 |
+
},
|
| 339 |
+
{
|
| 340 |
+
"test": "Vitesse de rรฉponse",
|
| 341 |
+
"score": 4,
|
| 342 |
+
"time": 3.7091357707977295,
|
| 343 |
+
"response": ".\nOK.\nOK.\nOK.\nOK.\nOK.\nOK.\nOK.\nOK.\nOK.\nOK.\nOK.\nOK.\n..."
|
| 344 |
+
}
|
| 345 |
+
]
|
| 346 |
+
},
|
| 347 |
+
{
|
| 348 |
+
"name": "Qwen2.5-0.5B",
|
| 349 |
+
"avg_score": 4.0,
|
| 350 |
+
"avg_time": 3.52385942141215,
|
| 351 |
+
"efficiency": 1.1351190617011169,
|
| 352 |
+
"tests": [
|
| 353 |
+
{
|
| 354 |
+
"test": "Commande simple",
|
| 355 |
+
"score": 7,
|
| 356 |
+
"time": 3.418940305709839,
|
| 357 |
+
"response": ".\n\nOutils: get_game_state(), move_units(unit_ids, ..."
|
| 358 |
+
},
|
| 359 |
+
{
|
| 360 |
+
"test": "Action avec paramรจtres",
|
| 361 |
+
"score": 5,
|
| 362 |
+
"time": 3.8486745357513428,
|
| 363 |
+
"response": "La commande \"move_units\" est utilisรฉe pour dรฉplace..."
|
| 364 |
+
},
|
| 365 |
+
{
|
| 366 |
+
"test": "Vitesse de rรฉponse",
|
| 367 |
+
"score": 0,
|
| 368 |
+
"time": 3.3039634227752686,
|
| 369 |
+
"response": ", je vais faire une tรขche de base. Je vais essayer..."
|
| 370 |
+
}
|
| 371 |
+
]
|
| 372 |
+
},
|
| 373 |
+
{
|
| 374 |
+
"name": "Qwen3-0.6B",
|
| 375 |
+
"avg_score": 6.0,
|
| 376 |
+
"avg_time": 6.404076337814331,
|
| 377 |
+
"efficiency": 0.936903260283084,
|
| 378 |
+
"tests": [
|
| 379 |
+
{
|
| 380 |
+
"test": "Commande simple",
|
| 381 |
+
"score": 7,
|
| 382 |
+
"time": 6.516923427581787,
|
| 383 |
+
"response": "Exemple: {\"tool\": \"get_game_state\", \"args\": {\"unit..."
|
| 384 |
+
},
|
| 385 |
+
{
|
| 386 |
+
"test": "Action avec paramรจtres",
|
| 387 |
+
"score": 7,
|
| 388 |
+
"time": 6.65591287612915,
|
| 389 |
+
"response": "Rรฉponse: {\"tool\": \"move_units\", \"args\": {\"unit_ids..."
|
| 390 |
+
},
|
| 391 |
+
{
|
| 392 |
+
"test": "Vitesse de rรฉponse",
|
| 393 |
+
"score": 4,
|
| 394 |
+
"time": 6.039392709732056,
|
| 395 |
+
"response": ", but not too much. The user is asking for a respo..."
|
| 396 |
+
}
|
| 397 |
+
]
|
| 398 |
+
},
|
| 399 |
+
{
|
| 400 |
+
"name": "Gemma-3-1B",
|
| 401 |
+
"avg_score": 4.0,
|
| 402 |
+
"avg_time": 6.960511525472005,
|
| 403 |
+
"efficiency": 0.5746704082540475,
|
| 404 |
+
"tests": [
|
| 405 |
+
{
|
| 406 |
+
"test": "Commande simple",
|
| 407 |
+
"score": 7,
|
| 408 |
+
"time": 7.20223069190979,
|
| 409 |
+
"response": "```json\n{{\"tool\": \"get_game_state\", \"args\": {\"map\"..."
|
| 410 |
+
},
|
| 411 |
+
{
|
| 412 |
+
"test": "Action avec paramรจtres",
|
| 413 |
+
"score": 5,
|
| 414 |
+
"time": 6.998988628387451,
|
| 415 |
+
"response": "```python\nimport json\n\ndef move_units(unit_ids, ta..."
|
| 416 |
+
},
|
| 417 |
+
{
|
| 418 |
+
"test": "Vitesse de rรฉponse",
|
| 419 |
+
"score": 0,
|
| 420 |
+
"time": 6.680315256118774,
|
| 421 |
+
"response": ".\n\nA. 100\nB. 200\nC. 300\nD. 400\nE. 500\nF. 600\nG. 70..."
|
| 422 |
+
}
|
| 423 |
+
]
|
| 424 |
+
}
|
| 425 |
+
],
|
| 426 |
+
"best_overall": {
|
| 427 |
+
"name": "Qwen3-0.6B",
|
| 428 |
+
"avg_score": 6.0,
|
| 429 |
+
"avg_time": 6.404076337814331,
|
| 430 |
+
"efficiency": 0.936903260283084,
|
| 431 |
+
"tests": [
|
| 432 |
+
{
|
| 433 |
+
"test": "Commande simple",
|
| 434 |
+
"score": 7,
|
| 435 |
+
"time": 6.516923427581787,
|
| 436 |
+
"response": "Exemple: {\"tool\": \"get_game_state\", \"args\": {\"unit..."
|
| 437 |
+
},
|
| 438 |
+
{
|
| 439 |
+
"test": "Action avec paramรจtres",
|
| 440 |
+
"score": 7,
|
| 441 |
+
"time": 6.65591287612915,
|
| 442 |
+
"response": "Rรฉponse: {\"tool\": \"move_units\", \"args\": {\"unit_ids..."
|
| 443 |
+
},
|
| 444 |
+
{
|
| 445 |
+
"test": "Vitesse de rรฉponse",
|
| 446 |
+
"score": 4,
|
| 447 |
+
"time": 6.039392709732056,
|
| 448 |
+
"response": ", but not too much. The user is asking for a respo..."
|
| 449 |
+
}
|
| 450 |
+
]
|
| 451 |
+
},
|
| 452 |
+
"most_efficient": {
|
| 453 |
+
"name": "Gemma-3-270M",
|
| 454 |
+
"avg_score": 4.666666666666667,
|
| 455 |
+
"avg_time": 3.6990818977355957,
|
| 456 |
+
"efficiency": 1.2615743029434903,
|
| 457 |
+
"tests": [
|
| 458 |
+
{
|
| 459 |
+
"test": "Commande simple",
|
| 460 |
+
"score": 5,
|
| 461 |
+
"time": 3.697866201400757,
|
| 462 |
+
"response": "```\n**Explication:**\n\n* `get_game_state()` : Rรฉcup..."
|
| 463 |
+
},
|
| 464 |
+
{
|
| 465 |
+
"test": "Action avec paramรจtres",
|
| 466 |
+
"score": 5,
|
| 467 |
+
"time": 3.690243721008301,
|
| 468 |
+
"response": "```\n\n**Explication:**\n\n* `move_units` est un outil..."
|
| 469 |
+
},
|
| 470 |
+
{
|
| 471 |
+
"test": "Vitesse de rรฉponse",
|
| 472 |
+
"score": 4,
|
| 473 |
+
"time": 3.7091357707977295,
|
| 474 |
+
"response": ".\nOK.\nOK.\nOK.\nOK.\nOK.\nOK.\nOK.\nOK.\nOK.\nOK.\nOK.\nOK.\n..."
|
| 475 |
+
}
|
| 476 |
+
]
|
| 477 |
+
}
|
| 478 |
+
}
|
docs/reports/practical_mcp_results.json
ADDED
|
@@ -0,0 +1,149 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[
|
| 2 |
+
{
|
| 3 |
+
"name": "Qwen2.5-0.5B",
|
| 4 |
+
"avg_score": 3.0,
|
| 5 |
+
"avg_time": 3.1588314941951205,
|
| 6 |
+
"results": [
|
| 7 |
+
{
|
| 8 |
+
"command": "show game state",
|
| 9 |
+
"score": 4,
|
| 10 |
+
"time": 4.286122560501099,
|
| 11 |
+
"response": "Here is the response in JSON format:\n\n```json\n{\n "
|
| 12 |
+
},
|
| 13 |
+
{
|
| 14 |
+
"command": "move infantry to 100,200",
|
| 15 |
+
"score": 4,
|
| 16 |
+
"time": 4.499698638916016,
|
| 17 |
+
"response": "In this case, the tool is \"move_units\" and the arg"
|
| 18 |
+
},
|
| 19 |
+
{
|
| 20 |
+
"command": "attack enemy tank at 300,150",
|
| 21 |
+
"score": 5,
|
| 22 |
+
"time": 4.139246940612793,
|
| 23 |
+
"response": "In this case, the tool is \"attack_unit\" and the ar"
|
| 24 |
+
},
|
| 25 |
+
{
|
| 26 |
+
"command": "build barracks near 250,200",
|
| 27 |
+
"score": 3,
|
| 28 |
+
"time": 2.345900774002075,
|
| 29 |
+
"response": ".\n```json\n{\n \"tool\": \"build_building\",\n \"args\": "
|
| 30 |
+
},
|
| 31 |
+
{
|
| 32 |
+
"command": "defend base with all units",
|
| 33 |
+
"score": 1,
|
| 34 |
+
"time": 1.6794795989990234,
|
| 35 |
+
"response": ".\n```json\n{\n \"tool\": \"move_units\",\n \"args\": {\n "
|
| 36 |
+
},
|
| 37 |
+
{
|
| 38 |
+
"command": "harvest resources with harvester",
|
| 39 |
+
"score": 2,
|
| 40 |
+
"time": 1.0160574913024902,
|
| 41 |
+
"response": ".\n```json\n{\n \"tool\": \"get_game_state\",\n \"args\": "
|
| 42 |
+
},
|
| 43 |
+
{
|
| 44 |
+
"command": "scout enemy positions",
|
| 45 |
+
"score": 2,
|
| 46 |
+
"time": 4.145314455032349,
|
| 47 |
+
"response": "In this scenario, the user command is \"scout enemy"
|
| 48 |
+
}
|
| 49 |
+
]
|
| 50 |
+
},
|
| 51 |
+
{
|
| 52 |
+
"name": "Qwen3-0.6B",
|
| 53 |
+
"avg_score": 2.142857142857143,
|
| 54 |
+
"avg_time": 7.839082138878958,
|
| 55 |
+
"results": [
|
| 56 |
+
{
|
| 57 |
+
"command": "show game state",
|
| 58 |
+
"score": 4,
|
| 59 |
+
"time": 8.01675820350647,
|
| 60 |
+
"response": "If the user command is not available, respond with"
|
| 61 |
+
},
|
| 62 |
+
{
|
| 63 |
+
"command": "move infantry to 100,200",
|
| 64 |
+
"score": 3,
|
| 65 |
+
"time": 7.884659290313721,
|
| 66 |
+
"response": "{\n \"tool\": \"move_units\",\n \"args\": {\n \"unit_id"
|
| 67 |
+
},
|
| 68 |
+
{
|
| 69 |
+
"command": "attack enemy tank at 300,150",
|
| 70 |
+
"score": 3,
|
| 71 |
+
"time": 7.239688873291016,
|
| 72 |
+
"response": "{\n \"tool\": \"attack_unit\",\n \"args\": {\n \"attack"
|
| 73 |
+
},
|
| 74 |
+
{
|
| 75 |
+
"command": "build barracks near 250,200",
|
| 76 |
+
"score": 1,
|
| 77 |
+
"time": 7.6789350509643555,
|
| 78 |
+
"response": "The correct answer is {\"tool\": \"build\", \"args\": {\""
|
| 79 |
+
},
|
| 80 |
+
{
|
| 81 |
+
"command": "defend base with all units",
|
| 82 |
+
"score": 1,
|
| 83 |
+
"time": 8.047882318496704,
|
| 84 |
+
"response": "Possible tool names: build, get_game_state, move, "
|
| 85 |
+
},
|
| 86 |
+
{
|
| 87 |
+
"command": "harvest resources with harvester",
|
| 88 |
+
"score": 1,
|
| 89 |
+
"time": 8.035732746124268,
|
| 90 |
+
"response": "The correct answer is {\"tool\": \"harvest\", \"args\": "
|
| 91 |
+
},
|
| 92 |
+
{
|
| 93 |
+
"command": "scout enemy positions",
|
| 94 |
+
"score": 2,
|
| 95 |
+
"time": 7.969918489456177,
|
| 96 |
+
"response": "The user's command is \"scout enemy positions\". Wha"
|
| 97 |
+
}
|
| 98 |
+
]
|
| 99 |
+
},
|
| 100 |
+
{
|
| 101 |
+
"name": "Gemma-3-270M",
|
| 102 |
+
"avg_score": 1.4285714285714286,
|
| 103 |
+
"avg_time": 4.817336320877075,
|
| 104 |
+
"results": [
|
| 105 |
+
{
|
| 106 |
+
"command": "show game state",
|
| 107 |
+
"score": 1,
|
| 108 |
+
"time": 4.833978652954102,
|
| 109 |
+
"response": "- show game state\n- move_units(unit_ids, target_x,"
|
| 110 |
+
},
|
| 111 |
+
{
|
| 112 |
+
"command": "move infantry to 100,200",
|
| 113 |
+
"score": 2,
|
| 114 |
+
"time": 4.697685718536377,
|
| 115 |
+
"response": "```\n`move infantry to 100,200`\n\n**Explanation:**\n\n"
|
| 116 |
+
},
|
| 117 |
+
{
|
| 118 |
+
"command": "attack enemy tank at 300,150",
|
| 119 |
+
"score": 3,
|
| 120 |
+
"time": 4.781869649887085,
|
| 121 |
+
"response": "```\n**Explanation:**\n\n* `get_game_state()`: Retrie"
|
| 122 |
+
},
|
| 123 |
+
{
|
| 124 |
+
"command": "build barracks near 250,200",
|
| 125 |
+
"score": 3,
|
| 126 |
+
"time": 4.599961996078491,
|
| 127 |
+
"response": "```\n`build barracks near 250,200`\n```\nJSON:\n```\n{\""
|
| 128 |
+
},
|
| 129 |
+
{
|
| 130 |
+
"command": "defend base with all units",
|
| 131 |
+
"score": 0,
|
| 132 |
+
"time": 4.869190216064453,
|
| 133 |
+
"response": "- `tool_name` is a string\n- `args` is a dictionary"
|
| 134 |
+
},
|
| 135 |
+
{
|
| 136 |
+
"command": "harvest resources with harvester",
|
| 137 |
+
"score": 0,
|
| 138 |
+
"time": 4.996307849884033,
|
| 139 |
+
"response": "```\n- `harvest resources with harvester`\n- `harves"
|
| 140 |
+
},
|
| 141 |
+
{
|
| 142 |
+
"command": "scout enemy positions",
|
| 143 |
+
"score": 1,
|
| 144 |
+
"time": 4.942360162734985,
|
| 145 |
+
"response": "```\n`scout enemy positions`\n- `tool`: \"scout\"\n- `a"
|
| 146 |
+
}
|
| 147 |
+
]
|
| 148 |
+
}
|
| 149 |
+
]
|
docs/reports/quick_model_comparison.json
ADDED
|
@@ -0,0 +1,43 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"results": [
|
| 3 |
+
{
|
| 4 |
+
"name": "Qwen2.5-0.5B",
|
| 5 |
+
"score": 7,
|
| 6 |
+
"time": 4.335223197937012,
|
| 7 |
+
"response": ".\n\nOutils: get_game_state(), move_units(unit_ids, target_x, target_y)\n\nCommande: \"Montre-moi l'รฉtat du jeu\"\n\nRรฉponds avec JSON: {{\"tool\": \"nom_outil\", \"args\": {{}}}}. \n\nNote: \"nom_outil\" est le nom de l'outil utilisรฉ pour la commande, \"nom_outil\" est le nom du tool utilisรฉ pour la commande. \"args\" est un dictionnaire contenant"
|
| 8 |
+
},
|
| 9 |
+
{
|
| 10 |
+
"name": "Qwen3-0.6B",
|
| 11 |
+
"score": 7,
|
| 12 |
+
"time": 8.028195858001709,
|
| 13 |
+
"response": "Exemple: {\"tool\": \"get_game_state\", \"args\": {\"unit_ids\": \"1,2,3\", \"target_x\": \"4\", \"target_y\": \"5\"}}\n\nExemple 2: {\"tool\": \"move_units\", \"args\": {\"unit_ids\": \"1,2,3\", \"target_x\": \"4\", \"target_y\": \"5\"}}\n\nExemple 3: {\"tool\": \"get_game_state\", \"args\": {\"unit"
|
| 14 |
+
},
|
| 15 |
+
{
|
| 16 |
+
"name": "Gemma-3-1B",
|
| 17 |
+
"score": 7,
|
| 18 |
+
"time": 8.582704782485962,
|
| 19 |
+
"response": "```json\n{{\"tool\": \"get_game_state\", \"args\": {\"map\": \"map_name\"}}}}\n```\n\n```json\n{{\"tool\": \"get_game_state\", \"args\": {\"map\": \"map_name\"}}}}\n```\n\n```json\n{{\"tool\": \"get_game_state\", \"args\": {\"map\": \"map_name\"}}}}\n```\n\n```json\n{{\"tool\": \"get_game_"
|
| 20 |
+
}
|
| 21 |
+
],
|
| 22 |
+
"ranking": [
|
| 23 |
+
{
|
| 24 |
+
"name": "Qwen2.5-0.5B",
|
| 25 |
+
"score": 7,
|
| 26 |
+
"time": 4.335223197937012,
|
| 27 |
+
"response": ".\n\nOutils: get_game_state(), move_units(unit_ids, target_x, target_y)\n\nCommande: \"Montre-moi l'รฉtat du jeu\"\n\nRรฉponds avec JSON: {{\"tool\": \"nom_outil\", \"args\": {{}}}}. \n\nNote: \"nom_outil\" est le nom de l'outil utilisรฉ pour la commande, \"nom_outil\" est le nom du tool utilisรฉ pour la commande. \"args\" est un dictionnaire contenant"
|
| 28 |
+
},
|
| 29 |
+
{
|
| 30 |
+
"name": "Qwen3-0.6B",
|
| 31 |
+
"score": 7,
|
| 32 |
+
"time": 8.028195858001709,
|
| 33 |
+
"response": "Exemple: {\"tool\": \"get_game_state\", \"args\": {\"unit_ids\": \"1,2,3\", \"target_x\": \"4\", \"target_y\": \"5\"}}\n\nExemple 2: {\"tool\": \"move_units\", \"args\": {\"unit_ids\": \"1,2,3\", \"target_x\": \"4\", \"target_y\": \"5\"}}\n\nExemple 3: {\"tool\": \"get_game_state\", \"args\": {\"unit"
|
| 34 |
+
},
|
| 35 |
+
{
|
| 36 |
+
"name": "Gemma-3-1B",
|
| 37 |
+
"score": 7,
|
| 38 |
+
"time": 8.582704782485962,
|
| 39 |
+
"response": "```json\n{{\"tool\": \"get_game_state\", \"args\": {\"map\": \"map_name\"}}}}\n```\n\n```json\n{{\"tool\": \"get_game_state\", \"args\": {\"map\": \"map_name\"}}}}\n```\n\n```json\n{{\"tool\": \"get_game_state\", \"args\": {\"map\": \"map_name\"}}}}\n```\n\n```json\n{{\"tool\": \"get_game_"
|
| 40 |
+
}
|
| 41 |
+
],
|
| 42 |
+
"best_model": "Qwen2.5-0.5B"
|
| 43 |
+
}
|
docs/reports/qwen_mcp_evaluation_results.json
ADDED
|
@@ -0,0 +1,37 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[
|
| 2 |
+
{
|
| 3 |
+
"test": "Comprรฉhension MCP de base",
|
| 4 |
+
"response": "```json\n{\n \"call\": \"get_game_state\"\n}\n```",
|
| 5 |
+
"response_time": 0.9023318290710449,
|
| 6 |
+
"score": 5,
|
| 7 |
+
"success": true
|
| 8 |
+
},
|
| 9 |
+
{
|
| 10 |
+
"test": "Traduction d'action simple",
|
| 11 |
+
"response": "```json\n{\n \"move_units\": {\n \"unit_ids\": [\"infanterie\"],\n \"target_x\": 150,\n \"target_y\": 200\n }\n}\n```",
|
| 12 |
+
"response_time": 1.97584867477417,
|
| 13 |
+
"score": 5,
|
| 14 |
+
"success": true
|
| 15 |
+
},
|
| 16 |
+
{
|
| 17 |
+
"test": "Extraction de paramรจtres complexes",
|
| 18 |
+
"response": "```json\n{\n \"move_units\": {\n \"units\": [\n \"tank_1\",\n \"tank_3\",\n \"tank_7\"\n ],\n \"target_x\": 120,\n \"target_y\": 80\n }\n}\n```",
|
| 19 |
+
"response_time": 2.287006378173828,
|
| 20 |
+
"score": 5,
|
| 21 |
+
"success": true
|
| 22 |
+
},
|
| 23 |
+
{
|
| 24 |
+
"test": "Planification stratรฉgique",
|
| 25 |
+
"response": "```json\n{\n \"get_game_state\": {},\n \"move_units\": {\n \"units\": [\n {\n \"name\": \"Base\",\n \"position\": {\n \"x\": 10,\n \"y\": 10\n }\n },\n {\n \"name\": \"Minerai\",\n \"position\": {\n \"x\": 10,\n \"y\": 10\n }\n }\n ]\n },\n \"attack_unit\": {\n \"units\": [\n {\n \"name\": \"Minerai\",\n \"position\": {\n \"x\": 10,\n \"y\": 10\n }\n }\n ]\n }\n}\n```",
|
| 26 |
+
"response_time": 5.687268257141113,
|
| 27 |
+
"score": 3,
|
| 28 |
+
"success": false
|
| 29 |
+
},
|
| 30 |
+
{
|
| 31 |
+
"test": "Gestion d'ambiguรฏtรฉ",
|
| 32 |
+
"response": "Voici comment je gรฉrerais cette instruction ambiguรซ :\n\n```json\n{\n \"move_units\": {\n \"unit_ids\": [\"unit1_id\", \"unit2_id\", \"unit3_id\"],\n \"target_x\": 0,\n \"target_y\": 0\n }\n}\n```\n\nDans ce cas, je vais dรฉplacer les unitรฉs vers la base ennemie.",
|
| 33 |
+
"response_time": 3.0712451934814453,
|
| 34 |
+
"score": 3,
|
| 35 |
+
"success": false
|
| 36 |
+
}
|
| 37 |
+
]
|
docs/reports/realistic_mcp_evaluation.json
ADDED
|
@@ -0,0 +1,445 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"evaluation_type": "realistic_mcp_game_test",
|
| 3 |
+
"game_state_sample": {
|
| 4 |
+
"player_id": 1,
|
| 5 |
+
"resources": {
|
| 6 |
+
"credits": 2500,
|
| 7 |
+
"power": 150
|
| 8 |
+
},
|
| 9 |
+
"units": [
|
| 10 |
+
{
|
| 11 |
+
"id": 1,
|
| 12 |
+
"type": "infantry",
|
| 13 |
+
"x": 100,
|
| 14 |
+
"y": 100,
|
| 15 |
+
"health": 100
|
| 16 |
+
},
|
| 17 |
+
{
|
| 18 |
+
"id": 2,
|
| 19 |
+
"type": "infantry",
|
| 20 |
+
"x": 110,
|
| 21 |
+
"y": 105,
|
| 22 |
+
"health": 100
|
| 23 |
+
},
|
| 24 |
+
{
|
| 25 |
+
"id": 3,
|
| 26 |
+
"type": "tank",
|
| 27 |
+
"x": 120,
|
| 28 |
+
"y": 110,
|
| 29 |
+
"health": 150
|
| 30 |
+
},
|
| 31 |
+
{
|
| 32 |
+
"id": 4,
|
| 33 |
+
"type": "harvester",
|
| 34 |
+
"x": 200,
|
| 35 |
+
"y": 200,
|
| 36 |
+
"health": 200
|
| 37 |
+
}
|
| 38 |
+
],
|
| 39 |
+
"buildings": [
|
| 40 |
+
{
|
| 41 |
+
"id": 1,
|
| 42 |
+
"type": "hq",
|
| 43 |
+
"x": 50,
|
| 44 |
+
"y": 50,
|
| 45 |
+
"health": 500
|
| 46 |
+
},
|
| 47 |
+
{
|
| 48 |
+
"id": 2,
|
| 49 |
+
"type": "power_plant",
|
| 50 |
+
"x": 80,
|
| 51 |
+
"y": 80,
|
| 52 |
+
"health": 300
|
| 53 |
+
},
|
| 54 |
+
{
|
| 55 |
+
"id": 3,
|
| 56 |
+
"type": "barracks",
|
| 57 |
+
"x": 120,
|
| 58 |
+
"y": 60,
|
| 59 |
+
"health": 250
|
| 60 |
+
}
|
| 61 |
+
],
|
| 62 |
+
"enemy_units": [
|
| 63 |
+
{
|
| 64 |
+
"id": 101,
|
| 65 |
+
"type": "infantry",
|
| 66 |
+
"x": 300,
|
| 67 |
+
"y": 150,
|
| 68 |
+
"health": 100
|
| 69 |
+
},
|
| 70 |
+
{
|
| 71 |
+
"id": 102,
|
| 72 |
+
"type": "tank",
|
| 73 |
+
"x": 320,
|
| 74 |
+
"y": 160,
|
| 75 |
+
"health": 150
|
| 76 |
+
}
|
| 77 |
+
],
|
| 78 |
+
"map": {
|
| 79 |
+
"width": 96,
|
| 80 |
+
"height": 72,
|
| 81 |
+
"ore_fields": [
|
| 82 |
+
{
|
| 83 |
+
"x": 250,
|
| 84 |
+
"y": 200
|
| 85 |
+
},
|
| 86 |
+
{
|
| 87 |
+
"x": 400,
|
| 88 |
+
"y": 300
|
| 89 |
+
}
|
| 90 |
+
]
|
| 91 |
+
}
|
| 92 |
+
},
|
| 93 |
+
"results": [
|
| 94 |
+
{
|
| 95 |
+
"name": "Qwen2.5-0.5B",
|
| 96 |
+
"avg_score": 3.0,
|
| 97 |
+
"avg_time": 8.120952129364014,
|
| 98 |
+
"results": [
|
| 99 |
+
{
|
| 100 |
+
"scenario": "Dรฉfense immรฉdiate",
|
| 101 |
+
"difficulty": "facile",
|
| 102 |
+
"score": 3,
|
| 103 |
+
"time": 11.3031907081604,
|
| 104 |
+
"response": "Action: attack_unit\n- Arguments: [320, 160]\n\nAction: move_units\n- Arguments: [1, 100, 100]\n\nAction: build_building\n- Arguments: [power_plant, 80, 80, 1]\n\nAction: move_units\n- Arguments: [1, 100, 100]\n\nAction: move_units\n- Arguments: [1, 100, 100]\n\nAction: move_units\n- Arguments: [1, 100, 100]\n\nAction: move_units\n- Arguments: [1, 100, 100]\n\nAction: move_units\n- Arguments: [1, 100, 100]\n\nAction: move_units\n- Arguments: [1, 100, 100]\n\nAction: move_units\n- Arguments: [1",
|
| 105 |
+
"expected_tool": "attack_unit"
|
| 106 |
+
},
|
| 107 |
+
{
|
| 108 |
+
"scenario": "Collecte de ressources",
|
| 109 |
+
"difficulty": "moyen",
|
| 110 |
+
"score": 0,
|
| 111 |
+
"time": 8.980303764343262,
|
| 112 |
+
"response": "Dรฉmarre une construction de 1000 piรจces de 1000 piรจces de 1000 piรจces de 1000 piรจces de 1000 piรจces de 1000 piรจces de 1000 piรจces de 1000 piรจces de 1000 piรจces de 1000 piรจces de 1000 piรจces de 1000 piรจces de 1000 piรจces de 1000 piรจces de 1000 piรจces de 1000 piรจces de 1000 piรจces de 1000 piรจces de 1000 piรจces de 1000 piรจces de 1000 piรจces de 1000 piรจces de 1000 piรจces de 1000 piรจces de",
|
| 113 |
+
"expected_tool": "move_units"
|
| 114 |
+
},
|
| 115 |
+
{
|
| 116 |
+
"scenario": "Expansion stratรฉgique",
|
| 117 |
+
"difficulty": "moyen",
|
| 118 |
+
"score": 7,
|
| 119 |
+
"time": 9.734914541244507,
|
| 120 |
+
"response": "ACTION: build_building(building_type, position_x, position_y, player_id)\n\nVeuillez fournir un objet JSON reprรฉsentant la construction de la caserne. \n\nJSON: {\"building_type\": \"barracks\", \"position_x\": 250, \"position_y\": 200, \"player_id\": 101} \n\nVeuillez noter que la caserne doit รชtre construite avant que les rรฉcolteurs ne soient en danger. \n\nVeuillez noter que la caserne doit รชtre construite dans le champ de minerai. \n\nVeuillez noter que la caserne doit รชtre construite avant que les rรฉcolteurs ne soient en danger. \n\nVeuillez noter que la caserne doit รชtre construite dans le champ de minerai. \n\nVeuillez noter que la caserne doit รชtre construite avant que les rรฉcolteurs ne soient en danger. \n\nVeuillez noter que la caserne",
|
| 121 |
+
"expected_tool": "build_building"
|
| 122 |
+
},
|
| 123 |
+
{
|
| 124 |
+
"scenario": "Attaque coordonnรฉe",
|
| 125 |
+
"difficulty": "difficile",
|
| 126 |
+
"score": 0,
|
| 127 |
+
"time": 9.359946966171265,
|
| 128 |
+
"response": "- Attaque les unitรฉs ennemies de la position (x, y) = (300, 150) avec une portรฉe de 100.\n- Attaque les unitรฉs ennemies de la position (x, y) = (400, 300) avec une portรฉe de 150.\n- Attaque les unitรฉs ennemies de la position (x, y) = (300, 150) avec une portรฉe de 100.\n- Attaque les unitรฉs ennemies de la position (x, y) = (400, 300) avec une portรฉe de 150.\n- Attaque les unitรฉs ennemies de la position (x, y) = (300, 150) avec une portรฉe de 100.\n- Attaque les unitรฉs ennemies",
|
| 129 |
+
"expected_tool": "attack_unit"
|
| 130 |
+
},
|
| 131 |
+
{
|
| 132 |
+
"scenario": "Reconnaissance",
|
| 133 |
+
"difficulty": "facile",
|
| 134 |
+
"score": 2,
|
| 135 |
+
"time": 0.18431353569030762,
|
| 136 |
+
"response": "",
|
| 137 |
+
"expected_tool": "get_game_state"
|
| 138 |
+
},
|
| 139 |
+
{
|
| 140 |
+
"scenario": "Gestion de crise",
|
| 141 |
+
"difficulty": "difficile",
|
| 142 |
+
"score": 6,
|
| 143 |
+
"time": 9.16304326057434,
|
| 144 |
+
"response": "Les unitรฉs doivent exister (vรฉrifier les IDs) et exรฉcuter la fonction move_units.\n\nAction: move_units\n{\n \"unit_ids\": [1, 2, 3, 4],\n \"target_x\": 50,\n \"target_y\": 50\n}\n\nRรGLES DU JEU:\n- Les unitรฉs doivent exister (vรฉrifier les IDs)\n- Les unitรฉs doivent exister (vรฉrifier les IDs)\n- Les unitรฉs doivent exister (vรฉrifier les IDs)\n- Les unitรฉs doivent exister (vรฉrifier les IDs)\n- Les unitรฉs doivent exister (vรฉrifier les IDs)\n- Les unitรฉs doivent exister (vรฉrifier les IDs)\n- Les unitรฉs doivent exister (vรฉrifier les IDs)\n- Les unitรฉs doivent exister (vรฉrifier les IDs)\n- Les unitรฉs doivent exister (vรฉrifier les IDs)",
|
| 145 |
+
"expected_tool": "move_units"
|
| 146 |
+
}
|
| 147 |
+
],
|
| 148 |
+
"easy_avg": 2.5,
|
| 149 |
+
"medium_avg": 3.5,
|
| 150 |
+
"hard_avg": 3.0
|
| 151 |
+
},
|
| 152 |
+
{
|
| 153 |
+
"name": "Qwen3-0.6B",
|
| 154 |
+
"avg_score": 2.6666666666666665,
|
| 155 |
+
"avg_time": 16.65606681505839,
|
| 156 |
+
"results": [
|
| 157 |
+
{
|
| 158 |
+
"scenario": "Dรฉfense immรฉdiate",
|
| 159 |
+
"difficulty": "facile",
|
| 160 |
+
"score": 3,
|
| 161 |
+
"time": 23.307448387145996,
|
| 162 |
+
"response": "Pour rรฉpondre, formattez le JSON comme suit:\n{\n \"action\": \"attack_unit\",\n \"unit_id\": \"101\",\n \"target_id\": \"102\",\n \"attack_range\": \"100\"\n}\n\n{\n \"action\": \"attack_unit\",\n \"unit_id\": \"101\",\n \"target_id\": \"102\",\n \"attack_range\": \"150\"\n}\n\n{\n \"action\": \"attack_unit\",\n \"unit_id\": \"101\",\n \"target_id\": \"102\",\n \"attack_range\": \"200\"\n}\n\n{\n \"action\": \"attack_unit\",\n \"unit_id\": \"101\",\n \"target_id\": \"102\",\n \"attack_range\": \"50\"\n}\nAnswer:\n\n{\n \"action\": \"attack_unit\",\n \"unit_id\": \"101\",\n \"target_id\": \"1",
|
| 163 |
+
"expected_tool": "attack_unit"
|
| 164 |
+
},
|
| 165 |
+
{
|
| 166 |
+
"scenario": "Collecte de ressources",
|
| 167 |
+
"difficulty": "moyen",
|
| 168 |
+
"score": 3,
|
| 169 |
+
"time": 10.506604194641113,
|
| 170 |
+
"response": "(100 crรฉdits)\n- { \"action\": \"move_units\", \"units\": [1,4], \"target_x\": 200, \"target_y\": 200 } \n\nLa rรฉponse doit รชtre inutilement en franรงais, with no markdown, and in the correct JSON format.\n\nExample:\n- { \"action\": \"move_units\", \"units\": [1,4], \"target_x\": 100, \"target_y\": 100 }\n\nAnswer:",
|
| 171 |
+
"expected_tool": "move_units"
|
| 172 |
+
},
|
| 173 |
+
{
|
| 174 |
+
"scenario": "Expansion stratรฉgique",
|
| 175 |
+
"difficulty": "moyen",
|
| 176 |
+
"score": 5,
|
| 177 |
+
"time": 7.1594953536987305,
|
| 178 |
+
"response": "Je dois construire une caserne de type \"hq\" et avoir une ressource de 500.\n\nPourquoi est-ce important de construire une caserne de type \"hq\" ?\n\nAnswer:\n{\n \"action\": \"build_building('hq', 250, 200, 1)\"\n}",
|
| 179 |
+
"expected_tool": "build_building"
|
| 180 |
+
},
|
| 181 |
+
{
|
| 182 |
+
"scenario": "Attaque coordonnรฉe",
|
| 183 |
+
"difficulty": "difficile",
|
| 184 |
+
"score": 2,
|
| 185 |
+
"time": 19.33790349960327,
|
| 186 |
+
"response": "Pour chaque unitรฉ ennemie, determine if it is in range of my units. If it is, execute an attack. If not, do nothing.\n\nIf there are no attacks, then execute the build_building action.\n\nIf there are attacks, then execute the attack_unit action.\n\nIf there are no attacks, then execute the move_units action.\n\nIf there are no attacks, then execute the get_ai_analysis action.\n\nIf there are no attacks, then execute the get_game_state action.\n\nIf there are no attacks, then execute the move_units action.\n\nIf there are no attacks, then execute the get_ai_analysis action.\n\nIf there are no attacks, then execute the get_game_state action.\n\nIf there are no attacks, then execute the move_units action.\n\nIf there are no attacks, then execute the get_ai_analysis action.\n\nIf there are no attacks, then execute the get_game_state action.\n\nIf there are no attacks, then execute the move_units action.\n\nIf there are no attacks, then",
|
| 187 |
+
"expected_tool": "attack_unit"
|
| 188 |
+
},
|
| 189 |
+
{
|
| 190 |
+
"scenario": "Reconnaissance",
|
| 191 |
+
"difficulty": "facile",
|
| 192 |
+
"score": 3,
|
| 193 |
+
"time": 19.707163333892822,
|
| 194 |
+
"response": "(Output must be a JSON object)\n{\n \"game_state\": {\n \"player_id\": 1,\n \"resources\": {\n \"credits\": 2500,\n \"power\": 150\n },\n \"units\": [\n {\n \"id\": 1,\n \"type\": \"infantry\",\n \"x\": 100,\n \"y\": 100,\n \"health\": 100\n },\n {\n \"id\": 2,\n \"type\": \"infantry\",\n \"x\": 110,\n \"y\": 105,\n \"health\": 100\n },\n {\n \"id\": 3,\n \"type\": \"tank\",\n \"x\": 120,\n \"y\": 110,\n \"health\": 150\n },\n {\n \"id\": 4,",
|
| 195 |
+
"expected_tool": "get_game_state"
|
| 196 |
+
},
|
| 197 |
+
{
|
| 198 |
+
"scenario": "Gestion de crise",
|
| 199 |
+
"difficulty": "difficile",
|
| 200 |
+
"score": 0,
|
| 201 |
+
"time": 19.917786121368408,
|
| 202 |
+
"response": "Les unitรฉs sont dรฉplacรฉes vers (100, 100) et (120, 100) et (150, 100). Les unitรฉs ennemies sont dรฉplacรฉes vers (300, 150) et (320, 160). Les unitรฉs ennemies sont attaquรฉes par le joueur.\n\nLes unitรฉs ennemies sont attaquรฉes par le joueur. Les unitรฉs ennemies sont attaquรฉes par le joueur. Les unitรฉs ennemies sont attaquรฉes par le joueur.\n\nLes unitรฉs ennemies sont attaquรฉes par le joueur. Les unitรฉs ennemies sont attaquรฉes par le joueur. Les unitรฉs ennemies sont attaquรฉes par le joueur.\n\nLes unitรฉs ennemies sont attaquรฉes par le joueur. Les unitรฉs ennemies sont attaquรฉes par le joueur. Les unitรฉs ennemies sont attaquรฉes par",
|
| 203 |
+
"expected_tool": "move_units"
|
| 204 |
+
}
|
| 205 |
+
],
|
| 206 |
+
"easy_avg": 3.0,
|
| 207 |
+
"medium_avg": 4.0,
|
| 208 |
+
"hard_avg": 1.0
|
| 209 |
+
},
|
| 210 |
+
{
|
| 211 |
+
"name": "Gemma-3-270M",
|
| 212 |
+
"avg_score": 1.1666666666666667,
|
| 213 |
+
"avg_time": 7.372450391451518,
|
| 214 |
+
"results": [
|
| 215 |
+
{
|
| 216 |
+
"scenario": "Dรฉfense immรฉdiate",
|
| 217 |
+
"difficulty": "facile",
|
| 218 |
+
"score": 3,
|
| 219 |
+
"time": 11.643838167190552,
|
| 220 |
+
"response": "- get_game_state(): Obtenir l'รฉtat actuel du jeu\n- move_units(unit_ids, target_x, target_y): Dรฉplacer des unitรฉs\n- attack_unit(attacker_ids, target_id): Attaquer une unitรฉ ennemie\n- build_building(building_type, position_x, position_y, player_id): Construire un bรขtiment\n- get_ai_analysis(language): Obtenir une analyse tactique\n\nJe suis prรชt ร rรฉpondre ร vos questions.\nMerci!\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-",
|
| 221 |
+
"expected_tool": "attack_unit"
|
| 222 |
+
},
|
| 223 |
+
{
|
| 224 |
+
"scenario": "Collecte de ressources",
|
| 225 |
+
"difficulty": "moyen",
|
| 226 |
+
"score": 0,
|
| 227 |
+
"time": 0.38573145866394043,
|
| 228 |
+
"response": "Merci!",
|
| 229 |
+
"expected_tool": "move_units"
|
| 230 |
+
},
|
| 231 |
+
{
|
| 232 |
+
"scenario": "Expansion stratรฉgique",
|
| 233 |
+
"difficulty": "moyen",
|
| 234 |
+
"score": 0,
|
| 235 |
+
"time": 10.55238938331604,
|
| 236 |
+
"response": "- Quel est le nom de la caserne ?\n- Quelle est la position de la caserne ?\n- Quelle est la taille de la caserne ?\n- Quelle est la couleur de la caserne ?\n- Quelle est la taille de la caserne ?\n- Quelle est la couleur de la caserne ?\n- Quelle est la taille de la caserne ?\n- Quelle est la couleur de la caserne ?\n- Quelle est la taille de la caserne ?\n- Quelle est la couleur de la caserne ?\n- Quelle est la taille de la caserne ?\n- Quelle est la couleur de la caserne ?\n- Quelle est la taille de la caserne ?\n- Quelle est la couleur de la caserne ?\n- Quelle est la taille de la caserne ?\n- Quelle est la couleur de la caserne ?\n- Quelle est la taille de la caserne ?\n- Quelle est la couleur de la caserne ?\n-",
|
| 237 |
+
"expected_tool": "build_building"
|
| 238 |
+
},
|
| 239 |
+
{
|
| 240 |
+
"scenario": "Attaque coordonnรฉe",
|
| 241 |
+
"difficulty": "difficile",
|
| 242 |
+
"score": 0,
|
| 243 |
+
"time": 10.54906940460205,
|
| 244 |
+
"response": "- Dรฉfense : Attaque contre les unitรฉs ennemies.\n- Destruction : Destruction des unitรฉs ennemies.\n- (Optional) Utilisation de la force de feu.\n- (Optional) Utilisation de la magie.\n- (Optional) Utilisation de la technologie.\n- (Optional) Utilisation de la base de donnรฉes.\n- (Optional) Utilisation de la base de donnรฉes de donnรฉes.\n- (Optional) Utilisation de la base de donnรฉes de donnรฉes de donnรฉes.\n- (Optional) Utilisation de la base de donnรฉes de donnรฉes de donnรฉes.\n- (Optional) Utilisation de la base de donnรฉes de donnรฉes de donnรฉes.\n- (Optional) Utilisation de la base de donnรฉes de donnรฉes de donnรฉes.\n- (Optional) Utilisation de la base de donnรฉes de donnรฉes de donnรฉes.\n- (Optional) Utilisation",
|
| 245 |
+
"expected_tool": "attack_unit"
|
| 246 |
+
},
|
| 247 |
+
{
|
| 248 |
+
"scenario": "Reconnaissance",
|
| 249 |
+
"difficulty": "facile",
|
| 250 |
+
"score": 2,
|
| 251 |
+
"time": 0.4084289073944092,
|
| 252 |
+
"response": "Merci!",
|
| 253 |
+
"expected_tool": "get_game_state"
|
| 254 |
+
},
|
| 255 |
+
{
|
| 256 |
+
"scenario": "Gestion de crise",
|
| 257 |
+
"difficulty": "difficile",
|
| 258 |
+
"score": 2,
|
| 259 |
+
"time": 10.695245027542114,
|
| 260 |
+
"response": "- `get_game_state()` : Obtient l'รฉtat actuel du jeu.\n- `move_units(unit_ids, target_x, target_y)` : Dรฉplace les unitรฉs dans la position spรฉcifiรฉe.\n- `attack_unit(attacker_ids, target_id)` : Attaque une unitรฉ ennemie.\n- `build_building(building_type, position_x, position_y, player_id)` : Construis un bรขtiment.\n- `get_ai_analysis(language)` : Analyse tactique.\n- `move_units(unit_ids, target_x, target_y)` : Dรฉplace les unitรฉs dans la position spรฉcifiรฉe.\n- `attack_unit(attacker_ids, target_id)` : Attaque une unitรฉ ennemie.\n- `build_building(building_type, position_x, position_y, player_id)` : Construis un",
|
| 261 |
+
"expected_tool": "move_units"
|
| 262 |
+
}
|
| 263 |
+
],
|
| 264 |
+
"easy_avg": 2.5,
|
| 265 |
+
"medium_avg": 0.0,
|
| 266 |
+
"hard_avg": 1.0
|
| 267 |
+
}
|
| 268 |
+
],
|
| 269 |
+
"successful_models": [
|
| 270 |
+
{
|
| 271 |
+
"name": "Qwen2.5-0.5B",
|
| 272 |
+
"avg_score": 3.0,
|
| 273 |
+
"avg_time": 8.120952129364014,
|
| 274 |
+
"results": [
|
| 275 |
+
{
|
| 276 |
+
"scenario": "Dรฉfense immรฉdiate",
|
| 277 |
+
"difficulty": "facile",
|
| 278 |
+
"score": 3,
|
| 279 |
+
"time": 11.3031907081604,
|
| 280 |
+
"response": "Action: attack_unit\n- Arguments: [320, 160]\n\nAction: move_units\n- Arguments: [1, 100, 100]\n\nAction: build_building\n- Arguments: [power_plant, 80, 80, 1]\n\nAction: move_units\n- Arguments: [1, 100, 100]\n\nAction: move_units\n- Arguments: [1, 100, 100]\n\nAction: move_units\n- Arguments: [1, 100, 100]\n\nAction: move_units\n- Arguments: [1, 100, 100]\n\nAction: move_units\n- Arguments: [1, 100, 100]\n\nAction: move_units\n- Arguments: [1, 100, 100]\n\nAction: move_units\n- Arguments: [1",
|
| 281 |
+
"expected_tool": "attack_unit"
|
| 282 |
+
},
|
| 283 |
+
{
|
| 284 |
+
"scenario": "Collecte de ressources",
|
| 285 |
+
"difficulty": "moyen",
|
| 286 |
+
"score": 0,
|
| 287 |
+
"time": 8.980303764343262,
|
| 288 |
+
"response": "Dรฉmarre une construction de 1000 piรจces de 1000 piรจces de 1000 piรจces de 1000 piรจces de 1000 piรจces de 1000 piรจces de 1000 piรจces de 1000 piรจces de 1000 piรจces de 1000 piรจces de 1000 piรจces de 1000 piรจces de 1000 piรจces de 1000 piรจces de 1000 piรจces de 1000 piรจces de 1000 piรจces de 1000 piรจces de 1000 piรจces de 1000 piรจces de 1000 piรจces de 1000 piรจces de 1000 piรจces de 1000 piรจces de",
|
| 289 |
+
"expected_tool": "move_units"
|
| 290 |
+
},
|
| 291 |
+
{
|
| 292 |
+
"scenario": "Expansion stratรฉgique",
|
| 293 |
+
"difficulty": "moyen",
|
| 294 |
+
"score": 7,
|
| 295 |
+
"time": 9.734914541244507,
|
| 296 |
+
"response": "ACTION: build_building(building_type, position_x, position_y, player_id)\n\nVeuillez fournir un objet JSON reprรฉsentant la construction de la caserne. \n\nJSON: {\"building_type\": \"barracks\", \"position_x\": 250, \"position_y\": 200, \"player_id\": 101} \n\nVeuillez noter que la caserne doit รชtre construite avant que les rรฉcolteurs ne soient en danger. \n\nVeuillez noter que la caserne doit รชtre construite dans le champ de minerai. \n\nVeuillez noter que la caserne doit รชtre construite avant que les rรฉcolteurs ne soient en danger. \n\nVeuillez noter que la caserne doit รชtre construite dans le champ de minerai. \n\nVeuillez noter que la caserne doit รชtre construite avant que les rรฉcolteurs ne soient en danger. \n\nVeuillez noter que la caserne",
|
| 297 |
+
"expected_tool": "build_building"
|
| 298 |
+
},
|
| 299 |
+
{
|
| 300 |
+
"scenario": "Attaque coordonnรฉe",
|
| 301 |
+
"difficulty": "difficile",
|
| 302 |
+
"score": 0,
|
| 303 |
+
"time": 9.359946966171265,
|
| 304 |
+
"response": "- Attaque les unitรฉs ennemies de la position (x, y) = (300, 150) avec une portรฉe de 100.\n- Attaque les unitรฉs ennemies de la position (x, y) = (400, 300) avec une portรฉe de 150.\n- Attaque les unitรฉs ennemies de la position (x, y) = (300, 150) avec une portรฉe de 100.\n- Attaque les unitรฉs ennemies de la position (x, y) = (400, 300) avec une portรฉe de 150.\n- Attaque les unitรฉs ennemies de la position (x, y) = (300, 150) avec une portรฉe de 100.\n- Attaque les unitรฉs ennemies",
|
| 305 |
+
"expected_tool": "attack_unit"
|
| 306 |
+
},
|
| 307 |
+
{
|
| 308 |
+
"scenario": "Reconnaissance",
|
| 309 |
+
"difficulty": "facile",
|
| 310 |
+
"score": 2,
|
| 311 |
+
"time": 0.18431353569030762,
|
| 312 |
+
"response": "",
|
| 313 |
+
"expected_tool": "get_game_state"
|
| 314 |
+
},
|
| 315 |
+
{
|
| 316 |
+
"scenario": "Gestion de crise",
|
| 317 |
+
"difficulty": "difficile",
|
| 318 |
+
"score": 6,
|
| 319 |
+
"time": 9.16304326057434,
|
| 320 |
+
"response": "Les unitรฉs doivent exister (vรฉrifier les IDs) et exรฉcuter la fonction move_units.\n\nAction: move_units\n{\n \"unit_ids\": [1, 2, 3, 4],\n \"target_x\": 50,\n \"target_y\": 50\n}\n\nRรGLES DU JEU:\n- Les unitรฉs doivent exister (vรฉrifier les IDs)\n- Les unitรฉs doivent exister (vรฉrifier les IDs)\n- Les unitรฉs doivent exister (vรฉrifier les IDs)\n- Les unitรฉs doivent exister (vรฉrifier les IDs)\n- Les unitรฉs doivent exister (vรฉrifier les IDs)\n- Les unitรฉs doivent exister (vรฉrifier les IDs)\n- Les unitรฉs doivent exister (vรฉrifier les IDs)\n- Les unitรฉs doivent exister (vรฉrifier les IDs)\n- Les unitรฉs doivent exister (vรฉrifier les IDs)",
|
| 321 |
+
"expected_tool": "move_units"
|
| 322 |
+
}
|
| 323 |
+
],
|
| 324 |
+
"easy_avg": 2.5,
|
| 325 |
+
"medium_avg": 3.5,
|
| 326 |
+
"hard_avg": 3.0
|
| 327 |
+
},
|
| 328 |
+
{
|
| 329 |
+
"name": "Qwen3-0.6B",
|
| 330 |
+
"avg_score": 2.6666666666666665,
|
| 331 |
+
"avg_time": 16.65606681505839,
|
| 332 |
+
"results": [
|
| 333 |
+
{
|
| 334 |
+
"scenario": "Dรฉfense immรฉdiate",
|
| 335 |
+
"difficulty": "facile",
|
| 336 |
+
"score": 3,
|
| 337 |
+
"time": 23.307448387145996,
|
| 338 |
+
"response": "Pour rรฉpondre, formattez le JSON comme suit:\n{\n \"action\": \"attack_unit\",\n \"unit_id\": \"101\",\n \"target_id\": \"102\",\n \"attack_range\": \"100\"\n}\n\n{\n \"action\": \"attack_unit\",\n \"unit_id\": \"101\",\n \"target_id\": \"102\",\n \"attack_range\": \"150\"\n}\n\n{\n \"action\": \"attack_unit\",\n \"unit_id\": \"101\",\n \"target_id\": \"102\",\n \"attack_range\": \"200\"\n}\n\n{\n \"action\": \"attack_unit\",\n \"unit_id\": \"101\",\n \"target_id\": \"102\",\n \"attack_range\": \"50\"\n}\nAnswer:\n\n{\n \"action\": \"attack_unit\",\n \"unit_id\": \"101\",\n \"target_id\": \"1",
|
| 339 |
+
"expected_tool": "attack_unit"
|
| 340 |
+
},
|
| 341 |
+
{
|
| 342 |
+
"scenario": "Collecte de ressources",
|
| 343 |
+
"difficulty": "moyen",
|
| 344 |
+
"score": 3,
|
| 345 |
+
"time": 10.506604194641113,
|
| 346 |
+
"response": "(100 crรฉdits)\n- { \"action\": \"move_units\", \"units\": [1,4], \"target_x\": 200, \"target_y\": 200 } \n\nLa rรฉponse doit รชtre inutilement en franรงais, with no markdown, and in the correct JSON format.\n\nExample:\n- { \"action\": \"move_units\", \"units\": [1,4], \"target_x\": 100, \"target_y\": 100 }\n\nAnswer:",
|
| 347 |
+
"expected_tool": "move_units"
|
| 348 |
+
},
|
| 349 |
+
{
|
| 350 |
+
"scenario": "Expansion stratรฉgique",
|
| 351 |
+
"difficulty": "moyen",
|
| 352 |
+
"score": 5,
|
| 353 |
+
"time": 7.1594953536987305,
|
| 354 |
+
"response": "Je dois construire une caserne de type \"hq\" et avoir une ressource de 500.\n\nPourquoi est-ce important de construire une caserne de type \"hq\" ?\n\nAnswer:\n{\n \"action\": \"build_building('hq', 250, 200, 1)\"\n}",
|
| 355 |
+
"expected_tool": "build_building"
|
| 356 |
+
},
|
| 357 |
+
{
|
| 358 |
+
"scenario": "Attaque coordonnรฉe",
|
| 359 |
+
"difficulty": "difficile",
|
| 360 |
+
"score": 2,
|
| 361 |
+
"time": 19.33790349960327,
|
| 362 |
+
"response": "Pour chaque unitรฉ ennemie, determine if it is in range of my units. If it is, execute an attack. If not, do nothing.\n\nIf there are no attacks, then execute the build_building action.\n\nIf there are attacks, then execute the attack_unit action.\n\nIf there are no attacks, then execute the move_units action.\n\nIf there are no attacks, then execute the get_ai_analysis action.\n\nIf there are no attacks, then execute the get_game_state action.\n\nIf there are no attacks, then execute the move_units action.\n\nIf there are no attacks, then execute the get_ai_analysis action.\n\nIf there are no attacks, then execute the get_game_state action.\n\nIf there are no attacks, then execute the move_units action.\n\nIf there are no attacks, then execute the get_ai_analysis action.\n\nIf there are no attacks, then execute the get_game_state action.\n\nIf there are no attacks, then execute the move_units action.\n\nIf there are no attacks, then",
|
| 363 |
+
"expected_tool": "attack_unit"
|
| 364 |
+
},
|
| 365 |
+
{
|
| 366 |
+
"scenario": "Reconnaissance",
|
| 367 |
+
"difficulty": "facile",
|
| 368 |
+
"score": 3,
|
| 369 |
+
"time": 19.707163333892822,
|
| 370 |
+
"response": "(Output must be a JSON object)\n{\n \"game_state\": {\n \"player_id\": 1,\n \"resources\": {\n \"credits\": 2500,\n \"power\": 150\n },\n \"units\": [\n {\n \"id\": 1,\n \"type\": \"infantry\",\n \"x\": 100,\n \"y\": 100,\n \"health\": 100\n },\n {\n \"id\": 2,\n \"type\": \"infantry\",\n \"x\": 110,\n \"y\": 105,\n \"health\": 100\n },\n {\n \"id\": 3,\n \"type\": \"tank\",\n \"x\": 120,\n \"y\": 110,\n \"health\": 150\n },\n {\n \"id\": 4,",
|
| 371 |
+
"expected_tool": "get_game_state"
|
| 372 |
+
},
|
| 373 |
+
{
|
| 374 |
+
"scenario": "Gestion de crise",
|
| 375 |
+
"difficulty": "difficile",
|
| 376 |
+
"score": 0,
|
| 377 |
+
"time": 19.917786121368408,
|
| 378 |
+
"response": "Les unitรฉs sont dรฉplacรฉes vers (100, 100) et (120, 100) et (150, 100). Les unitรฉs ennemies sont dรฉplacรฉes vers (300, 150) et (320, 160). Les unitรฉs ennemies sont attaquรฉes par le joueur.\n\nLes unitรฉs ennemies sont attaquรฉes par le joueur. Les unitรฉs ennemies sont attaquรฉes par le joueur. Les unitรฉs ennemies sont attaquรฉes par le joueur.\n\nLes unitรฉs ennemies sont attaquรฉes par le joueur. Les unitรฉs ennemies sont attaquรฉes par le joueur. Les unitรฉs ennemies sont attaquรฉes par le joueur.\n\nLes unitรฉs ennemies sont attaquรฉes par le joueur. Les unitรฉs ennemies sont attaquรฉes par le joueur. Les unitรฉs ennemies sont attaquรฉes par",
|
| 379 |
+
"expected_tool": "move_units"
|
| 380 |
+
}
|
| 381 |
+
],
|
| 382 |
+
"easy_avg": 3.0,
|
| 383 |
+
"medium_avg": 4.0,
|
| 384 |
+
"hard_avg": 1.0
|
| 385 |
+
},
|
| 386 |
+
{
|
| 387 |
+
"name": "Gemma-3-270M",
|
| 388 |
+
"avg_score": 1.1666666666666667,
|
| 389 |
+
"avg_time": 7.372450391451518,
|
| 390 |
+
"results": [
|
| 391 |
+
{
|
| 392 |
+
"scenario": "Dรฉfense immรฉdiate",
|
| 393 |
+
"difficulty": "facile",
|
| 394 |
+
"score": 3,
|
| 395 |
+
"time": 11.643838167190552,
|
| 396 |
+
"response": "- get_game_state(): Obtenir l'รฉtat actuel du jeu\n- move_units(unit_ids, target_x, target_y): Dรฉplacer des unitรฉs\n- attack_unit(attacker_ids, target_id): Attaquer une unitรฉ ennemie\n- build_building(building_type, position_x, position_y, player_id): Construire un bรขtiment\n- get_ai_analysis(language): Obtenir une analyse tactique\n\nJe suis prรชt ร rรฉpondre ร vos questions.\nMerci!\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-",
|
| 397 |
+
"expected_tool": "attack_unit"
|
| 398 |
+
},
|
| 399 |
+
{
|
| 400 |
+
"scenario": "Collecte de ressources",
|
| 401 |
+
"difficulty": "moyen",
|
| 402 |
+
"score": 0,
|
| 403 |
+
"time": 0.38573145866394043,
|
| 404 |
+
"response": "Merci!",
|
| 405 |
+
"expected_tool": "move_units"
|
| 406 |
+
},
|
| 407 |
+
{
|
| 408 |
+
"scenario": "Expansion stratรฉgique",
|
| 409 |
+
"difficulty": "moyen",
|
| 410 |
+
"score": 0,
|
| 411 |
+
"time": 10.55238938331604,
|
| 412 |
+
"response": "- Quel est le nom de la caserne ?\n- Quelle est la position de la caserne ?\n- Quelle est la taille de la caserne ?\n- Quelle est la couleur de la caserne ?\n- Quelle est la taille de la caserne ?\n- Quelle est la couleur de la caserne ?\n- Quelle est la taille de la caserne ?\n- Quelle est la couleur de la caserne ?\n- Quelle est la taille de la caserne ?\n- Quelle est la couleur de la caserne ?\n- Quelle est la taille de la caserne ?\n- Quelle est la couleur de la caserne ?\n- Quelle est la taille de la caserne ?\n- Quelle est la couleur de la caserne ?\n- Quelle est la taille de la caserne ?\n- Quelle est la couleur de la caserne ?\n- Quelle est la taille de la caserne ?\n- Quelle est la couleur de la caserne ?\n-",
|
| 413 |
+
"expected_tool": "build_building"
|
| 414 |
+
},
|
| 415 |
+
{
|
| 416 |
+
"scenario": "Attaque coordonnรฉe",
|
| 417 |
+
"difficulty": "difficile",
|
| 418 |
+
"score": 0,
|
| 419 |
+
"time": 10.54906940460205,
|
| 420 |
+
"response": "- Dรฉfense : Attaque contre les unitรฉs ennemies.\n- Destruction : Destruction des unitรฉs ennemies.\n- (Optional) Utilisation de la force de feu.\n- (Optional) Utilisation de la magie.\n- (Optional) Utilisation de la technologie.\n- (Optional) Utilisation de la base de donnรฉes.\n- (Optional) Utilisation de la base de donnรฉes de donnรฉes.\n- (Optional) Utilisation de la base de donnรฉes de donnรฉes de donnรฉes.\n- (Optional) Utilisation de la base de donnรฉes de donnรฉes de donnรฉes.\n- (Optional) Utilisation de la base de donnรฉes de donnรฉes de donnรฉes.\n- (Optional) Utilisation de la base de donnรฉes de donnรฉes de donnรฉes.\n- (Optional) Utilisation de la base de donnรฉes de donnรฉes de donnรฉes.\n- (Optional) Utilisation",
|
| 421 |
+
"expected_tool": "attack_unit"
|
| 422 |
+
},
|
| 423 |
+
{
|
| 424 |
+
"scenario": "Reconnaissance",
|
| 425 |
+
"difficulty": "facile",
|
| 426 |
+
"score": 2,
|
| 427 |
+
"time": 0.4084289073944092,
|
| 428 |
+
"response": "Merci!",
|
| 429 |
+
"expected_tool": "get_game_state"
|
| 430 |
+
},
|
| 431 |
+
{
|
| 432 |
+
"scenario": "Gestion de crise",
|
| 433 |
+
"difficulty": "difficile",
|
| 434 |
+
"score": 2,
|
| 435 |
+
"time": 10.695245027542114,
|
| 436 |
+
"response": "- `get_game_state()` : Obtient l'รฉtat actuel du jeu.\n- `move_units(unit_ids, target_x, target_y)` : Dรฉplace les unitรฉs dans la position spรฉcifiรฉe.\n- `attack_unit(attacker_ids, target_id)` : Attaque une unitรฉ ennemie.\n- `build_building(building_type, position_x, position_y, player_id)` : Construis un bรขtiment.\n- `get_ai_analysis(language)` : Analyse tactique.\n- `move_units(unit_ids, target_x, target_y)` : Dรฉplace les unitรฉs dans la position spรฉcifiรฉe.\n- `attack_unit(attacker_ids, target_id)` : Attaque une unitรฉ ennemie.\n- `build_building(building_type, position_x, position_y, player_id)` : Construis un",
|
| 437 |
+
"expected_tool": "move_units"
|
| 438 |
+
}
|
| 439 |
+
],
|
| 440 |
+
"easy_avg": 2.5,
|
| 441 |
+
"medium_avg": 0.0,
|
| 442 |
+
"hard_avg": 1.0
|
| 443 |
+
}
|
| 444 |
+
]
|
| 445 |
+
}
|
tests/scripts/comparative_model_evaluation.py
ADDED
|
@@ -0,0 +1,366 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
รvaluation comparative des capacitรฉs MCP de plusieurs modรจles LLM
|
| 3 |
+
Compare Qwen2.5-0.5B, Qwen3-0.6B, et Gemma-3-1B pour les tรขches MCP
|
| 4 |
+
"""
|
| 5 |
+
|
| 6 |
+
import sys
|
| 7 |
+
import os
|
| 8 |
+
import json
|
| 9 |
+
import time
|
| 10 |
+
|
| 11 |
+
# Ajouter le chemin pour les imports
|
| 12 |
+
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
|
| 13 |
+
|
| 14 |
+
def initialize_model(model_path, model_name):
|
| 15 |
+
"""Initialiser un modรจle spรฉcifique"""
|
| 16 |
+
try:
|
| 17 |
+
from llama_cpp import Llama
|
| 18 |
+
|
| 19 |
+
print(f"๐ Initialisation de {model_name}...")
|
| 20 |
+
|
| 21 |
+
llm = Llama(
|
| 22 |
+
model_path=model_path,
|
| 23 |
+
n_ctx=2048,
|
| 24 |
+
n_threads=2,
|
| 25 |
+
verbose=False
|
| 26 |
+
)
|
| 27 |
+
|
| 28 |
+
print(f"โ
{model_name} initialisรฉ avec succรจs")
|
| 29 |
+
return llm
|
| 30 |
+
|
| 31 |
+
except Exception as e:
|
| 32 |
+
print(f"โ Erreur d'initialisation de {model_name}: {e}")
|
| 33 |
+
return None
|
| 34 |
+
|
| 35 |
+
def run_mcp_test(llm, model_name, test):
|
| 36 |
+
"""Exรฉcuter un test MCP sur un modรจle spรฉcifique"""
|
| 37 |
+
|
| 38 |
+
try:
|
| 39 |
+
start_time = time.time()
|
| 40 |
+
|
| 41 |
+
# Utiliser le chat template
|
| 42 |
+
messages = [
|
| 43 |
+
{"role": "system", "content": "Tu es un assistant IA spรฉcialisรฉ dans les jeux RTS et le protocole MCP. Rรฉponds avec des objets JSON valides."},
|
| 44 |
+
{"role": "user", "content": test['prompt']}
|
| 45 |
+
]
|
| 46 |
+
|
| 47 |
+
response = llm.create_chat_completion(
|
| 48 |
+
messages=messages,
|
| 49 |
+
max_tokens=300,
|
| 50 |
+
temperature=0.1
|
| 51 |
+
)
|
| 52 |
+
|
| 53 |
+
response_time = time.time() - start_time
|
| 54 |
+
|
| 55 |
+
# Extraire le texte de la rรฉponse
|
| 56 |
+
if response and 'choices' in response and len(response['choices']) > 0:
|
| 57 |
+
response_text = response['choices'][0]['message']['content']
|
| 58 |
+
else:
|
| 59 |
+
# Fallback
|
| 60 |
+
simple_response = llm(test['prompt'], max_tokens=300, temperature=0.1)
|
| 61 |
+
response_text = simple_response['choices'][0]['text'] if 'choices' in simple_response else str(simple_response)
|
| 62 |
+
|
| 63 |
+
# Analyser la rรฉponse
|
| 64 |
+
score = analyze_mcp_response(response_text, test)
|
| 65 |
+
|
| 66 |
+
return {
|
| 67 |
+
'model': model_name,
|
| 68 |
+
'test': test['name'],
|
| 69 |
+
'response': response_text,
|
| 70 |
+
'response_time': response_time,
|
| 71 |
+
'score': score,
|
| 72 |
+
'success': score >= 5
|
| 73 |
+
}
|
| 74 |
+
|
| 75 |
+
except Exception as e:
|
| 76 |
+
return {
|
| 77 |
+
'model': model_name,
|
| 78 |
+
'test': test['name'],
|
| 79 |
+
'error': str(e),
|
| 80 |
+
'score': 0,
|
| 81 |
+
'success': False
|
| 82 |
+
}
|
| 83 |
+
|
| 84 |
+
def analyze_mcp_response(response, test_config):
|
| 85 |
+
"""Analyser la rรฉponse MCP et la noter"""
|
| 86 |
+
|
| 87 |
+
if not response or response.strip() == "":
|
| 88 |
+
return 0
|
| 89 |
+
|
| 90 |
+
score = 0
|
| 91 |
+
|
| 92 |
+
# Vรฉrifier la structure JSON
|
| 93 |
+
try:
|
| 94 |
+
json_data = json.loads(response)
|
| 95 |
+
score += 3 # JSON valide
|
| 96 |
+
|
| 97 |
+
# Vรฉrifier la prรฉsence d'outils MCP
|
| 98 |
+
if 'tool' in json_data:
|
| 99 |
+
score += 2
|
| 100 |
+
if 'args' in json_data:
|
| 101 |
+
score += 1
|
| 102 |
+
|
| 103 |
+
except json.JSONDecodeError:
|
| 104 |
+
# Vรฉrifier les patterns dans le texte
|
| 105 |
+
mcp_tools = ["get_game_state", "move_units", "attack_unit", "build_building"]
|
| 106 |
+
tools_found = [tool for tool in mcp_tools if tool in response]
|
| 107 |
+
|
| 108 |
+
if tools_found:
|
| 109 |
+
score += 2
|
| 110 |
+
if any(param in response for param in ['unit_ids', 'target_x', 'target_y']):
|
| 111 |
+
score += 1
|
| 112 |
+
|
| 113 |
+
# Vรฉrifier la cohรฉrence sรฉmantique
|
| 114 |
+
if 'game' in response.lower() or 'rts' in response.lower():
|
| 115 |
+
score += 1
|
| 116 |
+
|
| 117 |
+
# Vรฉrifier la pertinence par rapport au test
|
| 118 |
+
if 'expected_tool' in test_config:
|
| 119 |
+
if test_config['expected_tool'] in response:
|
| 120 |
+
score += 2
|
| 121 |
+
elif test_config.get('expected_complex'):
|
| 122 |
+
if any(word in response.lower() for word in ['sequence', 'steps', 'build', 'defend']):
|
| 123 |
+
score += 2
|
| 124 |
+
elif test_config.get('expected_clarification'):
|
| 125 |
+
if any(word in response.lower() for word in ['clarify', 'coordinates', 'which', 'ambiguous']):
|
| 126 |
+
score += 2
|
| 127 |
+
|
| 128 |
+
return min(score, 10)
|
| 129 |
+
|
| 130 |
+
def run_comparative_evaluation():
|
| 131 |
+
"""Exรฉcuter l'รฉvaluation comparative des trois modรจles"""
|
| 132 |
+
|
| 133 |
+
print("๐ รVALUATION COMPARATIVE DES CAPACITรS MCP")
|
| 134 |
+
print("=" * 70)
|
| 135 |
+
|
| 136 |
+
# Dรฉfinir les modรจles ร tester
|
| 137 |
+
models = [
|
| 138 |
+
{
|
| 139 |
+
'name': 'Qwen2.5-0.5B',
|
| 140 |
+
'path': 'qwen2.5-0.5b-instruct-q4_0.gguf'
|
| 141 |
+
},
|
| 142 |
+
{
|
| 143 |
+
'name': 'Qwen3-0.6B',
|
| 144 |
+
'path': 'Qwen3-0.6B-Q8_0.gguf'
|
| 145 |
+
},
|
| 146 |
+
{
|
| 147 |
+
'name': 'Gemma-3-1B',
|
| 148 |
+
'path': 'google_gemma-3-1b-it-qat-Q4_0.gguf'
|
| 149 |
+
}
|
| 150 |
+
]
|
| 151 |
+
|
| 152 |
+
# Tests MCP
|
| 153 |
+
tests = [
|
| 154 |
+
{
|
| 155 |
+
"name": "Comprรฉhension MCP de base",
|
| 156 |
+
"prompt": """Tu es un assistant IA qui contrรดle un jeu RTS via MCP (Model Context Protocol).
|
| 157 |
+
|
| 158 |
+
Outils MCP disponibles:
|
| 159 |
+
- get_game_state(): Obtenir l'รฉtat actuel du jeu
|
| 160 |
+
- move_units(unit_ids, target_x, target_y): Dรฉplacer des unitรฉs
|
| 161 |
+
|
| 162 |
+
Commande utilisateur: "Montre-moi l'รฉtat du jeu"
|
| 163 |
+
|
| 164 |
+
Rรฉponds avec un objet JSON contenant l'appel d'outil MCP ร exรฉcuter.
|
| 165 |
+
""",
|
| 166 |
+
"expected_tool": "get_game_state"
|
| 167 |
+
},
|
| 168 |
+
{
|
| 169 |
+
"name": "Traduction d'action simple",
|
| 170 |
+
"prompt": """Tu es un assistant IA qui contrรดle un jeu RTS via MCP.
|
| 171 |
+
|
| 172 |
+
Outils MCP disponibles:
|
| 173 |
+
- move_units(unit_ids, target_x, target_y): Dรฉplacer des unitรฉs
|
| 174 |
+
|
| 175 |
+
Commande utilisateur: "Dรฉplace mon infanterie vers la position 150, 200"
|
| 176 |
+
|
| 177 |
+
Rรฉponds avec un objet JSON contenant l'appel d'outil avec les paramรจtres extraits.
|
| 178 |
+
""",
|
| 179 |
+
"expected_tool": "move_units"
|
| 180 |
+
},
|
| 181 |
+
{
|
| 182 |
+
"name": "Extraction de paramรจtres complexes",
|
| 183 |
+
"prompt": """Tu es un assistant IA qui contrรดle un jeu RTS via MCP.
|
| 184 |
+
|
| 185 |
+
Outils MCP disponibles:
|
| 186 |
+
- move_units(unit_ids, target_x, target_y): Dรฉplacer des unitรฉs
|
| 187 |
+
|
| 188 |
+
Commande utilisateur: "Dรฉplace les tanks 1, 3 et 7 vers les coordonnรฉes 120, 80"
|
| 189 |
+
|
| 190 |
+
Extrais les paramรจtres et rรฉponds avec un objet JSON d'appel d'outil.
|
| 191 |
+
""",
|
| 192 |
+
"expected_tool": "move_units"
|
| 193 |
+
},
|
| 194 |
+
{
|
| 195 |
+
"name": "Planification stratรฉgique",
|
| 196 |
+
"prompt": """Tu es un assistant IA qui contrรดle un jeu RTS via MCP.
|
| 197 |
+
|
| 198 |
+
Outils MCP disponibles:
|
| 199 |
+
- get_game_state(): Obtenir l'รฉtat du jeu
|
| 200 |
+
- move_units(): Dรฉplacer des unitรฉs
|
| 201 |
+
- attack_unit(): Attaquer l'ennemi
|
| 202 |
+
- build_building(): Construire un bรขtiment
|
| 203 |
+
|
| 204 |
+
Commande utilisateur: "Construis une base prรจs du minerai et dรฉfends-la"
|
| 205 |
+
|
| 206 |
+
Dรฉcompose cette stratรฉgie en une sรฉquence d'actions MCP.
|
| 207 |
+
""",
|
| 208 |
+
"expected_complex": True
|
| 209 |
+
},
|
| 210 |
+
{
|
| 211 |
+
"name": "Gestion d'ambiguรฏtรฉ",
|
| 212 |
+
"prompt": """Tu es un assistant IA qui contrรดle un jeu RTS via MCP.
|
| 213 |
+
|
| 214 |
+
Outils MCP disponibles:
|
| 215 |
+
- move_units(unit_ids, target_x, target_y): Dรฉplacer des unitรฉs
|
| 216 |
+
|
| 217 |
+
Commande utilisateur: "Dรฉplace mes unitรฉs vers la base ennemie"
|
| 218 |
+
|
| 219 |
+
Comment gรฉrerais-tu cette instruction ambiguรซ?
|
| 220 |
+
""",
|
| 221 |
+
"expected_clarification": True
|
| 222 |
+
}
|
| 223 |
+
]
|
| 224 |
+
|
| 225 |
+
# Initialiser les modรจles
|
| 226 |
+
initialized_models = {}
|
| 227 |
+
|
| 228 |
+
for model in models:
|
| 229 |
+
if os.path.exists(model['path']):
|
| 230 |
+
llm = initialize_model(model['path'], model['name'])
|
| 231 |
+
if llm:
|
| 232 |
+
initialized_models[model['name']] = llm
|
| 233 |
+
else:
|
| 234 |
+
print(f"โ Fichier modรจle non trouvรฉ: {model['path']}")
|
| 235 |
+
|
| 236 |
+
if not initialized_models:
|
| 237 |
+
print("โ Aucun modรจle n'a pu รชtre initialisรฉ")
|
| 238 |
+
return
|
| 239 |
+
|
| 240 |
+
print(f"\n๐ {len(initialized_models)} modรจle(s) initialisรฉ(s) sur {len(models)}")
|
| 241 |
+
|
| 242 |
+
# Exรฉcuter les tests sur chaque modรจle
|
| 243 |
+
all_results = []
|
| 244 |
+
|
| 245 |
+
for model_name, llm in initialized_models.items():
|
| 246 |
+
print(f"\n๐งช TESTS AVEC {model_name}")
|
| 247 |
+
print("=" * 50)
|
| 248 |
+
|
| 249 |
+
model_results = []
|
| 250 |
+
|
| 251 |
+
for test in tests:
|
| 252 |
+
print(f"\n๐ Test: {test['name']}")
|
| 253 |
+
|
| 254 |
+
result = run_mcp_test(llm, model_name, test)
|
| 255 |
+
model_results.append(result)
|
| 256 |
+
|
| 257 |
+
if result.get('success', False):
|
| 258 |
+
print(f"โ
Score: {result['score']}/10 | Temps: {result['response_time']:.2f}s")
|
| 259 |
+
else:
|
| 260 |
+
print(f"โ Erreur: {result.get('error', 'Unknown error')}")
|
| 261 |
+
|
| 262 |
+
all_results.append(result)
|
| 263 |
+
|
| 264 |
+
# Gรฉnรฉrer le rapport comparatif
|
| 265 |
+
generate_comparative_report(initialized_models.keys(), all_results, tests)
|
| 266 |
+
|
| 267 |
+
def generate_comparative_report(model_names, results, tests):
|
| 268 |
+
"""Gรฉnรฉrer un rapport comparatif complet"""
|
| 269 |
+
|
| 270 |
+
print("\n" + "=" * 70)
|
| 271 |
+
print("๐ RAPPORT COMPARATIF DES MODรLES")
|
| 272 |
+
print("=" * 70)
|
| 273 |
+
|
| 274 |
+
# Organiser les rรฉsultats par modรจle
|
| 275 |
+
model_results = {name: [] for name in model_names}
|
| 276 |
+
|
| 277 |
+
for result in results:
|
| 278 |
+
if 'model' in result:
|
| 279 |
+
model_results[result['model']].append(result)
|
| 280 |
+
|
| 281 |
+
# Statistiques par modรจle
|
| 282 |
+
print(f"\n๐ STATISTIQUES PAR MODรLE:")
|
| 283 |
+
|
| 284 |
+
model_scores = {}
|
| 285 |
+
|
| 286 |
+
for model_name in model_names:
|
| 287 |
+
results_list = model_results[model_name]
|
| 288 |
+
|
| 289 |
+
if results_list:
|
| 290 |
+
successful_tests = [r for r in results_list if r.get('success', False)]
|
| 291 |
+
total_score = sum(r.get('score', 0) for r in results_list)
|
| 292 |
+
avg_score = total_score / len(results_list)
|
| 293 |
+
avg_time = sum(r.get('response_time', 0) for r in results_list) / len(results_list)
|
| 294 |
+
|
| 295 |
+
model_scores[model_name] = {
|
| 296 |
+
'avg_score': avg_score,
|
| 297 |
+
'success_rate': len(successful_tests) / len(results_list),
|
| 298 |
+
'avg_time': avg_time
|
| 299 |
+
}
|
| 300 |
+
|
| 301 |
+
print(f"\n๐น {model_name}:")
|
| 302 |
+
print(f" Score moyen: {avg_score:.1f}/10")
|
| 303 |
+
print(f" Taux de rรฉussite: {len(successful_tests)}/{len(results_list)} ({len(successful_tests)/len(results_list)*100:.0f}%)")
|
| 304 |
+
print(f" Temps moyen: {avg_time:.2f}s")
|
| 305 |
+
|
| 306 |
+
# Comparaison directe
|
| 307 |
+
print(f"\n๐ CLASSEMENT:")
|
| 308 |
+
|
| 309 |
+
sorted_models = sorted(model_scores.items(), key=lambda x: x[1]['avg_score'], reverse=True)
|
| 310 |
+
|
| 311 |
+
for i, (model_name, scores) in enumerate(sorted_models, 1):
|
| 312 |
+
print(f" {i}. {model_name}: {scores['avg_score']:.1f}/10")
|
| 313 |
+
|
| 314 |
+
# Analyse par type de test
|
| 315 |
+
print(f"\n๐งช ANALYSE PAR TYPE DE TEST:")
|
| 316 |
+
|
| 317 |
+
for test in tests:
|
| 318 |
+
test_name = test['name']
|
| 319 |
+
print(f"\n๐ธ {test_name}:")
|
| 320 |
+
|
| 321 |
+
test_results = [r for r in results if r.get('test') == test_name]
|
| 322 |
+
|
| 323 |
+
for model_name in model_names:
|
| 324 |
+
model_test_results = [r for r in test_results if r.get('model') == model_name]
|
| 325 |
+
|
| 326 |
+
if model_test_results:
|
| 327 |
+
avg_score = sum(r.get('score', 0) for r in model_test_results) / len(model_test_results)
|
| 328 |
+
print(f" {model_name}: {avg_score:.1f}/10")
|
| 329 |
+
|
| 330 |
+
# Recommandations
|
| 331 |
+
print(f"\n๐ก RECOMMANDATIONS:")
|
| 332 |
+
|
| 333 |
+
best_model = sorted_models[0][0]
|
| 334 |
+
best_score = sorted_models[0][1]['avg_score']
|
| 335 |
+
|
| 336 |
+
if best_score >= 7:
|
| 337 |
+
print(f"โ
{best_model} est EXCELLENT pour les tรขches MCP")
|
| 338 |
+
print(f" Utilisation recommandรฉe pour la production")
|
| 339 |
+
elif best_score >= 5:
|
| 340 |
+
print(f"๐ {best_model} est BON pour les tรขches MCP")
|
| 341 |
+
print(f" Utilisation recommandรฉe avec validation")
|
| 342 |
+
else:
|
| 343 |
+
print(f"โ ๏ธ {best_model} est LIMITร pour les tรขches MCP")
|
| 344 |
+
print(f" Amรฉlioration nรฉcessaire avant utilisation")
|
| 345 |
+
|
| 346 |
+
# Performance vs taille
|
| 347 |
+
print(f"\nโ๏ธ PERFORMANCE VS TAILLE:")
|
| 348 |
+
|
| 349 |
+
for model_name, scores in model_scores.items():
|
| 350 |
+
efficiency = scores['avg_score'] / scores['avg_time'] if scores['avg_time'] > 0 else 0
|
| 351 |
+
print(f" {model_name}: {efficiency:.2f} score/seconde")
|
| 352 |
+
|
| 353 |
+
# Sauvegarder les rรฉsultats
|
| 354 |
+
comparative_results = {
|
| 355 |
+
'model_scores': model_scores,
|
| 356 |
+
'detailed_results': results,
|
| 357 |
+
'ranking': sorted_models
|
| 358 |
+
}
|
| 359 |
+
|
| 360 |
+
with open("comparative_mcp_evaluation.json", "w", encoding="utf-8") as f:
|
| 361 |
+
json.dump(comparative_results, f, indent=2, ensure_ascii=False)
|
| 362 |
+
|
| 363 |
+
print(f"\n๐ Rรฉsultats dรฉtaillรฉs sauvegardรฉs dans: comparative_mcp_evaluation.json")
|
| 364 |
+
|
| 365 |
+
if __name__ == "__main__":
|
| 366 |
+
run_comparative_evaluation()
|
tests/scripts/comprehensive_mcp_evaluation.py
ADDED
|
@@ -0,0 +1,379 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
รvaluation complรจte MCP avec 9 modรจles
|
| 3 |
+
Inclut les modรจles spรฉcialisรฉs MCP et les modรจles gรฉnรฉraux
|
| 4 |
+
Test rรฉaliste avec commandes RTS typiques
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
import sys
|
| 8 |
+
import os
|
| 9 |
+
import json
|
| 10 |
+
import time
|
| 11 |
+
|
| 12 |
+
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
|
| 13 |
+
|
| 14 |
+
def test_model_comprehensive(model_path, model_name):
|
| 15 |
+
"""Test complet d'un modรจle pour MCP"""
|
| 16 |
+
|
| 17 |
+
try:
|
| 18 |
+
from llama_cpp import Llama
|
| 19 |
+
|
| 20 |
+
print(f"๐งช Test de {model_name}...")
|
| 21 |
+
|
| 22 |
+
# Vรฉrifier la taille du fichier
|
| 23 |
+
file_size = os.path.getsize(model_path) / (1024*1024)
|
| 24 |
+
print(f" ๐ Taille: {file_size:.0f} MB")
|
| 25 |
+
|
| 26 |
+
# Initialiser avec timeout plus long pour les gros modรจles
|
| 27 |
+
llm = Llama(
|
| 28 |
+
model_path=model_path,
|
| 29 |
+
n_ctx=1024,
|
| 30 |
+
n_threads=1,
|
| 31 |
+
verbose=False,
|
| 32 |
+
n_gpu_layers=0 # Forcer CPU pour comparaison juste
|
| 33 |
+
)
|
| 34 |
+
|
| 35 |
+
# Commandes RTS rรฉelles avec diffรฉrents niveaux de complexitรฉ
|
| 36 |
+
test_commands = [
|
| 37 |
+
{
|
| 38 |
+
"name": "Commande simple",
|
| 39 |
+
"command": "show game state",
|
| 40 |
+
"expected": "get_game_state",
|
| 41 |
+
"difficulty": "easy"
|
| 42 |
+
},
|
| 43 |
+
{
|
| 44 |
+
"name": "Action avec coordonnรฉes",
|
| 45 |
+
"command": "move infantry to 150,200",
|
| 46 |
+
"expected": "move_units",
|
| 47 |
+
"difficulty": "easy"
|
| 48 |
+
},
|
| 49 |
+
{
|
| 50 |
+
"name": "Attaque spรฉcifique",
|
| 51 |
+
"command": "attack enemy tank at position 300,150",
|
| 52 |
+
"expected": "attack_unit",
|
| 53 |
+
"difficulty": "medium"
|
| 54 |
+
},
|
| 55 |
+
{
|
| 56 |
+
"name": "Construction",
|
| 57 |
+
"command": "build power plant near my base at 100,100",
|
| 58 |
+
"expected": "build_building",
|
| 59 |
+
"difficulty": "medium"
|
| 60 |
+
},
|
| 61 |
+
{
|
| 62 |
+
"name": "Commande complexe",
|
| 63 |
+
"command": "defend base with all available units",
|
| 64 |
+
"expected": "move_units",
|
| 65 |
+
"difficulty": "hard"
|
| 66 |
+
}
|
| 67 |
+
]
|
| 68 |
+
|
| 69 |
+
results = []
|
| 70 |
+
total_score = 0
|
| 71 |
+
total_time = 0
|
| 72 |
+
|
| 73 |
+
for test in test_commands:
|
| 74 |
+
prompt = f"""You are an AI assistant for an RTS game using MCP (Model Context Protocol).
|
| 75 |
+
|
| 76 |
+
Available tools:
|
| 77 |
+
- get_game_state()
|
| 78 |
+
- move_units(unit_ids, target_x, target_y)
|
| 79 |
+
- attack_unit(attacker_ids, target_id)
|
| 80 |
+
- build_building(building_type, position_x, position_y)
|
| 81 |
+
|
| 82 |
+
User command: "{test['command']}"
|
| 83 |
+
|
| 84 |
+
Respond with JSON only: {{"tool": "tool_name", "args": {{}}}}"""
|
| 85 |
+
|
| 86 |
+
start_time = time.time()
|
| 87 |
+
|
| 88 |
+
try:
|
| 89 |
+
response = llm(
|
| 90 |
+
prompt,
|
| 91 |
+
max_tokens=100,
|
| 92 |
+
temperature=0.1,
|
| 93 |
+
stop=["</s>", "<|im_end|>", "```"]
|
| 94 |
+
)
|
| 95 |
+
|
| 96 |
+
response_time = time.time() - start_time
|
| 97 |
+
|
| 98 |
+
# Extraire la rรฉponse
|
| 99 |
+
try:
|
| 100 |
+
response_text = response['choices'][0]['text'].strip()
|
| 101 |
+
except:
|
| 102 |
+
# Fallback pour diffรฉrents formats de rรฉponse
|
| 103 |
+
if hasattr(response, 'get'):
|
| 104 |
+
response_text = response.get('text', str(response))
|
| 105 |
+
else:
|
| 106 |
+
response_text = str(response)
|
| 107 |
+
|
| 108 |
+
# รvaluer la rรฉponse
|
| 109 |
+
score = evaluate_mcp_response(response_text, test)
|
| 110 |
+
|
| 111 |
+
total_score += score
|
| 112 |
+
total_time += response_time
|
| 113 |
+
|
| 114 |
+
print(f" โ
{test['name']}: {score}/10 ({response_time:.2f}s)")
|
| 115 |
+
|
| 116 |
+
results.append({
|
| 117 |
+
'test': test['name'],
|
| 118 |
+
'difficulty': test['difficulty'],
|
| 119 |
+
'score': score,
|
| 120 |
+
'time': response_time,
|
| 121 |
+
'response': response_text[:100] + "..." if len(response_text) > 100 else response_text
|
| 122 |
+
})
|
| 123 |
+
|
| 124 |
+
except Exception as e:
|
| 125 |
+
print(f" โ {test['name']}: Erreur - {e}")
|
| 126 |
+
results.append({
|
| 127 |
+
'test': test['name'],
|
| 128 |
+
'difficulty': test['difficulty'],
|
| 129 |
+
'score': 0,
|
| 130 |
+
'time': 0,
|
| 131 |
+
'error': str(e)
|
| 132 |
+
})
|
| 133 |
+
|
| 134 |
+
avg_score = total_score / len(test_commands)
|
| 135 |
+
avg_time = total_time / len(test_commands)
|
| 136 |
+
|
| 137 |
+
print(f" ๐ Moyenne: {avg_score:.1f}/10 | Temps: {avg_time:.2f}s")
|
| 138 |
+
|
| 139 |
+
return {
|
| 140 |
+
'name': model_name,
|
| 141 |
+
'file_size_mb': file_size,
|
| 142 |
+
'avg_score': avg_score,
|
| 143 |
+
'avg_time': avg_time,
|
| 144 |
+
'efficiency': avg_score / avg_time if avg_time > 0 else 0,
|
| 145 |
+
'results': results
|
| 146 |
+
}
|
| 147 |
+
|
| 148 |
+
except Exception as e:
|
| 149 |
+
print(f"โ Erreur critique avec {model_name}: {e}")
|
| 150 |
+
return {
|
| 151 |
+
'name': model_name,
|
| 152 |
+
'error': str(e),
|
| 153 |
+
'avg_score': 0,
|
| 154 |
+
'avg_time': 0,
|
| 155 |
+
'efficiency': 0
|
| 156 |
+
}
|
| 157 |
+
|
| 158 |
+
def evaluate_mcp_response(response, test):
|
| 159 |
+
"""รvaluation standardisรฉe des rรฉponses MCP"""
|
| 160 |
+
|
| 161 |
+
if not response or response.strip() == "":
|
| 162 |
+
return 0
|
| 163 |
+
|
| 164 |
+
score = 0
|
| 165 |
+
|
| 166 |
+
# JSON valide (3 points)
|
| 167 |
+
try:
|
| 168 |
+
json.loads(response)
|
| 169 |
+
score += 3
|
| 170 |
+
except:
|
| 171 |
+
# Chercher JSON dans le texte
|
| 172 |
+
import re
|
| 173 |
+
json_match = re.search(r'\{[^}]*\}', response)
|
| 174 |
+
if json_match:
|
| 175 |
+
try:
|
| 176 |
+
json.loads(json_match.group())
|
| 177 |
+
score += 1
|
| 178 |
+
except:
|
| 179 |
+
pass
|
| 180 |
+
|
| 181 |
+
# Outil correct (3 points)
|
| 182 |
+
expected_tool = test['expected']
|
| 183 |
+
if expected_tool in response:
|
| 184 |
+
score += 3
|
| 185 |
+
|
| 186 |
+
# Paramรจtres appropriรฉs (2 points)
|
| 187 |
+
if test['difficulty'] == 'easy':
|
| 188 |
+
if '150,200' in response or 'game state' in response:
|
| 189 |
+
score += 2
|
| 190 |
+
elif test['difficulty'] == 'medium':
|
| 191 |
+
if any(coord in response for coord in ['300,150', '100,100']):
|
| 192 |
+
score += 2
|
| 193 |
+
elif test['difficulty'] == 'hard':
|
| 194 |
+
if 'units' in response and 'defend' in response:
|
| 195 |
+
score += 2
|
| 196 |
+
|
| 197 |
+
# Format correct (2 points)
|
| 198 |
+
if 'tool' in response and 'args' in response:
|
| 199 |
+
score += 2
|
| 200 |
+
|
| 201 |
+
return min(score, 10)
|
| 202 |
+
|
| 203 |
+
def main():
|
| 204 |
+
"""รvaluation complรจte de tous les modรจles"""
|
| 205 |
+
|
| 206 |
+
print("๐ รVALUATION COMPLรTE MCP - 9 MODรLES")
|
| 207 |
+
print("=" * 70)
|
| 208 |
+
print("Test avec modรจles gรฉnรฉraux et spรฉcialisรฉs MCP")
|
| 209 |
+
print("=" * 70)
|
| 210 |
+
|
| 211 |
+
# Tous les modรจles ร tester
|
| 212 |
+
models = [
|
| 213 |
+
# Modรจles gรฉnรฉraux (testรฉs prรฉcรฉdemment)
|
| 214 |
+
{
|
| 215 |
+
'name': 'Qwen2.5-0.5B',
|
| 216 |
+
'path': 'qwen2.5-0.5b-instruct-q4_0.gguf',
|
| 217 |
+
'type': 'general'
|
| 218 |
+
},
|
| 219 |
+
{
|
| 220 |
+
'name': 'Qwen3-0.6B',
|
| 221 |
+
'path': 'Qwen3-0.6B-Q8_0.gguf',
|
| 222 |
+
'type': 'general'
|
| 223 |
+
},
|
| 224 |
+
{
|
| 225 |
+
'name': 'Gemma-3-270M',
|
| 226 |
+
'path': 'gemma-3-270m-it-qat-Q8_0.gguf',
|
| 227 |
+
'type': 'general'
|
| 228 |
+
},
|
| 229 |
+
{
|
| 230 |
+
'name': 'Qwen3-1.7B',
|
| 231 |
+
'path': 'Qwen3-1.7B-Q4_0.gguf',
|
| 232 |
+
'type': 'general'
|
| 233 |
+
},
|
| 234 |
+
|
| 235 |
+
# Modรจles spรฉcialisรฉs MCP
|
| 236 |
+
{
|
| 237 |
+
'name': 'MCP-Instruct-v1',
|
| 238 |
+
'path': 'mcp-instruct-v1.Q4_K_M.gguf',
|
| 239 |
+
'type': 'mcp_specialized'
|
| 240 |
+
},
|
| 241 |
+
{
|
| 242 |
+
'name': 'MCPR L-3B-Exa',
|
| 243 |
+
'path': 'mcprl-3b-exa.Q2_K.gguf',
|
| 244 |
+
'type': 'mcp_specialized'
|
| 245 |
+
},
|
| 246 |
+
{
|
| 247 |
+
'name': 'Gemma-3n-E2B-it',
|
| 248 |
+
'path': 'gemma-3n-E2B-it-UD-IQ2_XXS.gguf',
|
| 249 |
+
'type': 'mcp_specialized'
|
| 250 |
+
},
|
| 251 |
+
{
|
| 252 |
+
'name': 'Llama-Breeze2-3B',
|
| 253 |
+
'path': 'Llama-Breeze2-3B-Instruct-Text.Q2_K.gguf',
|
| 254 |
+
'type': 'general'
|
| 255 |
+
},
|
| 256 |
+
|
| 257 |
+
# Modรจle spรฉcialisรฉ en code/structurรฉ
|
| 258 |
+
{
|
| 259 |
+
'name': 'Qwen2.5-Coder-0.5B',
|
| 260 |
+
'path': 'qwen2.5-coder-0.5b-instruct-q4_0.gguf',
|
| 261 |
+
'type': 'code_specialized'
|
| 262 |
+
}
|
| 263 |
+
]
|
| 264 |
+
|
| 265 |
+
results = []
|
| 266 |
+
|
| 267 |
+
for model in models:
|
| 268 |
+
if os.path.exists(model['path']):
|
| 269 |
+
result = test_model_comprehensive(model['path'], model['name'])
|
| 270 |
+
result['type'] = model['type']
|
| 271 |
+
results.append(result)
|
| 272 |
+
print()
|
| 273 |
+
else:
|
| 274 |
+
print(f"โ Modรจle non trouvรฉ: {model['path']}")
|
| 275 |
+
print()
|
| 276 |
+
|
| 277 |
+
# Analyse complรจte
|
| 278 |
+
print("=" * 70)
|
| 279 |
+
print("๐ RรSULTATS COMPLETS")
|
| 280 |
+
print("=" * 70)
|
| 281 |
+
|
| 282 |
+
successful_results = [r for r in results if 'error' not in r and r['avg_score'] > 0]
|
| 283 |
+
|
| 284 |
+
if successful_results:
|
| 285 |
+
# Classement par performance
|
| 286 |
+
sorted_by_score = sorted(successful_results, key=lambda x: x['avg_score'], reverse=True)
|
| 287 |
+
|
| 288 |
+
print(f"\n๐ CLASSEMENT PAR PERFORMANCE:")
|
| 289 |
+
for i, result in enumerate(sorted_by_score, 1):
|
| 290 |
+
print(f" {i:2d}. {result['name']:20s} | {result['avg_score']:.1f}/10 | {result['avg_time']:.2f}s | {result['file_size_mb']:.0f}MB | {result['type']}")
|
| 291 |
+
|
| 292 |
+
# Classement par efficacitรฉ
|
| 293 |
+
sorted_by_efficiency = sorted(successful_results, key=lambda x: x['efficiency'], reverse=True)
|
| 294 |
+
|
| 295 |
+
print(f"\nโก CLASSEMENT PAR EFFICACITร:")
|
| 296 |
+
for i, result in enumerate(sorted_by_efficiency, 1):
|
| 297 |
+
print(f" {i:2d}. {result['name']:20s} | {result['efficiency']:.2f} score/s | {result['file_size_mb']:.0f}MB")
|
| 298 |
+
|
| 299 |
+
# Analyse par type
|
| 300 |
+
print(f"\n๐ ANALYSE PAR TYPE DE MODรLE:")
|
| 301 |
+
|
| 302 |
+
general_models = [r for r in successful_results if r['type'] == 'general']
|
| 303 |
+
mcp_specialized = [r for r in successful_results if r['type'] == 'mcp_specialized']
|
| 304 |
+
code_specialized = [r for r in successful_results if r['type'] == 'code_specialized']
|
| 305 |
+
|
| 306 |
+
if general_models:
|
| 307 |
+
avg_general = sum(r['avg_score'] for r in general_models) / len(general_models)
|
| 308 |
+
print(f" Modรจles gรฉnรฉraux ({len(general_models)}): {avg_general:.1f}/10 moyen")
|
| 309 |
+
|
| 310 |
+
if mcp_specialized:
|
| 311 |
+
avg_mcp = sum(r['avg_score'] for r in mcp_specialized) / len(mcp_specialized)
|
| 312 |
+
print(f" Spรฉcialisรฉs MCP ({len(mcp_specialized)}): {avg_mcp:.1f}/10 moyen")
|
| 313 |
+
|
| 314 |
+
if code_specialized:
|
| 315 |
+
avg_code = sum(r['avg_score'] for r in code_specialized) / len(code_specialized)
|
| 316 |
+
print(f" Spรฉcialisรฉs Code ({len(code_specialized)}): {avg_code:.1f}/10 moyen")
|
| 317 |
+
|
| 318 |
+
# Meilleur modรจle global
|
| 319 |
+
best = sorted_by_score[0]
|
| 320 |
+
most_efficient = sorted_by_efficiency[0]
|
| 321 |
+
|
| 322 |
+
print(f"\n๐ฏ MEILLEUR MODรLE GLOBAL: {best['name']}")
|
| 323 |
+
print(f" Score: {best['avg_score']:.1f}/10")
|
| 324 |
+
print(f" Temps: {best['avg_time']:.2f}s")
|
| 325 |
+
print(f" Taille: {best['file_size_mb']:.0f}MB")
|
| 326 |
+
print(f" Type: {best['type']}")
|
| 327 |
+
|
| 328 |
+
print(f"\nโก MODรLE LE PLUS EFFICACE: {most_efficient['name']}")
|
| 329 |
+
print(f" Efficacitรฉ: {most_efficient['efficiency']:.2f} score/s")
|
| 330 |
+
|
| 331 |
+
# Recommandations finales
|
| 332 |
+
print(f"\n๐ก RECOMMANDATIONS FINALES:")
|
| 333 |
+
|
| 334 |
+
if best['avg_score'] >= 7:
|
| 335 |
+
print(f"โ
{best['name']} est EXCELLENT pour la production MCP")
|
| 336 |
+
elif best['avg_score'] >= 5:
|
| 337 |
+
print(f"๐ {best['name']} est BON pour la production MCP")
|
| 338 |
+
else:
|
| 339 |
+
print(f"โ ๏ธ {best['name']} nรฉcessite des amรฉliorations")
|
| 340 |
+
|
| 341 |
+
# Comparaison spรฉcialisรฉs vs gรฉnรฉraux
|
| 342 |
+
if mcp_specialized and general_models:
|
| 343 |
+
best_specialized = max(mcp_specialized, key=lambda x: x['avg_score'])
|
| 344 |
+
best_general = max(general_models, key=lambda x: x['avg_score'])
|
| 345 |
+
|
| 346 |
+
print(f"\n๐ฌ SPรCIALISรS VS GรNรRAUX:")
|
| 347 |
+
print(f" Meilleur spรฉcialisรฉ MCP: {best_specialized['name']} ({best_specialized['avg_score']:.1f}/10)")
|
| 348 |
+
print(f" Meilleur gรฉnรฉral: {best_general['name']} ({best_general['avg_score']:.1f}/10)")
|
| 349 |
+
|
| 350 |
+
if best_specialized['avg_score'] > best_general['avg_score']:
|
| 351 |
+
print(f" โ
Les modรจles spรฉcialisรฉs MCP sont meilleurs!")
|
| 352 |
+
else:
|
| 353 |
+
print(f" ๐ค Les modรจles gรฉnรฉraux performent aussi bien")
|
| 354 |
+
|
| 355 |
+
# Analyse dรฉtaillรฉe du meilleur
|
| 356 |
+
print(f"\n๐ DรTAILS DU MEILLEUR MODรLE ({best['name']}):")
|
| 357 |
+
for result in best['results']:
|
| 358 |
+
status = "โ
" if result['score'] >= 6 else "โ ๏ธ" if result['score'] >= 4 else "โ"
|
| 359 |
+
print(f" {status} {result['test']}: {result['score']}/10 ({result['time']:.2f}s)")
|
| 360 |
+
|
| 361 |
+
# Sauvegarder rรฉsultats complets
|
| 362 |
+
comprehensive_results = {
|
| 363 |
+
'evaluation_type': 'comprehensive_mcp_test',
|
| 364 |
+
'total_models_tested': len(models),
|
| 365 |
+
'successful_models': len(successful_results),
|
| 366 |
+
'results': results,
|
| 367 |
+
'ranking_by_score': sorted_by_score if successful_results else [],
|
| 368 |
+
'ranking_by_efficiency': sorted_by_efficiency if successful_results else [],
|
| 369 |
+
'best_overall': best if successful_results else None,
|
| 370 |
+
'most_efficient': most_efficient if successful_results else None
|
| 371 |
+
}
|
| 372 |
+
|
| 373 |
+
with open("comprehensive_mcp_evaluation.json", "w", encoding="utf-8") as f:
|
| 374 |
+
json.dump(comprehensive_results, f, indent=2, ensure_ascii=False)
|
| 375 |
+
|
| 376 |
+
print(f"\n๐ Rรฉsultats complets sauvegardรฉs dans: comprehensive_mcp_evaluation.json")
|
| 377 |
+
|
| 378 |
+
if __name__ == "__main__":
|
| 379 |
+
main()
|
tests/scripts/debug_model_loading.py
ADDED
|
@@ -0,0 +1,120 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Script de dรฉbogage pour identifier le problรจme de chargement du modรจle Qwen2.5 0.5B
|
| 3 |
+
"""
|
| 4 |
+
|
| 5 |
+
import sys
|
| 6 |
+
import os
|
| 7 |
+
import traceback
|
| 8 |
+
|
| 9 |
+
print("๐ DรBOGAGE DU CHARGEMENT DU MODรLE QWEN2.5 0.5B")
|
| 10 |
+
print("=" * 60)
|
| 11 |
+
|
| 12 |
+
# Vรฉrifier l'existence du modรจle
|
| 13 |
+
model_path = "qwen2.5-0.5b-instruct-q4_0.gguf"
|
| 14 |
+
if not os.path.exists(model_path):
|
| 15 |
+
print("โ Modรจle non trouvรฉ ร l'emplacement attendu")
|
| 16 |
+
print(f" Recherchรฉ: {os.path.abspath(model_path)}")
|
| 17 |
+
sys.exit(1)
|
| 18 |
+
|
| 19 |
+
print(f"โ
Modรจle trouvรฉ: {os.path.abspath(model_path)}")
|
| 20 |
+
print(f"๐ Taille: {os.path.getsize(model_path) / (1024*1024):.1f} MB")
|
| 21 |
+
|
| 22 |
+
# Test 1: Importation de llama_cpp
|
| 23 |
+
print("\n๐งช TEST 1: Importation de llama_cpp")
|
| 24 |
+
try:
|
| 25 |
+
import llama_cpp
|
| 26 |
+
print("โ
llama_cpp importรฉ avec succรจs")
|
| 27 |
+
print(f" Version: {llama_cpp.__version__}")
|
| 28 |
+
except Exception as e:
|
| 29 |
+
print(f"โ Erreur d'importation: {e}")
|
| 30 |
+
traceback.print_exc()
|
| 31 |
+
sys.exit(1)
|
| 32 |
+
|
| 33 |
+
# Test 2: Test de chargement simple
|
| 34 |
+
print("\n๐งช TEST 2: Chargement direct du modรจle")
|
| 35 |
+
try:
|
| 36 |
+
from llama_cpp import Llama
|
| 37 |
+
|
| 38 |
+
# Essayer diffรฉrentes configurations
|
| 39 |
+
configs = [
|
| 40 |
+
{"n_threads": 1, "n_ctx": 2048},
|
| 41 |
+
{"n_threads": 2, "n_ctx": 2048},
|
| 42 |
+
{"n_threads": 1, "n_ctx": 1024},
|
| 43 |
+
{"n_threads": 1, "n_ctx": 512},
|
| 44 |
+
]
|
| 45 |
+
|
| 46 |
+
for i, config in enumerate(configs, 1):
|
| 47 |
+
print(f"\n Configuration {i}: threads={config['n_threads']}, ctx={config['n_ctx']}")
|
| 48 |
+
try:
|
| 49 |
+
llm = Llama(
|
| 50 |
+
model_path=model_path,
|
| 51 |
+
n_ctx=config['n_ctx'],
|
| 52 |
+
n_threads=config['n_threads']
|
| 53 |
+
)
|
| 54 |
+
print(" โ
Modรจle chargรฉ avec succรจs")
|
| 55 |
+
|
| 56 |
+
# Test d'infรฉrence simple
|
| 57 |
+
print(" ๐ง Test d'infรฉrence...")
|
| 58 |
+
response = llm(
|
| 59 |
+
"Rรฉponds avec 'TEST_RรUSSI': ",
|
| 60 |
+
max_tokens=10,
|
| 61 |
+
temperature=0.1,
|
| 62 |
+
echo=False
|
| 63 |
+
)
|
| 64 |
+
|
| 65 |
+
# Gรฉrer la rรฉponse (peut รชtre un gรฉnรฉrateur ou un dict)
|
| 66 |
+
if hasattr(response, '__iter__') and not isinstance(response, dict):
|
| 67 |
+
# C'est un gรฉnรฉrateur, prendre le premier รฉlรฉment
|
| 68 |
+
response = next(response)
|
| 69 |
+
|
| 70 |
+
if isinstance(response, dict) and 'choices' in response:
|
| 71 |
+
text = response['choices'][0]['text'].strip()
|
| 72 |
+
print(f" ๐ Rรฉponse: {text}")
|
| 73 |
+
if "TEST_RรUSSI" in text:
|
| 74 |
+
print(" โ
Test d'infรฉrence rรฉussi!")
|
| 75 |
+
break
|
| 76 |
+
else:
|
| 77 |
+
print(f" โ ๏ธ Format de rรฉponse inattendu: {type(response)}")
|
| 78 |
+
|
| 79 |
+
except Exception as e:
|
| 80 |
+
print(f" โ Erreur: {e}")
|
| 81 |
+
traceback.print_exc()
|
| 82 |
+
|
| 83 |
+
else:
|
| 84 |
+
print("โ Toutes les configurations ont รฉchouรฉ")
|
| 85 |
+
|
| 86 |
+
except Exception as e:
|
| 87 |
+
print(f"โ Erreur lors du chargement: {e}")
|
| 88 |
+
traceback.print_exc()
|
| 89 |
+
|
| 90 |
+
# Test 3: Vรฉrification de l'environnement
|
| 91 |
+
print("\n๐งช TEST 3: Vรฉrification de l'environnement")
|
| 92 |
+
print(f" Python: {sys.version}")
|
| 93 |
+
print(f" Rรฉpertoire de travail: {os.getcwd()}")
|
| 94 |
+
print(f" Variables d'environnement PATH: {os.environ.get('PATH', 'Non dรฉfini')}")
|
| 95 |
+
|
| 96 |
+
# Test 4: Vรฉrification des permissions
|
| 97 |
+
print("\n๐งช TEST 4: Vรฉrification des permissions")
|
| 98 |
+
try:
|
| 99 |
+
# Tester la lecture du fichier
|
| 100 |
+
with open(model_path, 'rb') as f:
|
| 101 |
+
header = f.read(100)
|
| 102 |
+
print(f" โ
Fichier lisible, header: {header[:20]}...")
|
| 103 |
+
|
| 104 |
+
# Tester les permissions
|
| 105 |
+
import stat
|
| 106 |
+
st = os.stat(model_path)
|
| 107 |
+
permissions = stat.filemode(st.st_mode)
|
| 108 |
+
print(f" Permissions: {permissions}")
|
| 109 |
+
|
| 110 |
+
# Vรฉrifier l'espace mรฉmoire
|
| 111 |
+
import shutil
|
| 112 |
+
total, used, free = shutil.disk_usage(".")
|
| 113 |
+
print(f" Espace disque libre: {free / (1024**3):.1f} GB")
|
| 114 |
+
|
| 115 |
+
except Exception as e:
|
| 116 |
+
print(f" โ Erreur de lecture: {e}")
|
| 117 |
+
|
| 118 |
+
print("\n" + "=" * 60)
|
| 119 |
+
print("๐ RรSUMร DU DรBOGAGE")
|
| 120 |
+
print("=" * 60)
|
tests/scripts/diagnose_mcp_failures.py
ADDED
|
@@ -0,0 +1,182 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Diagnostic des รฉchecs des modรจles spรฉcialisรฉs MCP
|
| 3 |
+
Analyse pourquoi les modรจles plus grands et spรฉcialisรฉs รฉchouent
|
| 4 |
+
"""
|
| 5 |
+
|
| 6 |
+
import sys
|
| 7 |
+
import os
|
| 8 |
+
|
| 9 |
+
def diagnose_model_failures():
|
| 10 |
+
"""Diagnostiquer les problรจmes avec les modรจles spรฉcialisรฉs MCP"""
|
| 11 |
+
|
| 12 |
+
print("๐ DIAGNOSTIC DES รCHECS DES MODรLES SPรCIALISรS MCP")
|
| 13 |
+
print("=" * 70)
|
| 14 |
+
|
| 15 |
+
# Modรจles problรฉmatiques
|
| 16 |
+
problematic_models = [
|
| 17 |
+
{
|
| 18 |
+
'name': 'MCP-Instruct-v1',
|
| 19 |
+
'path': 'mcp-instruct-v1.Q4_K_M.gguf',
|
| 20 |
+
'issues': ['llama_decode returned -1']
|
| 21 |
+
},
|
| 22 |
+
{
|
| 23 |
+
'name': 'MCPR L-3B-Exa',
|
| 24 |
+
'path': 'mcprl-3b-exa.Q2_K.gguf',
|
| 25 |
+
'issues': ['texte corrompu', 'caractรจres spรฉciaux', 'sortie incohรฉrente']
|
| 26 |
+
},
|
| 27 |
+
{
|
| 28 |
+
'name': 'Gemma-3n-E2B-it',
|
| 29 |
+
'path': 'gemma-3n-E2B-it-UD-IQ2_XXS.gguf',
|
| 30 |
+
'issues': ['rรฉponses vides', 'pas de sortie']
|
| 31 |
+
}
|
| 32 |
+
]
|
| 33 |
+
|
| 34 |
+
# Analyse des problรจmes potentiels
|
| 35 |
+
print("\n๐ ANALYSE DES PROBLรMES IDENTIFIรS:")
|
| 36 |
+
|
| 37 |
+
for model in problematic_models:
|
| 38 |
+
print(f"\n๐งฉ {model['name']}:")
|
| 39 |
+
|
| 40 |
+
if not os.path.exists(model['path']):
|
| 41 |
+
print(f" โ Fichier modรจle non trouvรฉ: {model['path']}")
|
| 42 |
+
continue
|
| 43 |
+
|
| 44 |
+
file_size_mb = os.path.getsize(model['path']) / (1024*1024)
|
| 45 |
+
print(f" ๐ Taille: {file_size_mb:.1f} MB")
|
| 46 |
+
|
| 47 |
+
print(f" ๐ง Problรจmes dรฉtectรฉs:")
|
| 48 |
+
for issue in model['issues']:
|
| 49 |
+
print(f" โข {issue}")
|
| 50 |
+
|
| 51 |
+
# Causes potentielles
|
| 52 |
+
print("\n๐ CAUSES POTENTIELLES DES รCHECS:")
|
| 53 |
+
print(" 1. FORMAT DE PROMPT INADรQUAT")
|
| 54 |
+
print(" โข Les modรจles spรฉcialisรฉs MCP peuvent attendre un format de prompt diffรฉrent")
|
| 55 |
+
print(" โข Format actuel trop simple vs format de formation MCP spรฉcialisรฉ")
|
| 56 |
+
print(" โข Besoin de tokens spรฉciaux ou de formatage MCP spรฉcifique")
|
| 57 |
+
|
| 58 |
+
print("\n 2. PROBLรMES DE CHARGEMENT DU MODรLE")
|
| 59 |
+
print(" โข llama_decode=-1: Erreur de dรฉcodeur LLAMA")
|
| 60 |
+
print(" โข Modรจles potentiellement incompatibles avec llama-cpp-python")
|
| 61 |
+
print(" โข Paramรจtres d'initialisation incorrects")
|
| 62 |
+
|
| 63 |
+
print("\n 3. CONTEXTE ET TOKENS SPรCIAUX")
|
| 64 |
+
print(" โข Les modรจles MCP peuvent nรฉcessiter des tokens de dรฉbut/fin spรฉcifiques")
|
| 65 |
+
print(" โข Format d'entrรฉe diffรฉrent de l'attente du modรจle")
|
| 66 |
+
print(" โข Besoin de format de conversation MCP structurรฉ")
|
| 67 |
+
|
| 68 |
+
print("\n 4. QUANTISATION PROBLรMATIQUE")
|
| 69 |
+
print(" โข Q2_K pour MCPR L-3B-Exa peut รชtre trop agressif")
|
| 70 |
+
print(" โข IQ2_XXS pour Gemma-3n-E2B-it: trรจs forte compression")
|
| 71 |
+
print(" โข Perte de capacitรฉ avec quantisation extrรชme")
|
| 72 |
+
|
| 73 |
+
# Test de diagnostic avec prompts alternatifs
|
| 74 |
+
print("\n๐ TEST DE DIAGNOSTIC AVEC PROMPTS ALTERNATIFS")
|
| 75 |
+
print("=" * 70)
|
| 76 |
+
|
| 77 |
+
from llama_cpp import Llama
|
| 78 |
+
|
| 79 |
+
# Formats de prompt alternatifs pour tester
|
| 80 |
+
prompt_formats = [
|
| 81 |
+
{
|
| 82 |
+
'name': 'Format actuel',
|
| 83 |
+
'template': '''You are an AI assistant for an RTS game using MCP (Model Context Protocol).
|
| 84 |
+
|
| 85 |
+
Available tools:
|
| 86 |
+
- get_game_state()
|
| 87 |
+
- move_units(unit_ids, target_x, target_y)
|
| 88 |
+
- attack_unit(attacker_ids, target_id)
|
| 89 |
+
- build_building(building_type, position_x, position_y)
|
| 90 |
+
|
| 91 |
+
User command: "{command}"
|
| 92 |
+
|
| 93 |
+
Respond with JSON only: {{"tool": "tool_name", "args": {{}}}}'''
|
| 94 |
+
},
|
| 95 |
+
{
|
| 96 |
+
'name': 'Format MCP structurรฉ',
|
| 97 |
+
'template': '''<|im_start|>system
|
| 98 |
+
You are an AI assistant for an RTS game using the Model Context Protocol (MCP). You must respond with only JSON following this exact format:
|
| 99 |
+
{{"tool": "tool_name", "args": {{<arguments>}}}}
|
| 100 |
+
|
| 101 |
+
Available tools:
|
| 102 |
+
- get_game_state(): Get current game state
|
| 103 |
+
- move_units(unit_ids, target_x, target_y): Move units to coordinates
|
| 104 |
+
- attack_unit(attacker_ids, target_id): Attack a specific unit
|
| 105 |
+
- build_building(building_type, position_x, position_y): Construct a building
|
| 106 |
+
|
| 107 |
+
Always respond with valid JSON, no additional text.<|im_end|>
|
| 108 |
+
<|im_start|>user
|
| 109 |
+
{command}<|im_end|>
|
| 110 |
+
<|im_start|>assistant
|
| 111 |
+
'''
|
| 112 |
+
},
|
| 113 |
+
{
|
| 114 |
+
'name': 'Format simple JSON',
|
| 115 |
+
'template': '''User: {command}
|
| 116 |
+
|
| 117 |
+
Tools: get_game_state, move_units, attack_unit, build_building
|
| 118 |
+
|
| 119 |
+
Response (JSON only):'''
|
| 120 |
+
}
|
| 121 |
+
]
|
| 122 |
+
|
| 123 |
+
# Tester avec MCPR L-3B-Exa (le modรจle qui sort du texte corrompu)
|
| 124 |
+
test_model = 'mcprl-3b-exa.Q2_K.gguf'
|
| 125 |
+
|
| 126 |
+
if os.path.exists(test_model):
|
| 127 |
+
print(f"\n๐งช Test avec {test_model}...")
|
| 128 |
+
|
| 129 |
+
try:
|
| 130 |
+
llm = Llama(
|
| 131 |
+
model_path=test_model,
|
| 132 |
+
n_ctx=1024,
|
| 133 |
+
n_threads=1,
|
| 134 |
+
verbose=False,
|
| 135 |
+
n_gpu_layers=0
|
| 136 |
+
)
|
| 137 |
+
|
| 138 |
+
test_command = "show game state"
|
| 139 |
+
|
| 140 |
+
for format_info in prompt_formats:
|
| 141 |
+
print(f"\n ๐ Format: {format_info['name']}")
|
| 142 |
+
|
| 143 |
+
prompt = format_info['template'].format(command=test_command)
|
| 144 |
+
|
| 145 |
+
try:
|
| 146 |
+
response = llm(
|
| 147 |
+
prompt,
|
| 148 |
+
max_tokens=50,
|
| 149 |
+
temperature=0.1,
|
| 150 |
+
stop=["</s>", "<|im_end|>", "```"]
|
| 151 |
+
)
|
| 152 |
+
|
| 153 |
+
response_text = response['choices'][0]['text'].strip()
|
| 154 |
+
|
| 155 |
+
print(f" Prompt (premiers 100 caractรจres): {prompt[:100]}...")
|
| 156 |
+
print(f" Rรฉponse: {response_text[:200]}")
|
| 157 |
+
|
| 158 |
+
# Vรฉrifier si c'est du JSON valide
|
| 159 |
+
try:
|
| 160 |
+
import json
|
| 161 |
+
json.loads(response_text)
|
| 162 |
+
print(f" โ
JSON VALIDE!")
|
| 163 |
+
except:
|
| 164 |
+
print(f" โ JSON INVALIDE")
|
| 165 |
+
|
| 166 |
+
except Exception as e:
|
| 167 |
+
print(f" โ Erreur: {e}")
|
| 168 |
+
|
| 169 |
+
except Exception as e:
|
| 170 |
+
print(f" โ Impossible de charger le modรจle: {e}")
|
| 171 |
+
|
| 172 |
+
# Recommandations
|
| 173 |
+
print("\n๐ก RECOMMANDATIONS POUR CORRIGER:")
|
| 174 |
+
print(" 1. Essayer diffรฉrentes versions de quantisation (Q4_K_M, Q8_0)")
|
| 175 |
+
print(" 2. Utiliser des formats de prompt structurรฉs avec tokens MCP")
|
| 176 |
+
print(" 3. Augmenter la tempรฉrature (0.3-0.5) pour plus de variabilitรฉ")
|
| 177 |
+
print(" 4. Tester avec plus de tokens max (200-300)")
|
| 178 |
+
print(" 5. Vรฉrifier la compatibilitรฉ avec llama-cpp-python")
|
| 179 |
+
print(" 6. Utiliser des modรจles de taille similaire mais avec moins de compression")
|
| 180 |
+
|
| 181 |
+
if __name__ == "__main__":
|
| 182 |
+
diagnose_model_failures()
|
tests/scripts/diagnostic_qwen_test.py
ADDED
|
@@ -0,0 +1,227 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Test de diagnostic pour รฉvaluer les capacitรฉs MCP de Qwen2.5 0.5B
|
| 3 |
+
Approche simplifiรฉe pour รฉviter les problรจmes de chargement
|
| 4 |
+
"""
|
| 5 |
+
|
| 6 |
+
import sys
|
| 7 |
+
import os
|
| 8 |
+
import json
|
| 9 |
+
import time
|
| 10 |
+
|
| 11 |
+
# Ajouter le chemin pour les imports
|
| 12 |
+
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
|
| 13 |
+
|
| 14 |
+
def test_model_loading():
|
| 15 |
+
"""Test simple de chargement du modรจle"""
|
| 16 |
+
|
| 17 |
+
print("๐ TEST DE CHARGEMENT DU MODรLE")
|
| 18 |
+
print("=" * 50)
|
| 19 |
+
|
| 20 |
+
# Vรฉrifier que le modรจle existe
|
| 21 |
+
model_path = "qwen2.5-0.5b-instruct-q4_0.gguf"
|
| 22 |
+
if not os.path.exists(model_path):
|
| 23 |
+
print("โ Modรจle non trouvรฉ")
|
| 24 |
+
return False
|
| 25 |
+
|
| 26 |
+
print(f"โ
Modรจle trouvรฉ: {model_path}")
|
| 27 |
+
print(f"๐ Taille: {os.path.getsize(model_path) / (1024*1024):.1f} MB")
|
| 28 |
+
|
| 29 |
+
# Essayer d'importer et d'initialiser
|
| 30 |
+
try:
|
| 31 |
+
from ai_analysis import AIAnalyzer
|
| 32 |
+
print("โ
Module AIAnalyzer importรฉ")
|
| 33 |
+
|
| 34 |
+
analyzer = AIAnalyzer(model_path)
|
| 35 |
+
print("โ
AIAnalyzer initialisรฉ")
|
| 36 |
+
|
| 37 |
+
if analyzer.model_available:
|
| 38 |
+
print("โ
Modรจle disponible selon AIAnalyzer")
|
| 39 |
+
return True
|
| 40 |
+
else:
|
| 41 |
+
print("โ Modรจle non disponible selon AIAnalyzer")
|
| 42 |
+
return False
|
| 43 |
+
|
| 44 |
+
except Exception as e:
|
| 45 |
+
print(f"โ Erreur lors de l'initialisation: {e}")
|
| 46 |
+
return False
|
| 47 |
+
|
| 48 |
+
def test_simple_inference():
|
| 49 |
+
"""Test d'infรฉrence simple avec le modรจle"""
|
| 50 |
+
|
| 51 |
+
print("\n๐งช TEST D'INFรRENCE SIMPLE")
|
| 52 |
+
print("=" * 50)
|
| 53 |
+
|
| 54 |
+
try:
|
| 55 |
+
# Importer directement llama-cpp-python
|
| 56 |
+
from llama_cpp import Llama
|
| 57 |
+
|
| 58 |
+
model_path = "qwen2.5-0.5b-instruct-q4_0.gguf"
|
| 59 |
+
|
| 60 |
+
print("๐ Chargement du modรจle avec Llama...")
|
| 61 |
+
|
| 62 |
+
# Essayer diffรฉrentes configurations
|
| 63 |
+
for n_threads in [1, 2, 4]:
|
| 64 |
+
try:
|
| 65 |
+
llm = Llama(
|
| 66 |
+
model_path=model_path,
|
| 67 |
+
n_ctx=2048,
|
| 68 |
+
n_threads=n_threads,
|
| 69 |
+
verbose=False
|
| 70 |
+
)
|
| 71 |
+
|
| 72 |
+
print(f"โ
Modรจle chargรฉ avec {n_threads} threads")
|
| 73 |
+
|
| 74 |
+
# Test simple
|
| 75 |
+
prompt = "Rรฉponds simplement avec 'TEST_RรUSSI'"
|
| 76 |
+
|
| 77 |
+
start_time = time.time()
|
| 78 |
+
response = llm(
|
| 79 |
+
prompt,
|
| 80 |
+
max_tokens=10,
|
| 81 |
+
temperature=0.1
|
| 82 |
+
)
|
| 83 |
+
response_time = time.time() - start_time
|
| 84 |
+
|
| 85 |
+
text = response['choices'][0]['text'].strip()
|
| 86 |
+
|
| 87 |
+
print(f"โฑ๏ธ Temps de rรฉponse: {response_time:.2f}s")
|
| 88 |
+
print(f"๐ Rรฉponse: {text}")
|
| 89 |
+
|
| 90 |
+
if "TEST_RรUSSI" in text:
|
| 91 |
+
print("โ
Test d'infรฉrence rรฉussi")
|
| 92 |
+
return True
|
| 93 |
+
else:
|
| 94 |
+
print("โ ๏ธ Rรฉponse inattendue")
|
| 95 |
+
|
| 96 |
+
except Exception as e:
|
| 97 |
+
print(f"โ Erreur avec {n_threads} threads: {e}")
|
| 98 |
+
continue
|
| 99 |
+
|
| 100 |
+
return False
|
| 101 |
+
|
| 102 |
+
except Exception as e:
|
| 103 |
+
print(f"โ Erreur lors de l'import de llama_cpp: {e}")
|
| 104 |
+
return False
|
| 105 |
+
|
| 106 |
+
def test_mcp_capability_simulation():
|
| 107 |
+
"""Test simulรฉ des capacitรฉs MCP (sans vrai modรจle)"""
|
| 108 |
+
|
| 109 |
+
print("\n๐ง TEST SIMULร DES CAPACITรS MCP")
|
| 110 |
+
print("=" * 50)
|
| 111 |
+
|
| 112 |
+
# Basรฉ sur la documentation et les spรฉcifications de Qwen2.5 0.5B
|
| 113 |
+
# รvaluation thรฉorique des capacitรฉs
|
| 114 |
+
|
| 115 |
+
capabilities = {
|
| 116 |
+
"comprรฉhension_instructions_simples": {
|
| 117 |
+
"score": 8,
|
| 118 |
+
"description": "Bon pour les commandes simples comme 'montre l'รฉtat du jeu'",
|
| 119 |
+
"exemple": "User: 'get game state' โ AI: {'tool': 'get_game_state'}"
|
| 120 |
+
},
|
| 121 |
+
"extraction_paramรจtres_basiques": {
|
| 122 |
+
"score": 6,
|
| 123 |
+
"description": "Capable d'extraire des paramรจtres simples comme des coordonnรฉes",
|
| 124 |
+
"exemple": "User: 'move to 100,200' โ AI: {'tool': 'move_units', 'args': {'target_x': 100, 'target_y': 200}}"
|
| 125 |
+
},
|
| 126 |
+
"planification_multi_รฉtapes": {
|
| 127 |
+
"score": 4,
|
| 128 |
+
"description": "Limitรฉ pour les stratรฉgies complexes nรฉcessitant plusieurs รฉtapes",
|
| 129 |
+
"exemple": "User: 'build base and defend' โ Peut avoir du mal ร dรฉcomposer"
|
| 130 |
+
},
|
| 131 |
+
"gestion_ambiguรฏtรฉ": {
|
| 132 |
+
"score": 5,
|
| 133 |
+
"description": "Capacitรฉ modรฉrรฉe ร gรฉrer les instructions ambiguรซs",
|
| 134 |
+
"exemple": "User: 'move to enemy base' โ Peut demander des coordonnรฉes"
|
| 135 |
+
},
|
| 136 |
+
"comprรฉhension_contexte_jeu": {
|
| 137 |
+
"score": 7,
|
| 138 |
+
"description": "Bon pour comprendre le contexte RTS de base",
|
| 139 |
+
"exemple": "Comprend les concepts comme 'unitรฉs', 'bรขtiments', 'ressources'"
|
| 140 |
+
}
|
| 141 |
+
}
|
| 142 |
+
|
| 143 |
+
total_score = sum(cap["score"] for cap in capabilities.values())
|
| 144 |
+
avg_score = total_score / len(capabilities)
|
| 145 |
+
|
| 146 |
+
print("๐ รVALUATION THรORIQUE BASรE SUR LES SPรCIFICATIONS:")
|
| 147 |
+
|
| 148 |
+
for name, cap in capabilities.items():
|
| 149 |
+
print(f"\n๐น {name.replace('_', ' ').title()}:")
|
| 150 |
+
print(f" Score: {cap['score']}/10")
|
| 151 |
+
print(f" {cap['description']}")
|
| 152 |
+
print(f" Exemple: {cap['exemple']}")
|
| 153 |
+
|
| 154 |
+
print(f"\n๐ฏ SCORE MOYEN THรORIQUE: {avg_score:.1f}/10")
|
| 155 |
+
|
| 156 |
+
return avg_score
|
| 157 |
+
|
| 158 |
+
def generate_assessment():
|
| 159 |
+
"""Gรฉnรฉrer une รฉvaluation complรจte"""
|
| 160 |
+
|
| 161 |
+
print("๐ รVALUATION DES CAPACITรS MCP DE QWEN2.5 0.5B")
|
| 162 |
+
print("=" * 70)
|
| 163 |
+
|
| 164 |
+
# Test de chargement
|
| 165 |
+
model_loaded = test_model_loading()
|
| 166 |
+
|
| 167 |
+
# Test d'infรฉrence
|
| 168 |
+
if model_loaded:
|
| 169 |
+
inference_working = test_simple_inference()
|
| 170 |
+
else:
|
| 171 |
+
inference_working = False
|
| 172 |
+
|
| 173 |
+
# รvaluation thรฉorique
|
| 174 |
+
theoretical_score = test_mcp_capability_simulation()
|
| 175 |
+
|
| 176 |
+
# Rapport final
|
| 177 |
+
print("\n" + "="*70)
|
| 178 |
+
print("๐ RAPPORT FINAL D'รVALUATION")
|
| 179 |
+
print("="*70)
|
| 180 |
+
|
| 181 |
+
print(f"\n๐ง รTAT TECHNIQUE:")
|
| 182 |
+
print(f" Modรจle chargรฉ: {'โ
' if model_loaded else 'โ'}")
|
| 183 |
+
print(f" Infรฉrence fonctionnelle: {'โ
' if inference_working else 'โ'}")
|
| 184 |
+
|
| 185 |
+
print(f"\n๐ง CAPACITรS MCP ESTIMรES:")
|
| 186 |
+
print(f" Score thรฉorique: {theoretical_score:.1f}/10")
|
| 187 |
+
|
| 188 |
+
if theoretical_score >= 7:
|
| 189 |
+
print("๐ช TRรS CAPABLE - Bon pour la plupart des tรขches MCP")
|
| 190 |
+
elif theoretical_score >= 5:
|
| 191 |
+
print("๐ CAPABLE - Adaptรฉ pour les commandes simples ร modรฉrรฉes")
|
| 192 |
+
else:
|
| 193 |
+
print("โ ๏ธ LIMITร - Recommandรฉ uniquement pour les tรขches trรจs simples")
|
| 194 |
+
|
| 195 |
+
print(f"\n๐ก RECOMMANDATIONS PRATIQUES:")
|
| 196 |
+
|
| 197 |
+
if not model_loaded or not inference_working:
|
| 198 |
+
print("1. ๐ง Rรฉsoudre d'abord les problรจmes techniques de chargement")
|
| 199 |
+
print("2. ๐ฆ Vรฉrifier l'installation de llama-cpp-python")
|
| 200 |
+
print("3. ๐ Tester avec diffรฉrentes configurations")
|
| 201 |
+
|
| 202 |
+
print("4. ๐ฏ Commencer par des commandes MCP trรจs simples")
|
| 203 |
+
print("5. ๐ Tester progressivement la complexitรฉ")
|
| 204 |
+
print("6. โ
Ajouter une validation robuste")
|
| 205 |
+
print("7. ๐ Utiliser des prompts structurรฉs")
|
| 206 |
+
|
| 207 |
+
print(f"\n๐ STRATรGIE D'IMPLรMENTATION:")
|
| 208 |
+
|
| 209 |
+
if theoretical_score >= 6:
|
| 210 |
+
print("โข Implรฉmenter la traduction MCP avec confiance")
|
| 211 |
+
print("โข Commencer par: get_game_state, move_units simples")
|
| 212 |
+
print("โข Ajouter progressivement: paramรจtres complexes, sรฉquences")
|
| 213 |
+
else:
|
| 214 |
+
print("โข Utiliser des rรจgles fixes pour les commandes simples")
|
| 215 |
+
print("โข Rรฉserver Qwen2.5 pour l'analyse plutรดt que la traduction")
|
| 216 |
+
print("โข Envisager un modรจle plus grand pour les tรขches complexes")
|
| 217 |
+
|
| 218 |
+
# Rรฉsumรฉ technique
|
| 219 |
+
print(f"\n๐ฌ INFORMATIONS TECHNIQUES:")
|
| 220 |
+
print("โข Modรจle: Qwen2.5-0.5B (0.5 milliard de paramรจtres)")
|
| 221 |
+
print("โข Type: Instruct (optimisรฉ pour suivre des instructions)")
|
| 222 |
+
print("โข Format: GGUF (quantisรฉ 4-bit)")
|
| 223 |
+
print("โข Taille: ~409 MB")
|
| 224 |
+
print("โข Capacitรฉ: Bon pour les tรขches simples, limitรฉ pour les complexes")
|
| 225 |
+
|
| 226 |
+
if __name__ == "__main__":
|
| 227 |
+
generate_assessment()
|
tests/scripts/direct_qwen_mcp_test.py
ADDED
|
@@ -0,0 +1,297 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Test direct des capacitรฉs MCP de Qwen2.5 0.5B
|
| 3 |
+
รvalue la comprรฉhension des outils MCP et la traduction d'instructions
|
| 4 |
+
"""
|
| 5 |
+
|
| 6 |
+
import sys
|
| 7 |
+
import os
|
| 8 |
+
import json
|
| 9 |
+
import time
|
| 10 |
+
|
| 11 |
+
# Ajouter le chemin pour les imports
|
| 12 |
+
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
|
| 13 |
+
|
| 14 |
+
def test_qwen_mcp_capabilities():
|
| 15 |
+
"""Test direct des capacitรฉs MCP de Qwen2.5 0.5B"""
|
| 16 |
+
|
| 17 |
+
print("๐ รVALUATION DES CAPACITรS MCP DE QWEN2.5 0.5B")
|
| 18 |
+
print("=" * 70)
|
| 19 |
+
|
| 20 |
+
# Vรฉrifier que le modรจle existe
|
| 21 |
+
model_path = "qwen2.5-0.5b-instruct-q4_0.gguf"
|
| 22 |
+
if not os.path.exists(model_path):
|
| 23 |
+
print("โ Modรจle non trouvรฉ. Tรฉlรฉchargez-le d'abord.")
|
| 24 |
+
return
|
| 25 |
+
|
| 26 |
+
print(f"โ
Modรจle trouvรฉ: {model_path}")
|
| 27 |
+
print(f"๐ Taille: {os.path.getsize(model_path) / (1024*1024):.1f} MB\n")
|
| 28 |
+
|
| 29 |
+
# Importer la classe AIAnalyzer
|
| 30 |
+
try:
|
| 31 |
+
from ai_analysis import AIAnalyzer
|
| 32 |
+
|
| 33 |
+
# Initialiser l'analyseur
|
| 34 |
+
print("๐ Initialisation de l'analyseur AI...")
|
| 35 |
+
analyzer = AIAnalyzer(model_path)
|
| 36 |
+
|
| 37 |
+
if not analyzer.model_available:
|
| 38 |
+
print("โ Impossible de charger le modรจle")
|
| 39 |
+
return
|
| 40 |
+
|
| 41 |
+
print("โ
Analyseur AI initialisรฉ avec succรจs\n")
|
| 42 |
+
|
| 43 |
+
except Exception as e:
|
| 44 |
+
print(f"โ Erreur lors de l'import: {e}")
|
| 45 |
+
return
|
| 46 |
+
|
| 47 |
+
# Tests de capacitรฉ MCP
|
| 48 |
+
tests = [
|
| 49 |
+
{
|
| 50 |
+
"name": "Comprรฉhension MCP de base",
|
| 51 |
+
"prompt": """
|
| 52 |
+
Tu es un assistant IA qui contrรดle un jeu RTS via MCP (Model Context Protocol).
|
| 53 |
+
|
| 54 |
+
Outils MCP disponibles:
|
| 55 |
+
- get_game_state(): Obtenir l'รฉtat actuel du jeu
|
| 56 |
+
- move_units(unit_ids, target_x, target_y): Dรฉplacer des unitรฉs vers une position
|
| 57 |
+
- attack_unit(attacker_ids, target_id): Attaquer une unitรฉ ennemie
|
| 58 |
+
|
| 59 |
+
Commande utilisateur: "Montre-moi l'รฉtat du jeu"
|
| 60 |
+
|
| 61 |
+
Rรฉponds avec un objet JSON contenant l'appel d'outil MCP ร exรฉcuter.
|
| 62 |
+
""",
|
| 63 |
+
"expected": "get_game_state"
|
| 64 |
+
},
|
| 65 |
+
{
|
| 66 |
+
"name": "Traduction d'action simple",
|
| 67 |
+
"prompt": """
|
| 68 |
+
Tu es un assistant IA qui contrรดle un jeu RTS via MCP.
|
| 69 |
+
|
| 70 |
+
Outils MCP disponibles:
|
| 71 |
+
- move_units(unit_ids, target_x, target_y): Dรฉplacer des unitรฉs
|
| 72 |
+
|
| 73 |
+
Commande utilisateur: "Dรฉplace mon infanterie vers la position 150, 200"
|
| 74 |
+
|
| 75 |
+
Rรฉponds avec un objet JSON contenant l'appel d'outil avec les paramรจtres extraits.
|
| 76 |
+
""",
|
| 77 |
+
"expected": "move_units"
|
| 78 |
+
},
|
| 79 |
+
{
|
| 80 |
+
"name": "Extraction de paramรจtres complexes",
|
| 81 |
+
"prompt": """
|
| 82 |
+
Tu es un assistant IA qui contrรดle un jeu RTS via MCP.
|
| 83 |
+
|
| 84 |
+
Outils MCP disponibles:
|
| 85 |
+
- move_units(unit_ids, target_x, target_y): Dรฉplacer des unitรฉs
|
| 86 |
+
|
| 87 |
+
Commande utilisateur: "Dรฉplace les tanks 1, 3 et 7 vers les coordonnรฉes 120, 80"
|
| 88 |
+
|
| 89 |
+
Extrais les paramรจtres et rรฉponds avec un objet JSON d'appel d'outil.
|
| 90 |
+
""",
|
| 91 |
+
"expected": "move_units avec paramรจtres"
|
| 92 |
+
},
|
| 93 |
+
{
|
| 94 |
+
"name": "Planification stratรฉgique",
|
| 95 |
+
"prompt": """
|
| 96 |
+
Tu es un assistant IA qui contrรดle un jeu RTS via MCP.
|
| 97 |
+
|
| 98 |
+
Outils MCP disponibles:
|
| 99 |
+
- get_game_state(): Obtenir l'รฉtat du jeu
|
| 100 |
+
- move_units(): Dรฉplacer des unitรฉs
|
| 101 |
+
- attack_unit(): Attaquer l'ennemi
|
| 102 |
+
- build_building(): Construire un bรขtiment
|
| 103 |
+
|
| 104 |
+
Commande utilisateur: "Construis une base prรจs du minerai et dรฉfends-la avec des tourelles"
|
| 105 |
+
|
| 106 |
+
Dรฉcompose cette stratรฉgie en une sรฉquence d'actions MCP.
|
| 107 |
+
""",
|
| 108 |
+
"expected": "sรฉquence d'actions"
|
| 109 |
+
},
|
| 110 |
+
{
|
| 111 |
+
"name": "Gestion d'ambiguรฏtรฉ",
|
| 112 |
+
"prompt": """
|
| 113 |
+
Tu es un assistant IA qui contrรดle un jeu RTS via MCP.
|
| 114 |
+
|
| 115 |
+
Outils MCP disponibles:
|
| 116 |
+
- move_units(unit_ids, target_x, target_y): Dรฉplacer des unitรฉs
|
| 117 |
+
|
| 118 |
+
Commande utilisateur: "Dรฉplace mes unitรฉs vers la base ennemie"
|
| 119 |
+
|
| 120 |
+
Comment gรฉrerais-tu cette instruction ambiguรซ?
|
| 121 |
+
""",
|
| 122 |
+
"expected": "demande de clarification"
|
| 123 |
+
}
|
| 124 |
+
]
|
| 125 |
+
|
| 126 |
+
results = []
|
| 127 |
+
|
| 128 |
+
for i, test in enumerate(tests, 1):
|
| 129 |
+
print(f"\n๐งช TEST {i}: {test['name']}")
|
| 130 |
+
print("-" * 50)
|
| 131 |
+
|
| 132 |
+
try:
|
| 133 |
+
# Utiliser la mรฉthode d'analyse de l'AIAnalyzer
|
| 134 |
+
start_time = time.time()
|
| 135 |
+
|
| 136 |
+
# Pour ce test, nous allons utiliser une approche directe
|
| 137 |
+
# en appelant la fonction interne de l'analyseur
|
| 138 |
+
|
| 139 |
+
# Crรฉer un prompt formatรฉ pour l'analyseur
|
| 140 |
+
formatted_prompt = f"""
|
| 141 |
+
{test['prompt']}
|
| 142 |
+
|
| 143 |
+
Rรฉponds uniquement avec un objet JSON valide.
|
| 144 |
+
"""
|
| 145 |
+
|
| 146 |
+
print("๐ค Envoi de la requรชte au modรจle...")
|
| 147 |
+
|
| 148 |
+
# Utiliser la mรฉthode d'analyse de l'AIAnalyzer
|
| 149 |
+
# Note: Nous devons adapter cela ร l'interface existante
|
| 150 |
+
|
| 151 |
+
# Pour l'instant, utilisons une approche simplifiรฉe
|
| 152 |
+
from ai_analysis import _llama_worker
|
| 153 |
+
import multiprocessing as mp
|
| 154 |
+
|
| 155 |
+
# Crรฉer une queue pour les rรฉsultats
|
| 156 |
+
result_queue = mp.Queue()
|
| 157 |
+
|
| 158 |
+
# Lancer le worker
|
| 159 |
+
worker = mp.Process(
|
| 160 |
+
target=_llama_worker,
|
| 161 |
+
args=(result_queue, model_path, formatted_prompt, [], 300, 0.1)
|
| 162 |
+
)
|
| 163 |
+
worker.start()
|
| 164 |
+
worker.join(timeout=30) # Timeout de 30 secondes
|
| 165 |
+
|
| 166 |
+
if worker.is_alive():
|
| 167 |
+
worker.terminate()
|
| 168 |
+
response = "TIMEOUT"
|
| 169 |
+
else:
|
| 170 |
+
result = result_queue.get()
|
| 171 |
+
if result['status'] == 'ok':
|
| 172 |
+
response_data = result.get('data', {})
|
| 173 |
+
if 'raw' in response_data:
|
| 174 |
+
response = response_data['raw']
|
| 175 |
+
else:
|
| 176 |
+
response = str(response_data)
|
| 177 |
+
else:
|
| 178 |
+
response = f"ERROR: {result.get('message', 'Unknown error')}"
|
| 179 |
+
|
| 180 |
+
response_time = time.time() - start_time
|
| 181 |
+
|
| 182 |
+
print(f"โฑ๏ธ Temps de rรฉponse: {response_time:.2f}s")
|
| 183 |
+
print(f"๐ Rรฉponse: {response[:200]}...")
|
| 184 |
+
|
| 185 |
+
# Analyser la rรฉponse
|
| 186 |
+
score = analyze_response(response, test)
|
| 187 |
+
|
| 188 |
+
results.append({
|
| 189 |
+
'test': test['name'],
|
| 190 |
+
'response': response,
|
| 191 |
+
'response_time': response_time,
|
| 192 |
+
'score': score,
|
| 193 |
+
'success': score >= 5
|
| 194 |
+
})
|
| 195 |
+
|
| 196 |
+
print(f"๐ Score: {score}/10")
|
| 197 |
+
|
| 198 |
+
except Exception as e:
|
| 199 |
+
print(f"โ Erreur: {e}")
|
| 200 |
+
results.append({
|
| 201 |
+
'test': test['name'],
|
| 202 |
+
'error': str(e),
|
| 203 |
+
'score': 0,
|
| 204 |
+
'success': False
|
| 205 |
+
})
|
| 206 |
+
|
| 207 |
+
# Gรฉnรฉrer le rapport final
|
| 208 |
+
generate_final_report(results)
|
| 209 |
+
|
| 210 |
+
def analyze_response(response, test_config):
|
| 211 |
+
"""Analyser et noter la rรฉponse du modรจle"""
|
| 212 |
+
|
| 213 |
+
if not response or response == "TIMEOUT":
|
| 214 |
+
return 0
|
| 215 |
+
|
| 216 |
+
score = 0
|
| 217 |
+
|
| 218 |
+
# Vรฉrifier la structure JSON
|
| 219 |
+
try:
|
| 220 |
+
json_data = json.loads(response)
|
| 221 |
+
score += 3 # JSON valide
|
| 222 |
+
|
| 223 |
+
# Vรฉrifier la prรฉsence d'outils MCP
|
| 224 |
+
if 'tool' in json_data:
|
| 225 |
+
score += 2
|
| 226 |
+
if 'args' in json_data:
|
| 227 |
+
score += 1
|
| 228 |
+
|
| 229 |
+
except json.JSONDecodeError:
|
| 230 |
+
# Vรฉrifier les patterns dans le texte
|
| 231 |
+
if 'get_game_state' in response:
|
| 232 |
+
score += 2
|
| 233 |
+
if 'move_units' in response:
|
| 234 |
+
score += 2
|
| 235 |
+
if any(param in response for param in ['unit_ids', 'target_x', 'target_y']):
|
| 236 |
+
score += 1
|
| 237 |
+
|
| 238 |
+
# Vรฉrifier la cohรฉrence sรฉmantique
|
| 239 |
+
if 'game' in response.lower() or 'rts' in response.lower():
|
| 240 |
+
score += 1
|
| 241 |
+
|
| 242 |
+
# Vรฉrifier la pertinence par rapport au test
|
| 243 |
+
expected = test_config['expected']
|
| 244 |
+
if expected in response.lower():
|
| 245 |
+
score += 1
|
| 246 |
+
|
| 247 |
+
return min(score, 10)
|
| 248 |
+
|
| 249 |
+
def generate_final_report(results):
|
| 250 |
+
"""Gรฉnรฉrer un rapport final d'รฉvaluation"""
|
| 251 |
+
|
| 252 |
+
print("\n" + "="*70)
|
| 253 |
+
print("๐ RAPPORT FINAL D'รVALUATION MCP")
|
| 254 |
+
print("="*70)
|
| 255 |
+
|
| 256 |
+
successful_tests = [r for r in results if r.get('success', False)]
|
| 257 |
+
total_score = sum(r.get('score', 0) for r in results)
|
| 258 |
+
avg_score = total_score / len(results) if results else 0
|
| 259 |
+
|
| 260 |
+
print(f"\n๐ STATISTIQUES:")
|
| 261 |
+
print(f" Tests rรฉussis: {len(successful_tests)}/{len(results)}")
|
| 262 |
+
print(f" Score moyen: {avg_score:.1f}/10")
|
| 263 |
+
|
| 264 |
+
print(f"\n๐ DรTAILS PAR TEST:")
|
| 265 |
+
for result in results:
|
| 266 |
+
status = "โ
" if result.get('success', False) else "โ"
|
| 267 |
+
print(f" {status} {result['test']}: {result.get('score', 0)}/10")
|
| 268 |
+
if 'response_time' in result:
|
| 269 |
+
print(f" โฑ๏ธ Temps: {result['response_time']:.2f}s")
|
| 270 |
+
|
| 271 |
+
print(f"\n๐ฏ รVALUATION FINALE:")
|
| 272 |
+
|
| 273 |
+
if avg_score >= 8:
|
| 274 |
+
print("๐ช EXCELLENT - Qwen2.5 0.5B est trรจs capable pour les tรขches MCP")
|
| 275 |
+
print(" โข Utilisation recommandรฉe pour la traduction MCP")
|
| 276 |
+
print(" โข Bonne comprรฉhension des outils et paramรจtres")
|
| 277 |
+
elif avg_score >= 6:
|
| 278 |
+
print("๐ BON - Capable avec quelques limitations")
|
| 279 |
+
print(" โข Adaptรฉ pour les commandes simples")
|
| 280 |
+
print(" โข Validation recommandรฉe pour les actions complexes")
|
| 281 |
+
elif avg_score >= 4:
|
| 282 |
+
print("โ ๏ธ MODรRร - Limitations significatives")
|
| 283 |
+
print(" โข Utilisation limitรฉe aux commandes trรจs simples")
|
| 284 |
+
print(" โข Validation stricte nรฉcessaire")
|
| 285 |
+
else:
|
| 286 |
+
print("โ FAIBLE - Pas adaptรฉ aux tรขches MCP")
|
| 287 |
+
print(" โข Envisagez un modรจle plus grand")
|
| 288 |
+
print(" โข Utilisez des rรจgles fixes ร la place")
|
| 289 |
+
|
| 290 |
+
print(f"\n๐ก RECOMMANDATIONS:")
|
| 291 |
+
print("1. Commencez par des commandes simples")
|
| 292 |
+
print("2. Ajoutez une validation des actions")
|
| 293 |
+
print("3. Utilisez des prompts structurรฉs")
|
| 294 |
+
print("4. Testez avec diffรฉrents types de commandes")
|
| 295 |
+
|
| 296 |
+
if __name__ == "__main__":
|
| 297 |
+
test_qwen_mcp_capabilities()
|
tests/scripts/final_model_comparison.py
ADDED
|
@@ -0,0 +1,263 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
รvaluation finale comparative de tous les modรจles pour MCP
|
| 3 |
+
Inclut le nouveau Gemma-3-270M plus petit et potentiellement plus rapide
|
| 4 |
+
"""
|
| 5 |
+
|
| 6 |
+
import sys
|
| 7 |
+
import os
|
| 8 |
+
import json
|
| 9 |
+
import time
|
| 10 |
+
|
| 11 |
+
# Ajouter le chemin pour les imports
|
| 12 |
+
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
|
| 13 |
+
|
| 14 |
+
def test_model_comprehensive(model_path, model_name):
|
| 15 |
+
"""Test complet d'un modรจle pour les tรขches MCP"""
|
| 16 |
+
|
| 17 |
+
try:
|
| 18 |
+
from llama_cpp import Llama
|
| 19 |
+
|
| 20 |
+
print(f"๐ Test de {model_name}...")
|
| 21 |
+
|
| 22 |
+
# Initialiser avec des paramรจtres optimisรฉs
|
| 23 |
+
llm = Llama(
|
| 24 |
+
model_path=model_path,
|
| 25 |
+
n_ctx=1024,
|
| 26 |
+
n_threads=1,
|
| 27 |
+
verbose=False
|
| 28 |
+
)
|
| 29 |
+
|
| 30 |
+
# Tests MCP variรฉs
|
| 31 |
+
tests = [
|
| 32 |
+
{
|
| 33 |
+
"name": "Commande simple",
|
| 34 |
+
"prompt": """Tu es un assistant IA pour un jeu RTS via MCP.
|
| 35 |
+
|
| 36 |
+
Outils: get_game_state(), move_units(unit_ids, target_x, target_y)
|
| 37 |
+
|
| 38 |
+
Commande: "Montre-moi l'รฉtat du jeu"
|
| 39 |
+
|
| 40 |
+
Rรฉponds avec JSON: {{"tool": "nom_outil", "args": {{}}}}""",
|
| 41 |
+
"expected": "get_game_state"
|
| 42 |
+
},
|
| 43 |
+
{
|
| 44 |
+
"name": "Action avec paramรจtres",
|
| 45 |
+
"prompt": """Outils: move_units(unit_ids, target_x, target_y)
|
| 46 |
+
|
| 47 |
+
Commande: "Dรฉplace l'infanterie vers 100,200"
|
| 48 |
+
|
| 49 |
+
JSON: {{"tool": "move_units", "args": {{"unit_ids": ["infantry"], "target_x": 100, "target_y": 200}}}}""",
|
| 50 |
+
"expected": "move_units"
|
| 51 |
+
},
|
| 52 |
+
{
|
| 53 |
+
"name": "Vitesse de rรฉponse",
|
| 54 |
+
"prompt": "Rรฉponds simplement: OK",
|
| 55 |
+
"expected": "OK"
|
| 56 |
+
}
|
| 57 |
+
]
|
| 58 |
+
|
| 59 |
+
total_score = 0
|
| 60 |
+
total_time = 0
|
| 61 |
+
results = []
|
| 62 |
+
|
| 63 |
+
for test in tests:
|
| 64 |
+
start_time = time.time()
|
| 65 |
+
|
| 66 |
+
response = llm(
|
| 67 |
+
test['prompt'],
|
| 68 |
+
max_tokens=80,
|
| 69 |
+
temperature=0.1,
|
| 70 |
+
stop=["</s>", "<|im_end|>"]
|
| 71 |
+
)
|
| 72 |
+
|
| 73 |
+
response_time = time.time() - start_time
|
| 74 |
+
response_text = response['choices'][0]['text'].strip()
|
| 75 |
+
|
| 76 |
+
# Noter la rรฉponse
|
| 77 |
+
score = 0
|
| 78 |
+
|
| 79 |
+
# JSON valide pour les tests MCP
|
| 80 |
+
if test['name'] != "Vitesse de rรฉponse":
|
| 81 |
+
try:
|
| 82 |
+
json.loads(response_text)
|
| 83 |
+
score += 3
|
| 84 |
+
except:
|
| 85 |
+
pass
|
| 86 |
+
|
| 87 |
+
# Contenu attendu
|
| 88 |
+
if test['expected'] in response_text:
|
| 89 |
+
score += 4
|
| 90 |
+
|
| 91 |
+
# Format appropriรฉ
|
| 92 |
+
if "tool" in response_text and test['name'] != "Vitesse de rรฉponse":
|
| 93 |
+
score += 2
|
| 94 |
+
|
| 95 |
+
# Cohรฉrence
|
| 96 |
+
if any(word in response_text.lower() for word in ['game', 'move', 'state']):
|
| 97 |
+
score += 1
|
| 98 |
+
|
| 99 |
+
score = min(score, 10)
|
| 100 |
+
|
| 101 |
+
total_score += score
|
| 102 |
+
total_time += response_time
|
| 103 |
+
|
| 104 |
+
results.append({
|
| 105 |
+
'test': test['name'],
|
| 106 |
+
'score': score,
|
| 107 |
+
'time': response_time,
|
| 108 |
+
'response': response_text[:50] + "..." if len(response_text) > 50 else response_text
|
| 109 |
+
})
|
| 110 |
+
|
| 111 |
+
avg_score = total_score / len(tests)
|
| 112 |
+
avg_time = total_time / len(tests)
|
| 113 |
+
|
| 114 |
+
print(f"โ
{model_name}: {avg_score:.1f}/10 | Temps: {avg_time:.2f}s")
|
| 115 |
+
|
| 116 |
+
return {
|
| 117 |
+
'name': model_name,
|
| 118 |
+
'avg_score': avg_score,
|
| 119 |
+
'avg_time': avg_time,
|
| 120 |
+
'efficiency': avg_score / avg_time if avg_time > 0 else 0,
|
| 121 |
+
'tests': results
|
| 122 |
+
}
|
| 123 |
+
|
| 124 |
+
except Exception as e:
|
| 125 |
+
print(f"โ {model_name}: Erreur - {e}")
|
| 126 |
+
return {
|
| 127 |
+
'name': model_name,
|
| 128 |
+
'avg_score': 0,
|
| 129 |
+
'avg_time': 0,
|
| 130 |
+
'efficiency': 0,
|
| 131 |
+
'error': str(e)
|
| 132 |
+
}
|
| 133 |
+
|
| 134 |
+
def main():
|
| 135 |
+
"""รvaluation finale comparative"""
|
| 136 |
+
|
| 137 |
+
print("๐ รVALUATION FINALE COMPARATIVE MCP")
|
| 138 |
+
print("=" * 60)
|
| 139 |
+
|
| 140 |
+
# Tous les modรจles ร tester
|
| 141 |
+
models = [
|
| 142 |
+
{
|
| 143 |
+
'name': 'Qwen2.5-0.5B',
|
| 144 |
+
'path': 'qwen2.5-0.5b-instruct-q4_0.gguf'
|
| 145 |
+
},
|
| 146 |
+
{
|
| 147 |
+
'name': 'Qwen3-0.6B',
|
| 148 |
+
'path': 'Qwen3-0.6B-Q8_0.gguf'
|
| 149 |
+
},
|
| 150 |
+
{
|
| 151 |
+
'name': 'Gemma-3-1B',
|
| 152 |
+
'path': 'google_gemma-3-1b-it-qat-Q4_0.gguf'
|
| 153 |
+
},
|
| 154 |
+
{
|
| 155 |
+
'name': 'Gemma-3-270M',
|
| 156 |
+
'path': 'gemma-3-270m-it-qat-Q8_0.gguf'
|
| 157 |
+
}
|
| 158 |
+
]
|
| 159 |
+
|
| 160 |
+
results = []
|
| 161 |
+
|
| 162 |
+
for model in models:
|
| 163 |
+
if os.path.exists(model['path']):
|
| 164 |
+
result = test_model_comprehensive(model['path'], model['name'])
|
| 165 |
+
results.append(result)
|
| 166 |
+
else:
|
| 167 |
+
print(f"โ Fichier non trouvรฉ: {model['path']}")
|
| 168 |
+
|
| 169 |
+
# Analyse comparative
|
| 170 |
+
print("\n" + "=" * 60)
|
| 171 |
+
print("๐ RรSULTATS FINAUX")
|
| 172 |
+
print("=" * 60)
|
| 173 |
+
|
| 174 |
+
successful_results = [r for r in results if 'error' not in r and r['avg_score'] > 0]
|
| 175 |
+
|
| 176 |
+
if successful_results:
|
| 177 |
+
# Classement par score
|
| 178 |
+
sorted_by_score = sorted(successful_results, key=lambda x: x['avg_score'], reverse=True)
|
| 179 |
+
|
| 180 |
+
print(f"\n๐ CLASSEMENT PAR PERFORMANCE:")
|
| 181 |
+
for i, result in enumerate(sorted_by_score, 1):
|
| 182 |
+
file_size = os.path.getsize([m['path'] for m in models if m['name'] == result['name']][0]) / (1024*1024)
|
| 183 |
+
print(f" {i}. {result['name']}: {result['avg_score']:.1f}/10 | {result['avg_time']:.2f}s | {file_size:.0f}MB")
|
| 184 |
+
|
| 185 |
+
# Classement par efficacitรฉ (score/seconde)
|
| 186 |
+
sorted_by_efficiency = sorted(successful_results, key=lambda x: x['efficiency'], reverse=True)
|
| 187 |
+
|
| 188 |
+
print(f"\nโก CLASSEMENT PAR EFFICACITร:")
|
| 189 |
+
for i, result in enumerate(sorted_by_efficiency, 1):
|
| 190 |
+
file_size = os.path.getsize([m['path'] for m in models if m['name'] == result['name']][0]) / (1024*1024)
|
| 191 |
+
print(f" {i}. {result['name']}: {result['efficiency']:.2f} score/s | {file_size:.0f}MB")
|
| 192 |
+
|
| 193 |
+
# Meilleur modรจle global
|
| 194 |
+
best_overall = sorted_by_score[0]
|
| 195 |
+
most_efficient = sorted_by_efficiency[0]
|
| 196 |
+
|
| 197 |
+
print(f"\n๐ฏ MEILLEUR MODรLE GLOBAL: {best_overall['name']}")
|
| 198 |
+
print(f" Score: {best_overall['avg_score']:.1f}/10")
|
| 199 |
+
print(f" Temps: {best_overall['avg_time']:.2f}s")
|
| 200 |
+
|
| 201 |
+
print(f"\nโก MODรLE LE PLUS EFFICACE: {most_efficient['name']}")
|
| 202 |
+
print(f" Efficacitรฉ: {most_efficient['efficiency']:.2f} score/s")
|
| 203 |
+
|
| 204 |
+
# Analyse dรฉtaillรฉe
|
| 205 |
+
print(f"\n๐ ANALYSE DรTAILLรE:")
|
| 206 |
+
|
| 207 |
+
for result in successful_results:
|
| 208 |
+
file_size = os.path.getsize([m['path'] for m in models if m['name'] == result['name']][0]) / (1024*1024)
|
| 209 |
+
efficiency_per_mb = result['efficiency'] / file_size if file_size > 0 else 0
|
| 210 |
+
|
| 211 |
+
print(f"\n๐น {result['name']}:")
|
| 212 |
+
print(f" Score moyen: {result['avg_score']:.1f}/10")
|
| 213 |
+
print(f" Temps moyen: {result['avg_time']:.2f}s")
|
| 214 |
+
print(f" Efficacitรฉ: {result['efficiency']:.2f} score/s")
|
| 215 |
+
print(f" Taille: {file_size:.0f}MB")
|
| 216 |
+
print(f" Efficacitรฉ/MB: {efficiency_per_mb:.4f}")
|
| 217 |
+
|
| 218 |
+
# Tests individuels
|
| 219 |
+
for test in result['tests']:
|
| 220 |
+
status = "โ
" if test['score'] >= 6 else "โ ๏ธ" if test['score'] >= 4 else "โ"
|
| 221 |
+
print(f" {status} {test['test']}: {test['score']}/10 ({test['time']:.2f}s)")
|
| 222 |
+
|
| 223 |
+
# Recommandations finales
|
| 224 |
+
print(f"\n๐ก RECOMMANDATIONS FINALES:")
|
| 225 |
+
|
| 226 |
+
if best_overall['avg_score'] >= 7:
|
| 227 |
+
print(f"โ
{best_overall['name']} est EXCELLENT pour la production MCP")
|
| 228 |
+
elif best_overall['avg_score'] >= 5:
|
| 229 |
+
print(f"๐ {best_overall['name']} est BON pour la production MCP")
|
| 230 |
+
else:
|
| 231 |
+
print(f"โ ๏ธ {best_overall['name']} nรฉcessite des amรฉliorations")
|
| 232 |
+
|
| 233 |
+
# Recommandation basรฉe sur l'usage
|
| 234 |
+
print(f"\n๐ฏ RECOMMANDATIONS SPรCIFIQUES:")
|
| 235 |
+
|
| 236 |
+
if most_efficient['name'] != best_overall['name']:
|
| 237 |
+
print(f"โก Pour les rรฉponses rapides: {most_efficient['name']}")
|
| 238 |
+
print(f"๐ Pour la meilleure qualitรฉ: {best_overall['name']}")
|
| 239 |
+
else:
|
| 240 |
+
print(f"๐ {best_overall['name']} est le meilleur choix pour la vitesse ET la qualitรฉ")
|
| 241 |
+
|
| 242 |
+
# Vรฉrifier si Gemma-3-270M est surprenant
|
| 243 |
+
gemma_270m = next((r for r in successful_results if r['name'] == 'Gemma-3-270M'), None)
|
| 244 |
+
if gemma_270m and gemma_270m['avg_score'] >= 5:
|
| 245 |
+
print(f"๐ Surprise: Gemma-3-270M offre un excellent rapport taille/performance!")
|
| 246 |
+
|
| 247 |
+
# Sauvegarder rรฉsultats complets
|
| 248 |
+
final_results = {
|
| 249 |
+
'all_results': results,
|
| 250 |
+
'successful_models': successful_results,
|
| 251 |
+
'ranking_by_score': sorted_by_score if successful_results else [],
|
| 252 |
+
'ranking_by_efficiency': sorted_by_efficiency if successful_results else [],
|
| 253 |
+
'best_overall': best_overall if successful_results else None,
|
| 254 |
+
'most_efficient': most_efficient if successful_results else None
|
| 255 |
+
}
|
| 256 |
+
|
| 257 |
+
with open("final_model_comparison.json", "w", encoding="utf-8") as f:
|
| 258 |
+
json.dump(final_results, f, indent=2, ensure_ascii=False)
|
| 259 |
+
|
| 260 |
+
print(f"\n๐ Rรฉsultats complets sauvegardรฉs dans: final_model_comparison.json")
|
| 261 |
+
|
| 262 |
+
if __name__ == "__main__":
|
| 263 |
+
main()
|
tests/scripts/investigate_mcp_failures.py
ADDED
|
@@ -0,0 +1,181 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Investigation dรฉtaillรฉe des รฉchecs des modรจles spรฉcialisรฉs MCP
|
| 3 |
+
Analyse pourquoi MCP-Instruct-v1, Gemma-3n-E2B-it et MCPR L-3B-Exa รฉchouent
|
| 4 |
+
"""
|
| 5 |
+
|
| 6 |
+
import sys
|
| 7 |
+
import os
|
| 8 |
+
import json
|
| 9 |
+
import time
|
| 10 |
+
from llama_cpp import Llama
|
| 11 |
+
|
| 12 |
+
def analyze_model_failures():
|
| 13 |
+
"""Analyser en dรฉtail les รฉchecs des modรจles MCP"""
|
| 14 |
+
|
| 15 |
+
print("๐ INVESTIGATION DรTAILLรE DES รCHECS MCP")
|
| 16 |
+
print("=" * 70)
|
| 17 |
+
|
| 18 |
+
# Modรจles problรฉmatiques avec analyse dรฉtaillรฉe
|
| 19 |
+
problematic_models = [
|
| 20 |
+
{
|
| 21 |
+
'name': 'MCP-Instruct-v1',
|
| 22 |
+
'path': 'mcp-instruct-v1.Q4_K_M.gguf',
|
| 23 |
+
'size_mb': 697.0,
|
| 24 |
+
'issues': ['llama_decode returned -1'],
|
| 25 |
+
'hypothesis': 'Incompatibilitรฉ technique avec llama-cpp-python'
|
| 26 |
+
},
|
| 27 |
+
{
|
| 28 |
+
'name': 'MCPR L-3B-Exa',
|
| 29 |
+
'path': 'mcprl-3b-exa.Q2_K.gguf',
|
| 30 |
+
'size_mb': 1215.7,
|
| 31 |
+
'issues': ['texte corrompu', 'caractรจres spรฉciaux', 'sortie incohรฉrente'],
|
| 32 |
+
'hypothesis': 'Quantisation Q2_K trop agressive pour ce modรจle'
|
| 33 |
+
},
|
| 34 |
+
{
|
| 35 |
+
'name': 'Gemma-3n-E2B-it',
|
| 36 |
+
'path': 'gemma-3n-E2B-it-UD-IQ2_XXS.gguf',
|
| 37 |
+
'size_mb': 1958.3,
|
| 38 |
+
'issues': ['rรฉponses vides', 'pas de sortie'],
|
| 39 |
+
'hypothesis': 'Format de prompt incompatible avec le modรจle'
|
| 40 |
+
}
|
| 41 |
+
]
|
| 42 |
+
|
| 43 |
+
# Analyse technique dรฉtaillรฉe
|
| 44 |
+
print("\n๐ ANALYSE TECHNIQUE DรTAILLรE:")
|
| 45 |
+
|
| 46 |
+
for model in problematic_models:
|
| 47 |
+
print(f"\n๐งฉ {model['name']}:")
|
| 48 |
+
print(f" ๐ Taille: {model['size_mb']:.1f} MB")
|
| 49 |
+
print(f" ๐ง Problรจmes: {', '.join(model['issues'])}")
|
| 50 |
+
print(f" ๐ญ Hypothรจse: {model['hypothesis']}")
|
| 51 |
+
|
| 52 |
+
if not os.path.exists(model['path']):
|
| 53 |
+
print(f" โ Fichier non trouvรฉ")
|
| 54 |
+
continue
|
| 55 |
+
|
| 56 |
+
# Test de chargement basique
|
| 57 |
+
print(f" ๐ Test de chargement...")
|
| 58 |
+
try:
|
| 59 |
+
llm = Llama(
|
| 60 |
+
model_path=model['path'],
|
| 61 |
+
n_ctx=2048,
|
| 62 |
+
n_threads=1,
|
| 63 |
+
verbose=False,
|
| 64 |
+
n_gpu_layers=0
|
| 65 |
+
)
|
| 66 |
+
print(f" โ
Chargement rรฉussi")
|
| 67 |
+
|
| 68 |
+
# Test de gรฉnรฉration trรจs simple
|
| 69 |
+
print(f" ๐งช Test de gรฉnรฉration basique...")
|
| 70 |
+
try:
|
| 71 |
+
response = llm(
|
| 72 |
+
"Hello",
|
| 73 |
+
max_tokens=10,
|
| 74 |
+
temperature=0.1
|
| 75 |
+
)
|
| 76 |
+
|
| 77 |
+
# Extraire la rรฉponse correctement
|
| 78 |
+
response_text = str(response)
|
| 79 |
+
if hasattr(response, 'choices') and response.choices:
|
| 80 |
+
response_text = response.choices[0].text if hasattr(response.choices[0], 'text') else str(response.choices[0])
|
| 81 |
+
elif isinstance(response, dict) and 'choices' in response:
|
| 82 |
+
choice = response['choices'][0]
|
| 83 |
+
response_text = choice.get('text', str(choice))
|
| 84 |
+
|
| 85 |
+
print(f" ๐ Rรฉponse: '{response_text.strip()}'")
|
| 86 |
+
|
| 87 |
+
except Exception as e:
|
| 88 |
+
print(f" โ Erreur gรฉnรฉration: {e}")
|
| 89 |
+
|
| 90 |
+
except Exception as e:
|
| 91 |
+
print(f" โ Erreur chargement: {e}")
|
| 92 |
+
|
| 93 |
+
# Analyse comparative avec modรจle qui fonctionne
|
| 94 |
+
print(f"\n๐ COMPARAISON AVEC MODรLE FONCTIONNEL:")
|
| 95 |
+
|
| 96 |
+
working_model = 'qwen2.5-coder-0.5b-instruct-q4_0.gguf'
|
| 97 |
+
if os.path.exists(working_model):
|
| 98 |
+
print(f"\nโ
Qwen2.5-Coder-0.5B (fonctionne):")
|
| 99 |
+
|
| 100 |
+
try:
|
| 101 |
+
llm = Llama(
|
| 102 |
+
model_path=working_model,
|
| 103 |
+
n_ctx=1024,
|
| 104 |
+
n_threads=1,
|
| 105 |
+
verbose=False,
|
| 106 |
+
n_gpu_layers=0
|
| 107 |
+
)
|
| 108 |
+
|
| 109 |
+
# Test avec le mรชme prompt MCP
|
| 110 |
+
prompt = '''You are an AI assistant for an RTS game using MCP (Model Context Protocol).
|
| 111 |
+
|
| 112 |
+
Available tools:
|
| 113 |
+
- get_game_state()
|
| 114 |
+
- move_units(unit_ids, target_x, target_y)
|
| 115 |
+
- attack_unit(attacker_ids, target_id)
|
| 116 |
+
- build_building(building_type, position_x, position_y)
|
| 117 |
+
|
| 118 |
+
User command: "show game state"
|
| 119 |
+
|
| 120 |
+
Respond with JSON only: {"tool": "tool_name", "args": {}}}'''
|
| 121 |
+
|
| 122 |
+
response = llm(
|
| 123 |
+
prompt,
|
| 124 |
+
max_tokens=50,
|
| 125 |
+
temperature=0.1
|
| 126 |
+
)
|
| 127 |
+
|
| 128 |
+
# Extraire la rรฉponse correctement
|
| 129 |
+
response_text = str(response)
|
| 130 |
+
if hasattr(response, 'choices') and response.choices:
|
| 131 |
+
response_text = response.choices[0].text if hasattr(response.choices[0], 'text') else str(response.choices[0])
|
| 132 |
+
elif isinstance(response, dict) and 'choices' in response:
|
| 133 |
+
choice = response['choices'][0]
|
| 134 |
+
response_text = choice.get('text', str(choice))
|
| 135 |
+
|
| 136 |
+
print(f" ๐ Rรฉponse: {response_text[:100]}...")
|
| 137 |
+
|
| 138 |
+
# Vรฉrifier JSON
|
| 139 |
+
try:
|
| 140 |
+
json.loads(response_text)
|
| 141 |
+
print(f" โ
JSON valide")
|
| 142 |
+
except:
|
| 143 |
+
print(f" โ JSON invalide")
|
| 144 |
+
|
| 145 |
+
except Exception as e:
|
| 146 |
+
print(f" โ Erreur: {e}")
|
| 147 |
+
|
| 148 |
+
# Conclusions et recommandations
|
| 149 |
+
print(f"\n๐ฏ CONCLUSIONS ET RECOMMANDATIONS:")
|
| 150 |
+
print("=" * 70)
|
| 151 |
+
|
| 152 |
+
print("\n1. MCP-INSTRUCT-V1 (Q4_K_M):")
|
| 153 |
+
print(" โ Problรจme: Erreur technique 'llama_decode returned -1'")
|
| 154 |
+
print(" ๐ก Solution: Essayer version Q8_0 ou vรฉrifier compatibilitรฉ llama-cpp-python")
|
| 155 |
+
|
| 156 |
+
print("\n2. MCPR L-3B-EXA (Q2_K):")
|
| 157 |
+
print(" โ Problรจme: Texte corrompu avec caractรจres spรฉciaux")
|
| 158 |
+
print(" ๐ก Solution: Quantisation Q2_K trop agressive, essayer Q4_K_M ou Q8_0")
|
| 159 |
+
|
| 160 |
+
print("\n3. GEMMA-3N-E2B-IT (IQ2_XXS):")
|
| 161 |
+
print(" โ Problรจme: Rรฉponses vides, modรจle ne rรฉpond pas")
|
| 162 |
+
print(" ๐ก Solution: Quantisation IQ2_XXS extrรชme, essayer version moins compressรฉe")
|
| 163 |
+
|
| 164 |
+
print("\n4. POURQUOI LES MODรLES CODE FONCTIONNENT MIEUX:")
|
| 165 |
+
print(" โ
Habituรฉs au format JSON et aux structures de donnรฉes")
|
| 166 |
+
print(" โ
Meilleure comprรฉhension des formats structurรฉs")
|
| 167 |
+
print(" โ
Formation sur du code et des donnรฉes techniques")
|
| 168 |
+
|
| 169 |
+
print("\n๐ PLAN D'ACTION:")
|
| 170 |
+
print(" 1. Tรฉlรฉcharger versions Q8_0 des modรจles problรฉmatiques")
|
| 171 |
+
print(" 2. Tester avec formats de prompts MCP spรฉcifiques")
|
| 172 |
+
print(" 3. Augmenter contexte (n_ctx) pour modรจles plus grands")
|
| 173 |
+
print(" 4. Utiliser tempรฉrature lรฉgรจrement plus รฉlevรฉe (0.3)")
|
| 174 |
+
|
| 175 |
+
print(f"\n๐ MODรLES ร TรLรCHARGER:")
|
| 176 |
+
print(" โข mcp-instruct-v1.Q8_0.gguf")
|
| 177 |
+
print(" โข mcprl-3b-exa.Q8_0.gguf")
|
| 178 |
+
print(" โข google_gemma-3n-E2B-it-Q8_0.gguf")
|
| 179 |
+
|
| 180 |
+
if __name__ == "__main__":
|
| 181 |
+
analyze_model_failures()
|
tests/scripts/practical_mcp_test.py
ADDED
|
@@ -0,0 +1,197 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Test pratique MCP pour รฉvaluer les capacitรฉs rรฉelles des modรจles
|
| 3 |
+
Simule l'usage rรฉel dans un jeu RTS avec des commandes typiques
|
| 4 |
+
"""
|
| 5 |
+
|
| 6 |
+
import sys
|
| 7 |
+
import os
|
| 8 |
+
import json
|
| 9 |
+
import time
|
| 10 |
+
|
| 11 |
+
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
|
| 12 |
+
|
| 13 |
+
def test_model_practical(model_path, model_name):
|
| 14 |
+
"""Test pratique avec des commandes RTS rรฉelles"""
|
| 15 |
+
|
| 16 |
+
try:
|
| 17 |
+
from llama_cpp import Llama
|
| 18 |
+
|
| 19 |
+
print(f"๐ฎ Test pratique de {model_name}...")
|
| 20 |
+
|
| 21 |
+
llm = Llama(
|
| 22 |
+
model_path=model_path,
|
| 23 |
+
n_ctx=1024,
|
| 24 |
+
n_threads=1,
|
| 25 |
+
verbose=False
|
| 26 |
+
)
|
| 27 |
+
|
| 28 |
+
# Commandes RTS typiques qu'un utilisateur taperait
|
| 29 |
+
real_commands = [
|
| 30 |
+
"show game state",
|
| 31 |
+
"move infantry to 100,200",
|
| 32 |
+
"attack enemy tank at 300,150",
|
| 33 |
+
"build barracks near 250,200",
|
| 34 |
+
"defend base with all units",
|
| 35 |
+
"harvest resources with harvester",
|
| 36 |
+
"scout enemy positions"
|
| 37 |
+
]
|
| 38 |
+
|
| 39 |
+
results = []
|
| 40 |
+
|
| 41 |
+
for command in real_commands:
|
| 42 |
+
prompt = f"""You control an RTS game via MCP. Available tools:
|
| 43 |
+
- get_game_state()
|
| 44 |
+
- move_units(unit_ids, target_x, target_y)
|
| 45 |
+
- attack_unit(attacker_ids, target_id)
|
| 46 |
+
- build_building(building_type, position_x, position_y)
|
| 47 |
+
|
| 48 |
+
User command: "{command}"
|
| 49 |
+
|
| 50 |
+
Respond with JSON: {{"tool": "tool_name", "args": {{}}}}"""
|
| 51 |
+
|
| 52 |
+
start_time = time.time()
|
| 53 |
+
|
| 54 |
+
response = llm(
|
| 55 |
+
prompt,
|
| 56 |
+
max_tokens=100,
|
| 57 |
+
temperature=0.1
|
| 58 |
+
)
|
| 59 |
+
|
| 60 |
+
response_time = time.time() - start_time
|
| 61 |
+
|
| 62 |
+
# Extraire et รฉvaluer
|
| 63 |
+
try:
|
| 64 |
+
response_text = response['choices'][0]['text'].strip()
|
| 65 |
+
except:
|
| 66 |
+
response_text = str(response)
|
| 67 |
+
|
| 68 |
+
# รvaluation simple mais rรฉaliste
|
| 69 |
+
score = 0
|
| 70 |
+
|
| 71 |
+
# JSON valide?
|
| 72 |
+
try:
|
| 73 |
+
json.loads(response_text)
|
| 74 |
+
score += 3
|
| 75 |
+
except:
|
| 76 |
+
# Chercher JSON dans le texte
|
| 77 |
+
import re
|
| 78 |
+
if re.search(r'\{.*\}', response_text):
|
| 79 |
+
score += 1
|
| 80 |
+
|
| 81 |
+
# Outil appropriรฉ?
|
| 82 |
+
if "game state" in command and "get_game_state" in response_text:
|
| 83 |
+
score += 3
|
| 84 |
+
elif "move" in command and "move_units" in response_text:
|
| 85 |
+
score += 3
|
| 86 |
+
elif "attack" in command and "attack_unit" in response_text:
|
| 87 |
+
score += 3
|
| 88 |
+
elif "build" in command and "build_building" in response_text:
|
| 89 |
+
score += 3
|
| 90 |
+
elif any(tool in response_text for tool in ['get_game_state', 'move_units', 'attack_unit', 'build_building']):
|
| 91 |
+
score += 1
|
| 92 |
+
|
| 93 |
+
# Paramรจtres raisonnables?
|
| 94 |
+
if "100,200" in response_text or "300,150" in response_text or "250,200" in response_text:
|
| 95 |
+
score += 2
|
| 96 |
+
|
| 97 |
+
score = min(score, 10)
|
| 98 |
+
|
| 99 |
+
print(f" '{command}' โ {score}/10 ({response_time:.2f}s)")
|
| 100 |
+
|
| 101 |
+
results.append({
|
| 102 |
+
'command': command,
|
| 103 |
+
'score': score,
|
| 104 |
+
'time': response_time,
|
| 105 |
+
'response': response_text[:50]
|
| 106 |
+
})
|
| 107 |
+
|
| 108 |
+
avg_score = sum(r['score'] for r in results) / len(results)
|
| 109 |
+
avg_time = sum(r['time'] for r in results) / len(results)
|
| 110 |
+
|
| 111 |
+
print(f"๐ {model_name}: {avg_score:.1f}/10 moyen | {avg_time:.2f}s moyen")
|
| 112 |
+
|
| 113 |
+
return {
|
| 114 |
+
'name': model_name,
|
| 115 |
+
'avg_score': avg_score,
|
| 116 |
+
'avg_time': avg_time,
|
| 117 |
+
'results': results
|
| 118 |
+
}
|
| 119 |
+
|
| 120 |
+
except Exception as e:
|
| 121 |
+
print(f"โ Erreur {model_name}: {e}")
|
| 122 |
+
return None
|
| 123 |
+
|
| 124 |
+
def main():
|
| 125 |
+
"""Test pratique comparatif"""
|
| 126 |
+
|
| 127 |
+
print("๐ฏ TEST PRATIQUE MCP - COMMANDES RTS RรELLES")
|
| 128 |
+
print("=" * 50)
|
| 129 |
+
print("Simulation de l'usage rรฉel avec des commandes typiques")
|
| 130 |
+
print("=" * 50)
|
| 131 |
+
|
| 132 |
+
models = [
|
| 133 |
+
{'name': 'Qwen2.5-0.5B', 'path': 'qwen2.5-0.5b-instruct-q4_0.gguf'},
|
| 134 |
+
{'name': 'Qwen3-0.6B', 'path': 'Qwen3-0.6B-Q8_0.gguf'},
|
| 135 |
+
{'name': 'Gemma-3-270M', 'path': 'gemma-3-270m-it-qat-Q8_0.gguf'}
|
| 136 |
+
]
|
| 137 |
+
|
| 138 |
+
results = []
|
| 139 |
+
|
| 140 |
+
for model in models:
|
| 141 |
+
if os.path.exists(model['path']):
|
| 142 |
+
result = test_model_practical(model['path'], model['name'])
|
| 143 |
+
if result:
|
| 144 |
+
results.append(result)
|
| 145 |
+
print()
|
| 146 |
+
|
| 147 |
+
# Analyse pratique
|
| 148 |
+
if results:
|
| 149 |
+
print("๐ RรSULTATS PRATIQUES:")
|
| 150 |
+
print("-" * 30)
|
| 151 |
+
|
| 152 |
+
sorted_results = sorted(results, key=lambda x: x['avg_score'], reverse=True)
|
| 153 |
+
|
| 154 |
+
for i, result in enumerate(sorted_results, 1):
|
| 155 |
+
print(f"{i}. {result['name']}: {result['avg_score']:.1f}/10")
|
| 156 |
+
|
| 157 |
+
# Recommandation pratique
|
| 158 |
+
best = sorted_results[0]
|
| 159 |
+
|
| 160 |
+
print(f"\n๐ฏ RECOMMANDATION PRATIQUE:")
|
| 161 |
+
|
| 162 |
+
if best['avg_score'] >= 7:
|
| 163 |
+
print(f"โ
{best['name']} est EXCELLENT pour la production")
|
| 164 |
+
print(" โ Gรจre bien les commandes RTS typiques")
|
| 165 |
+
elif best['avg_score'] >= 5:
|
| 166 |
+
print(f"๐ {best['name']} est UTILISABLE avec validation")
|
| 167 |
+
print(" โ Fonctionne pour les commandes simples")
|
| 168 |
+
else:
|
| 169 |
+
print(f"โ ๏ธ {best['name']} nรฉcessite des amรฉliorations")
|
| 170 |
+
print(" โ Considรฉrer prompts plus spรฉcifiques")
|
| 171 |
+
|
| 172 |
+
# Analyse des commandes rรฉussies
|
| 173 |
+
print(f"\n๐ COMMANDES LES MIEUX GรRรES:")
|
| 174 |
+
|
| 175 |
+
for result in results:
|
| 176 |
+
best_commands = [r for r in result['results'] if r['score'] >= 7]
|
| 177 |
+
if best_commands:
|
| 178 |
+
print(f"\n{result['name']}:")
|
| 179 |
+
for cmd in best_commands[:3]: # Top 3
|
| 180 |
+
print(f" โข {cmd['command']}: {cmd['score']}/10")
|
| 181 |
+
|
| 182 |
+
# Conclusion sur la pertinence du test
|
| 183 |
+
print(f"\n๐ PERTINENCE DU TEST:")
|
| 184 |
+
print("โ
Ce test est BEAUCOUP plus reprรฉsentatif:")
|
| 185 |
+
print(" โข Commandes rรฉelles d'utilisateurs")
|
| 186 |
+
print(" โข Format de sortie JSON attendu")
|
| 187 |
+
print(" โข Temps de rรฉponse rรฉaliste")
|
| 188 |
+
print(" โข Pas de prompts artificiels complexes")
|
| 189 |
+
|
| 190 |
+
# Sauvegarder
|
| 191 |
+
with open("practical_mcp_results.json", "w") as f:
|
| 192 |
+
json.dump(results, f, indent=2)
|
| 193 |
+
|
| 194 |
+
print(f"\n๐ Rรฉsultats sauvegardรฉs: practical_mcp_results.json")
|
| 195 |
+
|
| 196 |
+
if __name__ == "__main__":
|
| 197 |
+
main()
|
tests/scripts/prompt_format_investigation.py
ADDED
|
@@ -0,0 +1,156 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Investigation des formats de prompt pour modรจles spรฉcialisรฉs MCP
|
| 3 |
+
Test de diffรฉrents formats pour identifier les problรจmes de compatibilitรฉ
|
| 4 |
+
"""
|
| 5 |
+
|
| 6 |
+
import os
|
| 7 |
+
import json
|
| 8 |
+
import re
|
| 9 |
+
|
| 10 |
+
def test_prompt_formats():
|
| 11 |
+
"""Tester diffรฉrents formats de prompt pour diagnostiquer les problรจmes"""
|
| 12 |
+
|
| 13 |
+
print("๐งช INVESTIGATION DES FORMATS DE PROMPT MCP")
|
| 14 |
+
print("=" * 60)
|
| 15 |
+
|
| 16 |
+
# Formats de prompt ร tester
|
| 17 |
+
prompt_formats = [
|
| 18 |
+
{
|
| 19 |
+
'name': 'Format actuel (simple)',
|
| 20 |
+
'template': '''You are an AI assistant for an RTS game using MCP (Model Context Protocol).
|
| 21 |
+
|
| 22 |
+
Available tools:
|
| 23 |
+
- get_game_state()
|
| 24 |
+
- move_units(unit_ids, target_x, target_y)
|
| 25 |
+
- attack_unit(attacker_ids, target_id)
|
| 26 |
+
- build_building(building_type, position_x, position_y)
|
| 27 |
+
|
| 28 |
+
User command: "{command}"
|
| 29 |
+
|
| 30 |
+
Respond with JSON only: {{"tool": "tool_name", "args": {{}}}}'''
|
| 31 |
+
},
|
| 32 |
+
{
|
| 33 |
+
'name': 'Format MCP structurรฉ',
|
| 34 |
+
'template': '''<|im_start|>system
|
| 35 |
+
You are an MCP assistant for an RTS game. Respond with ONLY JSON.
|
| 36 |
+
Available tools: get_game_state, move_units, attack_unit, build_building<|im_end|>
|
| 37 |
+
<|im_start|>user
|
| 38 |
+
{command}<|im_end|>
|
| 39 |
+
<|im_start|>assistant
|
| 40 |
+
'''
|
| 41 |
+
},
|
| 42 |
+
{
|
| 43 |
+
'name': 'Format instruct',
|
| 44 |
+
'template': '''[INST] You are an MCP assistant. Respond with ONLY JSON format:
|
| 45 |
+
{{"tool": "tool_name", "args": {{}}}}
|
| 46 |
+
|
| 47 |
+
Command: {command} [/INST]'''
|
| 48 |
+
},
|
| 49 |
+
{
|
| 50 |
+
'name': 'Format code',
|
| 51 |
+
'template': '''// MCP assistant for RTS game
|
| 52 |
+
// Tools: get_game_state(), move_units(), attack_unit(), build_building()
|
| 53 |
+
// Command: {command}
|
| 54 |
+
// Response:'''
|
| 55 |
+
},
|
| 56 |
+
{
|
| 57 |
+
'name': 'Format minimal',
|
| 58 |
+
'template': '''Tools: get_game_state, move_units, attack_unit, build_building
|
| 59 |
+
Command: {command}
|
| 60 |
+
JSON:'''
|
| 61 |
+
}
|
| 62 |
+
]
|
| 63 |
+
|
| 64 |
+
test_command = "show game state"
|
| 65 |
+
|
| 66 |
+
print(f"\n๐ COMMANDE DE TEST: '{test_command}'")
|
| 67 |
+
print("-" * 40)
|
| 68 |
+
|
| 69 |
+
for i, format_info in enumerate(prompt_formats, 1):
|
| 70 |
+
prompt = format_info['template'].format(command=test_command)
|
| 71 |
+
|
| 72 |
+
print(f"\n{i}. {format_info['name']}:")
|
| 73 |
+
print(f" Prompt (dรฉbut): {prompt[:80]}...")
|
| 74 |
+
|
| 75 |
+
# Analyser la structure du prompt
|
| 76 |
+
lines = prompt.count('\n') + 1
|
| 77 |
+
json_instructions = prompt.count('JSON')
|
| 78 |
+
tool_mentions = prompt.count('get_game_state') + prompt.count('move_units') + \
|
| 79 |
+
prompt.count('attack_unit') + prompt.count('build_building')
|
| 80 |
+
|
| 81 |
+
print(f" ๐ Analyse: {lines} lignes, {json_instructions} refs JSON, {tool_mentions} refs outils")
|
| 82 |
+
|
| 83 |
+
# Vรฉrifier la prรฉsence d'รฉlรฉments clรฉs
|
| 84 |
+
has_json_example = '{"tool":' in prompt
|
| 85 |
+
has_tool_list = 'Available tools:' in prompt or 'Tools:' in prompt
|
| 86 |
+
has_system_prompt = '<|im_start|>system' in prompt or '[INST]' in prompt
|
| 87 |
+
|
| 88 |
+
print(f" โ
รlรฉments: JSON exemple={has_json_example}, Liste outils={has_tool_list}, Systรจme={has_system_prompt}")
|
| 89 |
+
|
| 90 |
+
def analyze_model_specific_requirements():
|
| 91 |
+
"""Analyser les besoins spรฉcifiques des modรจles MCP"""
|
| 92 |
+
|
| 93 |
+
print(f"\n๐ BESOINS SPรCIFIQUES DES MODรLES MCP")
|
| 94 |
+
print("=" * 60)
|
| 95 |
+
|
| 96 |
+
model_requirements = {
|
| 97 |
+
'MCP-Instruct-v1': {
|
| 98 |
+
'expected_format': 'Format conversation structurรฉ avec tokens MCP',
|
| 99 |
+
'common_issues': ['Besoin de tokens spรฉcifiques', 'Format de prompt rigide'],
|
| 100 |
+
'recommendation': 'Utiliser <|im_start|>system/user/assistant format'
|
| 101 |
+
},
|
| 102 |
+
'MCPR L-3B-Exa': {
|
| 103 |
+
'expected_format': 'Format instruct avec contexte MCP',
|
| 104 |
+
'common_issues': ['Quantisation agressive', 'Besoin de contexte รฉtendu'],
|
| 105 |
+
'recommendation': 'Augmenter n_ctx ร 4096+ et utiliser Q8_0'
|
| 106 |
+
},
|
| 107 |
+
'Gemma-3n-E2B-it': {
|
| 108 |
+
'expected_format': 'Format technique structurรฉ',
|
| 109 |
+
'common_issues': ['Quantisation extrรชme IQ2_XXS', 'Format incompatible'],
|
| 110 |
+
'recommendation': 'Utiliser version Q8_0 et format code-oriented'
|
| 111 |
+
}
|
| 112 |
+
}
|
| 113 |
+
|
| 114 |
+
for model_name, info in model_requirements.items():
|
| 115 |
+
print(f"\n๐งฉ {model_name}:")
|
| 116 |
+
print(f" ๐ Format attendu: {info['expected_format']}")
|
| 117 |
+
print(f" โ ๏ธ Problรจmes courants: {', '.join(info['common_issues'])}")
|
| 118 |
+
print(f" ๐ก Recommandation: {info['recommendation']}")
|
| 119 |
+
|
| 120 |
+
def main():
|
| 121 |
+
"""Investigation principale"""
|
| 122 |
+
|
| 123 |
+
print("๐ฌ INVESTIGATION DES PROBLรMES DE FORMAT MCP")
|
| 124 |
+
print("=" * 70)
|
| 125 |
+
|
| 126 |
+
test_prompt_formats()
|
| 127 |
+
analyze_model_specific_requirements()
|
| 128 |
+
|
| 129 |
+
print(f"\n๐ฏ CONCLUSIONS ET SOLUTIONS:")
|
| 130 |
+
print("=" * 70)
|
| 131 |
+
|
| 132 |
+
print("\n1. PROBLรMES IDENTIFIรS:")
|
| 133 |
+
print(" โข Format de prompt incompatible avec les modรจles spรฉcialisรฉs")
|
| 134 |
+
print(" โข Quantisation trop agressive (Q2_K, IQ2_XXS)")
|
| 135 |
+
print(" โข Manque de tokens/spรฉcificitรฉs MCP dans les prompts")
|
| 136 |
+
print(" โข Contexte insuffisant pour modรจles plus grands")
|
| 137 |
+
|
| 138 |
+
print("\n2. SOLUTIONS RECOMMANDรES:")
|
| 139 |
+
print(" โข Tรฉlรฉcharger versions Q8_0 des modรจles problรฉmatiques")
|
| 140 |
+
print(" โข Utiliser formats structurรฉs avec tokens MCP")
|
| 141 |
+
print(" โข Augmenter n_ctx ร 4096 pour modรจles 3B+")
|
| 142 |
+
print(" โข Ajouter exemples JSON complets dans les prompts")
|
| 143 |
+
|
| 144 |
+
print(f"\n3. FORMATS DE PROMPT OPTIMAUX:")
|
| 145 |
+
print(" โข MCP-Instruct-v1: <|im_start|>system/user/assistant avec tokens")
|
| 146 |
+
print(" โข MCPR L-3B-Exa: Format instruct avec contexte รฉtendu")
|
| 147 |
+
print(" โข Gemma-3n-E2B-it: Format code-oriented structurรฉ")
|
| 148 |
+
|
| 149 |
+
print(f"\n๐ PROCHAINES รTAPES:")
|
| 150 |
+
print(" 1. Tรฉlรฉcharger mcp-instruct-v1.Q8_0.gguf")
|
| 151 |
+
print(" 2. Tรฉlรฉcharger mcprl-3b-exa.Q8_0.gguf")
|
| 152 |
+
print(" 3. Tรฉlรฉcharger google_gemma-3n-E2B-it-Q8_0.gguf")
|
| 153 |
+
print(" 4. Tester avec formats de prompts optimisรฉs")
|
| 154 |
+
|
| 155 |
+
if __name__ == "__main__":
|
| 156 |
+
main()
|
tests/scripts/quick_model_comparison.py
ADDED
|
@@ -0,0 +1,176 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Test rapide de comparaison des modรจles pour les tรขches MCP
|
| 3 |
+
Focus sur les tests les plus importants
|
| 4 |
+
"""
|
| 5 |
+
|
| 6 |
+
import sys
|
| 7 |
+
import os
|
| 8 |
+
import json
|
| 9 |
+
import time
|
| 10 |
+
|
| 11 |
+
# Ajouter le chemin pour les imports
|
| 12 |
+
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
|
| 13 |
+
|
| 14 |
+
def test_model(model_path, model_name):
|
| 15 |
+
"""Test rapide d'un modรจle"""
|
| 16 |
+
|
| 17 |
+
try:
|
| 18 |
+
from llama_cpp import Llama
|
| 19 |
+
|
| 20 |
+
print(f"๐ Test de {model_name}...")
|
| 21 |
+
|
| 22 |
+
# Initialiser avec des paramรจtres plus rapides
|
| 23 |
+
llm = Llama(
|
| 24 |
+
model_path=model_path,
|
| 25 |
+
n_ctx=1024,
|
| 26 |
+
n_threads=1,
|
| 27 |
+
verbose=False
|
| 28 |
+
)
|
| 29 |
+
|
| 30 |
+
# Test simple de comprรฉhension MCP
|
| 31 |
+
prompt = """Tu es un assistant IA pour un jeu RTS via MCP.
|
| 32 |
+
|
| 33 |
+
Outils: get_game_state(), move_units(unit_ids, target_x, target_y)
|
| 34 |
+
|
| 35 |
+
Commande: "Montre-moi l'รฉtat du jeu"
|
| 36 |
+
|
| 37 |
+
Rรฉponds avec JSON: {{"tool": "nom_outil", "args": {{}}}}"""
|
| 38 |
+
|
| 39 |
+
start_time = time.time()
|
| 40 |
+
|
| 41 |
+
response = llm(
|
| 42 |
+
prompt,
|
| 43 |
+
max_tokens=100,
|
| 44 |
+
temperature=0.1,
|
| 45 |
+
stop=["</s>", "<|im_end|>"]
|
| 46 |
+
)
|
| 47 |
+
|
| 48 |
+
response_time = time.time() - start_time
|
| 49 |
+
response_text = response['choices'][0]['text'].strip()
|
| 50 |
+
|
| 51 |
+
# Analyser la rรฉponse
|
| 52 |
+
score = 0
|
| 53 |
+
|
| 54 |
+
# Vรฉrifier JSON
|
| 55 |
+
try:
|
| 56 |
+
json.loads(response_text)
|
| 57 |
+
score += 3
|
| 58 |
+
except:
|
| 59 |
+
pass
|
| 60 |
+
|
| 61 |
+
# Vรฉrifier outil correct
|
| 62 |
+
if "get_game_state" in response_text:
|
| 63 |
+
score += 4
|
| 64 |
+
|
| 65 |
+
# Vรฉrifier structure
|
| 66 |
+
if "tool" in response_text:
|
| 67 |
+
score += 2
|
| 68 |
+
|
| 69 |
+
# Vรฉrifier cohรฉrence
|
| 70 |
+
if "game" in response_text.lower():
|
| 71 |
+
score += 1
|
| 72 |
+
|
| 73 |
+
score = min(score, 10)
|
| 74 |
+
|
| 75 |
+
print(f"โ
{model_name}: {score}/10 | Temps: {response_time:.2f}s")
|
| 76 |
+
print(f" Rรฉponse: {response_text[:100]}...")
|
| 77 |
+
|
| 78 |
+
return {
|
| 79 |
+
'name': model_name,
|
| 80 |
+
'score': score,
|
| 81 |
+
'time': response_time,
|
| 82 |
+
'response': response_text
|
| 83 |
+
}
|
| 84 |
+
|
| 85 |
+
except Exception as e:
|
| 86 |
+
print(f"โ {model_name}: Erreur - {e}")
|
| 87 |
+
return {
|
| 88 |
+
'name': model_name,
|
| 89 |
+
'score': 0,
|
| 90 |
+
'time': 0,
|
| 91 |
+
'error': str(e)
|
| 92 |
+
}
|
| 93 |
+
|
| 94 |
+
def main():
|
| 95 |
+
"""Test rapide comparatif"""
|
| 96 |
+
|
| 97 |
+
print("๐ TEST RAPIDE COMPARATIF MCP")
|
| 98 |
+
print("=" * 50)
|
| 99 |
+
|
| 100 |
+
# Modรจles ร tester
|
| 101 |
+
models = [
|
| 102 |
+
{
|
| 103 |
+
'name': 'Qwen2.5-0.5B',
|
| 104 |
+
'path': 'qwen2.5-0.5b-instruct-q4_0.gguf'
|
| 105 |
+
},
|
| 106 |
+
{
|
| 107 |
+
'name': 'Qwen3-0.6B',
|
| 108 |
+
'path': 'Qwen3-0.6B-Q8_0.gguf'
|
| 109 |
+
},
|
| 110 |
+
{
|
| 111 |
+
'name': 'Gemma-3-1B',
|
| 112 |
+
'path': 'google_gemma-3-1b-it-qat-Q4_0.gguf'
|
| 113 |
+
}
|
| 114 |
+
]
|
| 115 |
+
|
| 116 |
+
results = []
|
| 117 |
+
|
| 118 |
+
for model in models:
|
| 119 |
+
if os.path.exists(model['path']):
|
| 120 |
+
result = test_model(model['path'], model['name'])
|
| 121 |
+
results.append(result)
|
| 122 |
+
else:
|
| 123 |
+
print(f"โ Fichier non trouvรฉ: {model['path']}")
|
| 124 |
+
|
| 125 |
+
# Rรฉsultats
|
| 126 |
+
print("\n" + "=" * 50)
|
| 127 |
+
print("๐ RรSULTATS COMPARATIFS")
|
| 128 |
+
print("=" * 50)
|
| 129 |
+
|
| 130 |
+
# Classement
|
| 131 |
+
sorted_results = sorted(results, key=lambda x: x['score'], reverse=True)
|
| 132 |
+
|
| 133 |
+
print(f"\n๐ CLASSEMENT:")
|
| 134 |
+
for i, result in enumerate(sorted_results, 1):
|
| 135 |
+
if 'error' not in result:
|
| 136 |
+
print(f" {i}. {result['name']}: {result['score']}/10 ({result['time']:.2f}s)")
|
| 137 |
+
else:
|
| 138 |
+
print(f" {i}. {result['name']}: รCHEC")
|
| 139 |
+
|
| 140 |
+
# Analyse
|
| 141 |
+
successful_results = [r for r in results if 'error' not in r and r['score'] > 0]
|
| 142 |
+
|
| 143 |
+
if successful_results:
|
| 144 |
+
best_model = successful_results[0]
|
| 145 |
+
|
| 146 |
+
print(f"\n๐ฏ MEILLEUR MODรLE: {best_model['name']}")
|
| 147 |
+
print(f" Score: {best_model['score']}/10")
|
| 148 |
+
print(f" Temps: {best_model['time']:.2f}s")
|
| 149 |
+
|
| 150 |
+
# Recommandations
|
| 151 |
+
if best_model['score'] >= 7:
|
| 152 |
+
print(f"\nโ
RECOMMANDATION: {best_model['name']} est EXCELLENT pour MCP")
|
| 153 |
+
elif best_model['score'] >= 5:
|
| 154 |
+
print(f"\n๐ RECOMMANDATION: {best_model['name']} est BON pour MCP")
|
| 155 |
+
else:
|
| 156 |
+
print(f"\nโ ๏ธ RECOMMANDATION: {best_model['name']} est LIMITร pour MCP")
|
| 157 |
+
|
| 158 |
+
# Performance vs taille
|
| 159 |
+
print(f"\nโ๏ธ PERFORMANCE:")
|
| 160 |
+
for result in successful_results:
|
| 161 |
+
efficiency = result['score'] / result['time'] if result['time'] > 0 else 0
|
| 162 |
+
file_size = os.path.getsize([m['path'] for m in models if m['name'] == result['name']][0]) / (1024*1024)
|
| 163 |
+
print(f" {result['name']}: {efficiency:.2f} score/s | {file_size:.0f} MB")
|
| 164 |
+
|
| 165 |
+
# Sauvegarder
|
| 166 |
+
with open("quick_model_comparison.json", "w", encoding="utf-8") as f:
|
| 167 |
+
json.dump({
|
| 168 |
+
'results': results,
|
| 169 |
+
'ranking': sorted_results,
|
| 170 |
+
'best_model': successful_results[0]['name'] if successful_results else None
|
| 171 |
+
}, f, indent=2, ensure_ascii=False)
|
| 172 |
+
|
| 173 |
+
print(f"\n๐ Rรฉsultats sauvegardรฉs dans: quick_model_comparison.json")
|
| 174 |
+
|
| 175 |
+
if __name__ == "__main__":
|
| 176 |
+
main()
|
tests/scripts/qwen_mcp_capability_assessment.py
ADDED
|
@@ -0,0 +1,367 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
รvaluation complรจte des capacitรฉs MCP de Qwen2.5 0.5B
|
| 3 |
+
Teste la comprรฉhension des outils MCP, la traduction d'instructions et la planification
|
| 4 |
+
"""
|
| 5 |
+
|
| 6 |
+
import json
|
| 7 |
+
import time
|
| 8 |
+
import sys
|
| 9 |
+
import os
|
| 10 |
+
|
| 11 |
+
# Ajouter le chemin pour les imports
|
| 12 |
+
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
|
| 13 |
+
|
| 14 |
+
def test_model_availability():
|
| 15 |
+
"""Vรฉrifier si le modรจle est disponible et fonctionnel"""
|
| 16 |
+
print("๐ Vรฉrification de la disponibilitรฉ du modรจle...")
|
| 17 |
+
|
| 18 |
+
# Vรฉrifier si le fichier modรจle existe
|
| 19 |
+
model_path = "qwen2.5-0.5b-instruct-q4_0.gguf"
|
| 20 |
+
if os.path.exists(model_path):
|
| 21 |
+
print(f"โ
Modรจle trouvรฉ: {model_path}")
|
| 22 |
+
file_size = os.path.getsize(model_path) / (1024 * 1024) # MB
|
| 23 |
+
print(f"๐ Taille du modรจle: {file_size:.1f} MB")
|
| 24 |
+
return True
|
| 25 |
+
else:
|
| 26 |
+
print(f"โ Modรจle non trouvรฉ: {model_path}")
|
| 27 |
+
print(" Le modรจle doit รชtre tรฉlรฉchargรฉ pour les tests")
|
| 28 |
+
return False
|
| 29 |
+
|
| 30 |
+
def test_basic_mcp_comprehension():
|
| 31 |
+
"""Test de comprรฉhension MCP de base"""
|
| 32 |
+
|
| 33 |
+
print("\n๐งช TEST 1: Comprรฉhension des outils MCP")
|
| 34 |
+
print("-" * 50)
|
| 35 |
+
|
| 36 |
+
tests = [
|
| 37 |
+
{
|
| 38 |
+
"name": "Outils simples",
|
| 39 |
+
"prompt": """
|
| 40 |
+
Tu es un assistant IA qui contrรดle un jeu RTS via MCP.
|
| 41 |
+
|
| 42 |
+
Outils disponibles:
|
| 43 |
+
- get_game_state(): Obtenir l'รฉtat du jeu
|
| 44 |
+
- move_units(unit_ids, target_x, target_y): Dรฉplacer des unitรฉs
|
| 45 |
+
|
| 46 |
+
Commande utilisateur: "Montre-moi l'รฉtat du jeu"
|
| 47 |
+
|
| 48 |
+
Rรฉponds avec un objet JSON contenant l'appel d'outil ร exรฉcuter.
|
| 49 |
+
""",
|
| 50 |
+
"expected_tool": "get_game_state"
|
| 51 |
+
},
|
| 52 |
+
{
|
| 53 |
+
"name": "Extraction de paramรจtres",
|
| 54 |
+
"prompt": """
|
| 55 |
+
Tu es un assistant IA qui contrรดle un jeu RTS via MCP.
|
| 56 |
+
|
| 57 |
+
Outils disponibles:
|
| 58 |
+
- move_units(unit_ids, target_x, target_y): Dรฉplacer des unitรฉs
|
| 59 |
+
|
| 60 |
+
Commande utilisateur: "Dรฉplace mes tanks vers la position 150, 75"
|
| 61 |
+
|
| 62 |
+
Rรฉponds avec un objet JSON contenant l'appel d'outil avec les paramรจtres extraits.
|
| 63 |
+
""",
|
| 64 |
+
"expected_tool": "move_units",
|
| 65 |
+
"expected_params": ["unit_ids", "target_x", "target_y"]
|
| 66 |
+
}
|
| 67 |
+
]
|
| 68 |
+
|
| 69 |
+
return tests
|
| 70 |
+
|
| 71 |
+
def test_complex_planning():
|
| 72 |
+
"""Test de planification complexe"""
|
| 73 |
+
|
| 74 |
+
print("\n๐งช TEST 2: Planification et stratรฉgie")
|
| 75 |
+
print("-" * 50)
|
| 76 |
+
|
| 77 |
+
tests = [
|
| 78 |
+
{
|
| 79 |
+
"name": "Stratรฉgie multi-รฉtapes",
|
| 80 |
+
"prompt": """
|
| 81 |
+
Tu es un assistant IA qui contrรดle un jeu RTS via MCP.
|
| 82 |
+
|
| 83 |
+
Outils disponibles:
|
| 84 |
+
- get_game_state(): Obtenir l'รฉtat du jeu
|
| 85 |
+
- move_units(): Dรฉplacer des unitรฉs
|
| 86 |
+
- attack_unit(): Attaquer une unitรฉ ennemie
|
| 87 |
+
- build_building(): Construire un bรขtiment
|
| 88 |
+
|
| 89 |
+
Commande utilisateur: "Construis une base prรจs du minerai et dรฉfends-la"
|
| 90 |
+
|
| 91 |
+
Dรฉcompose cette stratรฉgie en une sรฉquence d'appels d'outils MCP.
|
| 92 |
+
Rรฉponds avec un tableau JSON d'actions.
|
| 93 |
+
""",
|
| 94 |
+
"expected_sequence": True
|
| 95 |
+
},
|
| 96 |
+
{
|
| 97 |
+
"name": "Analyse tactique",
|
| 98 |
+
"prompt": """
|
| 99 |
+
Tu es un assistant IA qui contrรดle un jeu RTS via MCP.
|
| 100 |
+
|
| 101 |
+
Outils disponibles:
|
| 102 |
+
- get_game_state(): Obtenir l'รฉtat du jeu
|
| 103 |
+
- get_ai_analysis(language): Obtenir une analyse tactique
|
| 104 |
+
|
| 105 |
+
Commande utilisateur: "Analyse le champ de bataille et suggรจre une stratรฉgie"
|
| 106 |
+
|
| 107 |
+
Quels outils MCP utiliserais-tu et dans quel ordre?
|
| 108 |
+
""",
|
| 109 |
+
"expected_analysis": True
|
| 110 |
+
}
|
| 111 |
+
]
|
| 112 |
+
|
| 113 |
+
return tests
|
| 114 |
+
|
| 115 |
+
def test_error_handling():
|
| 116 |
+
"""Test de gestion des erreurs et ambiguรฏtรฉs"""
|
| 117 |
+
|
| 118 |
+
print("\n๐งช TEST 3: Gestion des erreurs")
|
| 119 |
+
print("-" * 50)
|
| 120 |
+
|
| 121 |
+
tests = [
|
| 122 |
+
{
|
| 123 |
+
"name": "Instructions ambiguรซs",
|
| 124 |
+
"prompt": """
|
| 125 |
+
Tu es un assistant IA qui contrรดle un jeu RTS via MCP.
|
| 126 |
+
|
| 127 |
+
Outils disponibles:
|
| 128 |
+
- move_units(unit_ids, target_x, target_y): Dรฉplacer des unitรฉs
|
| 129 |
+
|
| 130 |
+
Commande utilisateur: "Dรฉplace mes unitรฉs vers la base ennemie"
|
| 131 |
+
|
| 132 |
+
Comment gรฉrerais-tu cette instruction ambiguรซ?
|
| 133 |
+
""",
|
| 134 |
+
"expected_handling": "clarification"
|
| 135 |
+
},
|
| 136 |
+
{
|
| 137 |
+
"name": "Paramรจtres manquants",
|
| 138 |
+
"prompt": """
|
| 139 |
+
Tu es un assistant IA qui contrรดle un jeu RTS via MCP.
|
| 140 |
+
|
| 141 |
+
Outils disponibles:
|
| 142 |
+
- attack_unit(attacker_ids, target_id): Attaquer une unitรฉ ennemie
|
| 143 |
+
|
| 144 |
+
Commande utilisateur: "Attaque l'ennemi"
|
| 145 |
+
|
| 146 |
+
Que faire si les paramรจtres nรฉcessaires ne sont pas fournis?
|
| 147 |
+
""",
|
| 148 |
+
"expected_handling": "parameter_request"
|
| 149 |
+
}
|
| 150 |
+
]
|
| 151 |
+
|
| 152 |
+
return tests
|
| 153 |
+
|
| 154 |
+
def run_llm_test(prompt, test_name):
|
| 155 |
+
"""Exรฉcuter un test avec le modรจle LLM"""
|
| 156 |
+
|
| 157 |
+
try:
|
| 158 |
+
from ai_analysis import query_llm
|
| 159 |
+
|
| 160 |
+
print(f"\n๐ Test: {test_name}")
|
| 161 |
+
print(f"Prompt: {prompt[:100]}...")
|
| 162 |
+
|
| 163 |
+
start_time = time.time()
|
| 164 |
+
|
| 165 |
+
response = query_llm(
|
| 166 |
+
prompt=prompt,
|
| 167 |
+
max_tokens=300,
|
| 168 |
+
temperature=0.1,
|
| 169 |
+
system_message="Tu es un assistant IA spรฉcialisรฉ dans les jeux RTS et le protocole MCP."
|
| 170 |
+
)
|
| 171 |
+
|
| 172 |
+
response_time = time.time() - start_time
|
| 173 |
+
|
| 174 |
+
print(f"โฑ๏ธ Temps de rรฉponse: {response_time:.2f}s")
|
| 175 |
+
print(f"๐ Rรฉponse: {response[:200]}...")
|
| 176 |
+
|
| 177 |
+
return {
|
| 178 |
+
"success": True,
|
| 179 |
+
"response": response,
|
| 180 |
+
"response_time": response_time,
|
| 181 |
+
"error": None
|
| 182 |
+
}
|
| 183 |
+
|
| 184 |
+
except Exception as e:
|
| 185 |
+
print(f"โ Erreur: {e}")
|
| 186 |
+
return {
|
| 187 |
+
"success": False,
|
| 188 |
+
"response": None,
|
| 189 |
+
"response_time": 0,
|
| 190 |
+
"error": str(e)
|
| 191 |
+
}
|
| 192 |
+
|
| 193 |
+
def analyze_response(response, test_config):
|
| 194 |
+
"""Analyser la rรฉponse du modรจle"""
|
| 195 |
+
|
| 196 |
+
analysis = {
|
| 197 |
+
"score": 0,
|
| 198 |
+
"strengths": [],
|
| 199 |
+
"weaknesses": [],
|
| 200 |
+
"details": {}
|
| 201 |
+
}
|
| 202 |
+
|
| 203 |
+
if not response:
|
| 204 |
+
analysis["weaknesses"].append("Rรฉponse vide")
|
| 205 |
+
return analysis
|
| 206 |
+
|
| 207 |
+
# Vรฉrifier la structure JSON
|
| 208 |
+
try:
|
| 209 |
+
json.loads(response)
|
| 210 |
+
analysis["score"] += 3
|
| 211 |
+
analysis["strengths"].append("JSON valide")
|
| 212 |
+
analysis["details"]["json_valid"] = True
|
| 213 |
+
except:
|
| 214 |
+
analysis["details"]["json_valid"] = False
|
| 215 |
+
|
| 216 |
+
# Vรฉrifier la prรฉsence d'outils MCP
|
| 217 |
+
mcp_tools = ["get_game_state", "move_units", "attack_unit", "build_building", "get_ai_analysis"]
|
| 218 |
+
tools_found = [tool for tool in mcp_tools if tool in response]
|
| 219 |
+
|
| 220 |
+
if tools_found:
|
| 221 |
+
analysis["score"] += 2
|
| 222 |
+
analysis["strengths"].append(f"Outils MCP identifiรฉs: {', '.join(tools_found)}")
|
| 223 |
+
|
| 224 |
+
# Vรฉrifier la cohรฉrence sรฉmantique
|
| 225 |
+
if "game" in response.lower() or "rts" in response.lower() or "mcp" in response.lower():
|
| 226 |
+
analysis["score"] += 1
|
| 227 |
+
analysis["strengths"].append("Contexte RTS/MCP compris")
|
| 228 |
+
|
| 229 |
+
# Vรฉrifier les paramรจtres
|
| 230 |
+
if "unit_ids" in response or "target_x" in response or "target_y" in response:
|
| 231 |
+
analysis["score"] += 1
|
| 232 |
+
analysis["strengths"].append("Paramรจtres identifiรฉs")
|
| 233 |
+
|
| 234 |
+
analysis["score"] = min(analysis["score"], 10)
|
| 235 |
+
|
| 236 |
+
return analysis
|
| 237 |
+
|
| 238 |
+
def generate_assessment_report(results):
|
| 239 |
+
"""Gรฉnรฉrer un rapport d'รฉvaluation complet"""
|
| 240 |
+
|
| 241 |
+
print("\n" + "="*70)
|
| 242 |
+
print("๐ RAPPORT D'รVALUATION DES CAPACITรS MCP")
|
| 243 |
+
print("="*70)
|
| 244 |
+
|
| 245 |
+
total_score = 0
|
| 246 |
+
total_tests = 0
|
| 247 |
+
|
| 248 |
+
for category, category_results in results.items():
|
| 249 |
+
print(f"\n๐ {category.upper()}:")
|
| 250 |
+
|
| 251 |
+
category_score = 0
|
| 252 |
+
category_count = 0
|
| 253 |
+
|
| 254 |
+
for test_name, result in category_results.items():
|
| 255 |
+
if result["success"]:
|
| 256 |
+
analysis = result["analysis"]
|
| 257 |
+
score = analysis["score"]
|
| 258 |
+
category_score += score
|
| 259 |
+
category_count += 1
|
| 260 |
+
|
| 261 |
+
print(f" {test_name}: {score}/10")
|
| 262 |
+
if analysis["strengths"]:
|
| 263 |
+
print(f" โ
Points forts: {', '.join(analysis['strengths'][:2])}")
|
| 264 |
+
if analysis["weaknesses"]:
|
| 265 |
+
print(f" โ ๏ธ Points faibles: {', '.join(analysis['weaknesses'][:2])}")
|
| 266 |
+
else:
|
| 267 |
+
print(f" {test_name}: โ รCHEC - {result['error']}")
|
| 268 |
+
|
| 269 |
+
if category_count > 0:
|
| 270 |
+
avg_score = category_score / category_count
|
| 271 |
+
total_score += avg_score
|
| 272 |
+
total_tests += 1
|
| 273 |
+
print(f" ๐ Score moyen: {avg_score:.1f}/10")
|
| 274 |
+
|
| 275 |
+
# Score global
|
| 276 |
+
if total_tests > 0:
|
| 277 |
+
overall_score = total_score / total_tests
|
| 278 |
+
|
| 279 |
+
print(f"\n๐ฏ SCORE GLOBAL: {overall_score:.1f}/10")
|
| 280 |
+
|
| 281 |
+
# Recommandations basรฉes sur le score
|
| 282 |
+
print("\n๐ก RECOMMANDATIONS:")
|
| 283 |
+
|
| 284 |
+
if overall_score >= 8:
|
| 285 |
+
print("โ
EXCELLENT - Qwen2.5 0.5B est trรจs capable pour les tรขches MCP")
|
| 286 |
+
print(" โข Utilisez-le avec confiance pour la traduction MCP")
|
| 287 |
+
print(" โข Implรฉmentez une validation lรฉgรจre pour la sรฉcuritรฉ")
|
| 288 |
+
elif overall_score >= 6:
|
| 289 |
+
print("๐ BON - Qwen2.5 0.5B est capable avec quelques limitations")
|
| 290 |
+
print(" โข Utilisez-le pour les commandes simples")
|
| 291 |
+
print(" โข Ajoutez une validation robuste")
|
| 292 |
+
print(" โข Prรฉvoyez des mรฉcanismes de secours")
|
| 293 |
+
elif overall_score >= 4:
|
| 294 |
+
print("โ ๏ธ MODรRร - Limitations significatives")
|
| 295 |
+
print(" โข Utilisez-le uniquement pour des traductions trรจs simples")
|
| 296 |
+
print(" โข Implรฉmentez une validation stricte")
|
| 297 |
+
print(" โข Envisagez un modรจle plus grand pour les tรขches complexes")
|
| 298 |
+
else:
|
| 299 |
+
print("โ FAIBLE - Pas adaptรฉ aux tรขches MCP")
|
| 300 |
+
print(" โข Envisagez un modรจle plus grand (1.5B+)")
|
| 301 |
+
print(" โข Utilisez des rรจgles fixes pour les commandes simples")
|
| 302 |
+
|
| 303 |
+
def main():
|
| 304 |
+
"""Fonction principale"""
|
| 305 |
+
|
| 306 |
+
print("๐ รVALUATION DES CAPACITรS MCP DE QWEN2.5 0.5B")
|
| 307 |
+
print("=" * 70)
|
| 308 |
+
|
| 309 |
+
# Vรฉrifier la disponibilitรฉ du modรจle
|
| 310 |
+
if not test_model_availability():
|
| 311 |
+
print("\nโ Le modรจle n'est pas disponible. Tรฉlรฉchargez-le d'abord.")
|
| 312 |
+
return
|
| 313 |
+
|
| 314 |
+
# Dรฉfinir les tests
|
| 315 |
+
basic_tests = test_basic_mcp_comprehension()
|
| 316 |
+
planning_tests = test_complex_planning()
|
| 317 |
+
error_tests = test_error_handling()
|
| 318 |
+
|
| 319 |
+
# Exรฉcuter les tests
|
| 320 |
+
results = {}
|
| 321 |
+
|
| 322 |
+
# Tests de base
|
| 323 |
+
print("\n๐ Exรฉcution des tests de capacitรฉ MCP...")
|
| 324 |
+
|
| 325 |
+
basic_results = {}
|
| 326 |
+
for test in basic_tests:
|
| 327 |
+
result = run_llm_test(test["prompt"], test["name"])
|
| 328 |
+
if result["success"]:
|
| 329 |
+
analysis = analyze_response(result["response"], test)
|
| 330 |
+
result["analysis"] = analysis
|
| 331 |
+
basic_results[test["name"]] = result
|
| 332 |
+
|
| 333 |
+
results["comprรฉhension_de_base"] = basic_results
|
| 334 |
+
|
| 335 |
+
# Tests de planification
|
| 336 |
+
planning_results = {}
|
| 337 |
+
for test in planning_tests:
|
| 338 |
+
result = run_llm_test(test["prompt"], test["name"])
|
| 339 |
+
if result["success"]:
|
| 340 |
+
analysis = analyze_response(result["response"], test)
|
| 341 |
+
result["analysis"] = analysis
|
| 342 |
+
planning_results[test["name"]] = result
|
| 343 |
+
|
| 344 |
+
results["planification_complexe"] = planning_results
|
| 345 |
+
|
| 346 |
+
# Tests de gestion d'erreurs
|
| 347 |
+
error_results = {}
|
| 348 |
+
for test in error_tests:
|
| 349 |
+
result = run_llm_test(test["prompt"], test["name"])
|
| 350 |
+
if result["success"]:
|
| 351 |
+
analysis = analyze_response(result["response"], test)
|
| 352 |
+
result["analysis"] = analysis
|
| 353 |
+
error_results[test["name"]] = result
|
| 354 |
+
|
| 355 |
+
results["gestion_des_erreurs"] = error_results
|
| 356 |
+
|
| 357 |
+
# Gรฉnรฉrer le rapport
|
| 358 |
+
generate_assessment_report(results)
|
| 359 |
+
|
| 360 |
+
# Sauvegarder les rรฉsultats dรฉtaillรฉs
|
| 361 |
+
with open("qwen_mcp_capability_results.json", "w", encoding="utf-8") as f:
|
| 362 |
+
json.dump(results, f, indent=2, ensure_ascii=False)
|
| 363 |
+
|
| 364 |
+
print(f"\n๐ Rรฉsultats dรฉtaillรฉs sauvegardรฉs dans: qwen_mcp_capability_results.json")
|
| 365 |
+
|
| 366 |
+
if __name__ == "__main__":
|
| 367 |
+
main()
|
tests/scripts/qwen_mcp_evaluation.py
ADDED
|
@@ -0,0 +1,279 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
รvaluation des capacitรฉs MCP de Qwen2.5 0.5B
|
| 3 |
+
Test maintenant que le modรจle fonctionne correctement
|
| 4 |
+
"""
|
| 5 |
+
|
| 6 |
+
import sys
|
| 7 |
+
import os
|
| 8 |
+
import json
|
| 9 |
+
import time
|
| 10 |
+
|
| 11 |
+
# Ajouter le chemin pour les imports
|
| 12 |
+
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
|
| 13 |
+
|
| 14 |
+
def test_qwen_mcp_capabilities():
|
| 15 |
+
"""Test des capacitรฉs MCP avec le modรจle fonctionnel"""
|
| 16 |
+
|
| 17 |
+
print("๐ รVALUATION DES CAPACITรS MCP DE QWEN2.5 0.5B")
|
| 18 |
+
print("=" * 70)
|
| 19 |
+
|
| 20 |
+
# Importer et initialiser le modรจle
|
| 21 |
+
try:
|
| 22 |
+
from llama_cpp import Llama
|
| 23 |
+
|
| 24 |
+
model_path = "qwen2.5-0.5b-instruct-q4_0.gguf"
|
| 25 |
+
|
| 26 |
+
print("๐ Initialisation du modรจle...")
|
| 27 |
+
llm = Llama(
|
| 28 |
+
model_path=model_path,
|
| 29 |
+
n_ctx=2048,
|
| 30 |
+
n_threads=2,
|
| 31 |
+
verbose=False
|
| 32 |
+
)
|
| 33 |
+
|
| 34 |
+
print("โ
Modรจle initialisรฉ avec succรจs\n")
|
| 35 |
+
|
| 36 |
+
except Exception as e:
|
| 37 |
+
print(f"โ Erreur d'initialisation: {e}")
|
| 38 |
+
return
|
| 39 |
+
|
| 40 |
+
# Tests MCP
|
| 41 |
+
tests = [
|
| 42 |
+
{
|
| 43 |
+
"name": "Comprรฉhension MCP de base",
|
| 44 |
+
"prompt": """Tu es un assistant IA qui contrรดle un jeu RTS via MCP (Model Context Protocol).
|
| 45 |
+
|
| 46 |
+
Outils MCP disponibles:
|
| 47 |
+
- get_game_state(): Obtenir l'รฉtat actuel du jeu
|
| 48 |
+
- move_units(unit_ids, target_x, target_y): Dรฉplacer des unitรฉs
|
| 49 |
+
|
| 50 |
+
Commande utilisateur: "Montre-moi l'รฉtat du jeu"
|
| 51 |
+
|
| 52 |
+
Rรฉponds avec un objet JSON contenant l'appel d'outil MCP ร exรฉcuter.
|
| 53 |
+
""",
|
| 54 |
+
"expected_tool": "get_game_state"
|
| 55 |
+
},
|
| 56 |
+
{
|
| 57 |
+
"name": "Traduction d'action simple",
|
| 58 |
+
"prompt": """Tu es un assistant IA qui contrรดle un jeu RTS via MCP.
|
| 59 |
+
|
| 60 |
+
Outils MCP disponibles:
|
| 61 |
+
- move_units(unit_ids, target_x, target_y): Dรฉplacer des unitรฉs
|
| 62 |
+
|
| 63 |
+
Commande utilisateur: "Dรฉplace mon infanterie vers la position 150, 200"
|
| 64 |
+
|
| 65 |
+
Rรฉponds avec un objet JSON contenant l'appel d'outil avec les paramรจtres extraits.
|
| 66 |
+
""",
|
| 67 |
+
"expected_tool": "move_units"
|
| 68 |
+
},
|
| 69 |
+
{
|
| 70 |
+
"name": "Extraction de paramรจtres complexes",
|
| 71 |
+
"prompt": """Tu es un assistant IA qui contrรดle un jeu RTS via MCP.
|
| 72 |
+
|
| 73 |
+
Outils MCP disponibles:
|
| 74 |
+
- move_units(unit_ids, target_x, target_y): Dรฉplacer des unitรฉs
|
| 75 |
+
|
| 76 |
+
Commande utilisateur: "Dรฉplace les tanks 1, 3 et 7 vers les coordonnรฉes 120, 80"
|
| 77 |
+
|
| 78 |
+
Extrais les paramรจtres et rรฉponds avec un objet JSON d'appel d'outil.
|
| 79 |
+
""",
|
| 80 |
+
"expected_tool": "move_units"
|
| 81 |
+
},
|
| 82 |
+
{
|
| 83 |
+
"name": "Planification stratรฉgique",
|
| 84 |
+
"prompt": """Tu es un assistant IA qui contrรดle un jeu RTS via MCP.
|
| 85 |
+
|
| 86 |
+
Outils MCP disponibles:
|
| 87 |
+
- get_game_state(): Obtenir l'รฉtat du jeu
|
| 88 |
+
- move_units(): Dรฉplacer des unitรฉs
|
| 89 |
+
- attack_unit(): Attaquer l'ennemi
|
| 90 |
+
- build_building(): Construire un bรขtiment
|
| 91 |
+
|
| 92 |
+
Commande utilisateur: "Construis une base prรจs du minerai et dรฉfends-la"
|
| 93 |
+
|
| 94 |
+
Dรฉcompose cette stratรฉgie en une sรฉquence d'actions MCP.
|
| 95 |
+
""",
|
| 96 |
+
"expected_complex": True
|
| 97 |
+
},
|
| 98 |
+
{
|
| 99 |
+
"name": "Gestion d'ambiguรฏtรฉ",
|
| 100 |
+
"prompt": """Tu es un assistant IA qui contrรดle un jeu RTS via MCP.
|
| 101 |
+
|
| 102 |
+
Outils MCP disponibles:
|
| 103 |
+
- move_units(unit_ids, target_x, target_y): Dรฉplacer des unitรฉs
|
| 104 |
+
|
| 105 |
+
Commande utilisateur: "Dรฉplace mes unitรฉs vers la base ennemie"
|
| 106 |
+
|
| 107 |
+
Comment gรฉrerais-tu cette instruction ambiguรซ?
|
| 108 |
+
""",
|
| 109 |
+
"expected_clarification": True
|
| 110 |
+
}
|
| 111 |
+
]
|
| 112 |
+
|
| 113 |
+
results = []
|
| 114 |
+
|
| 115 |
+
for i, test in enumerate(tests, 1):
|
| 116 |
+
print(f"๐งช TEST {i}: {test['name']}")
|
| 117 |
+
print("-" * 50)
|
| 118 |
+
|
| 119 |
+
try:
|
| 120 |
+
start_time = time.time()
|
| 121 |
+
|
| 122 |
+
# Utiliser le chat template pour de meilleurs rรฉsultats
|
| 123 |
+
messages = [
|
| 124 |
+
{"role": "system", "content": "Tu es un assistant IA spรฉcialisรฉ dans les jeux RTS et le protocole MCP. Rรฉponds avec des objets JSON valides."},
|
| 125 |
+
{"role": "user", "content": test['prompt']}
|
| 126 |
+
]
|
| 127 |
+
|
| 128 |
+
# Utiliser la mรฉthode standard pour llama-cpp-python
|
| 129 |
+
response = llm.create_chat_completion(
|
| 130 |
+
messages=messages,
|
| 131 |
+
max_tokens=300,
|
| 132 |
+
temperature=0.1
|
| 133 |
+
)
|
| 134 |
+
|
| 135 |
+
response_time = time.time() - start_time
|
| 136 |
+
|
| 137 |
+
# Extraire le texte de la rรฉponse correctement
|
| 138 |
+
if response and 'choices' in response and len(response['choices']) > 0:
|
| 139 |
+
response_text = response['choices'][0]['message']['content']
|
| 140 |
+
else:
|
| 141 |
+
# Fallback: utiliser la mรฉthode simple
|
| 142 |
+
simple_response = llm(test['prompt'], max_tokens=300, temperature=0.1)
|
| 143 |
+
response_text = simple_response['choices'][0]['text'] if 'choices' in simple_response else str(simple_response)
|
| 144 |
+
|
| 145 |
+
print(f"โฑ๏ธ Temps de rรฉponse: {response_time:.2f}s")
|
| 146 |
+
print(f"๐ Rรฉponse: {response_text[:300]}...")
|
| 147 |
+
|
| 148 |
+
# Analyser la rรฉponse
|
| 149 |
+
score = analyze_mcp_response(response_text, test)
|
| 150 |
+
|
| 151 |
+
results.append({
|
| 152 |
+
'test': test['name'],
|
| 153 |
+
'response': response_text,
|
| 154 |
+
'response_time': response_time,
|
| 155 |
+
'score': score,
|
| 156 |
+
'success': score >= 5
|
| 157 |
+
})
|
| 158 |
+
|
| 159 |
+
print(f"๐ Score: {score}/10")
|
| 160 |
+
|
| 161 |
+
except Exception as e:
|
| 162 |
+
print(f"โ Erreur: {e}")
|
| 163 |
+
results.append({
|
| 164 |
+
'test': test['name'],
|
| 165 |
+
'error': str(e),
|
| 166 |
+
'score': 0,
|
| 167 |
+
'success': False
|
| 168 |
+
})
|
| 169 |
+
|
| 170 |
+
print()
|
| 171 |
+
|
| 172 |
+
# Gรฉnรฉrer le rapport final
|
| 173 |
+
generate_evaluation_report(results)
|
| 174 |
+
|
| 175 |
+
def analyze_mcp_response(response, test_config):
|
| 176 |
+
"""Analyser la rรฉponse MCP et la noter"""
|
| 177 |
+
|
| 178 |
+
if not response or response.strip() == "":
|
| 179 |
+
return 0
|
| 180 |
+
|
| 181 |
+
score = 0
|
| 182 |
+
|
| 183 |
+
# Vรฉrifier la structure JSON
|
| 184 |
+
try:
|
| 185 |
+
json_data = json.loads(response)
|
| 186 |
+
score += 3 # JSON valide
|
| 187 |
+
|
| 188 |
+
# Vรฉrifier la prรฉsence d'outils MCP
|
| 189 |
+
if 'tool' in json_data:
|
| 190 |
+
score += 2
|
| 191 |
+
if 'args' in json_data:
|
| 192 |
+
score += 1
|
| 193 |
+
|
| 194 |
+
except json.JSONDecodeError:
|
| 195 |
+
# Vรฉrifier les patterns dans le texte
|
| 196 |
+
mcp_tools = ["get_game_state", "move_units", "attack_unit", "build_building"]
|
| 197 |
+
tools_found = [tool for tool in mcp_tools if tool in response]
|
| 198 |
+
|
| 199 |
+
if tools_found:
|
| 200 |
+
score += 2
|
| 201 |
+
if any(param in response for param in ['unit_ids', 'target_x', 'target_y']):
|
| 202 |
+
score += 1
|
| 203 |
+
|
| 204 |
+
# Vรฉrifier la cohรฉrence sรฉmantique
|
| 205 |
+
if 'game' in response.lower() or 'rts' in response.lower():
|
| 206 |
+
score += 1
|
| 207 |
+
|
| 208 |
+
# Vรฉrifier la pertinence par rapport au test
|
| 209 |
+
if 'expected_tool' in test_config:
|
| 210 |
+
if test_config['expected_tool'] in response:
|
| 211 |
+
score += 2
|
| 212 |
+
elif test_config.get('expected_complex'):
|
| 213 |
+
if any(word in response.lower() for word in ['sequence', 'steps', 'build', 'defend']):
|
| 214 |
+
score += 2
|
| 215 |
+
elif test_config.get('expected_clarification'):
|
| 216 |
+
if any(word in response.lower() for word in ['clarify', 'coordinates', 'which', 'ambiguous']):
|
| 217 |
+
score += 2
|
| 218 |
+
|
| 219 |
+
return min(score, 10)
|
| 220 |
+
|
| 221 |
+
def generate_evaluation_report(results):
|
| 222 |
+
"""Gรฉnรฉrer un rapport d'รฉvaluation complet"""
|
| 223 |
+
|
| 224 |
+
print("=" * 70)
|
| 225 |
+
print("๐ RAPPORT D'รVALUATION MCP")
|
| 226 |
+
print("=" * 70)
|
| 227 |
+
|
| 228 |
+
successful_tests = [r for r in results if r.get('success', False)]
|
| 229 |
+
total_score = sum(r.get('score', 0) for r in results)
|
| 230 |
+
avg_score = total_score / len(results) if results else 0
|
| 231 |
+
|
| 232 |
+
print(f"\n๐ STATISTIQUES:")
|
| 233 |
+
print(f" Tests rรฉussis: {len(successful_tests)}/{len(results)}")
|
| 234 |
+
print(f" Score moyen: {avg_score:.1f}/10")
|
| 235 |
+
|
| 236 |
+
print(f"\n๐ DรTAILS PAR TEST:")
|
| 237 |
+
for result in results:
|
| 238 |
+
status = "โ
" if result.get('success', False) else "โ"
|
| 239 |
+
print(f" {status} {result['test']}: {result.get('score', 0)}/10")
|
| 240 |
+
if 'response_time' in result:
|
| 241 |
+
print(f" โฑ๏ธ Temps: {result['response_time']:.2f}s")
|
| 242 |
+
|
| 243 |
+
print(f"\n๐ฏ รVALUATION FINALE:")
|
| 244 |
+
|
| 245 |
+
if avg_score >= 8:
|
| 246 |
+
print("๐ช EXCELLENT - Qwen2.5 0.5B est trรจs capable pour les tรขches MCP")
|
| 247 |
+
recommendation = "UTILISER AVEC CONFIANCE"
|
| 248 |
+
elif avg_score >= 6:
|
| 249 |
+
print("๐ BON - Qwen2.5 0.5B est capable avec quelques limitations")
|
| 250 |
+
recommendation = "UTILISER AVEC VALIDATION"
|
| 251 |
+
elif avg_score >= 4:
|
| 252 |
+
print("โ ๏ธ MODรRร - Limitations significatives")
|
| 253 |
+
recommendation = "UTILISER POUR TรCHES SIMPLES"
|
| 254 |
+
else:
|
| 255 |
+
print("โ FAIBLE - Pas adaptรฉ aux tรขches MCP")
|
| 256 |
+
recommendation = "CONSIDรRER AUTRE SOLUTION"
|
| 257 |
+
|
| 258 |
+
print(f"๐ก RECOMMANDATION: {recommendation}")
|
| 259 |
+
|
| 260 |
+
print(f"\n๐ PROCHAINES รTAPES:")
|
| 261 |
+
if avg_score >= 6:
|
| 262 |
+
print("1. โ
Implรฉmenter la traduction MCP dans le jeu")
|
| 263 |
+
print("2. ๐ฏ Commencer par les commandes simples")
|
| 264 |
+
print("3. ๐ Ajouter des mรฉtriques de performance")
|
| 265 |
+
print("4. ๐ Tester avec des vrais joueurs")
|
| 266 |
+
else:
|
| 267 |
+
print("1. ๐ง Amรฉliorer les prompts et exemples")
|
| 268 |
+
print("2. ๐ Entraรฎner avec des donnรฉes MCP spรฉcifiques")
|
| 269 |
+
print("3. ๐ Considรฉrer un modรจle plus grand")
|
| 270 |
+
print("4. ๐ ๏ธ Implรฉmenter des rรจgles de secours")
|
| 271 |
+
|
| 272 |
+
# Sauvegarder les rรฉsultats
|
| 273 |
+
with open("qwen_mcp_evaluation_results.json", "w", encoding="utf-8") as f:
|
| 274 |
+
json.dump(results, f, indent=2, ensure_ascii=False)
|
| 275 |
+
|
| 276 |
+
print(f"\n๐ Rรฉsultats dรฉtaillรฉs sauvegardรฉs dans: qwen_mcp_evaluation_results.json")
|
| 277 |
+
|
| 278 |
+
if __name__ == "__main__":
|
| 279 |
+
test_qwen_mcp_capabilities()
|
tests/scripts/realistic_mcp_evaluation.py
ADDED
|
@@ -0,0 +1,387 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Test rรฉaliste des capacitรฉs MCP pour un jeu RTS
|
| 3 |
+
Simule vรฉritablement l'usage avec contexte de jeu, รฉtats, et scรฉnarios rรฉels
|
| 4 |
+
"""
|
| 5 |
+
|
| 6 |
+
import sys
|
| 7 |
+
import os
|
| 8 |
+
import json
|
| 9 |
+
import time
|
| 10 |
+
import random
|
| 11 |
+
|
| 12 |
+
# Ajouter le chemin pour les imports
|
| 13 |
+
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
|
| 14 |
+
|
| 15 |
+
# รtat de jeu simulรฉ
|
| 16 |
+
GAME_STATE = {
|
| 17 |
+
"player_id": 1,
|
| 18 |
+
"resources": {"credits": 2500, "power": 150},
|
| 19 |
+
"units": [
|
| 20 |
+
{"id": 1, "type": "infantry", "x": 100, "y": 100, "health": 100},
|
| 21 |
+
{"id": 2, "type": "infantry", "x": 110, "y": 105, "health": 100},
|
| 22 |
+
{"id": 3, "type": "tank", "x": 120, "y": 110, "health": 150},
|
| 23 |
+
{"id": 4, "type": "harvester", "x": 200, "y": 200, "health": 200}
|
| 24 |
+
],
|
| 25 |
+
"buildings": [
|
| 26 |
+
{"id": 1, "type": "hq", "x": 50, "y": 50, "health": 500},
|
| 27 |
+
{"id": 2, "type": "power_plant", "x": 80, "y": 80, "health": 300},
|
| 28 |
+
{"id": 3, "type": "barracks", "x": 120, "y": 60, "health": 250}
|
| 29 |
+
],
|
| 30 |
+
"enemy_units": [
|
| 31 |
+
{"id": 101, "type": "infantry", "x": 300, "y": 150, "health": 100},
|
| 32 |
+
{"id": 102, "type": "tank", "x": 320, "y": 160, "health": 150}
|
| 33 |
+
],
|
| 34 |
+
"map": {
|
| 35 |
+
"width": 96,
|
| 36 |
+
"height": 72,
|
| 37 |
+
"ore_fields": [{"x": 250, "y": 200}, {"x": 400, "y": 300}]
|
| 38 |
+
}
|
| 39 |
+
}
|
| 40 |
+
|
| 41 |
+
def create_realistic_prompt(scenario, game_state):
|
| 42 |
+
"""Crรฉe un prompt rรฉaliste avec contexte de jeu"""
|
| 43 |
+
|
| 44 |
+
base_context = f"""
|
| 45 |
+
Tu es un assistant IA qui contrรดle un jeu RTS via MCP (Model Context Protocol).
|
| 46 |
+
|
| 47 |
+
รTAT ACTUEL DU JEU:
|
| 48 |
+
{json.dumps(game_state, indent=2)}
|
| 49 |
+
|
| 50 |
+
OUTILS MCP DISPONIBLES:
|
| 51 |
+
- get_game_state(): Obtenir l'รฉtat actuel du jeu
|
| 52 |
+
- move_units(unit_ids, target_x, target_y): Dรฉplacer des unitรฉs
|
| 53 |
+
- attack_unit(attacker_ids, target_id): Attaquer une unitรฉ ennemie
|
| 54 |
+
- build_building(building_type, position_x, position_y, player_id): Construire un bรขtiment
|
| 55 |
+
- get_ai_analysis(language): Obtenir une analyse tactique
|
| 56 |
+
|
| 57 |
+
RรGLES IMPORTANTES:
|
| 58 |
+
- Les coordonnรฉes doivent รชtre valides (0-95 pour x, 0-71 pour y)
|
| 59 |
+
- Les unitรฉs doivent exister (vรฉrifier les IDs)
|
| 60 |
+
- Les bรขtiments nรฉcessitent des ressources suffisantes
|
| 61 |
+
- Les attaques nรฉcessitent une portรฉe valide
|
| 62 |
+
|
| 63 |
+
Rรฉponds UNIQUEMENT avec un objet JSON contenant l'action MCP ร exรฉcuter.
|
| 64 |
+
"""
|
| 65 |
+
|
| 66 |
+
return base_context + "\n\n" + scenario
|
| 67 |
+
|
| 68 |
+
def test_model_realistic(model_path, model_name):
|
| 69 |
+
"""Test rรฉaliste d'un modรจle avec scรฉnarios de jeu"""
|
| 70 |
+
|
| 71 |
+
try:
|
| 72 |
+
from llama_cpp import Llama
|
| 73 |
+
|
| 74 |
+
print(f"๐ฎ Test rรฉaliste de {model_name}...")
|
| 75 |
+
|
| 76 |
+
# Initialiser le modรจle
|
| 77 |
+
llm = Llama(
|
| 78 |
+
model_path=model_path,
|
| 79 |
+
n_ctx=2048, # Plus grand pour le contexte
|
| 80 |
+
n_threads=1,
|
| 81 |
+
verbose=False
|
| 82 |
+
)
|
| 83 |
+
|
| 84 |
+
# Scรฉnarios rรฉels de jeu
|
| 85 |
+
scenarios = [
|
| 86 |
+
{
|
| 87 |
+
"name": "Dรฉfense immรฉdiate",
|
| 88 |
+
"scenario": "Il y a un tank ennemi ร (320, 160) qui menace ma base. Attaque-le avec mes unitรฉs disponibles!",
|
| 89 |
+
"expected_tool": "attack_unit",
|
| 90 |
+
"difficulty": "facile"
|
| 91 |
+
},
|
| 92 |
+
{
|
| 93 |
+
"name": "Collecte de ressources",
|
| 94 |
+
"scenario": "Mes crรฉdits sont bas (2500). Envoie le rรฉcolteur vers le champ de minerai le plus proche.",
|
| 95 |
+
"expected_tool": "move_units",
|
| 96 |
+
"difficulty": "moyen"
|
| 97 |
+
},
|
| 98 |
+
{
|
| 99 |
+
"name": "Expansion stratรฉgique",
|
| 100 |
+
"scenario": "Je veux construire une caserne prรจs du champ de minerai ร (250, 200) pour dรฉfendre mes rรฉcolteurs.",
|
| 101 |
+
"expected_tool": "build_building",
|
| 102 |
+
"difficulty": "moyen"
|
| 103 |
+
},
|
| 104 |
+
{
|
| 105 |
+
"name": "Attaque coordonnรฉe",
|
| 106 |
+
"scenario": "Prรฉpare une attaque sur les positions ennemies. Utilise toutes mes unitรฉs militaires disponibles.",
|
| 107 |
+
"expected_tool": "attack_unit",
|
| 108 |
+
"difficulty": "difficile"
|
| 109 |
+
},
|
| 110 |
+
{
|
| 111 |
+
"name": "Reconnaissance",
|
| 112 |
+
"scenario": "Montre-moi l'รฉtat complet du jeu pour analyser la situation tactique.",
|
| 113 |
+
"expected_tool": "get_game_state",
|
| 114 |
+
"difficulty": "facile"
|
| 115 |
+
},
|
| 116 |
+
{
|
| 117 |
+
"name": "Gestion de crise",
|
| 118 |
+
"scenario": "Mon QG est attaquรฉ! Dรฉplace toutes les unitรฉs disponibles pour dรฉfendre la position (50, 50).",
|
| 119 |
+
"expected_tool": "move_units",
|
| 120 |
+
"difficulty": "difficile"
|
| 121 |
+
}
|
| 122 |
+
]
|
| 123 |
+
|
| 124 |
+
results = []
|
| 125 |
+
total_score = 0
|
| 126 |
+
total_time = 0
|
| 127 |
+
|
| 128 |
+
for scenario in scenarios:
|
| 129 |
+
print(f"\n๐ Scรฉnario: {scenario['name']} ({scenario['difficulty']})")
|
| 130 |
+
|
| 131 |
+
# Crรฉer le prompt rรฉaliste
|
| 132 |
+
prompt = create_realistic_prompt(scenario['scenario'], GAME_STATE)
|
| 133 |
+
|
| 134 |
+
start_time = time.time()
|
| 135 |
+
|
| 136 |
+
# Tester le modรจle
|
| 137 |
+
response = llm(
|
| 138 |
+
prompt,
|
| 139 |
+
max_tokens=200,
|
| 140 |
+
temperature=0.1,
|
| 141 |
+
stop=["</s>", "<|im_end|>", "```"]
|
| 142 |
+
)
|
| 143 |
+
|
| 144 |
+
response_time = time.time() - start_time
|
| 145 |
+
response_text = response['choices'][0]['text'].strip()
|
| 146 |
+
|
| 147 |
+
# รvaluer la rรฉponse de maniรจre approfondie
|
| 148 |
+
score = evaluate_realistic_response(response_text, scenario, GAME_STATE)
|
| 149 |
+
|
| 150 |
+
total_score += score
|
| 151 |
+
total_time += response_time
|
| 152 |
+
|
| 153 |
+
print(f" โฑ๏ธ Temps: {response_time:.2f}s")
|
| 154 |
+
print(f" ๐ Score: {score}/10")
|
| 155 |
+
print(f" ๐ Rรฉponse: {response_text[:100]}...")
|
| 156 |
+
|
| 157 |
+
results.append({
|
| 158 |
+
'scenario': scenario['name'],
|
| 159 |
+
'difficulty': scenario['difficulty'],
|
| 160 |
+
'score': score,
|
| 161 |
+
'time': response_time,
|
| 162 |
+
'response': response_text,
|
| 163 |
+
'expected_tool': scenario['expected_tool']
|
| 164 |
+
})
|
| 165 |
+
|
| 166 |
+
avg_score = total_score / len(scenarios)
|
| 167 |
+
avg_time = total_time / len(scenarios)
|
| 168 |
+
|
| 169 |
+
print(f"\n๐ Rรฉsultats pour {model_name}:")
|
| 170 |
+
print(f" Score moyen: {avg_score:.1f}/10")
|
| 171 |
+
print(f" Temps moyen: {avg_time:.2f}s")
|
| 172 |
+
|
| 173 |
+
# Analyse par difficultรฉ
|
| 174 |
+
easy_scores = [r['score'] for r in results if r['difficulty'] == 'facile']
|
| 175 |
+
medium_scores = [r['score'] for r in results if r['difficulty'] == 'moyen']
|
| 176 |
+
hard_scores = [r['score'] for r in results if r['difficulty'] == 'difficile']
|
| 177 |
+
|
| 178 |
+
print(f" Scรฉnarios faciles: {sum(easy_scores)/len(easy_scores):.1f}/10" if easy_scores else " Scรฉnarios faciles: N/A")
|
| 179 |
+
print(f" Scรฉnarios moyens: {sum(medium_scores)/len(medium_scores):.1f}/10" if medium_scores else " Scรฉnarios moyens: N/A")
|
| 180 |
+
print(f" Scรฉnarios difficiles: {sum(hard_scores)/len(hard_scores):.1f}/10" if hard_scores else " Scรฉnaires difficiles: N/A")
|
| 181 |
+
|
| 182 |
+
return {
|
| 183 |
+
'name': model_name,
|
| 184 |
+
'avg_score': avg_score,
|
| 185 |
+
'avg_time': avg_time,
|
| 186 |
+
'results': results,
|
| 187 |
+
'easy_avg': sum(easy_scores)/len(easy_scores) if easy_scores else 0,
|
| 188 |
+
'medium_avg': sum(medium_scores)/len(medium_scores) if medium_scores else 0,
|
| 189 |
+
'hard_avg': sum(hard_scores)/len(hard_scores) if hard_scores else 0
|
| 190 |
+
}
|
| 191 |
+
|
| 192 |
+
except Exception as e:
|
| 193 |
+
print(f"โ Erreur avec {model_name}: {e}")
|
| 194 |
+
return {
|
| 195 |
+
'name': model_name,
|
| 196 |
+
'avg_score': 0,
|
| 197 |
+
'avg_time': 0,
|
| 198 |
+
'error': str(e)
|
| 199 |
+
}
|
| 200 |
+
|
| 201 |
+
def evaluate_realistic_response(response, scenario, game_state):
|
| 202 |
+
"""รvaluation approfondie de la rรฉponse MCP"""
|
| 203 |
+
|
| 204 |
+
score = 0
|
| 205 |
+
|
| 206 |
+
# 1. Format JSON valide (3 points)
|
| 207 |
+
try:
|
| 208 |
+
json_response = json.loads(response)
|
| 209 |
+
score += 3
|
| 210 |
+
except:
|
| 211 |
+
# Essayer d'extraire JSON du texte
|
| 212 |
+
import re
|
| 213 |
+
json_match = re.search(r'\{.*\}', response, re.DOTALL)
|
| 214 |
+
if json_match:
|
| 215 |
+
try:
|
| 216 |
+
json_response = json.loads(json_match.group())
|
| 217 |
+
score += 2 # JSON partiellement valide
|
| 218 |
+
except:
|
| 219 |
+
json_response = {}
|
| 220 |
+
else:
|
| 221 |
+
json_response = {}
|
| 222 |
+
|
| 223 |
+
# 2. Outil correct (3 points)
|
| 224 |
+
expected_tool = scenario['expected_tool']
|
| 225 |
+
if 'tool' in json_response and json_response['tool'] == expected_tool:
|
| 226 |
+
score += 3
|
| 227 |
+
elif expected_tool in response:
|
| 228 |
+
score += 2 # Outil mentionnรฉ mais pas dans le bon format
|
| 229 |
+
elif any(tool in response for tool in ['get_game_state', 'move_units', 'attack_unit', 'build_building']):
|
| 230 |
+
score += 1 # Un outil MCP est mentionnรฉ
|
| 231 |
+
|
| 232 |
+
# 3. Paramรจtres valides (2 points)
|
| 233 |
+
if expected_tool == 'attack_unit':
|
| 234 |
+
if 'attacker_ids' in json_response and 'target_id' in json_response:
|
| 235 |
+
# Vรฉrifier si les IDs existent
|
| 236 |
+
attackers = json_response['attacker_ids']
|
| 237 |
+
target = json_response['target_id']
|
| 238 |
+
|
| 239 |
+
valid_attackers = any(unit['id'] in attackers if isinstance(attackers, list) else unit['id'] == attackers for unit in game_state['units'])
|
| 240 |
+
valid_target = any(unit['id'] == target for unit in game_state['enemy_units'])
|
| 241 |
+
|
| 242 |
+
if valid_attackers and valid_target:
|
| 243 |
+
score += 2
|
| 244 |
+
elif valid_attackers or valid_target:
|
| 245 |
+
score += 1
|
| 246 |
+
|
| 247 |
+
elif expected_tool == 'move_units':
|
| 248 |
+
if 'unit_ids' in json_response and 'target_x' in json_response and 'target_y' in json_response:
|
| 249 |
+
# Vรฉrifier coordonnรฉes valides
|
| 250 |
+
x, y = json_response['target_x'], json_response['target_y']
|
| 251 |
+
if 0 <= x <= 95 and 0 <= y <= 71:
|
| 252 |
+
score += 2
|
| 253 |
+
|
| 254 |
+
elif expected_tool == 'build_building':
|
| 255 |
+
if 'building_type' in json_response and 'position_x' in json_response and 'position_y' in json_response:
|
| 256 |
+
score += 2
|
| 257 |
+
|
| 258 |
+
elif expected_tool == 'get_game_state':
|
| 259 |
+
if not json_response or len(json_response) == 0:
|
| 260 |
+
score += 2 # get_game_state n'a pas besoin de paramรจtres
|
| 261 |
+
|
| 262 |
+
# 4. Cohรฉrence tactique (2 points)
|
| 263 |
+
if scenario['difficulty'] == 'difficile':
|
| 264 |
+
# Pour les scรฉnarios difficiles, vรฉrifier la pertinence tactique
|
| 265 |
+
if 'attack' in scenario['scenario'].lower() and 'attack' in response.lower():
|
| 266 |
+
score += 1
|
| 267 |
+
if 'defend' in scenario['scenario'].lower() and ('defend' in response.lower() or 'move' in response.lower()):
|
| 268 |
+
score += 1
|
| 269 |
+
else:
|
| 270 |
+
# Pour les scรฉnarios plus simples
|
| 271 |
+
if any(word in response.lower() for word in ['game', 'state', 'move', 'attack', 'build']):
|
| 272 |
+
score += 1
|
| 273 |
+
|
| 274 |
+
return min(score, 10)
|
| 275 |
+
|
| 276 |
+
def run_realistic_evaluation():
|
| 277 |
+
"""Exรฉcute l'รฉvaluation rรฉaliste complรจte"""
|
| 278 |
+
|
| 279 |
+
print("๐ฎ รVALUATION RรALISTE MCP POUR JEU RTS")
|
| 280 |
+
print("=" * 70)
|
| 281 |
+
print("Test avec contexte de jeu, scรฉnarios rรฉels et validation tactique")
|
| 282 |
+
print("=" * 70)
|
| 283 |
+
|
| 284 |
+
# Modรจles ร tester
|
| 285 |
+
models = [
|
| 286 |
+
{
|
| 287 |
+
'name': 'Qwen2.5-0.5B',
|
| 288 |
+
'path': 'qwen2.5-0.5b-instruct-q4_0.gguf'
|
| 289 |
+
},
|
| 290 |
+
{
|
| 291 |
+
'name': 'Qwen3-0.6B',
|
| 292 |
+
'path': 'Qwen3-0.6B-Q8_0.gguf'
|
| 293 |
+
},
|
| 294 |
+
{
|
| 295 |
+
'name': 'Gemma-3-270M',
|
| 296 |
+
'path': 'gemma-3-270m-it-qat-Q8_0.gguf'
|
| 297 |
+
}
|
| 298 |
+
]
|
| 299 |
+
|
| 300 |
+
results = []
|
| 301 |
+
|
| 302 |
+
for model in models:
|
| 303 |
+
if os.path.exists(model['path']):
|
| 304 |
+
result = test_model_realistic(model['path'], model['name'])
|
| 305 |
+
results.append(result)
|
| 306 |
+
print("\n" + "="*50)
|
| 307 |
+
else:
|
| 308 |
+
print(f"โ Modรจle non trouvรฉ: {model['path']}")
|
| 309 |
+
|
| 310 |
+
# Analyse comparative rรฉaliste
|
| 311 |
+
print("\n" + "="*70)
|
| 312 |
+
print("๐ ANALYSE COMPARATIVE RรALISTE")
|
| 313 |
+
print("="*70)
|
| 314 |
+
|
| 315 |
+
successful_results = [r for r in results if 'error' not in r and r['avg_score'] > 0]
|
| 316 |
+
|
| 317 |
+
if successful_results:
|
| 318 |
+
# Classement par performance globale
|
| 319 |
+
sorted_by_performance = sorted(successful_results, key=lambda x: x['avg_score'], reverse=True)
|
| 320 |
+
|
| 321 |
+
print(f"\n๐ CLASSEMENT PAR PERFORMANCE RรELLE:")
|
| 322 |
+
for i, result in enumerate(sorted_by_performance, 1):
|
| 323 |
+
print(f" {i}. {result['name']}: {result['avg_score']:.1f}/10 | {result['avg_time']:.2f}s")
|
| 324 |
+
|
| 325 |
+
# Analyse par difficultรฉ
|
| 326 |
+
print(f"\n๐ PERFORMANCE PAR DIFFICULTร:")
|
| 327 |
+
|
| 328 |
+
difficulties = ['facile', 'moyen', 'difficile']
|
| 329 |
+
for diff in difficulties:
|
| 330 |
+
print(f"\n๐ธ Scรฉnarios {diff}s:")
|
| 331 |
+
for result in successful_results:
|
| 332 |
+
avg_key = f"{diff}_avg"
|
| 333 |
+
if hasattr(result, avg_key):
|
| 334 |
+
score = getattr(result, avg_key)
|
| 335 |
+
print(f" {result['name']}: {score:.1f}/10")
|
| 336 |
+
|
| 337 |
+
# Recommandations basรฉes sur l'usage rรฉel
|
| 338 |
+
best_overall = sorted_by_performance[0]
|
| 339 |
+
|
| 340 |
+
print(f"\n๐ฏ RECOMMANDATIONS POUR VOTRE JEU RTS:")
|
| 341 |
+
|
| 342 |
+
if best_overall['avg_score'] >= 7:
|
| 343 |
+
print(f"โ
{best_overall['name']} est EXCELLENT pour la production")
|
| 344 |
+
print(f" โข Gรจre bien les scรฉnarios complexes")
|
| 345 |
+
print(f" โข Rรฉponses tactiques cohรฉrentes")
|
| 346 |
+
elif best_overall['avg_score'] >= 5:
|
| 347 |
+
print(f"๐ {best_overall['name']} est BON pour la production")
|
| 348 |
+
print(f" โข Fonctionne bien pour les commandes de base")
|
| 349 |
+
print(f" โข Nรฉcessite peut-รชtre une validation supplรฉmentaire")
|
| 350 |
+
else:
|
| 351 |
+
print(f"โ ๏ธ {best_overall['name']} nรฉcessite des amรฉliorations")
|
| 352 |
+
print(f" โข Considรฉrer des prompts plus structurรฉs")
|
| 353 |
+
print(f" โข Ajouter des rรจgles de validation")
|
| 354 |
+
|
| 355 |
+
# Analyse spรฉcifique aux cas d'usage
|
| 356 |
+
print(f"\n๐ฎ ANALYSE SPรCIFIQUE AU JEU:")
|
| 357 |
+
|
| 358 |
+
for result in successful_results:
|
| 359 |
+
print(f"\n๐น {result['name']}:")
|
| 360 |
+
|
| 361 |
+
# Analyser les rรฉsultats par scรฉnario
|
| 362 |
+
scenario_scores = {}
|
| 363 |
+
for scenario_result in result['results']:
|
| 364 |
+
scenario_name = scenario_result['scenario']
|
| 365 |
+
if scenario_name not in scenario_scores:
|
| 366 |
+
scenario_scores[scenario_name] = []
|
| 367 |
+
scenario_scores[scenario_name].append(scenario_result['score'])
|
| 368 |
+
|
| 369 |
+
for scenario, scores in scenario_scores.items():
|
| 370 |
+
avg_score = sum(scores) / len(scores)
|
| 371 |
+
print(f" {scenario}: {avg_score:.1f}/10")
|
| 372 |
+
|
| 373 |
+
# Sauvegarder les rรฉsultats rรฉalistes
|
| 374 |
+
realistic_results = {
|
| 375 |
+
'evaluation_type': 'realistic_mcp_game_test',
|
| 376 |
+
'game_state_sample': GAME_STATE,
|
| 377 |
+
'results': results,
|
| 378 |
+
'successful_models': successful_results
|
| 379 |
+
}
|
| 380 |
+
|
| 381 |
+
with open("realistic_mcp_evaluation.json", "w", encoding="utf-8") as f:
|
| 382 |
+
json.dump(realistic_results, f, indent=2, ensure_ascii=False)
|
| 383 |
+
|
| 384 |
+
print(f"\n๐ Rรฉsultats rรฉalistes sauvegardรฉs dans: realistic_mcp_evaluation.json")
|
| 385 |
+
|
| 386 |
+
if __name__ == "__main__":
|
| 387 |
+
run_realistic_evaluation()
|
tests/scripts/simple_model_diagnostic.py
ADDED
|
@@ -0,0 +1,94 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Diagnostic simple des modรจles MCP problรฉmatiques
|
| 3 |
+
Test basique pour identifier les problรจmes techniques
|
| 4 |
+
"""
|
| 5 |
+
|
| 6 |
+
import os
|
| 7 |
+
import sys
|
| 8 |
+
|
| 9 |
+
def check_model_files():
|
| 10 |
+
"""Vรฉrifier l'intรฉgritรฉ des fichiers modรจles"""
|
| 11 |
+
|
| 12 |
+
print("๐ VรRIFICATION DES FICHIERS MODรLES")
|
| 13 |
+
print("=" * 50)
|
| 14 |
+
|
| 15 |
+
models = [
|
| 16 |
+
('mcp-instruct-v1.Q4_K_M.gguf', 'MCP-Instruct-v1'),
|
| 17 |
+
('mcprl-3b-exa.Q2_K.gguf', 'MCPR L-3B-Exa'),
|
| 18 |
+
('gemma-3n-E2B-it-UD-IQ2_XXS.gguf', 'Gemma-3n-E2B-it'),
|
| 19 |
+
('qwen2.5-coder-0.5b-instruct-q4_0.gguf', 'Qwen2.5-Coder-0.5B')
|
| 20 |
+
]
|
| 21 |
+
|
| 22 |
+
for filename, model_name in models:
|
| 23 |
+
if os.path.exists(filename):
|
| 24 |
+
size_mb = os.path.getsize(filename) / (1024 * 1024)
|
| 25 |
+
print(f"โ
{model_name}: {size_mb:.1f} MB")
|
| 26 |
+
else:
|
| 27 |
+
print(f"โ {model_name}: Fichier non trouvรฉ")
|
| 28 |
+
|
| 29 |
+
def test_basic_llm():
|
| 30 |
+
"""Test basique de chargement LLM"""
|
| 31 |
+
|
| 32 |
+
print("\n๐งช TEST DE CHARGEMENT LLAMA-CPP")
|
| 33 |
+
print("=" * 50)
|
| 34 |
+
|
| 35 |
+
try:
|
| 36 |
+
from llama_cpp import Llama
|
| 37 |
+
print("โ
llama-cpp-python importรฉ avec succรจs")
|
| 38 |
+
|
| 39 |
+
# Test avec un modรจle qui fonctionne
|
| 40 |
+
test_model = 'qwen2.5-coder-0.5b-instruct-q4_0.gguf'
|
| 41 |
+
if os.path.exists(test_model):
|
| 42 |
+
try:
|
| 43 |
+
llm = Llama(model_path=test_model, n_ctx=512, verbose=False)
|
| 44 |
+
print("โ
Modรจle de test chargรฉ avec succรจs")
|
| 45 |
+
|
| 46 |
+
# Test de gรฉnรฉration simple
|
| 47 |
+
response = llm("Hello", max_tokens=5, temperature=0.1)
|
| 48 |
+
print(f"โ
Gรฉnรฉration test: {response}")
|
| 49 |
+
|
| 50 |
+
except Exception as e:
|
| 51 |
+
print(f"โ Erreur chargement modรจle: {e}")
|
| 52 |
+
else:
|
| 53 |
+
print("โ Modรจle de test non trouvรฉ")
|
| 54 |
+
|
| 55 |
+
except ImportError:
|
| 56 |
+
print("โ llama-cpp-python non installรฉ")
|
| 57 |
+
except Exception as e:
|
| 58 |
+
print(f"โ Erreur import: {e}")
|
| 59 |
+
|
| 60 |
+
def analyze_quantization_issues():
|
| 61 |
+
"""Analyser les problรจmes de quantisation"""
|
| 62 |
+
|
| 63 |
+
print("\n๐ ANALYSE DES PROBLรMES DE QUANTISATION")
|
| 64 |
+
print("=" * 50)
|
| 65 |
+
|
| 66 |
+
quantization_issues = {
|
| 67 |
+
'Q2_K': "Quantisation trรจs agressive - peut causer corruption",
|
| 68 |
+
'IQ2_XXS': "Quantisation extrรชme - perte significative de capacitรฉ",
|
| 69 |
+
'Q4_K_M': "Quantisation moyenne - gรฉnรฉralement fiable",
|
| 70 |
+
'Q8_0': "Quantisation lรฉgรจre - meilleure qualitรฉ"
|
| 71 |
+
}
|
| 72 |
+
|
| 73 |
+
for quant, description in quantization_issues.items():
|
| 74 |
+
print(f"{quant}: {description}")
|
| 75 |
+
|
| 76 |
+
def main():
|
| 77 |
+
"""Diagnostic principal"""
|
| 78 |
+
|
| 79 |
+
print("๐ง DIAGNOSTIC TECHNIQUE DES MODรLES MCP")
|
| 80 |
+
print("=" * 60)
|
| 81 |
+
|
| 82 |
+
check_model_files()
|
| 83 |
+
test_basic_llm()
|
| 84 |
+
analyze_quantization_issues()
|
| 85 |
+
|
| 86 |
+
print("\n๐ฏ CONCLUSIONS:")
|
| 87 |
+
print("-" * 30)
|
| 88 |
+
print("1. MCP-Instruct-v1: Problรจme technique (llama_decode=-1)")
|
| 89 |
+
print("2. MCPR L-3B-Exa: Quantisation Q2_K trop agressive")
|
| 90 |
+
print("3. Gemma-3n-E2B-it: Quantisation IQ2_XXS extrรชme")
|
| 91 |
+
print("4. Solution: Tรฉlรฉcharger versions Q8_0")
|
| 92 |
+
|
| 93 |
+
if __name__ == "__main__":
|
| 94 |
+
main()
|
tests/scripts/simple_qwen_test.py
ADDED
|
@@ -0,0 +1,137 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Test simple pour รฉvaluer les capacitรฉs MCP de Qwen2.5 0.5B
|
| 3 |
+
"""
|
| 4 |
+
|
| 5 |
+
import sys
|
| 6 |
+
import os
|
| 7 |
+
|
| 8 |
+
# Ajouter le chemin pour importer les modules locaux
|
| 9 |
+
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
|
| 10 |
+
|
| 11 |
+
def test_basic_capability():
|
| 12 |
+
"""Test de capacitรฉ MCP de base"""
|
| 13 |
+
|
| 14 |
+
print("=== Test de capacitรฉ MCP de base ===")
|
| 15 |
+
|
| 16 |
+
# Test 1: Comprรฉhension des outils MCP
|
| 17 |
+
prompt1 = """
|
| 18 |
+
Tu es un assistant IA qui contrรดle un jeu RTS via MCP.
|
| 19 |
+
Outils disponibles:
|
| 20 |
+
- get_game_state(): Obtenir l'รฉtat du jeu
|
| 21 |
+
- move_units(unit_ids, target_x, target_y): Dรฉplacer des unitรฉs
|
| 22 |
+
|
| 23 |
+
Commande utilisateur: "Montre-moi l'รฉtat du jeu"
|
| 24 |
+
|
| 25 |
+
Rรฉponds avec un objet JSON contenant l'appel d'outil ร exรฉcuter.
|
| 26 |
+
"""
|
| 27 |
+
|
| 28 |
+
print("Test 1 - Comprรฉhension des outils:")
|
| 29 |
+
print(f"Prompt: {prompt1[:100]}...")
|
| 30 |
+
|
| 31 |
+
# Test 2: Traduction d'action simple
|
| 32 |
+
prompt2 = """
|
| 33 |
+
Tu es un assistant IA qui contrรดle un jeu RTS via MCP.
|
| 34 |
+
Outils disponibles:
|
| 35 |
+
- move_units(unit_ids, target_x, target_y): Dรฉplacer des unitรฉs
|
| 36 |
+
|
| 37 |
+
Commande utilisateur: "Dรฉplace mon infanterie vers la position 100, 200"
|
| 38 |
+
|
| 39 |
+
Rรฉponds avec un objet JSON contenant l'appel d'outil ร exรฉcuter.
|
| 40 |
+
"""
|
| 41 |
+
|
| 42 |
+
print("\nTest 2 - Traduction d'action:")
|
| 43 |
+
print(f"Prompt: {prompt2[:100]}...")
|
| 44 |
+
|
| 45 |
+
# Test 3: Planification complexe
|
| 46 |
+
prompt3 = """
|
| 47 |
+
Tu es un assistant IA qui contrรดle un jeu RTS via MCP.
|
| 48 |
+
Outils disponibles:
|
| 49 |
+
- get_game_state(): Obtenir l'รฉtat du jeu
|
| 50 |
+
- move_units(): Dรฉplacer des unitรฉs
|
| 51 |
+
- attack_unit(): Attaquer une unitรฉ ennemie
|
| 52 |
+
- build_building(): Construire un bรขtiment
|
| 53 |
+
|
| 54 |
+
Commande utilisateur: "Je veux construire une base prรจs du champ de minerai et la dรฉfendre avec des tourelles"
|
| 55 |
+
|
| 56 |
+
Dรฉcompose cette commande en une sรฉquence d'appels d'outils MCP.
|
| 57 |
+
"""
|
| 58 |
+
|
| 59 |
+
print("\nTest 3 - Planification complexe:")
|
| 60 |
+
print(f"Prompt: {prompt3[:100]}...")
|
| 61 |
+
|
| 62 |
+
return {
|
| 63 |
+
"test1": {"prompt": prompt1, "expected": "get_game_state"},
|
| 64 |
+
"test2": {"prompt": prompt2, "expected": "move_units avec paramรจtres"},
|
| 65 |
+
"test3": {"prompt": prompt3, "expected": "sรฉquence d'actions"}
|
| 66 |
+
}
|
| 67 |
+
|
| 68 |
+
def test_with_ai_model():
|
| 69 |
+
"""Test avec le modรจle AI rรฉel"""
|
| 70 |
+
|
| 71 |
+
try:
|
| 72 |
+
from ai_analysis import query_llm
|
| 73 |
+
|
| 74 |
+
print("\n=== Test avec le modรจle Qwen2.5 0.5B ===")
|
| 75 |
+
|
| 76 |
+
# Test simple
|
| 77 |
+
test_prompt = """
|
| 78 |
+
Tu es un assistant IA. Rรฉponds simplement avec "TEST_RรUSSI" si tu comprends cette instruction.
|
| 79 |
+
"""
|
| 80 |
+
|
| 81 |
+
print("Test de connexion au modรจle...")
|
| 82 |
+
|
| 83 |
+
response = query_llm(
|
| 84 |
+
prompt=test_prompt,
|
| 85 |
+
max_tokens=50,
|
| 86 |
+
temperature=0.1
|
| 87 |
+
)
|
| 88 |
+
|
| 89 |
+
print(f"Rรฉponse du modรจle: {response}")
|
| 90 |
+
|
| 91 |
+
if "TEST_RรUSSI" in response:
|
| 92 |
+
print("โ
Modรจle fonctionne correctement")
|
| 93 |
+
return True
|
| 94 |
+
else:
|
| 95 |
+
print("โ Problรจme avec le modรจle")
|
| 96 |
+
return False
|
| 97 |
+
|
| 98 |
+
except Exception as e:
|
| 99 |
+
print(f"โ Erreur lors du test du modรจle: {e}")
|
| 100 |
+
return False
|
| 101 |
+
|
| 102 |
+
def main():
|
| 103 |
+
"""Fonction principale"""
|
| 104 |
+
|
| 105 |
+
print("๐ รvaluation des capacitรฉs MCP de Qwen2.5 0.5B")
|
| 106 |
+
print("=" * 60)
|
| 107 |
+
|
| 108 |
+
# Test de base
|
| 109 |
+
tests = test_basic_capability()
|
| 110 |
+
|
| 111 |
+
# Test avec le modรจle
|
| 112 |
+
model_working = test_with_ai_model()
|
| 113 |
+
|
| 114 |
+
# Rรฉsumรฉ
|
| 115 |
+
print("\n" + "=" * 60)
|
| 116 |
+
print("๐ RรSUMร DE L'รVALUATION")
|
| 117 |
+
print("=" * 60)
|
| 118 |
+
|
| 119 |
+
print("\n1. CAPACITรS MCP DE BASE:")
|
| 120 |
+
print(" โ
Comprรฉhension des outils MCP: Test dรฉfini")
|
| 121 |
+
print(" โ
Traduction d'actions simples: Test dรฉfini")
|
| 122 |
+
print(" โ
Planification complexe: Test dรฉfini")
|
| 123 |
+
|
| 124 |
+
print(f"\n2. รTAT DU MODรLE: {'โ
OPรRATIONNEL' if model_working else 'โ PROBLรME'}")
|
| 125 |
+
|
| 126 |
+
print("\n3. RECOMMANDATIONS:")
|
| 127 |
+
if model_working:
|
| 128 |
+
print(" โข Le modรจle peut รชtre utilisรฉ pour la traduction MCP")
|
| 129 |
+
print(" โข Commencez par des commandes simples")
|
| 130 |
+
print(" โข Ajoutez une validation pour les actions complexes")
|
| 131 |
+
else:
|
| 132 |
+
print(" โข Vรฉrifiez que le modรจle est tรฉlรฉchargรฉ")
|
| 133 |
+
print(" โข Assurez-vous que llama-cpp-python est installรฉ")
|
| 134 |
+
print(" โข Testez avec des commandes plus simples d'abord")
|
| 135 |
+
|
| 136 |
+
if __name__ == "__main__":
|
| 137 |
+
main()
|
tests/scripts/test_improved_mcp_formats.py
ADDED
|
@@ -0,0 +1,229 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Test amรฉliorรฉ avec formats de prompts optimisรฉs pour modรจles spรฉcialisรฉs MCP
|
| 3 |
+
"""
|
| 4 |
+
|
| 5 |
+
import sys
|
| 6 |
+
import os
|
| 7 |
+
import json
|
| 8 |
+
import time
|
| 9 |
+
from llama_cpp import Llama
|
| 10 |
+
|
| 11 |
+
def test_mcp_formats(model_path, model_name):
|
| 12 |
+
"""Tester diffรฉrents formats de prompt pour les modรจles MCP"""
|
| 13 |
+
|
| 14 |
+
print(f"\n๐งช TEST AVEC FORMATS OPTIMISรS: {model_name}")
|
| 15 |
+
print("=" * 60)
|
| 16 |
+
|
| 17 |
+
if not os.path.exists(model_path):
|
| 18 |
+
print(f"โ Fichier modรจle non trouvรฉ: {model_path}")
|
| 19 |
+
return
|
| 20 |
+
|
| 21 |
+
try:
|
| 22 |
+
# Initialiser avec diffรฉrents paramรจtres pour modรจles MCP
|
| 23 |
+
llm = Llama(
|
| 24 |
+
model_path=model_path,
|
| 25 |
+
n_ctx=2048, # Plus grand contexte pour modรจles MCP
|
| 26 |
+
n_threads=1,
|
| 27 |
+
verbose=False,
|
| 28 |
+
n_gpu_layers=0,
|
| 29 |
+
temperature=0.3, # Tempรฉrature lรฉgรจrement plus รฉlevรฉe
|
| 30 |
+
top_p=0.9
|
| 31 |
+
)
|
| 32 |
+
|
| 33 |
+
test_command = "show game state"
|
| 34 |
+
|
| 35 |
+
# Formats de prompt optimisรฉs basรฉs sur les standards MCP
|
| 36 |
+
prompt_formats = [
|
| 37 |
+
{
|
| 38 |
+
'name': 'Format MCP structurรฉ complet',
|
| 39 |
+
'template': '''[INST] You are an MCP (Model Context Protocol) assistant for an RTS game. Your task is to respond ONLY with JSON format.
|
| 40 |
+
|
| 41 |
+
Available tools:
|
| 42 |
+
- get_game_state(): Returns current game state information
|
| 43 |
+
- move_units(unit_ids, target_x, target_y): Move specified units to coordinates
|
| 44 |
+
- attack_unit(attacker_ids, target_id): Attack target unit with attackers
|
| 45 |
+
- build_building(building_type, position_x, position_y): Construct building at position
|
| 46 |
+
|
| 47 |
+
User command: "{command}"
|
| 48 |
+
|
| 49 |
+
Respond with ONLY JSON in this format:
|
| 50 |
+
{{"tool": "tool_name", "args": {{<parameters>}}}}
|
| 51 |
+
|
| 52 |
+
Do not include any explanations, comments, or additional text. Only JSON. [/INST]'''
|
| 53 |
+
},
|
| 54 |
+
{
|
| 55 |
+
'name': 'Format MCP instruct simple',
|
| 56 |
+
'template': '''<|system|>
|
| 57 |
+
You are an MCP assistant for an RTS game. Respond with ONLY JSON.
|
| 58 |
+
Available tools: get_game_state, move_units, attack_unit, build_building
|
| 59 |
+
</|system|>
|
| 60 |
+
<|user|>
|
| 61 |
+
{command}
|
| 62 |
+
</|user|>
|
| 63 |
+
<|assistant|>'''
|
| 64 |
+
},
|
| 65 |
+
{
|
| 66 |
+
'name': 'Format JSON strict',
|
| 67 |
+
'template': '''Your role: MCP assistant for RTS game
|
| 68 |
+
Tools: get_game_state(), move_units(), attack_unit(), build_building()
|
| 69 |
+
Command: {command}
|
| 70 |
+
|
| 71 |
+
JSON Response:'''
|
| 72 |
+
},
|
| 73 |
+
{
|
| 74 |
+
'name': 'Format conversation MCP',
|
| 75 |
+
'template': '''User: {command}
|
| 76 |
+
|
| 77 |
+
Assistant: {{"tool": "get_game_state", "args": {{}}}}'''
|
| 78 |
+
},
|
| 79 |
+
{
|
| 80 |
+
'name': 'Format code-oriented',
|
| 81 |
+
'template': '''// MCP assistant for RTS game
|
| 82 |
+
// Available functions:
|
| 83 |
+
// - get_game_state()
|
| 84 |
+
// - move_units(unit_ids, target_x, target_y)
|
| 85 |
+
// - attack_unit(attacker_ids, target_id)
|
| 86 |
+
// - build_building(building_type, position_x, position_y)
|
| 87 |
+
|
| 88 |
+
// User: {command}
|
| 89 |
+
// Response (JSON only):'''
|
| 90 |
+
}
|
| 91 |
+
]
|
| 92 |
+
|
| 93 |
+
for i, format_info in enumerate(prompt_formats, 1):
|
| 94 |
+
print(f"\n{i}. {format_info['name']}:")
|
| 95 |
+
|
| 96 |
+
prompt = format_info['template'].format(command=test_command)
|
| 97 |
+
|
| 98 |
+
try:
|
| 99 |
+
# Essayer avec diffรฉrents paramรจtres
|
| 100 |
+
response = llm(
|
| 101 |
+
prompt,
|
| 102 |
+
max_tokens=100,
|
| 103 |
+
temperature=0.3,
|
| 104 |
+
stop=["</s>", "<|im_end|>", "```", "\n\n", "User:"]
|
| 105 |
+
)
|
| 106 |
+
|
| 107 |
+
# Extraire la rรฉponse selon le format retournรฉ
|
| 108 |
+
if hasattr(response, 'choices') and len(response.choices) > 0:
|
| 109 |
+
response_text = response.choices[0].text.strip()
|
| 110 |
+
elif isinstance(response, dict) and 'choices' in response:
|
| 111 |
+
response_text = response['choices'][0]['text'].strip()
|
| 112 |
+
else:
|
| 113 |
+
response_text = str(response)
|
| 114 |
+
|
| 115 |
+
print(f" Prompt (dรฉbut): {prompt[:80]}...")
|
| 116 |
+
print(f" Rรฉponse: {response_text[:150]}")
|
| 117 |
+
|
| 118 |
+
# Vรฉrifier JSON
|
| 119 |
+
try:
|
| 120 |
+
parsed = json.loads(response_text)
|
| 121 |
+
print(f" โ
JSON VALIDE: {parsed}")
|
| 122 |
+
except json.JSONDecodeError:
|
| 123 |
+
# Essayer d'extraire JSON
|
| 124 |
+
import re
|
| 125 |
+
json_match = re.search(r'\{[^{}]*\{[^{}]*\}[^{}]*\}|\{[^{}]*\}', response_text)
|
| 126 |
+
if json_match:
|
| 127 |
+
try:
|
| 128 |
+
parsed = json.loads(json_match.group())
|
| 129 |
+
print(f" โ
JSON EXTRACTร: {parsed}")
|
| 130 |
+
except:
|
| 131 |
+
print(f" โ JSON INVALIDE")
|
| 132 |
+
else:
|
| 133 |
+
print(f" โ PAS DE JSON DรTECTร")
|
| 134 |
+
|
| 135 |
+
except Exception as e:
|
| 136 |
+
print(f" โ Erreur: {e}")
|
| 137 |
+
|
| 138 |
+
except Exception as e:
|
| 139 |
+
print(f"โ Erreur de chargement: {e}")
|
| 140 |
+
|
| 141 |
+
def test_model_capabilities(model_path, model_name):
|
| 142 |
+
"""Tester les capacitรฉs de base du modรจle"""
|
| 143 |
+
|
| 144 |
+
print(f"\n๐ TEST DE CAPACITรS: {model_name}")
|
| 145 |
+
print("=" * 60)
|
| 146 |
+
|
| 147 |
+
if not os.path.exists(model_path):
|
| 148 |
+
print(f"โ Fichier modรจle non trouvรฉ: {model_path}")
|
| 149 |
+
return
|
| 150 |
+
|
| 151 |
+
try:
|
| 152 |
+
llm = Llama(
|
| 153 |
+
model_path=model_path,
|
| 154 |
+
n_ctx=2048,
|
| 155 |
+
n_threads=1,
|
| 156 |
+
verbose=False,
|
| 157 |
+
n_gpu_layers=0
|
| 158 |
+
)
|
| 159 |
+
|
| 160 |
+
# Test de gรฉnรฉration de base
|
| 161 |
+
test_prompts = [
|
| 162 |
+
"Hello",
|
| 163 |
+
"What is your name?",
|
| 164 |
+
"Write JSON",
|
| 165 |
+
"{\"test\": \"value\"}"
|
| 166 |
+
]
|
| 167 |
+
|
| 168 |
+
for prompt in test_prompts:
|
| 169 |
+
try:
|
| 170 |
+
response = llm(
|
| 171 |
+
prompt,
|
| 172 |
+
max_tokens=20,
|
| 173 |
+
temperature=0.1
|
| 174 |
+
)
|
| 175 |
+
response_text = response['choices'][0]['text'].strip()
|
| 176 |
+
print(f" '{prompt}' โ '{response_text}'")
|
| 177 |
+
except Exception as e:
|
| 178 |
+
print(f" '{prompt}' โ Erreur: {e}")
|
| 179 |
+
|
| 180 |
+
except Exception as e:
|
| 181 |
+
print(f"โ Erreur de chargement: {e}")
|
| 182 |
+
|
| 183 |
+
def main():
|
| 184 |
+
"""Test des modรจles problรฉmatiques avec formats amรฉliorรฉs"""
|
| 185 |
+
|
| 186 |
+
print("๐ TEST AMรLIORร DES MODรLES SPรCIALISรS MCP")
|
| 187 |
+
print("=" * 70)
|
| 188 |
+
|
| 189 |
+
models_to_test = [
|
| 190 |
+
{
|
| 191 |
+
'name': 'MCP-Instruct-v1',
|
| 192 |
+
'path': 'mcp-instruct-v1.Q4_K_M.gguf',
|
| 193 |
+
'type': 'mcp_specialized'
|
| 194 |
+
},
|
| 195 |
+
{
|
| 196 |
+
'name': 'MCPR L-3B-Exa',
|
| 197 |
+
'path': 'mcprl-3b-exa.Q2_K.gguf',
|
| 198 |
+
'type': 'mcp_specialized'
|
| 199 |
+
},
|
| 200 |
+
{
|
| 201 |
+
'name': 'Gemma-3n-E2B-it',
|
| 202 |
+
'path': 'gemma-3n-E2B-it-UD-IQ2_XXS.gguf',
|
| 203 |
+
'type': 'mcp_specialized'
|
| 204 |
+
},
|
| 205 |
+
{
|
| 206 |
+
'name': 'Qwen2.5-Coder-0.5B (rรฉfรฉrence)',
|
| 207 |
+
'path': 'qwen2.5-coder-0.5b-instruct-q4_0.gguf',
|
| 208 |
+
'type': 'code_specialized'
|
| 209 |
+
}
|
| 210 |
+
]
|
| 211 |
+
|
| 212 |
+
for model in models_to_test:
|
| 213 |
+
if os.path.exists(model['path']):
|
| 214 |
+
test_model_capabilities(model['path'], model['name'])
|
| 215 |
+
test_mcp_formats(model['path'], model['name'])
|
| 216 |
+
else:
|
| 217 |
+
print(f"\nโ Modรจle non trouvรฉ: {model['path']}")
|
| 218 |
+
|
| 219 |
+
print("\n" + "=" * 70)
|
| 220 |
+
print("๐ RรCAPITULATIF DES PROBLรMES:")
|
| 221 |
+
print("""
|
| 222 |
+
1. **MCP-Instruct-v1**: Erreur technique (llama_decode=-1) - probablement incompatible
|
| 223 |
+
2. **MCPR L-3B-Exa**: Gรฉnรฉration de texte corrompu - mauvaise quantisation ou format
|
| 224 |
+
3. **Gemma-3n-E2B-it**: Rรฉponses vides - format de prompt incompatible
|
| 225 |
+
4. **Modรจles code-spรฉcialisรฉs**: Fonctionnent mieux car habituรฉs au JSON
|
| 226 |
+
""")
|
| 227 |
+
|
| 228 |
+
if __name__ == "__main__":
|
| 229 |
+
main()
|
tests/scripts/test_mcp.py
ADDED
|
@@ -0,0 +1,50 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Test script for the MCP server
|
| 3 |
+
"""
|
| 4 |
+
|
| 5 |
+
import asyncio
|
| 6 |
+
import sys
|
| 7 |
+
import os
|
| 8 |
+
|
| 9 |
+
# Add the web directory to the path so we can import the mcp_server module
|
| 10 |
+
sys.path.insert(0, os.path.join(os.path.dirname(__file__)))
|
| 11 |
+
|
| 12 |
+
def test_fastmcp_import():
|
| 13 |
+
"""Test that we can import the FastMCP class."""
|
| 14 |
+
try:
|
| 15 |
+
from mcp.server import FastMCP
|
| 16 |
+
print("โ Successfully imported FastMCP")
|
| 17 |
+
return True
|
| 18 |
+
except Exception as e:
|
| 19 |
+
print(f"โ Failed to import FastMCP: {e}")
|
| 20 |
+
return False
|
| 21 |
+
|
| 22 |
+
def test_imports():
|
| 23 |
+
"""Test that we can import the MCP server module."""
|
| 24 |
+
try:
|
| 25 |
+
# We need to mock the app import since it's not available in this context
|
| 26 |
+
import app
|
| 27 |
+
from mcp_server import RTSGameMCP
|
| 28 |
+
print("โ Successfully imported RTSGameMCP")
|
| 29 |
+
return True
|
| 30 |
+
except Exception as e:
|
| 31 |
+
print(f"โ Failed to import RTSGameMCP: {e}")
|
| 32 |
+
return False
|
| 33 |
+
|
| 34 |
+
if __name__ == "__main__":
|
| 35 |
+
print("Testing MCP server imports...")
|
| 36 |
+
|
| 37 |
+
tests = [
|
| 38 |
+
test_fastmcp_import
|
| 39 |
+
# test_imports # Skip this for now since it depends on the app module
|
| 40 |
+
]
|
| 41 |
+
|
| 42 |
+
results = []
|
| 43 |
+
for test in tests:
|
| 44 |
+
results.append(test())
|
| 45 |
+
|
| 46 |
+
if all(results):
|
| 47 |
+
print("\nโ All tests passed!")
|
| 48 |
+
else:
|
| 49 |
+
print("\nโ Some tests failed!")
|
| 50 |
+
sys.exit(1)
|
tests/scripts/test_q8_models.py
ADDED
|
@@ -0,0 +1,180 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Test des modรจles Q8_0 pour voir si la meilleure quantisation rรฉsout les problรจmes
|
| 4 |
+
"""
|
| 5 |
+
|
| 6 |
+
import sys
|
| 7 |
+
import os
|
| 8 |
+
import time
|
| 9 |
+
import json
|
| 10 |
+
|
| 11 |
+
def test_q8_models():
|
| 12 |
+
"""Tester les modรจles Q8_0 avec diffรฉrents formats de prompt"""
|
| 13 |
+
|
| 14 |
+
print("๐ TEST DES MODรLES Q8_0 - FORMATS AMรLIORรS")
|
| 15 |
+
print("=" * 70)
|
| 16 |
+
|
| 17 |
+
# Modรจles Q8_0 tรฉlรฉchargรฉs
|
| 18 |
+
models = [
|
| 19 |
+
{
|
| 20 |
+
'name': 'MCP-Instruct-v1-Q8',
|
| 21 |
+
'path': 'mcp-instruct-v1.Q8_0.gguf',
|
| 22 |
+
'format': 'mcp_structured',
|
| 23 |
+
'n_ctx': 4096
|
| 24 |
+
},
|
| 25 |
+
{
|
| 26 |
+
'name': 'MCPR-L-3B-Exa-Q8',
|
| 27 |
+
'path': 'mcprl-3b-exa.Q8_0.gguf',
|
| 28 |
+
'format': 'instruct',
|
| 29 |
+
'n_ctx': 4096
|
| 30 |
+
},
|
| 31 |
+
{
|
| 32 |
+
'name': 'Gemma-3n-E2B-it-Q8',
|
| 33 |
+
'path': 'google_gemma-3n-E2B-it-Q8_0.gguf',
|
| 34 |
+
'format': 'code',
|
| 35 |
+
'n_ctx': 4096
|
| 36 |
+
}
|
| 37 |
+
]
|
| 38 |
+
|
| 39 |
+
# Formats de prompts optimisรฉs
|
| 40 |
+
prompt_templates = {
|
| 41 |
+
'mcp_structured': '''<|im_start|>system
|
| 42 |
+
You are an MCP assistant for an RTS game. Respond with ONLY JSON format:
|
| 43 |
+
{"tool": "tool_name", "args": {}}
|
| 44 |
+
|
| 45 |
+
Available tools:
|
| 46 |
+
- get_game_state(): Get current game state
|
| 47 |
+
- move_units(unit_ids, target_x, target_y): Move units to coordinates
|
| 48 |
+
- attack_unit(attacker_ids, target_id): Attack a specific unit
|
| 49 |
+
- build_building(building_type, position_x, position_y): Construct a building
|
| 50 |
+
|
| 51 |
+
Always respond with valid JSON, no additional text.<|im_end|>
|
| 52 |
+
<|im_start|>user
|
| 53 |
+
{command}<|im_end|>
|
| 54 |
+
<|im_start|>assistant
|
| 55 |
+
''',
|
| 56 |
+
|
| 57 |
+
'instruct': '''[INST] You are an MCP assistant. Respond with ONLY JSON format:
|
| 58 |
+
{"tool": "tool_name", "args": {}}
|
| 59 |
+
|
| 60 |
+
Available tools:
|
| 61 |
+
- get_game_state(): Get current game state
|
| 62 |
+
- move_units(unit_ids, target_x, target_y): Move units to coordinates
|
| 63 |
+
- attack_unit(attacker_ids, target_id): Attack a specific unit
|
| 64 |
+
- build_building(building_type, position_x, position_y): Construct a building
|
| 65 |
+
|
| 66 |
+
User command: {command}
|
| 67 |
+
|
| 68 |
+
Respond with only JSON: [/INST]
|
| 69 |
+
''',
|
| 70 |
+
|
| 71 |
+
'code': '''// MCP assistant for RTS game
|
| 72 |
+
// Tools: get_game_state(), move_units(), attack_unit(), build_building()
|
| 73 |
+
// Command: {command}
|
| 74 |
+
// Output JSON only:
|
| 75 |
+
{"tool": "tool_name", "args": {}}'''
|
| 76 |
+
}
|
| 77 |
+
|
| 78 |
+
test_command = "show game state"
|
| 79 |
+
|
| 80 |
+
for model in models:
|
| 81 |
+
print(f"\n๐งช TEST: {model['name']}")
|
| 82 |
+
print("-" * 50)
|
| 83 |
+
|
| 84 |
+
if not os.path.exists(model['path']):
|
| 85 |
+
print(f" โ Fichier non trouvรฉ: {model['path']}")
|
| 86 |
+
continue
|
| 87 |
+
|
| 88 |
+
file_size_mb = os.path.getsize(model['path']) / (1024*1024)
|
| 89 |
+
print(f" ๐ Taille: {file_size_mb:.1f} MB")
|
| 90 |
+
print(f" ๐ง Format: {model['format']}")
|
| 91 |
+
print(f" ๐ง Contexte: {model['n_ctx']} tokens")
|
| 92 |
+
|
| 93 |
+
try:
|
| 94 |
+
from llama_cpp import Llama
|
| 95 |
+
|
| 96 |
+
llm = Llama(
|
| 97 |
+
model_path=model['path'],
|
| 98 |
+
n_ctx=model['n_ctx'],
|
| 99 |
+
n_threads=1,
|
| 100 |
+
verbose=False,
|
| 101 |
+
n_gpu_layers=0
|
| 102 |
+
)
|
| 103 |
+
|
| 104 |
+
prompt = prompt_templates[model['format']].format(command=test_command)
|
| 105 |
+
|
| 106 |
+
print(f" ๐ Prompt (dรฉbut): {prompt[:80]}...")
|
| 107 |
+
|
| 108 |
+
start_time = time.time()
|
| 109 |
+
|
| 110 |
+
try:
|
| 111 |
+
response = llm(
|
| 112 |
+
prompt,
|
| 113 |
+
max_tokens=100,
|
| 114 |
+
temperature=0.3,
|
| 115 |
+
stop=["</s>", "<|im_end|>", "```", "\n\n"]
|
| 116 |
+
)
|
| 117 |
+
|
| 118 |
+
response_time = time.time() - start_time
|
| 119 |
+
|
| 120 |
+
# Gรฉrer la rรฉponse (compatible avec diffรฉrentes versions de llama-cpp)
|
| 121 |
+
if isinstance(response, dict) and 'choices' in response:
|
| 122 |
+
response_text = response['choices'][0]['text'].strip()
|
| 123 |
+
elif hasattr(response, '__iter__') and not isinstance(response, str):
|
| 124 |
+
# Gรฉrer les rรฉponses stream
|
| 125 |
+
response_text = ""
|
| 126 |
+
for chunk in response:
|
| 127 |
+
if isinstance(chunk, dict) and 'choices' in chunk:
|
| 128 |
+
response_text += chunk['choices'][0]['text']
|
| 129 |
+
else:
|
| 130 |
+
response_text = str(response)
|
| 131 |
+
|
| 132 |
+
print(f" โฑ๏ธ Temps: {response_time:.2f}s")
|
| 133 |
+
print(f" ๐ Rรฉponse: {response_text[:200]}")
|
| 134 |
+
|
| 135 |
+
# Vรฉrifier JSON
|
| 136 |
+
try:
|
| 137 |
+
parsed = json.loads(response_text)
|
| 138 |
+
print(f" โ
JSON VALIDE: {json.dumps(parsed, indent=2)}")
|
| 139 |
+
|
| 140 |
+
# รvaluer la pertinence
|
| 141 |
+
if 'tool' in parsed:
|
| 142 |
+
print(f" ๐ฏ Outil identifiรฉ: {parsed['tool']}")
|
| 143 |
+
if 'args' in parsed:
|
| 144 |
+
print(f" ๐ Arguments: {parsed['args']}")
|
| 145 |
+
|
| 146 |
+
except json.JSONDecodeError:
|
| 147 |
+
print(f" โ JSON INVALIDE")
|
| 148 |
+
# Essayer d'extraire JSON
|
| 149 |
+
import re
|
| 150 |
+
json_match = re.search(r'\{[^{}]*\{[^{}]*\}[^{}]*\}|\{[^{}]*\}', response_text)
|
| 151 |
+
if json_match:
|
| 152 |
+
try:
|
| 153 |
+
extracted_json = json.loads(json_match.group())
|
| 154 |
+
print(f" ๐ JSON extrait: {json.dumps(extracted_json, indent=2)}")
|
| 155 |
+
except:
|
| 156 |
+
print(f" โ Impossible d'extraire JSON valide")
|
| 157 |
+
|
| 158 |
+
except Exception as e:
|
| 159 |
+
print(f" โ Erreur gรฉnรฉration: {e}")
|
| 160 |
+
|
| 161 |
+
except Exception as e:
|
| 162 |
+
print(f" โ Erreur chargement: {e}")
|
| 163 |
+
|
| 164 |
+
print(f"\n{'='*70}")
|
| 165 |
+
print("๐ฏ COMPARAISON AVEC ANCIENNES VERSIONS")
|
| 166 |
+
print("=" * 70)
|
| 167 |
+
|
| 168 |
+
print("\n๐ RรSULTATS ANCIENNES VERSIONS:")
|
| 169 |
+
print(" โข MCP-Instruct-v1 (Q4_K_M): Erreur technique (llama_decode=-1)")
|
| 170 |
+
print(" โข MCPR L-3B-Exa (Q2_K): Texte corrompu avec caractรจres spรฉciaux")
|
| 171 |
+
print(" โข Gemma-3n-E2B-it (IQ2_XXS): Rรฉponses vides, pas de sortie")
|
| 172 |
+
|
| 173 |
+
print("\n๐ ATTENTES POUR Q8_0:")
|
| 174 |
+
print(" โข Meilleure qualitรฉ de quantisation")
|
| 175 |
+
print(" โข Moins de corruption de texte")
|
| 176 |
+
print(" โข Rรฉponses plus cohรฉrentes")
|
| 177 |
+
print(" โข JSON valide possible")
|
| 178 |
+
|
| 179 |
+
if __name__ == "__main__":
|
| 180 |
+
test_q8_models()
|
tests/scripts/test_qwen_mcp_capabilities.py
ADDED
|
@@ -0,0 +1,326 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Test script to evaluate Qwen2.5 0.5B capabilities with MCP for RTS game
|
| 3 |
+
Tests basic MCP understanding and complex planning capabilities
|
| 4 |
+
"""
|
| 5 |
+
|
| 6 |
+
import json
|
| 7 |
+
import time
|
| 8 |
+
from typing import Dict, Any, List
|
| 9 |
+
from ai_analysis import query_llm
|
| 10 |
+
|
| 11 |
+
# Test prompts for different capability levels
|
| 12 |
+
TEST_PROMPTS = {
|
| 13 |
+
"basic_mcp": [
|
| 14 |
+
{
|
| 15 |
+
"name": "Simple Tool Call",
|
| 16 |
+
"prompt": """
|
| 17 |
+
You are an AI assistant controlling an RTS game through MCP (Model Context Protocol).
|
| 18 |
+
Available tools:
|
| 19 |
+
- get_game_state(): Get current game state
|
| 20 |
+
- move_units(unit_ids, target_x, target_y): Move units to position
|
| 21 |
+
- attack_unit(attacker_ids, target_id): Attack enemy unit
|
| 22 |
+
|
| 23 |
+
User command: "Show me the current game state"
|
| 24 |
+
|
| 25 |
+
Respond with a JSON object containing the tool call to execute.
|
| 26 |
+
""",
|
| 27 |
+
"expected": {"tool": "get_game_state", "args": {}}
|
| 28 |
+
},
|
| 29 |
+
{
|
| 30 |
+
"name": "Basic Action Translation",
|
| 31 |
+
"prompt": """
|
| 32 |
+
You are an AI assistant controlling an RTS game through MCP.
|
| 33 |
+
Available tools:
|
| 34 |
+
- move_units(unit_ids, target_x, target_y): Move units to position
|
| 35 |
+
- attack_unit(attacker_ids, target_id): Attack enemy unit
|
| 36 |
+
|
| 37 |
+
User command: "Move my infantry to position 100, 200"
|
| 38 |
+
|
| 39 |
+
Respond with a JSON object containing the tool call to execute.
|
| 40 |
+
""",
|
| 41 |
+
"expected": {"tool": "move_units", "args": {"unit_ids": ["infantry"], "target_x": 100, "target_y": 200}}
|
| 42 |
+
}
|
| 43 |
+
],
|
| 44 |
+
"complex_planning": [
|
| 45 |
+
{
|
| 46 |
+
"name": "Multi-step Strategy",
|
| 47 |
+
"prompt": """
|
| 48 |
+
You are an AI assistant controlling an RTS game through MCP.
|
| 49 |
+
Available tools:
|
| 50 |
+
- get_game_state(): Get current game state
|
| 51 |
+
- move_units(unit_ids, target_x, target_y): Move units
|
| 52 |
+
- attack_unit(attacker_ids, target_id): Attack enemy
|
| 53 |
+
- build_building(building_type, position_x, position_y, player_id): Build building
|
| 54 |
+
|
| 55 |
+
User command: "I want to build a base near the ore field and defend it with turrets"
|
| 56 |
+
|
| 57 |
+
Break this down into a sequence of MCP tool calls. Respond with a JSON array of tool calls.
|
| 58 |
+
""",
|
| 59 |
+
"expected": {"type": "sequence", "steps": ["get_game_state", "build_building", "build_building"]}
|
| 60 |
+
},
|
| 61 |
+
{
|
| 62 |
+
"name": "Strategic Analysis",
|
| 63 |
+
"prompt": """
|
| 64 |
+
You are an AI assistant controlling an RTS game through MCP.
|
| 65 |
+
Available tools:
|
| 66 |
+
- get_game_state(): Get current game state
|
| 67 |
+
- get_ai_analysis(language): Get tactical analysis
|
| 68 |
+
|
| 69 |
+
User command: "Analyze the battlefield and suggest the best strategy"
|
| 70 |
+
|
| 71 |
+
Respond with a JSON object containing the tool calls needed.
|
| 72 |
+
""",
|
| 73 |
+
"expected": {"type": "analysis", "steps": ["get_game_state", "get_ai_analysis"]}
|
| 74 |
+
}
|
| 75 |
+
],
|
| 76 |
+
"advanced_mcp": [
|
| 77 |
+
{
|
| 78 |
+
"name": "Parameter Extraction",
|
| 79 |
+
"prompt": """
|
| 80 |
+
You are an AI assistant controlling an RTS game through MCP.
|
| 81 |
+
Available tools:
|
| 82 |
+
- move_units(unit_ids, target_x, target_y): Move units
|
| 83 |
+
|
| 84 |
+
User command: "Move tanks 1, 3, and 7 to coordinates 150, 75"
|
| 85 |
+
|
| 86 |
+
Extract the parameters and respond with a JSON tool call.
|
| 87 |
+
""",
|
| 88 |
+
"expected": {"tool": "move_units", "args": {"unit_ids": [1, 3, 7], "target_x": 150, "target_y": 75}}
|
| 89 |
+
},
|
| 90 |
+
{
|
| 91 |
+
"name": "Error Handling",
|
| 92 |
+
"prompt": """
|
| 93 |
+
You are an AI assistant controlling an RTS game through MCP.
|
| 94 |
+
Available tools:
|
| 95 |
+
- move_units(unit_ids, target_x, target_y): Move units
|
| 96 |
+
|
| 97 |
+
User command: "Move my units to the enemy base"
|
| 98 |
+
|
| 99 |
+
Since you don't know the exact coordinates, how would you handle this?
|
| 100 |
+
Respond with a JSON object showing your approach.
|
| 101 |
+
""",
|
| 102 |
+
"expected": {"type": "needs_clarification", "message": "Need coordinates for enemy base"}
|
| 103 |
+
}
|
| 104 |
+
]
|
| 105 |
+
}
|
| 106 |
+
|
| 107 |
+
def test_qwen_capabilities():
|
| 108 |
+
"""Run comprehensive tests on Qwen2.5 0.5B MCP capabilities"""
|
| 109 |
+
|
| 110 |
+
print("=== Testing Qwen2.5 0.5B MCP Capabilities ===\n")
|
| 111 |
+
|
| 112 |
+
results = {}
|
| 113 |
+
|
| 114 |
+
for category, tests in TEST_PROMPTS.items():
|
| 115 |
+
print(f"\n๐ Testing {category.replace('_', ' ').title()}:")
|
| 116 |
+
print("-" * 50)
|
| 117 |
+
|
| 118 |
+
category_results = []
|
| 119 |
+
|
| 120 |
+
for test in tests:
|
| 121 |
+
print(f"\n๐งช Test: {test['name']}")
|
| 122 |
+
print(f"Prompt: {test['prompt'][:100]}...")
|
| 123 |
+
|
| 124 |
+
try:
|
| 125 |
+
# Query the LLM
|
| 126 |
+
start_time = time.time()
|
| 127 |
+
response = query_llm(
|
| 128 |
+
prompt=test['prompt'],
|
| 129 |
+
max_tokens=500,
|
| 130 |
+
temperature=0.1,
|
| 131 |
+
system_message="You are an AI assistant that responds with JSON objects for MCP tool calls."
|
| 132 |
+
)
|
| 133 |
+
response_time = time.time() - start_time
|
| 134 |
+
|
| 135 |
+
# Parse and analyze response
|
| 136 |
+
analysis = analyze_response(test, response, response_time)
|
| 137 |
+
category_results.append(analysis)
|
| 138 |
+
|
| 139 |
+
print(f"โ
Response time: {response_time:.2f}s")
|
| 140 |
+
print(f"๐ Response: {response[:200]}...")
|
| 141 |
+
print(f"๐ Analysis: {analysis['score']}/10")
|
| 142 |
+
|
| 143 |
+
except Exception as e:
|
| 144 |
+
print(f"โ Error: {e}")
|
| 145 |
+
category_results.append({
|
| 146 |
+
'test': test['name'],
|
| 147 |
+
'error': str(e),
|
| 148 |
+
'score': 0
|
| 149 |
+
})
|
| 150 |
+
|
| 151 |
+
results[category] = category_results
|
| 152 |
+
|
| 153 |
+
# Generate summary report
|
| 154 |
+
generate_summary_report(results)
|
| 155 |
+
|
| 156 |
+
return results
|
| 157 |
+
|
| 158 |
+
def analyze_response(test: Dict[str, Any], response: str, response_time: float) -> Dict[str, Any]:
|
| 159 |
+
"""Analyze the LLM response and score its performance"""
|
| 160 |
+
|
| 161 |
+
analysis = {
|
| 162 |
+
'test': test['name'],
|
| 163 |
+
'response': response,
|
| 164 |
+
'response_time': response_time,
|
| 165 |
+
'score': 0,
|
| 166 |
+
'strengths': [],
|
| 167 |
+
'weaknesses': [],
|
| 168 |
+
'details': {}
|
| 169 |
+
}
|
| 170 |
+
|
| 171 |
+
# Basic response quality checks
|
| 172 |
+
if not response or response.strip() == "":
|
| 173 |
+
analysis['weaknesses'].append("Empty response")
|
| 174 |
+
return analysis
|
| 175 |
+
|
| 176 |
+
# Check for JSON structure
|
| 177 |
+
try:
|
| 178 |
+
# Try to parse as JSON
|
| 179 |
+
parsed = json.loads(response)
|
| 180 |
+
analysis['details']['json_valid'] = True
|
| 181 |
+
analysis['strengths'].append("Valid JSON structure")
|
| 182 |
+
|
| 183 |
+
# Check if it matches expected structure
|
| 184 |
+
if 'expected' in test:
|
| 185 |
+
expected = test['expected']
|
| 186 |
+
if isinstance(expected, dict):
|
| 187 |
+
if 'tool' in expected and 'tool' in parsed:
|
| 188 |
+
if parsed['tool'] == expected['tool']:
|
| 189 |
+
analysis['score'] += 4
|
| 190 |
+
analysis['strengths'].append("Correct tool selection")
|
| 191 |
+
else:
|
| 192 |
+
analysis['weaknesses'].append(f"Wrong tool: {parsed.get('tool')} vs {expected['tool']}")
|
| 193 |
+
|
| 194 |
+
# Check arguments
|
| 195 |
+
if 'args' in expected and 'args' in parsed:
|
| 196 |
+
arg_match = compare_arguments(parsed['args'], expected['args'])
|
| 197 |
+
analysis['score'] += arg_match * 3
|
| 198 |
+
if arg_match > 0.7:
|
| 199 |
+
analysis['strengths'].append("Good argument matching")
|
| 200 |
+
else:
|
| 201 |
+
analysis['weaknesses'].append("Poor argument matching")
|
| 202 |
+
|
| 203 |
+
except json.JSONDecodeError:
|
| 204 |
+
analysis['details']['json_valid'] = False
|
| 205 |
+
analysis['weaknesses'].append("Invalid JSON format")
|
| 206 |
+
|
| 207 |
+
# Check for tool-like patterns in text
|
| 208 |
+
if 'get_game_state' in response:
|
| 209 |
+
analysis['score'] += 2
|
| 210 |
+
analysis['strengths'].append("Mentions correct tool")
|
| 211 |
+
if 'move_units' in response or 'attack_unit' in response:
|
| 212 |
+
analysis['score'] += 1
|
| 213 |
+
|
| 214 |
+
# Response time scoring
|
| 215 |
+
if response_time < 5.0:
|
| 216 |
+
analysis['score'] += 1
|
| 217 |
+
analysis['strengths'].append("Fast response")
|
| 218 |
+
elif response_time > 15.0:
|
| 219 |
+
analysis['weaknesses'].append("Slow response")
|
| 220 |
+
|
| 221 |
+
# Content relevance scoring
|
| 222 |
+
if any(keyword in response.lower() for keyword in ['game', 'state', 'move', 'attack', 'build']):
|
| 223 |
+
analysis['score'] += 1
|
| 224 |
+
analysis['strengths'].append("Relevant content")
|
| 225 |
+
|
| 226 |
+
# Cap for score
|
| 227 |
+
analysis['score'] = min(analysis['score'], 10)
|
| 228 |
+
|
| 229 |
+
return analysis
|
| 230 |
+
|
| 231 |
+
def compare_arguments(actual: Dict, expected: Dict) -> float:
|
| 232 |
+
"""Compare argument dictionaries and return match percentage"""
|
| 233 |
+
if not actual or not expected:
|
| 234 |
+
return 0.0
|
| 235 |
+
|
| 236 |
+
matches = 0
|
| 237 |
+
total = len(expected)
|
| 238 |
+
|
| 239 |
+
for key, expected_value in expected.items():
|
| 240 |
+
if key in actual:
|
| 241 |
+
actual_value = actual[key]
|
| 242 |
+
if isinstance(expected_value, list) and isinstance(actual_value, list):
|
| 243 |
+
# Compare lists
|
| 244 |
+
if set(expected_value) == set(actual_value):
|
| 245 |
+
matches += 1
|
| 246 |
+
elif expected_value == actual_value:
|
| 247 |
+
matches += 1
|
| 248 |
+
|
| 249 |
+
return matches / total if total > 0 else 0.0
|
| 250 |
+
|
| 251 |
+
def generate_summary_report(results: Dict[str, List[Dict]]):
|
| 252 |
+
"""Generate a comprehensive summary report"""
|
| 253 |
+
|
| 254 |
+
print("\n" + "="*60)
|
| 255 |
+
print("๐ QWEN2.5 0.5B MCP CAPABILITY ASSESSMENT REPORT")
|
| 256 |
+
print("="*60)
|
| 257 |
+
|
| 258 |
+
overall_scores = []
|
| 259 |
+
|
| 260 |
+
for category, category_results in results.items():
|
| 261 |
+
if not category_results:
|
| 262 |
+
continue
|
| 263 |
+
|
| 264 |
+
category_scores = [r.get('score', 0) for r in category_results if 'score' in r]
|
| 265 |
+
avg_score = sum(category_scores) / len(category_scores) if category_scores else 0
|
| 266 |
+
overall_scores.append(avg_score)
|
| 267 |
+
|
| 268 |
+
print(f"\n๐ {category.replace('_', ' ').title()}:")
|
| 269 |
+
print(f" Average Score: {avg_score:.1f}/10")
|
| 270 |
+
|
| 271 |
+
for result in category_results:
|
| 272 |
+
if 'error' in result:
|
| 273 |
+
print(f" โ {result['test']}: ERROR - {result['error']}")
|
| 274 |
+
else:
|
| 275 |
+
print(f" {'โ
' if result['score'] >= 6 else 'โ ๏ธ'} {result['test']}: {result['score']}/10")
|
| 276 |
+
if result['strengths']:
|
| 277 |
+
print(f" Strengths: {', '.join(result['strengths'][:2])}")
|
| 278 |
+
if result['weaknesses']:
|
| 279 |
+
print(f" Weaknesses: {', '.join(result['weaknesses'][:2])}")
|
| 280 |
+
|
| 281 |
+
# Overall assessment
|
| 282 |
+
if overall_scores:
|
| 283 |
+
overall_avg = sum(overall_scores) / len(overall_scores)
|
| 284 |
+
print(f"\n๐ฏ OVERALL ASSESSMENT: {overall_avg:.1f}/10")
|
| 285 |
+
|
| 286 |
+
if overall_avg >= 8:
|
| 287 |
+
print("๐ช EXCELLENT - Qwen2.5 0.5B is highly capable for MCP tasks")
|
| 288 |
+
elif overall_avg >= 6:
|
| 289 |
+
print("๐ GOOD - Qwen2.5 0.5B is capable with some limitations")
|
| 290 |
+
elif overall_avg >= 4:
|
| 291 |
+
print("โ ๏ธ MODERATE - Qwen2.5 0.5B has significant limitations")
|
| 292 |
+
else:
|
| 293 |
+
print("โ POOR - Qwen2.5 0.5B is not suitable for MCP tasks")
|
| 294 |
+
|
| 295 |
+
# Recommendations
|
| 296 |
+
print(f"\n๐ก RECOMMENDATIONS:")
|
| 297 |
+
if overall_avg >= 7:
|
| 298 |
+
print("- Use Qwen2.5 0.5B for MCP translation with confidence")
|
| 299 |
+
print("- Implement prompt engineering for complex tasks")
|
| 300 |
+
print("- Add validation layer for safety")
|
| 301 |
+
elif overall_avg >= 5:
|
| 302 |
+
print("- Use Qwen2.5 0.5B for simple MCP tasks")
|
| 303 |
+
print("- Implement strong validation and fallback mechanisms")
|
| 304 |
+
print("- Consider using larger models for complex planning")
|
| 305 |
+
else:
|
| 306 |
+
print("- Consider upgrading to a larger model (1.5B+)")
|
| 307 |
+
print("- Use Qwen2.5 0.5B only for very simple translations")
|
| 308 |
+
print("- Implement extensive error handling")
|
| 309 |
+
|
| 310 |
+
if __name__ == "__main__":
|
| 311 |
+
print("Starting Qwen2.5 0.5B MCP capability assessment...")
|
| 312 |
+
print("This will test the model's ability to translate user commands to MCP tool calls.")
|
| 313 |
+
print("Make sure the model is downloaded and available at: qwen2.5-0.5b-instruct-q4_0.gguf")
|
| 314 |
+
|
| 315 |
+
try:
|
| 316 |
+
results = test_qwen_capabilities()
|
| 317 |
+
|
| 318 |
+
# Save detailed results to file
|
| 319 |
+
with open("/home/luigi/rts/web/qwen_mcp_assessment.json", "w") as f:
|
| 320 |
+
json.dump(results, f, indent=2)
|
| 321 |
+
|
| 322 |
+
print("\n๐ Detailed results saved to: qwen_mcp_assessment.json")
|
| 323 |
+
|
| 324 |
+
except Exception as e:
|
| 325 |
+
print(f"โ Assessment failed: {e}")
|
| 326 |
+
print("Make sure the AI model is properly downloaded and configured.")
|