mcp-server

Paused

App Files Files Community

NiWaRe commited on Sep 25

Commit

40e1a91

1 Parent(s): 0783971

refactor for stateless: turn stateless on for FastMCP to work with OpenAI client etc

Browse files

Files changed (7) hide show

ARCHITECTURE.md +263 -0
ARCHITECTURE_DECISION.md +0 -75
HUGGINGFACE_DEPLOYMENT.md +0 -205
README.md +55 -18
SCALABILITY_GUIDE.md +0 -754
SCALABILITY_GUIDE_CONCISE.md +0 -712
app.py +33 -54

ARCHITECTURE.md ADDED Viewed

	@@ -0,0 +1,263 @@

+# W&B MCP Server - Architecture & Scalability Guide
+## Table of Contents
+1. [Architecture Decision](#architecture-decision)
+2. [Stateless HTTP Design](#stateless-http-design)
+3. [Performance & Scalability](#performance--scalability)
+4. [Load Test Results](#load-test-results)
+5. [Deployment Recommendations](#deployment-recommendations)
+---
+## Architecture Decision
+### Decision: Pure Stateless HTTP Mode
+**The W&B MCP Server uses pure stateless HTTP mode (`stateless_http=True`).**
+This fundamental architecture decision enables:
+- ✅ **Universal client compatibility** (OpenAI, Cursor, LeChat, Claude)
+- ✅ **Horizontal scaling** capabilities
+- ✅ **Simpler operations** and maintenance
+- ✅ **Cloud-native** deployment patterns
+### Why Stateless?
+The Model Context Protocol traditionally used stateful sessions, but this created issues:
+| Client | Behavior | Problem with Stateful |
+|--------|----------|----------------------|
+| **OpenAI** | Deletes session after listing tools, then reuses ID | Session not found errors |
+| **Cursor** | Sends Bearer token with every request | Expects stateless behavior |
+| **Claude** | Can work with either model | No issues |
+### The Solution
+```python
+# Pure stateless operation - no session persistence
+mcp = FastMCP("wandb-mcp-server", stateless_http=True)
+```
+With this approach:
+- **Session IDs are correlation IDs only** - they match requests to responses
+- **No state persists between requests** - each request is independent
+- **Authentication required per request** - Bearer token must be included
+- **Any worker can handle any request** - enables horizontal scaling
+---
+## Stateless HTTP Design
+### Architecture Overview
+```
+┌─────────────────────────────────────┐
+│    MCP Clients (OpenAI/Cursor/etc)  │
+│     Bearer Token with Each Request   │
+└─────────────┬───────────────────────┘
+              │ HTTPS
+┌─────────────▼───────────────────────┐
+│         Load Balancer (Optional)     │
+│      Round-Robin Distribution        │
+└──┬──────────┬──────────┬────────────┘
+   │          │          │
+┌──▼───┐  ┌──▼───┐  ┌──▼───┐
+│ W1   │  │ W2   │  │ W3   │  (Multiple Workers Possible)
+│      │  │      │  │      │
+│ ASGI │  │ ASGI │  │ ASGI │  Uvicorn/Gunicorn
+└──┬───┘  └──┬───┘  └──┬───┘
+   │          │          │
+┌──▼──────────▼──────────▼────────────┐
+│         FastAPI Application         │
+│  ┌────────────────────────────┐     │
+│  │  Stateless Auth Middleware  │     │
+│  │  (Bearer Token Validation)  │     │
+│  └────────────────────────────┘     │
+│  ┌────────────────────────────┐     │
+│  │    MCP Stateless Handler    │     │
+│  │  (No Session Storage)       │     │
+│  └────────────────────────────┘     │
+└─────────────┬───────────────────────┘
+              │
+┌─────────────▼───────────────────────┐
+│         W&B API Integration         │
+└─────────────────────────────────────┘
+```
+### Request Flow
+1. **Client sends request** with Bearer token and session ID
+2. **Middleware validates** Bearer token
+3. **MCP processes** request (session ID used for correlation only)
+4. **Response sent** with matching session ID
+5. **No state persisted** - request complete
+### Key Implementation Details
+```python
+async def thread_safe_auth_middleware(request: Request, call_next):
+    """Stateless authentication middleware."""
+    # Session IDs are correlation IDs only
+    session_id = request.headers.get("Mcp-Session-Id")
+    if session_id:
+        logger.debug(f"Correlation ID: {session_id[:8]}...")
+    # Every request must have Bearer token
+    authorization = request.headers.get("Authorization", "")
+    if authorization.startswith("Bearer "):
+        api_key = authorization[7:].strip()
+        # Use API key for this request only
+        # No session storage or retrieval
+```
+---
+## Performance & Scalability
+### Single Worker Performance
+Based on testing with stateless mode:
+| Metric | Local Server | Remote (HF Spaces) |
+|--------|--------------|-------------------|
+| **Max Concurrent** | 1000 clients | 500+ clients |
+| **Throughput** | ~50-60 req/s | ~35 req/s |
+| **Latency (p50)** | <500ms | <2s |
+| **Memory Usage** | 200-500MB | 300-600MB |
+### Horizontal Scaling Potential
+With stateless mode, the server supports true horizontal scaling:
+| Workers | Max Concurrent | Total Throughput | Notes |
+|---------|----------------|------------------|-------|
+| 1 | 1000 | ~50 req/s | Current deployment |
+| 2 | 2000 | ~100 req/s | Linear scaling |
+| 4 | 4000 | ~200 req/s | Near-linear |
+| 8 | 8000 | ~400 req/s | Some overhead |
+**Key Advantage**: No session affinity required - any worker can handle any request!
+---
+## Load Test Results
+### Latest Test Results (2025-09-25)
+#### Local Server (MacOS, Single Worker)
+| Concurrent Clients | Success Rate | Throughput | Mean Response |
+|--------------------|-------------|------------|---------------|
+| 10 | 100% | 47 req/s | 89ms |
+| 100 | 100% | 47 req/s | 1.2s |
+| 500 | 100% | 56 req/s | 4.4s |
+| **1000** | **100%** | **48 req/s** | **9.3s** |
+| 1500 | 80% | 51 req/s | 15.4s |
+| 2000 | 70% | 53 req/s | 20.8s |
+**Breaking Point**: ~1500 concurrent connections
+#### Remote Server (mcp.withwandb.com)
+| Concurrent Clients | Success Rate | Throughput | Mean Response |
+|--------------------|-------------|------------|---------------|
+| 10 | 100% | 10 req/s | 0.8s |
+| 50 | 100% | 29 req/s | 1.2s |
+| 100 | 100% | 33 req/s | 1.9s |
+| 200 | 100% | 34 req/s | 3.3s |
+| **500** | **100%** | **35 req/s** | **7.5s** |
+**Key Finding**: Remote server handles 500+ concurrent connections reliably!
+### Performance Sweet Spots
+1. **Low Latency** (<1s response): Use ≤50 concurrent connections
+2. **Balanced** (good throughput & latency): Use 100-200 concurrent connections
+3. **Maximum Throughput**: Use 200-300 concurrent connections
+4. **Maximum Capacity**: Up to 500 concurrent (remote) or 1000 (local)
+---
+## Deployment Recommendations
+### Current Deployment (HuggingFace Spaces)
+```yaml
+Configuration:
+  - Single worker (can be increased)
+  - Stateless HTTP mode
+  - 2 vCPU, 16GB RAM
+  - Port 7860
+Performance:
+  - 500+ concurrent connections
+  - ~35 req/s throughput
+  - 100% reliability up to 500 concurrent
+```
+### Scaling Options
+#### Option 1: Vertical Scaling
+- Increase CPU/RAM on HuggingFace Spaces
+- Can improve single-worker throughput
+#### Option 2: Horizontal Scaling (Recommended)
+```python
+# app.py - Enable multiple workers
+uvicorn.run(app, host="0.0.0.0", port=PORT, workers=4)
+```
+#### Option 3: Multi-Region Deployment
+- Deploy to multiple regions
+- Use global load balancer
+- Reduce latency for users worldwide
+### Production Checklist
+✅ **Stateless mode enabled** (`stateless_http=True`)
+✅ **Bearer authentication** on every request
+✅ **Health check endpoint** (`/health`)
+✅ **Monitoring** for response times and errors
+✅ **Rate limiting** (recommended: 100 req/s per client)
+✅ **Connection limits** (recommended: 500 concurrent)
+### Configuration Example
+```python
+# Production configuration
+mcp = FastMCP("wandb-mcp-server", stateless_http=True)
+# Uvicorn with multiple workers (if needed)
+if __name__ == "__main__":
+    uvicorn.run(
+        app,
+        host="0.0.0.0",
+        port=7860,
+        workers=1,  # Increase for horizontal scaling
+        limit_concurrency=1000,  # Connection limit
+        timeout_keep_alive=30,  # Keepalive timeout
+    )
+```
+### Security Considerations
+1. **API Key Validation**: Every request validates Bearer token
+2. **No Session Storage**: No risk of session hijacking
+3. **Rate Limiting**: Protect against abuse
+4. **HTTPS Only**: Always use TLS in production
+5. **Token Rotation**: Encourage regular API key rotation
+---
+## Summary
+The W&B MCP Server's stateless architecture provides:
+- **Universal Compatibility**: Works with all MCP clients
+- **Excellent Performance**: 500+ concurrent connections, ~35 req/s
+- **Horizontal Scalability**: Add workers to increase capacity
+- **Simple Operations**: No session management complexity
+- **Production Ready**: Deployed and tested at scale
+The stateless design is not a compromise - it's the optimal architecture for MCP servers in production environments.

ARCHITECTURE_DECISION.md DELETED Viewed

@@ -1,75 +0,0 @@
-# Architecture Decision: Single-Worker Async
-## Decision
-Use **single-worker async architecture** with Uvicorn and uvloop for the W&B MCP Server deployment.
-## Context
-MCP (Model Context Protocol) requires stateful session management where:
-- Server creates session IDs on initialization
-- Clients must include session ID in subsequent requests
-- Session state must be maintained across the conversation
-## Considered Options
-### 1. Multi-Worker with Gunicorn (Rejected)
-- ❌ Session state not shared across workers
-- ❌ Requires Redis/Memcached (not available on HF Spaces)
-- ❌ Breaks MCP protocol compliance
-### 2. Multi-Worker with Sticky Sessions (Rejected)
-- ❌ No load balancer control on HF Spaces
-- ❌ Complex configuration
-- ❌ Still doesn't guarantee session persistence
-### 3. Single-Worker Async (Chosen) ✅
-- ✅ Full MCP protocol compliance
-- ✅ Handles 100-1000+ concurrent requests
-- ✅ Simple, reliable architecture
-- ✅ Used by GitHub MCP Server and other references
-## Implementation
-```dockerfile
-CMD ["uvicorn", "app:app",
-     "--workers", "1",
-     "--loop", "uvloop",
-     "--limit-concurrency", "1000"]
-```
-## Performance
-Despite single-worker limitation:
-- **Concurrent Handling**: Async event loop processes I/O concurrently
-- **Non-blocking**: Database queries, API calls don't block other requests
-- **Throughput**: 500-2000 requests/second
-- **Memory Efficient**: ~200-500MB for hundreds of concurrent sessions
-## Comparison with Industry Standards
-| Server | Architecture | Reasoning |
-|--------|------------|-----------|
-| GitHub MCP Server | Single process (Go) | Stateful sessions |
-| WebSocket servers | Single worker + async | Connection state |
-| GraphQL subscriptions | Single worker + async | Subscription state |
-| **W&B MCP Server** | **Single worker + async** | **MCP session state** |
-## Future Scaling Path
-If we outgrow single-worker capacity:
-1. **Vertical Scaling**: Increase CPU/memory (immediate)
-2. **Edge Deployment**: Multiple regions with geo-routing
-3. **Kubernetes StatefulSets**: When platform supports it
-4. **Durable Objects**: For edge computing platforms
-## Conclusion
-Single-worker async is the **correct architectural choice** for MCP servers, not a limitation. It provides:
-- Protocol compliance
-- High concurrency
-- Simple deployment
-- Reliable session management
-This mirrors how other stateful protocols (WebSockets, SSE, GraphQL subscriptions) are typically deployed.

HUGGINGFACE_DEPLOYMENT.md DELETED Viewed

@@ -1,205 +0,0 @@
-# Hugging Face Spaces Deployment Guide
-This repository is configured for deployment on Hugging Face Spaces as a Model Context Protocol (MCP) server for Weights & Biases.
-## Architecture
-The application runs as a FastAPI server on port 7860 (HF Spaces default) with:
-- **Main landing page**: `/` - Serves the index.html with setup instructions
-- **Health check**: `/health` - Returns server status and W&B configuration
-- **MCP endpoint**: `/mcp` - Streamable HTTP transport endpoint for MCP
-  - Server can intelligently decide to return plan plan JSON or a SSE stream (the client always requests in the same way, see below)
-  - Requires `Accept: application/json, text/event-stream` header
-  - Supports initialize, tools/list, tools/call methods
-More information on the details of [streamable http](https://modelcontextprotocol.io/specification/draft/basic/transports#streamable-http) are in the official docs and [this PR](https://github.com/modelcontextprotocol/modelcontextprotocol/pull/206).
-## Key Changes for HF Spaces
-### 1. app.py
-- Creates a FastAPI application that serves the landing page
-- Mounts FastMCP server using `mcp.streamable_http_app()` pattern (following [example from Mistral here](https://huggingface.co/spaces/Jofthomas/Multiple_mcp_fastapi_template))
-- Uses lifespan context manager for session management
-- Configured to run on `0.0.0.0:7860` (HF Spaces requirement)
-- Sets W&B cache directories to `/tmp` to avoid permission issues
-### 2. server.py
-- Exports necessary functions for HF Spaces initialization
-- Support for being imported as a module
-- Maintains backward compatibility with CLI usage
-### 3. Dependencies
-- FastAPI and uvicorn as main dependencies
-- All dependencies listed in requirements.txt for HF Spaces
-### 4. Lazy Loading Fix
-- `TraceService` initialization in `query_weave.py` to use lazy loading
-- This allows the server to start even without a W&B API key (when first adding in LeChat for example without connecting)
-- The service is only initialized when first needed
-## Environment Variables
-No environment variables are required! The server works without any configuration.
-**Note**: Users provide their own W&B API keys as Bearer tokens. No server configuration needed (see AUTH_README.md).
-## Deployment Steps
-1. **Create a new Space on Hugging Face**
-   - Choose "Docker" as the SDK
-   - Set visibility as needed
-2. **Configure Secrets**
-   - Go to Settings → Variables and secrets
-   - Add `MCP_SERVER_URL` as a variable for the URL to be correctly
-3. **Push the Code**
-   ```bash
-   git add .
-   git commit -m "Configure for HF Spaces deployment"
-   git push
-   ```
-4. **Connect to the MCP Server**
-   - Use the endpoint: `https://[your-username]-[space-name].hf.space/mcp`
-   - Configure your MCP client with this URL and "streamable-http" transport
-## File Structure
-```
-.
-├── app.py              # HF Spaces entry point
-├── index.html          # Landing page
-├── Dockerfile          # Container configuration
-├── requirements.txt    # Python dependencies
-├── pyproject.toml      # Package configuration
-└── src/
-    └── wandb_mcp_server/
-        ├── server.py   # MCP server implementation
-        └── ...         # Tool implementations
-```
-## Testing Locally
-To test the HF Spaces configuration locally:
-```bash
-# Install dependencies
-pip install -r requirements.txt
-# Set environment variables
-export WANDB_API_KEY=your_key_here
-# Run the server
-python app.py
-```
-The server will start on http://localhost:7860
-## MCP Architecture & Key Learnings
-### Understanding MCP and FastMCP
-The Model Context Protocol (MCP) is a protocol for communication between AI assistants and external tools/services. Through our experimentation, we discovered several important aspects:
-#### 1. FastMCP Framework
-- **FastMCP** is a Python framework that simplifies MCP server implementation
-- It provides decorators (`@mcp.tool()`) for easy tool registration
-- Internally uses Starlette for HTTP handling
-- Supports multiple transports: stdio, SSE, and streamable HTTP
-#### 2. Streamable HTTP Transport
-The streamable HTTP transport (introduced in [MCP PR #206](https://github.com/modelcontextprotocol/modelcontextprotocol/pull/206)) is the modern approach for remote MCP:
-- **Single endpoint** (`/mcp`) handles all communication
-- **Dual mode operation**:
-  - Regular POST requests for stateless operations
-  - SSE (Server-Sent Events) upgrade for streaming responses
-- **Key advantages**:
-  - Stateless servers possible (no persistent connections required)
-  - Better infrastructure compatibility ("just HTTP")
-  - Supports both request-response and streaming patterns
-#### 3. Implementation Patterns
-##### The HuggingFace Pattern
-Based on the [reference implementation](https://huggingface.co/spaces/Jofthomas/Multiple_mcp_fastapi_template), the correct pattern is:
-```python
-# Create MCP server
-mcp = FastMCP("server-name")
-# Register tools
-@mcp.tool()
-def my_tool(): ...
-# Get streamable HTTP app (returns Starlette app)
-mcp_app = mcp.streamable_http_app()
-# Mount in FastAPI
-app.mount("/", mcp_app)  # Note: mount at root, not at /mcp
-```
-##### Why Mount at Root?
-- `streamable_http_app()` creates internal routes at `/mcp`
-- Mounting at `/mcp` would create `/mcp/mcp` (double path)
-- Mounting at root gives us the clean `/mcp` endpoint
-#### 4. Session Management
-- FastMCP includes a `session_manager` for handling stateful operations
-- Use lifespan context manager to properly initialize/cleanup:
-  ```python
-  async with mcp.session_manager.run():
-      yield
-  ```
-#### 5. Response Format
-- MCP uses **Server-Sent Events (SSE)** for responses
-- Responses are prefixed with `event: message` and `data: `
-- JSON-RPC format for the actual message content
-- Example response:
-  ```
-  event: message
-  data: {"jsonrpc":"2.0","id":1,"result":{...}}
-  ```
-### Critical Implementation Details
-#### 1. Required Headers
-Clients MUST send:
-- `Content-Type: application/json`
-- `Accept: application/json, text/event-stream`
-Without the correct Accept header, the server returns a "Not Acceptable" error.
-#### 2. Lazy Loading Pattern
-To avoid initialization issues (e.g., API keys required at import time):
-```python
-# Instead of this:
-_service = Service()  # Fails if no API key
-# Use lazy loading:
-_service = None
-def get_service():
-    global _service
-    if _service is None:
-        _service = Service()
-    return _service
-```
-#### 3. Environment Setup for HF Spaces
-Critical for avoiding permission errors:
-```python
-os.environ["WANDB_CACHE_DIR"] = "/tmp/.wandb_cache"
-os.environ["HOME"] = "/tmp"
-```
-### Common Pitfalls & Solutions
-| Issue | Symptom | Solution |
-|-------|---------|----------|
-| Double path (`/mcp/mcp`) | 404 errors on `/mcp` | Mount streamable_http_app() at root (`/`) |
-| Missing Accept header | "Not Acceptable" error | Include `Accept: application/json, text/event-stream` |
-| Import-time API key errors | Server fails to start | Use lazy loading pattern |
-| Permission errors in HF Spaces | `mkdir /.cache: permission denied` | Set cache dirs to `/tmp` |
-| Can't access MCP methods | Methods not exposed | Use FastMCP's built-in decorators and methods |

README.md CHANGED Viewed

@@ -131,7 +131,7 @@ The integrated [wandbot](https://github.com/wandb/wandbot) support agent provide
 This MCP server can be deployed in three ways. **We recommend starting with the hosted server** for the easiest setup experience.
-### 🌐 Option 1: Hosted Server (Recommended - No Installation Required)
 Use our publicly hosted server on Hugging Face Spaces - **zero installation needed!**
@@ -139,7 +139,7 @@ Use our publicly hosted server on Hugging Face Spaces - **zero installation need
 > **ℹ️ Quick Setup:** Click the button for your client above, then use the configuration examples in the sections below. Just replace `YOUR_WANDB_API_KEY` with your actual API key from [wandb.ai/authorize](https://wandb.ai/authorize).
-### 💻 Option 2: Local Development (STDIO)
 Run the server locally with direct stdio communication - best for development and testing.
@@ -239,7 +239,7 @@ Use the HTTPS URL in your OpenAI client:
 > **Note:** Free ngrok URLs change each time you restart. For persistent URLs, consider ngrok's paid plans or alternatives like Cloudflare Tunnel.
-### 🔌 Option 3: Self-Hosted HTTP Server
 Deploy your own HTTP server with API key authentication - great for team deployments or custom infrastructure.
@@ -842,7 +842,7 @@ Deploy your own instance of the W&B MCP Server on Hugging Face Spaces:
    https://YOUR_USERNAME-YOUR_SPACE_NAME.hf.space/mcp
    ```
-See [HUGGINGFACE_DEPLOYMENT.md](HUGGINGFACE_DEPLOYMENT.md) for detailed deployment instructions.
 ### Run Local HTTP Server
@@ -872,7 +872,7 @@ wandb-mcp-server/
 ├── requirements.txt          # Python dependencies for HTTP deployment
 ├── index.html               # Landing page for web interface
 ├── AUTH_README.md           # Authentication documentation
-├── HUGGINGFACE_DEPLOYMENT.md # HF Spaces deployment guide
 ├── src/
 │   └── wandb_mcp_server/
 │       ├── server.py        # Core MCP server (STDIO & HTTP)
@@ -1056,11 +1056,11 @@ The W&B MCP Server is built with a modern, scalable architecture designed for bo
 ### Key Design Principles
-1. **Stateless Architecture**: Each request is independent, enabling horizontal scaling
-2. **Per-Request Authentication**: API keys are isolated per request using Python's ContextVar
-3. **No Global State**: Eliminated `wandb.login()` in favor of `wandb.Api(api_key=...)`
-4. **Transport Agnostic**: Supports both STDIO (local) and HTTP (remote) transports
-5. **Cloud Native**: Designed for containerization and deployment on platforms like Hugging Face Spaces
 ### Deployment Architecture
@@ -1072,17 +1072,17 @@ The server can be deployed in multiple configurations:
 - **Containerized**: Docker with configurable worker counts
 - **Cloud Platforms**: Hugging Face Spaces, AWS, GCP, etc.
-For detailed scalability information and advanced deployment options, see the [Scalability Guide](SCALABILITY_GUIDE.md).
 ### Performance & Scalability
-The server has been thoroughly tested and can handle significant production workloads:
-**Measured Performance (HF Spaces, 2 vCPU)**:
-- **Maximum Capacity**: 600 concurrent connections
-- **Peak Throughput**: 150 req/s
-- **Breaking Point**: 650-700 concurrent connections
-- **100% Success Rate**: Up to 600 clients
 Run your own load tests:
@@ -1097,7 +1097,44 @@ python load_test.py --url https://mcp.withwandb.com --mode stress
 python load_test.py --url https://mcp.withwandb.com --clients 100 --requests 20
 ```
-See the comprehensive [Scalability Guide](SCALABILITY_GUIDE.md) for detailed performance analysis, testing instructions, and optimization strategies.
 ## Support

 This MCP server can be deployed in three ways. **We recommend starting with the hosted server** for the easiest setup experience.
+### Option 1: Hosted Server (Recommended - No Installation Required)
 Use our publicly hosted server on Hugging Face Spaces - **zero installation needed!**
 > **ℹ️ Quick Setup:** Click the button for your client above, then use the configuration examples in the sections below. Just replace `YOUR_WANDB_API_KEY` with your actual API key from [wandb.ai/authorize](https://wandb.ai/authorize).
+### Option 2: Local Development (STDIO)
 Run the server locally with direct stdio communication - best for development and testing.
 > **Note:** Free ngrok URLs change each time you restart. For persistent URLs, consider ngrok's paid plans or alternatives like Cloudflare Tunnel.
+### Option 3: Self-Hosted HTTP Server
 Deploy your own HTTP server with API key authentication - great for team deployments or custom infrastructure.
    https://YOUR_USERNAME-YOUR_SPACE_NAME.hf.space/mcp
    ```
+The server is deployed on HuggingFace Spaces at `https://mcp.withwandb.com`.
 ### Run Local HTTP Server
 ├── requirements.txt          # Python dependencies for HTTP deployment
 ├── index.html               # Landing page for web interface
 ├── AUTH_README.md           # Authentication documentation
+├── ARCHITECTURE.md           # Architecture & scalability guide
 ├── src/
 │   └── wandb_mcp_server/
 │       ├── server.py        # Core MCP server (STDIO & HTTP)
 ### Key Design Principles
+1. **Pure Stateless Mode**: Session IDs are correlation IDs only - no state persists
+2. **Horizontal Scalability**: Any worker can handle any request
+3. **Universal Compatibility**: Works with OpenAI, Cursor, LeChat, and all MCP clients
+4. **Per-Request Authentication**: Bearer token required with every request
+5. **Cloud Native**: Optimized for containerization and cloud deployment
 ### Deployment Architecture
 - **Containerized**: Docker with configurable worker counts
 - **Cloud Platforms**: Hugging Face Spaces, AWS, GCP, etc.
+For detailed architecture and scalability information, see the [Architecture Guide](ARCHITECTURE.md).
 ### Performance & Scalability
+The stateless server architecture provides excellent performance:
+**Measured Performance**:
+- **Remote Server (mcp.withwandb.com)**: 500+ concurrent connections @ ~35 req/s
+- **Local Server**: 1000 concurrent connections @ ~50 req/s
+- **100% Success Rate**: Up to 500 clients (remote) or 1000 (local)
+- **Horizontal Scaling**: Add workers to multiply capacity
 Run your own load tests:
 python load_test.py --url https://mcp.withwandb.com --clients 100 --requests 20
 ```
+See the [Architecture Guide](ARCHITECTURE.md) for detailed performance analysis, testing instructions, and deployment recommendations.
+## Example: Using with OpenAI
+Here's a complete example using the W&B MCP Server with OpenAI's client:
+```python
+from openai import OpenAI
+from dotenv import load_dotenv
+import os
+load_dotenv()
+client = OpenAI()
+resp = client.responses.create(
+    model="gpt-4o",  # Use gpt-4o for larger context window to handle all MCP tools
+    tools=[
+        {
+            "type": "mcp",
+            "server_label": "wandb",
+            "server_description": "A tool to query and analyze Weights & Biases data.",
+            "server_url": "https://mcp.withwandb.com/mcp",  # Must use public URL for OpenAI
+            "authorization": os.getenv('WANDB_API_KEY'),  # Use authorization field directly
+            "require_approval": "never",
+        },
+    ],
+    input="How many traces are in wandb-smle/hiring-agent-demo-public?",
+)
+print(resp.output_text)
+```
+**Key Points:**
+- OpenAI's MCP implementation is server-side, so you must use a publicly accessible URL
+- The `authorization` field should contain your W&B API key directly (not in headers)
+- Use `gpt-4o` model for sufficient context window to handle all W&B tools
+- The server operates in stateless mode - each request includes authentication
 ## Support

SCALABILITY_GUIDE.md DELETED Viewed

@@ -1,754 +0,0 @@
-# W&B MCP Server - Scalability & Performance Guide
-## Table of Contents
-1. [Current Architecture](#current-architecture)
-   - [Architecture Decision](#architecture-decision-why-single-worker-async)
-   - [Implementation Details](#implementation-details)
-2. [Performance Test Results](#performance-test-results)
-3. [Load Testing Guide](#load-testing-guide)
-4. [Hardware Scaling Analysis](#hardware-scaling-analysis)
-5. [Optimization Strategies](#optimization-strategies)
-6. [Deployment Recommendations](#deployment-recommendations)
-7. [Future Scaling Options](#future-scaling-options)
-8. [Common Questions About the Architecture](#common-questions-about-the-architecture)
-9. [Summary](#summary)
----
-## Current Architecture
-### Architecture Decision: Why Single-Worker Async?
-The W&B MCP server uses a **single-worker async architecture** - a deliberate design choice optimized for the Model Context Protocol's stateful session requirements.
-#### The Decision Process
-MCP (Model Context Protocol) requires stateful session management where:
-- Server creates session IDs on initialization
-- Clients must include session ID in subsequent requests
-- Session state must be maintained across the conversation
-#### Options We Considered
-| Option | Verdict | Reasoning |
-|--------|---------|-----------|
-| **Multi-Worker with Gunicorn** | ❌ Rejected | Session state not shared across workers; Requires Redis/Memcached (not available on HF Spaces); Breaks MCP protocol compliance |
-| **Multi-Worker with Sticky Sessions** | ❌ Rejected | No load balancer control on HF Spaces; Complex configuration; Doesn't guarantee session persistence |
-| **Single-Worker Async** | ✅ **Chosen** | Full MCP protocol compliance; Handles 1000+ concurrent requests; Simple, reliable architecture; Industry standard for stateful protocols |
-#### Industry Comparison
-| Server | Architecture | Reasoning |
-|--------|-------------|-----------|
-| GitHub MCP Server | Single process (Go) | Stateful sessions |
-| WebSocket servers | Single worker + async | Connection state |
-| GraphQL subscriptions | Single worker + async | Subscription state |
-| **W&B MCP Server** | **Single worker + async** | **MCP session state** |
-#### Why This Isn't a Limitation
-Single-worker async is the **correct architectural choice** for MCP servers, not a compromise. Despite using a single worker, the architecture provides:
-- **Concurrent Handling**: Async event loop processes I/O concurrently
-- **Non-blocking Operations**: Database queries and API calls don't block other requests
-- **High Throughput**: 500-2000 requests/second capability
-- **Memory Efficiency**: Only ~200-500MB for hundreds of concurrent sessions
-### Single-Worker Async Design
-```
-┌─────────────────────────────────────┐
-│       Hugging Face Spaces           │
-│         (2 vCPU, 16GB RAM)          │
-└─────────────┬───────────────────────┘
-              │
-┌─────────────▼───────────────────────┐
-│     Uvicorn ASGI Server (Port 7860) │
-│         Single Worker Process        │
-│      ┌──────────────────────┐       │
-│      │  Async Event Loop     │       │
-│      │  (uvloop if available)│       │
-│      └──────────────────────┘       │
-└─────────────┬───────────────────────┘
-              │
-┌─────────────▼───────────────────────┐
-│         FastAPI Application         │
-│  ┌────────────────────────────┐     │
-│  │  Authentication Middleware  │     │
-│  │   (ContextVar API Keys)     │     │
-│  └────────────────────────────┘     │
-│  ┌────────────────────────────┐     │
-│  │    MCP Session Manager      │     │
-│  │  (In-Memory Session Store)  │     │
-│  └────────────────────────────┘     │
-└─────────────┬───────────────────────┘
-              │
-┌─────────────▼───────────────────────┐
-│         W&B MCP Tools               │
-│  • query_weave_traces_tool          │
-│  • count_weave_traces_tool          │
-│  • query_wandb_tool                 │
-│  • create_wandb_report_tool         │
-│  • query_wandb_entity_projects      │
-│  • query_wandb_support_bot          │
-└─────────────────────────────────────┘
-```
-### Key Design Principles
-1. **Stateful Session Management**: MCP requires persistent session state, making single-worker optimal
-2. **Async Concurrency**: Event loop handles thousands of concurrent connections
-3. **ContextVar Isolation**: Thread-safe API key storage for concurrent requests
-4. **Connection Pooling**: Reuses HTTP connections to W&B APIs
-5. **Non-blocking I/O**: All tools use async operations
-### Implementation Details
-#### Dockerfile Configuration
-```dockerfile
-# Single worker with high concurrency limits
-CMD ["uvicorn", "app:app", \
-     "--host", "0.0.0.0", \
-     "--port", "7860", \
-     "--workers", "1", \               # Single worker for session state
-     "--log-level", "info", \
-     "--timeout-keep-alive", "120", \  # Keep connections alive
-     "--limit-concurrency", "1000"]    # Handle 1000+ concurrent
-```
-#### Session Management
-```python
-# In-memory session storage (app.py)
-session_api_keys = {}  # Maps MCP session ID to W&B API key
-# Session lifecycle:
-# 1. Client sends Bearer token on initialization
-# 2. Server creates session ID and stores API key
-# 3. Client uses session ID for subsequent requests
-# 4. Server retrieves API key from session storage
-```
-#### API Key Isolation (ContextVar)
-```python
-# Thread-safe API key storage for concurrent requests
-from contextvars import ContextVar
-api_key_context: ContextVar[str] = ContextVar('wandb_api_key')
-# Per-request isolation:
-# 1. Middleware sets API key in context
-# 2. Tools retrieve from context (not environment)
-# 3. Each concurrent request has isolated context
-```
----
-## Performance Test Results
-### Executive Summary
-The W&B MCP Server deployed on Hugging Face Spaces has been thoroughly stress-tested. **Key Finding**: The server can reliably handle **up to 600 concurrent connections** with 100% success rate, achieving **113-150 req/s throughput**.
-### Optimal Performance Zone (100% Success Rate)
-| Concurrent Clients | Success Rate | Throughput | Mean Response Time | p99 Response Time |
-|--------------------|-------------|------------|-------------------|-------------------|
-| 1 | 100% | 2.6 req/s | 340ms | N/A |
-| 10 | 100% | 25 req/s | 290ms | 380ms |
-| 50 | 100% | 86 req/s | 390ms | 550ms |
-| 100 | 100% | 97 req/s | 690ms | 1.0s |
-| 200 | 100% | 150 req/s | 890ms | 1.2s |
-| 300 | 100% | 129 req/s | 1.51s | 1.91s |
-| 500 | 100% | 98 req/s | 4.52s | 6.02s |
-| **600** | **100%** | **113 req/s** | ~5s | ~7s |
-### Performance Degradation Zone
-| Concurrent Clients | Success Rate | Notes |
-|--------------------|-------------|-------|
-| 650 | 94% | First signs of degradation |
-| 700 | 12.7% | Breaking point - server overwhelmed |
-| 750+ | <10% | Complete failure |
-### Performance Sweet Spots
-1. **For Low Latency** (< 1s response time):
-   - Use ≤ 100 concurrent connections
-   - Expect ~97 req/s throughput
-   - p99 latency: 1 second
-2. **For Maximum Throughput**:
-   - Use 200-300 concurrent connections
-   - Achieve 130-150 req/s
-   - p99 latency: 1.2-1.9 seconds
-3. **For Maximum Capacity**:
-   - Use up to 600 concurrent connections
-   - Achieve ~113 req/s
-   - p99 latency: ~7 seconds
-### Capacity Limits
-- **Absolute Maximum**: 600 concurrent connections
-- **Safe Operating Limit**: 500 concurrent connections (with buffer)
-- **Recommended Production Limit**: 400 concurrent connections
-- **Breaking Point**: 650-700 concurrent connections
-### Comparison: Local vs Deployed
-| Metric | Local (2 vCPU) | HF Spaces (2 vCPU) | Notes |
-|--------|----------------|-------------------|-------|
-| Max Concurrent | 100 | 600 | HF handles 6x more! |
-| Throughput | 600 req/s | 113-150 req/s | Network overhead |
-| p50 Latency | 20ms | 500ms | Network + processing |
-| Breaking Point | 100 clients | 650 clients | Better infrastructure |
----
-## Load Testing Guide
-### Prerequisites
-```bash
-# Install dependencies
-pip install httpx
-# Or using uv (recommended)
-uv pip install httpx
-```
-### Test Tools Overview
-We provide a comprehensive load testing tool (`load_test.py`) with three modes:
-1. **Standard Mode**: Runs predefined test suite (light, medium, heavy load)
-2. **Stress Mode**: Finds the breaking point progressively
-3. **Custom Mode**: Run specific test configurations
-### Testing Local Server
-#### 1. Start the Local Server
-```bash
-# Terminal 1: Start the server
-cd /path/to/mcp-server
-source .venv/bin/activate  # or use uv
-uvicorn app:app --host 0.0.0.0 --port 7860 --workers 1
-```
-#### 2. Run Load Tests
-```bash
-# Terminal 2: Run tests
-# Standard test suite (recommended first test)
-python load_test.py --mode standard
-# Custom test with specific parameters
-python load_test.py --mode custom --clients 50 --requests 20 --delay 0.05
-# Stress test to find breaking point
-python load_test.py --mode stress
-# Test with real API key
-python load_test.py --api-key YOUR_WANDB_API_KEY --mode custom --clients 10 --requests 5
-```
-### Testing Deployed Hugging Face Space
-#### 1. Basic Functionality Test
-```bash
-# Test with small load first
-python load_test.py \
-    --url https://mcp.withwandb.com \
-    --mode custom \
-    --clients 5 \
-    --requests 3
-```
-#### 2. Progressive Load Testing
-```bash
-# Light load (10 clients)
-python load_test.py \
-    --url https://mcp.withwandb.com \
-    --mode custom \
-    --clients 10 \
-    --requests 10
-# Medium load (50 clients)
-python load_test.py \
-    --url https://mcp.withwandb.com \
-    --mode custom \
-    --clients 50 \
-    --requests 10 \
-    --delay 0.05
-# Heavy load (100 clients) - be careful!
-python load_test.py \
-    --url https://mcp.withwandb.com \
-    --mode custom \
-    --clients 100 \
-    --requests 20 \
-    --delay 0.01
-```
-#### 3. Comprehensive Stress Test
-```bash
-# Run full stress test (gradually increases load)
-python load_test.py \
-    --url https://mcp.withwandb.com \
-    --mode stress
-```
-### Creating Custom Stress Tests
-For finding exact breaking points, create a custom test script:
-```python
-#!/usr/bin/env python3
-"""Custom stress test for finding precise limits"""
-import asyncio
-import time
-import httpx
-async def test_concurrent_load(url, num_clients):
-    """Test specific number of concurrent clients"""
-    async def make_request(client):
-        try:
-            response = await client.post(
-                f"{url}/mcp",
-                headers={
-                    "Authorization": "Bearer test_key_12345678901234567890",
-                    "Content-Type": "application/json",
-                    "Accept": "application/json, text/event-stream",
-                },
-                json={
-                    "jsonrpc": "2.0",
-                    "method": "initialize",
-                    "params": {
-                        "protocolVersion": "2025-06-18",
-                        "capabilities": {},
-                        "clientInfo": {"name": "stress_test", "version": "1.0"}
-                    },
-                    "id": 1
-                },
-                timeout=60
-            )
-            return response.status_code == 200
-        except:
-            return False
-    print(f"Testing {num_clients} concurrent clients...")
-    start = time.time()
-    async with httpx.AsyncClient(limits=httpx.Limits(max_connections=1000)) as client:
-        tasks = [make_request(client) for _ in range(num_clients)]
-        results = await asyncio.gather(*tasks)
-    elapsed = time.time() - start
-    success_count = sum(results)
-    success_rate = (success_count / num_clients) * 100
-    print(f"  ✅ Success: {success_count}/{num_clients} ({success_rate:.1f}%)")
-    print(f"  ⚡ Throughput: {num_clients/elapsed:.2f} req/s")
-    print(f"  ⏱️  Time: {elapsed:.2f}s")
-    return success_rate
-async def main():
-    # Test specific range to find breaking point
-    for clients in [500, 550, 600, 650, 700]:
-        success_rate = await test_concurrent_load(
-            "https://mcp.withwandb.com",
-            clients
-        )
-        if success_rate < 50:
-            print(f"🔥 Breaking point at {clients} clients!")
-            break
-        await asyncio.sleep(3)  # Let server recover
-if __name__ == "__main__":
-    asyncio.run(main())
-```
-### Understanding Test Results
-#### Key Metrics to Monitor
-1. **Success Rate**: Percentage of successful requests
-   - 100%: Perfect performance
-   - 90-99%: Acceptable with retries
-   - <90%: Performance issues
-   - <50%: Breaking point
-2. **Throughput (req/s)**: Total requests per second
-   - Local: Can achieve 600+ req/s
-   - HF Spaces: Typically 100-150 req/s peak
-3. **Response Time Percentiles**:
-   - p50 (median): Typical response time
-   - p95: 95% of requests faster than this
-   - p99: 99% of requests faster than this
-4. **Resource Usage**:
-   - Monitor HF Space dashboard for CPU/Memory
-   - Local: Use `htop` or system monitor
-### Test Results Interpretation
-```
-============================================================
-Load Test Results
-============================================================
-📊 Overall Metrics:
-  Total Time: 3.46s              # How long the test took
-  Total Requests: 2100            # Total requests made
-  Successful: 2100 (100.0%)       # Success rate - key metric!
-  Failed: 0                       # Should be 0 for good performance
-  Requests/Second: 607.33         # Throughput
-🔑 Session Creation:
-  Mean: 1.348s                    # Average time to create session
-  Median: 1.342s                  # Middle value (less affected by outliers)
-  Std Dev: 0.157s                 # Consistency (lower is better)
-🔧 Tool Calls:
-  Mean: 0.024s                    # Average tool call time
-  Median: 0.020s                  # Typical tool call time
-  Min: 0.001s                     # Fastest response
-  Max: 0.077s                     # Slowest response
-📈 Latency Percentiles:
-  p50: 0.020s                     # 50% of requests faster than this
-  p95: 0.070s                     # 95% of requests faster than this
-  p99: 0.076s                     # 99% of requests faster than this
-⚡ Throughput:
-  Concurrent Clients: 100         # Number of simultaneous clients
-  Requests/Second/Client: 6.07    # Per-client throughput
-  Total Throughput: 606.83 req/s  # Overall server throughput
-```
----
-## Hardware Scaling Analysis
-### Current Configuration (2 vCPU, 16GB RAM on HF Spaces)
-**Actual Measured Performance**:
-- ✅ 600 concurrent connections with 100% success
-- ✅ 113-150 req/s sustained throughput
-- ✅ 100% reliability up to 600 clients
-- ✅ Graceful degradation 600-700 clients
-**This significantly exceeds initial estimates!** The combination of:
-- Efficient async architecture
-- HF Spaces infrastructure
-- Optimized connection handling
-Results in 6x better performance than expected.
-### Potential Upgrade (8 vCPU, 32GB RAM)
-**Estimated Performance** (linear scaling from current):
-- ~2,400 concurrent connections (4x current)
-- ~450-600 req/s throughput
-- Better response times under load
-- More consistent p99 latencies
-### Scaling Factors
-| Resource | Impact on Performance |
-|----------|---------------------|
-| **CPU Cores** | More concurrent request processing, better I/O scheduling |
-| **RAM** | Larger connection pools, more session storage, better caching |
-| **Network** | HF Spaces has excellent network infrastructure |
-| **Event Loop** | Single async loop scales well with resources |
----
-## Optimization Strategies
-### 1. Connection Pooling
-```python
-# Already implemented in httpx clients
-connector = httpx.AsyncHTTPTransport(
-    limits=httpx.Limits(
-        max_connections=100,
-        max_keepalive_connections=50
-    )
-)
-```
-### 2. Session Management
-```python
-# Periodic cleanup of old sessions
-async def cleanup_old_sessions():
-    """Remove sessions older than 1 hour"""
-    cutoff = time.time() - 3600
-    for session_id in list(session_api_keys.keys()):
-        if session_timestamps.get(session_id, 0) < cutoff:
-            del session_api_keys[session_id]
-```
-### 3. Rate Limiting
-```python
-# Add per-client rate limiting
-from slowapi import Limiter
-limiter = Limiter(key_func=get_remote_address)
-@app.post("/mcp")
-@limiter.limit("100/minute")
-async def mcp_endpoint(request: Request):
-    # Handle request
-```
-### 4. Response Caching
-- Cache frequently accessed data (entity/project lists)
-- Use TTL-based caching for tool responses
-- Implement ETag support for conditional requests
-### 5. Monitoring & Metrics
-```python
-# Add Prometheus metrics
-from prometheus_client import Counter, Histogram, Gauge
-request_count = Counter('mcp_requests_total', 'Total requests', ['method', 'status'])
-request_duration = Histogram('mcp_request_duration_seconds', 'Request duration', ['method'])
-active_sessions = Gauge('mcp_active_sessions', 'Number of active sessions')
-```
----
-## Deployment Recommendations
-### By Team Size
-#### Development/Testing (1-10 users)
-- ✅ Current HF Space perfect
-- Sub-second response times
-- No changes needed
-#### Small Teams (10-50 users)
-- ✅ Current HF Space excellent
-- ~86 req/s throughput
-- Response times < 600ms
-#### Medium Organizations (50-200 users)
-- ✅ Current HF Space adequate
-- 150 req/s peak throughput
-- Recommendations:
-  - Implement request queueing
-  - Add client-side retries
-  - Set up monitoring
-#### Large Deployments (200-500 users)
-- ⚠️ Current HF Space at limits
-- Recommendations:
-  - Implement load balancer
-  - Add monitoring/alerting (>400 connections)
-  - Consider upgrading HF Space tier
-  - Or deploy multiple instances
-#### Enterprise (500+ users)
-- ❌ Exceeds current capacity
-- Solutions:
-  - Deploy on dedicated infrastructure
-  - Use Kubernetes with HPA
-  - Implement Redis for session storage
-  - Multiple server instances with load balancing
-### Production Checklist
-If deploying for production use:
-1. **Monitoring Setup**:
-   ```bash
-   # Set up alerts for:
-   - Concurrent connections > 400
-   - p99 latency > 5s
-   - Success rate < 95%
-   - Memory usage > 80%
-   ```
-2. **Client Configuration**:
-   ```python
-   # Recommended client settings
-   client = httpx.AsyncClient(
-       timeout=httpx.Timeout(30.0),  # 30 second timeout
-       limits=httpx.Limits(
-           max_connections=10,        # Per-client connection limit
-           max_keepalive_connections=5
-       )
-   )
-   # Implement exponential backoff
-   async def retry_with_backoff(func, max_retries=3):
-       for i in range(max_retries):
-           try:
-               return await func()
-           except Exception as e:
-               if i == max_retries - 1:
-                   raise
-               await asyncio.sleep(2 ** i)  # Exponential backoff
-   ```
-3. **Rate Limiting**:
-   - Limit per-client to 100 requests/minute
-   - Implement request quotas per API key
-   - Add circuit breakers for failing clients
-4. **Documentation**:
-   - Document the 500 client soft limit
-   - Provide client configuration examples
-   - Create runbooks for high load scenarios
----
-## Future Scaling Options
-When the single-worker architecture reaches its limits (500+ concurrent users), here's the scaling progression:
-### Immediate Options (No Code Changes)
-1. **Vertical Scaling**:
-   - Upgrade to 8 vCPU, 32GB RAM HF Space
-   - Expected: 2,400 concurrent connections, 450-600 req/s
-   - Cost: ~4x higher but 4-5x performance gain
-2. **Edge Deployment**:
-   - Deploy in multiple regions with geo-routing
-   - Reduce latency for global users
-   - Each region handles its own sessions
-### Advanced Options (Code Changes Required)
-#### Option 1: Horizontal Scaling with External Session Store
-Replace in-memory session storage with Redis:
-```python
-# Redis-based session management
-import redis.asyncio as redis
-class RedisSessionStore:
-    def __init__(self, redis_url: str):
-        self.redis = redis.from_url(redis_url)
-    async def set_session(self, session_id: str, api_key: str):
-        await self.redis.setex(f"mcp:session:{session_id}", 3600, api_key)
-    async def get_session(self, session_id: str) -> Optional[str]:
-        return await self.redis.get(f"mcp:session:{session_id}")
-```
-This enables multiple worker processes while maintaining session state.
-### Option 2: Edge Caching with CDN
-For read-heavy workloads:
-- Cache tool responses at CDN edge
-- Use cache keys based on (tool, params, api_key_hash)
-- TTL based on data freshness requirements
-### Option 3: Serverless Functions
-For specific tools that don't need session state:
-- Deploy stateless tools as AWS Lambda / Cloud Functions
-- Route via API Gateway
-- Scale to thousands of concurrent executions
-### Option 4: WebSocket Upgrade
-For real-time applications:
-- Upgrade to WebSocket connections
-- Maintain persistent connections
-- Push updates to clients
-- Reduce connection overhead
-### Option 5: Multi-Region Deployment
-For global distribution:
-- Deploy in multiple regions
-- Use GeoDNS for routing
-- Implement cross-region session sync
-- Reduce latency for global users
----
-### Option 6: Platform-Specific Solutions
-When platforms evolve to better support stateful applications:
-1. **Kubernetes StatefulSets**:
-   - When HF Spaces supports Kubernetes
-   - Maintains pod identity across restarts
-   - Enables persistent volume claims
-2. **Durable Objects** (Cloudflare Workers):
-   - Edge computing with guaranteed session affinity
-   - Automatic scaling with state persistence
-   - Global distribution
----
-## Common Questions About the Architecture
-### Q: Why not use multiple workers like traditional web apps?
-**A**: MCP is a stateful protocol, similar to WebSockets or GraphQL subscriptions. Multiple workers would break session continuity unless you add complex state synchronization (Redis, sticky sessions), which adds latency and complexity without improving performance for our I/O-bound workload.
-### Q: Is single-worker a bottleneck?
-**A**: No. Our tests show a single async worker handles **600+ concurrent connections** and **150 req/s** on just 2 vCPUs. The bottleneck is network I/O to W&B APIs, not CPU processing. Adding workers wouldn't improve this.
-### Q: How does this compare to multi-threaded servers?
-**A**: Python's GIL (Global Interpreter Lock) makes true multi-threading inefficient for CPU-bound work. For I/O-bound work (like our API calls), async/await with a single thread is actually more efficient than multi-threading due to lower overhead and no context switching.
-### Q: What about reliability and fault tolerance?
-**A**:
-- **Health checks**: HF Spaces automatically restarts unhealthy containers
-- **Graceful shutdown**: Server properly closes connections on restart
-- **Session recovery**: Clients can re-authenticate with Bearer token
-- **Error handling**: Each request is isolated; one failure doesn't affect others
-### Q: When would you need to change this architecture?
-**A**: Only when:
-1. CPU-bound processing becomes significant (unlikely for MCP proxy)
-2. You need 1000+ concurrent users (then use Redis for sessions)
-3. Global distribution is required (deploy regional instances)
----
-## Summary
-The W&B MCP Server on Hugging Face Spaces **significantly exceeds expectations**, handling 6x more concurrent connections than initially estimated.
-**Architecture Highlights**:
-- 🏗️ **Single-worker async**: The correct choice for stateful protocols
-- 🚀 **600 concurrent connections**: Proven capacity with 100% success rate
-- ⚡ **150 req/s peak throughput**: Excellent for I/O-bound operations
-- 🎯 **Simple and reliable**: No complex state synchronization needed
-**Key Achievements**:
-- ✅ **Industry-standard architecture** for stateful protocols
-- ✅ **Production-ready** for teams up to 500 users
-- ✅ **Clear scaling path** for larger deployments
-- ✅ **Cost-effective** on basic HF Space tier
-**Bottom Line by Team Size**:
-- ✅ **Development** (1-10 users): Perfect
-- ✅ **Small Teams** (10-50 users): Excellent
-- ✅ **Medium Teams** (50-200 users): Good
-- ⚠️ **Large Teams** (200-500 users): Adequate with monitoring
-- ❌ **Enterprise** (500+ users): Needs infrastructure upgrade
-The single-worker async architecture is not a limitation but a **deliberate design choice** that aligns with MCP's requirements and industry best practices for stateful protocols. The deployment on Hugging Face Spaces provides excellent value and surprising performance for small to medium-scale deployments.

SCALABILITY_GUIDE_CONCISE.md DELETED Viewed

@@ -1,712 +0,0 @@
-# MCP Server Scalability Guide
-## System Design & Architecture
-### Core Components Overview
-The W&B MCP Server is built with a layered architecture optimized for scalability:
-#### 1. **FastAPI Application Layer**
-- **Purpose**: HTTP server handling incoming requests
-- **Technology**: FastAPI with Uvicorn/Gunicorn
-- **Key Features**:
-  - Async request handling for non-blocking I/O
-  - Automatic OpenAPI documentation
-  - Middleware pipeline for authentication and logging
-  - Static file serving for web interface
-#### 2. **Authentication Middleware**
-- **Purpose**: Secure, thread-safe API key management
-- **Technology**: Custom middleware using Python ContextVar
-- **Implementation**:
-  ```python
-  # Per-request API key isolation (no global state)
-  api_key_context: ContextVar[str] = ContextVar('wandb_api_key')
-  # Each request gets isolated context
-  token = api_key_context.set(api_key)
-  ```
-- **Benefits**:
-  - No race conditions between concurrent requests
-  - Thread-safe by design
-  - Zero global state pollution
-#### 3. **MCP Protocol Layer**
-- **Purpose**: Model Context Protocol implementation
-- **Technology**: FastMCP framework with streamable HTTP transport
-- **Features**:
-  - Tool registration and dynamic dispatch
-  - Session management for stateful operations
-  - SSE (Server-Sent Events) for response streaming
-  - JSON-RPC 2.0 protocol compliance
-#### 4. **Tool Implementation Layer**
-- **Purpose**: W&B/Weave functionality exposure
-- **Components**:
-  - `query_wandb_tool`: GraphQL queries for experiments
-  - `query_weave_traces`: LLM trace analysis
-  - `count_weave_traces`: Efficient analytics
-  - `create_wandb_report`: Report generation
-  - `query_wandb_support_bot`: RAG-powered help
-### Request Flow Architecture
-```
-┌──────────────┐
-│ MCP Client   │
-└──────┬───────┘
-       │ HTTPS + Bearer Token
-       ▼
-┌──────────────────────────────────┐
-│ 1. Nginx/Load Balancer (HF)      │
-└──────┬───────────────────────────┘
-       │
-       ▼
-┌──────────────────────────────────┐
-│ 2. Gunicorn Master Process       │
-│    - Worker management            │
-│    - Request distribution         │
-└──────┬───────────────────────────┘
-       │ Round-robin
-       ▼
-┌──────────────────────────────────┐
-│ 3. Uvicorn Worker (1 of N)       │
-│    - Async request handling       │
-│    - WebSocket/SSE support        │
-└──────┬───────────────────────────┘
-       │
-       ▼
-┌──────────────────────────────────┐
-│ 4. FastAPI Application           │
-│    - Route matching               │
-│    - Request validation           │
-└──────┬───────────────────────────┘
-       │
-       ▼
-┌──────────────────────────────────┐
-│ 5. Authentication Middleware     │
-│    - Bearer token extraction      │
-│    - API key validation           │
-│    - Context variable setup       │
-└──────┬───────────────────────────┘
-       │
-       ▼
-┌──────────────────────────────────┐
-│ 6. MCP Server (FastMCP)          │
-│    - JSON-RPC parsing             │
-│    - Tool dispatch                │
-│    - Session management           │
-└──────┬───────────────────────────┘
-       │
-       ▼
-┌──────────────────────────────────┐
-│ 7. Tool Execution                 │
-│    - Get API key from context     │
-│    - Create wandb.Api(api_key)    │
-│    - Execute W&B/Weave operations │
-└──────┬───────────────────────────┘
-       │
-       ▼
-┌──────────────────────────────────┐
-│ 8. Response Generation            │
-│    - JSON-RPC formatting          │
-│    - SSE streaming (if applicable)│
-│    - Error handling               │
-└──────────────────────────────────┘
-```
-### Key Design Decisions
-#### 1. **No Global State**
-- **Problem**: `wandb.login()` sets global state, causing race conditions
-- **Solution**: Use `wandb.Api(api_key=...)` per request
-- **Benefit**: True request isolation, no cross-contamination
-#### 2. **ContextVar for API Keys**
-- **Problem**: Thread-local storage doesn't work with async
-- **Solution**: Python's ContextVar for async-aware context
-- **Benefit**: Automatic propagation through async call chains
-#### 3. **Stateless Architecture**
-- **Problem**: Session state limits scalability
-- **Solution**: Stateless design with session correlation
-- **Benefit**: Horizontal scaling without sticky sessions
-#### 4. **Worker Recycling**
-- **Problem**: Long-running processes accumulate memory
-- **Solution**: Gunicorn's `--max-requests` with jitter
-- **Benefit**: Automatic memory leak prevention
-## Current Production Architecture: Single-Worker Async
-### Why Single-Worker?
-MCP protocol requires stateful session management that is incompatible with multi-worker deployments:
-- Session IDs must be maintained across requests
-- Session state cannot be easily shared across worker processes
-- Similar to WebSocket connections, MCP sessions are inherently stateful
-Following the pattern of [GitHub's MCP Server](https://github.com/github/github-mcp-server) and other reference implementations, we use a **single-worker async architecture**.
-### The Architecture: Async Event Loop Concurrency
-```python
-# Single Uvicorn worker with async event loop
-CMD ["uvicorn", "app:app",
-     "--workers", "1",           # Single worker for session state
-     "--loop", "uvloop",          # High-performance event loop
-     "--limit-concurrency", "1000"] # Handle 1000+ concurrent connections
-```
-#### How It Handles Concurrent Requests
-```
-┌─────────────────────────────────────────────┐
-│         Single Uvicorn Process               │
-│                                              │
-│  ┌─────────────────────────────────────┐    │
-│  │      Async Event Loop (uvloop)      │    │
-│  │                                      │    │
-│  │  Request 1 ──┐                      │    │
-│  │  Request 2 ──├── Concurrent         │    │
-│  │  Request 3 ──├── Processing         │    │
-│  │  Request N ──┘   (Non-blocking I/O) │    │
-│  └─────────────────────────────────────┘    │
-│                                              │
-│  ┌─────────────────────────────────────┐    │
-│  │     In-Memory Session Storage       │    │
-│  │   { session_id: api_key, ... }      │    │
-│  └─────────────────────────────────────┘    │
-└─────────────────────────────────────────────┘
-```
-### Performance Characteristics
-Despite being single-worker, the async architecture provides excellent concurrency:
-| Metric | Capability | Explanation |
-|--------|-----------|-------------|
-| **Concurrent Requests** | 100-1000+ | Event loop handles I/O concurrently |
-| **Throughput** | 500-2000 req/s | Non-blocking async operations |
-| **Latency** | < 100ms p50 | Efficient event loop scheduling |
-| **Memory** | ~200-500MB | Single process, shared memory |
-### The Problems We Solved
-- ✅ **Thread-Safe API Keys**: Using ContextVar for proper isolation
-- ✅ **MCP Session Compliance**: Proper session management in single process
-- ✅ **High Concurrency**: Async event loop handles many concurrent requests
-- ✅ **No Race Conditions**: Request contexts properly isolated
-## Future Scaling Architecture
-When single-worker async reaches its limits, here are proven scaling strategies:
-### Option 1: Sticky Sessions with Load Balancer
-```
-┌──────────────────────────────────┐
-│   Load Balancer (Nginx/HAProxy)  │
-│   with Session Affinity           │
-└────────┬──────────┬──────────────┘
-         │          │
-    ┌────▼───┐ ┌───▼────┐
-    │Worker 1│ │Worker 2│  (Each maintains
-    │Sessions│ │Sessions│   own session state)
-    └────────┘ └────────┘
-```
-**Implementation:**
-```nginx
-upstream mcp_servers {
-    ip_hash;  # Session affinity based on client IP
-    server worker1:7860;
-    server worker2:7860;
-}
-```
-### Option 2: Shared Session Storage
-```
-┌────────────┐ ┌────────────┐
-│  Worker 1  │ │  Worker 2  │
-└─────┬──────┘ └─────┬──────┘
-      │              │
-      ▼              ▼
-┌────────────────────────────┐
-│     Redis/Memcached        │
-│   (Shared Session Store)   │
-└────────────────────────────┘
-```
-**Implementation:**
-```python
-import redis
-redis_client = redis.Redis(host='redis-server')
-# Store session
-redis_client.setex(f"session:{session_id}",
-                   3600, api_key)
-# Retrieve session
-api_key = redis_client.get(f"session:{session_id}")
-```
-### Option 3: Kubernetes with StatefulSets
-For cloud-native deployments:
-```yaml
-apiVersion: apps/v1
-kind: StatefulSet
-metadata:
-  name: mcp-server
-spec:
-  serviceName: mcp-service
-  replicas: 3
-  podManagementPolicy: Parallel
-  # Each pod maintains persistent session state
-```
-### Option 4: Edge Computing with Durable Objects
-For global scale using Cloudflare Workers or similar:
-```javascript
-// Durable Object for session state
-export class MCPSession {
-  constructor(state, env) {
-    this.state = state;
-    this.sessions = new Map();
-  }
-  async fetch(request) {
-    // Handle session-specific requests
-  }
-}
-```
-## Current Deployment Reality on Hugging Face Spaces
-Due to platform constraints:
-- ❌ No Redis/Memcached available
-- ❌ No sticky session load balancer control
-- ❌ No Kubernetes StatefulSets
-- ✅ **Single-worker async is the optimal solution**
-This architecture successfully handles hundreds of concurrent users while maintaining MCP protocol compliance.
-```python
-# Core Innovation: Context Variable Isolation
-from contextvars import ContextVar
-# Each request gets its own isolated API key context
-api_key_context: ContextVar[str] = ContextVar('wandb_api_key')
-# In middleware (per request)
-async def thread_safe_auth_middleware(request: Request, call_next):
-    api_key = extract_from_bearer_token(request)
-    token = api_key_context.set(api_key)  # Thread-safe storage
-    try:
-        response = await call_next(request)
-    finally:
-        api_key_context.reset(token)  # Cleanup
-    return response
-```
-#### Multi-Worker Deployment Configuration
-```dockerfile
-# Current production setup in Dockerfile
-CMD ["gunicorn", "app:app", \
-     "--bind", "0.0.0.0:7860", \
-     "--workers", "4", \
-     "--worker-class", "uvicorn.workers.UvicornWorker", \
-     "--timeout", "120", \
-     "--keep-alive", "5", \
-     "--max-requests", "1000", \
-     "--max-requests-jitter", "50"]
-```
-**What each parameter does:**
-- `--workers 4`: 4 parallel processes (scales with CPU cores)
-- `--worker-class uvicorn.workers.UvicornWorker`: Full async/await support
-- `--max-requests 1000`: Auto-restart workers after 1000 requests (prevents memory leaks)
-- `--max-requests-jitter 50`: Randomize restarts to avoid all workers restarting simultaneously
-- `--timeout 120`: Allow long-running operations (e.g., large Weave queries)
-#### Request Flow Architecture
-```
-Client Request
-    ↓
-[Gunicorn Master Process (PID 1)]
-    ↓ (Round-robin distribution)
-[Worker Process (1 of 4)]
-    ↓
-[FastAPI App Instance]
-    ↓
-[Thread-Safe Middleware]
-    ↓ (Sets ContextVar)
-[MCP Tool Execution]
-    ↓ (Uses isolated API key)
-[Response Stream]
-```
-## Comprehensive Testing Results
-### Test Suite Executed
-#### 1. **Multi-Worker Distribution Test**
-```python
-# Test: 50 concurrent health checks
-async def test_concurrent_health_checks(num_requests=50):
-    tasks = [send_health_request(session, i) for i in range(50)]
-    results = await asyncio.gather(*tasks)
-```
-**Results:**
-- ✅ **1,073 requests/second** throughput achieved
-- ✅ Even distribution across workers:
-  - Worker PID 7: 11 requests (22.0%)
-  - Worker PID 8: 13 requests (26.0%)
-  - Worker PID 9: 11 requests (22.0%)
-  - Worker PID 10: 15 requests (30.0%)
-#### 2. **API Key Isolation Test**
-```python
-# Test: 100 concurrent requests from 20 different clients
-# Each client has unique API key: test_api_key_client_001, etc.
-for client_id in range(20):
-    for request_num in range(5):
-        tasks.append(send_request_with_api_key(f"key_{client_id}"))
-random.shuffle(tasks)  # Simulate random arrival
-results = await asyncio.gather(*tasks)
-```
-**Results:**
-- ✅ **Zero API key cross-contamination**
-- ✅ Each request maintained correct API key throughout execution
-- ✅ **1,014 requests/second** with authentication enabled
-#### 3. **Stress Test**
-```python
-# Test: Sustained load for 5 seconds at 50 req/s target
-async def stress_test(duration_seconds=5, target_rps=50):
-    # Send requests continuously for duration
-    while time.time() < end_time:
-        tasks.append(send_health_request())
-        await asyncio.sleep(1.0 / target_rps)
-```
-**Results:**
-- ✅ **239 total requests processed**
-- ✅ **100% success rate** (0 errors)
-- ✅ Actual RPS: 46.9 (close to 50 target)
-- ✅ All 4 workers utilized
-#### 4. **Authentication Enforcement Test**
-```python
-# Test: Verify auth is properly enforced
-# 1. Request without token → Should get 401
-# 2. Request with invalid token → Should get 401
-# 3. Request with valid token → Should succeed
-```
-**Results:**
-- ✅ Correctly rejected unauthenticated requests (401)
-- ✅ Invalid API keys properly rejected
-- ✅ Valid tokens processed successfully
-### Performance Comparison
-| Metric | Original | Current Production | Improvement |
-|--------|----------|-------------------|-------------|
-| **Concurrent Users** | 10-20 | 50-100 | **5x** |
-| **Peak Throughput** | ~50 req/s | 1,073 req/s | **21x** |
-| **Sustained Load** | ~20 req/s | 47 req/s | **2.3x** |
-| **API Key Safety** | ❌ Race condition | ✅ Thread-safe | **Fixed** |
-| **Worker Processes** | 1 | 4 | **4x** |
-| **Memory Management** | Unbounded | Auto-recycled | **Stable** |
-## Quick Deployment (Already in Production)
-The concurrent version is already deployed. To update or redeploy:
-```bash
-# The current app.py already includes all concurrent improvements
-git add .
-git commit -m "Update MCP server"
-git push  # Deploys to HF Spaces
-# To add more workers (if HF Spaces resources allow)
-echo "ENV WEB_CONCURRENCY=8" >> Dockerfile
-```
----
-## Large-Scale Deployment (100s-1000s of Agents)
-### Architecture Overview
-```
-                    [Load Balancer]
-                          |
-            +-------------+-------------+
-            |             |             |
-        [Region 1]    [Region 2]    [Region 3]
-            |             |             |
-     +------+------+ +----+----+ +------+------+
-     |      |      | |    |    | |      |      |
-   [Pod1] [Pod2] [Pod3] [Pod4] [Pod5] [Pod6] [Pod7]
-     |      |      |      |      |      |      |
-   [Redis Cache]  [Redis Cache]  [Redis Cache]
-```
-### Implementation Tiers
-#### Tier 1: Enhanced HF Spaces (50-200 agents)
-```yaml
-# Just use more workers
-ENV WEB_CONCURRENCY=8
-```
-#### Tier 2: Kubernetes Deployment (200-1000 agents)
-```yaml
-# k8s-deployment.yaml
-apiVersion: apps/v1
-kind: Deployment
-metadata:
-  name: wandb-mcp-server
-spec:
-  replicas: 10
-  template:
-    spec:
-      containers:
-      - name: mcp-server
-        image: wandb-mcp:latest
-        resources:
-          requests:
-            cpu: "2"
-            memory: "4Gi"
-          limits:
-            cpu: "4"
-            memory: "8Gi"
-        env:
-        - name: WEB_CONCURRENCY
-          value: "8"
----
-apiVersion: v1
-kind: Service
-metadata:
-  name: wandb-mcp-service
-spec:
-  type: LoadBalancer
-  ports:
-  - port: 80
-    targetPort: 7860
-```
-#### Tier 3: Cloud-Native Architecture (1000+ agents)
-**Components:**
-1. **API Gateway** (AWS API Gateway / Kong)
-   - Rate limiting per client
-   - Request routing
-   - Authentication
-2. **Container Orchestration** (ECS/EKS/GKE)
-   ```bash
-   # AWS ECS Example
-   aws ecs create-service \
-     --cluster mcp-cluster \
-     --service-name wandb-mcp \
-     --task-definition wandb-mcp:1 \
-     --desired-count 20 \
-     --launch-type FARGATE
-   ```
-3. **Caching Layer** (Redis Cluster)
-   ```python
-   # In app_concurrent.py
-   import redis
-   redis_client = redis.RedisCluster(
-     startup_nodes=[{"host": "cache.aws.com", "port": "6379"}]
-   )
-   @lru_cache_redis(ttl=300)
-   async def cached_query(key, query_func, *args):
-       cached = redis_client.get(key)
-       if cached:
-           return json.loads(cached)
-       result = await query_func(*args)
-       redis_client.setex(key, 300, json.dumps(result))
-       return result
-   ```
-4. **Queue System** (SQS/RabbitMQ for async processing)
-   ```python
-   # For heavy operations
-   from celery import Celery
-   celery_app = Celery('wandb_mcp', broker='redis://localhost:6379')
-   @celery_app.task
-   def process_large_report(params):
-       return create_report(**params)
-   ```
-5. **Monitoring Stack**
-   - **Prometheus** + **Grafana**: Metrics
-   - **ELK Stack**: Logs
-   - **Jaeger**: Distributed tracing
-### Quick Deployment Commands
-#### Docker Swarm (Medium Scale)
-```bash
-docker swarm init
-docker service create \
-  --name wandb-mcp \
-  --replicas 10 \
-  --publish published=80,target=7860 \
-  wandb-mcp:concurrent
-```
-#### Kubernetes with Helm (Large Scale)
-```bash
-helm create wandb-mcp-chart
-helm install wandb-mcp ./wandb-mcp-chart \
-  --set replicaCount=20 \
-  --set image.repository=wandb-mcp \
-  --set image.tag=concurrent \
-  --set autoscaling.enabled=true \
-  --set autoscaling.minReplicas=10 \
-  --set autoscaling.maxReplicas=50
-```
-#### AWS CDK (Enterprise)
-```python
-# cdk_stack.py
-from aws_cdk import (
-    aws_ecs as ecs,
-    aws_ecs_patterns as patterns,
-    Stack
-)
-class WandBMCPStack(Stack):
-    def __init__(self, scope, id):
-        super().__init__(scope, id)
-        patterns.ApplicationLoadBalancedFargateService(
-            self, "WandBMCP",
-            task_image_options=patterns.ApplicationLoadBalancedTaskImageOptions(
-                image=ecs.ContainerImage.from_registry("wandb-mcp:concurrent"),
-                container_port=7860,
-                environment={
-                    "WEB_CONCURRENCY": "8"
-                }
-            ),
-            desired_count=20,
-            cpu=2048,
-            memory_limit_mib=4096
-        )
-```
-### Performance Optimization Checklist
-- [ ] **Connection Pooling**: Reuse W&B API connections
-- [ ] **Caching**: Redis for frequent queries
-- [ ] **CDN**: Static assets via CloudFlare
-- [ ] **Database**: Read replicas for analytics
-- [ ] **Async Everything**: No blocking operations
-- [ ] **Rate Limiting**: Per-user and global limits
-- [ ] **Circuit Breakers**: Prevent cascade failures
-- [ ] **Health Checks**: Automatic bad instance removal
-### Cost Optimization
-| Scale | Architecture | Est. Monthly Cost |
-|-------|-------------|------------------|
-| 50-100 agents | HF Spaces Pro | $9-49 |
-| 100-500 agents | 5x ECS Fargate | $200-500 |
-| 500-1000 agents | 20x EKS nodes | $800-1500 |
-| 1000+ agents | Multi-region K8s | $2000+ |
-### Monitoring Metrics
-```python
-# Key metrics to track
-METRICS = {
-    "request_rate": "promhttp_metric_handler_requests_total",
-    "response_time_p99": "http_request_duration_seconds{quantile='0.99'}",
-    "error_rate": "rate(http_requests_total{status=~'5..'}[5m])",
-    "api_key_cache_hit": "redis_cache_hits_total / redis_cache_requests_total",
-    "worker_saturation": "gunicorn_workers_busy / gunicorn_workers_total"
-}
-```
-### Emergency Scaling Playbook
-```bash
-# Quick scale during traffic spike
-kubectl scale deployment wandb-mcp --replicas=50
-# Add more nodes
-eksctl scale nodegroup --cluster=mcp-cluster --nodes=20
-# Enable autoscaling
-kubectl autoscale deployment wandb-mcp --min=10 --max=100 --cpu-percent=70
-```
----
-## Migration Path
-### Step 1: Fix Current Issues (Day 1)
-Deploy `app_concurrent.py` to fix API key race condition
-### Step 2: Monitor & Optimize (Week 1)
-- Add metrics collection
-- Identify bottlenecks
-- Tune worker counts
-### Step 3: Scale Horizontally (Month 1)
-- Deploy to Kubernetes
-- Add Redis caching
-- Implement rate limiting
-### Step 4: Enterprise Features (Quarter 1)
-- Multi-region deployment
-- Advanced monitoring
-- SLA guarantees
----
-## TL;DR for PR Description
-```markdown
-## Scalability Improvements
-This PR enables the MCP server to handle 100+ concurrent agents safely:
-### Changes
-- ✅ Thread-safe API key handling using ContextVar
-- ✅ Multi-worker Gunicorn deployment (4x throughput)
-- ✅ Async execution for all tools
-- ✅ Worker recycling to prevent memory leaks
-### Performance
-- Before: 10-20 concurrent users, 50 req/s
-- After: 50-100 concurrent users, 200 req/s
-- API keys now fully isolated (fixes security issue)
-### Deployment
-```bash
-# Simple upgrade - just use the new files
-cp app_concurrent.py app.py
-cp Dockerfile.concurrent Dockerfile
-```
-### Future Scale
-For 1000+ agents, see SCALABILITY_GUIDE_CONCISE.md for Kubernetes/cloud deployment options.
-```

app.py CHANGED Viewed

@@ -1,6 +1,6 @@
 #!/usr/bin/env python3
 """
-Thread-safe HuggingFace Spaces entry point for the Weights & Biases MCP Server.
 """
 import os
@@ -103,28 +103,27 @@ if api_key:
 else:
     logger.info("No server W&B API key configured - clients will provide their own")
-# Create the MCP server
-# NOT using stateless mode - we'll handle session sharing across workers
-logger.info("Creating W&B MCP server...")
-mcp = FastMCP("wandb-mcp-server")
 # Register all W&B tools
 # The tools will use WandBApiManager.get_api_key() to get the current request's API key
 register_tools(mcp)
-# Session storage for API keys (maps MCP session ID to W&B API key)
-# This works in single-worker mode where all sessions are in the same process
-session_api_keys = {}
 # Custom authentication middleware
 async def thread_safe_auth_middleware(request: Request, call_next):
     """
-    Thread-safe authentication middleware for MCP endpoints.
-    Handles MCP session management with proper API key association:
-    1. Initial request with Bearer token → store API key with session ID
-    2. Subsequent requests with session ID → retrieve stored API key
-    3. All requests get proper W&B authentication via context
     """
     # Only apply auth to MCP endpoints
     if not request.url.path.startswith("/mcp"):
@@ -146,40 +145,31 @@ async def thread_safe_auth_middleware(request: Request, call_next):
     try:
         api_key = None
-        # Check if request has MCP session ID (for established sessions)
         session_id = request.headers.get("Mcp-Session-Id") or request.headers.get("mcp-session-id")
         if session_id:
-            logger.info(f"Request has MCP Session ID: {session_id[:8]}...")
-            if session_id in session_api_keys:
-                # Use stored API key for this session
-                api_key = session_api_keys[session_id]
-                logger.info(f"Session found in storage, using stored API key")
-            else:
-                logger.warning(f"Session ID {session_id[:8]}... NOT found in storage!")
-                logger.info(f"   Active sessions ({len(session_api_keys)}): {[sid[:8] for sid in session_api_keys.keys()]}")
-                # Don't fail here - the request might have its own Bearer token
         # Check for Bearer token (for new sessions or explicit auth)
         authorization = request.headers.get("Authorization", "")
         if authorization.startswith("Bearer "):
-            # Override with Bearer token if provided
-            api_key = authorization[7:].strip()
             # Basic validation
-            if len(api_key) < 20 or len(api_key) > 100:
                 return JSONResponse(
                     status_code=401,
                     content={"error": f"Invalid W&B API key format. Get your key at: https://wandb.ai/authorize"},
                     headers={"WWW-Authenticate": 'Bearer realm="W&B MCP", error="invalid_token"'}
                 )
-        # Handle session cleanup
         if request.method == "DELETE" and session_id:
-            if session_id in session_api_keys:
-                del session_api_keys[session_id]
-                logger.info(f"Session cleanup: Deleted session {session_id[:8]}... (Remaining sessions: {len(session_api_keys)})")
-            else:
-                logger.warning(f"Session cleanup: Attempted to delete non-existent session {session_id[:8]}...")
             return await call_next(request)
         if api_key:
@@ -193,32 +183,20 @@ async def thread_safe_auth_middleware(request: Request, call_next):
                 # Process the request
                 response = await call_next(request)
-                # If MCP returns a session ID, store our API key for future requests
                 response_session_id = response.headers.get("Mcp-Session-Id") or response.headers.get("mcp-session-id")
                 if response_session_id:
-                    if api_key:
-                        # Check if this session already exists
-                        if response_session_id in session_api_keys:
-                            logger.debug(f"Session {response_session_id[:8]}... already exists, updating API key")
-                        else:
-                            logger.info(f"New MCP session created: {response_session_id[:8]}...")
-                        session_api_keys[response_session_id] = api_key
-                        logger.info(f"Session storage updated. Total sessions: {len(session_api_keys)}")
-                        logger.debug(f"   Active session IDs: {[sid[:8] for sid in session_api_keys.keys()]}")
-                    else:
-                        logger.warning(f"Session created but no API key to store: {response_session_id[:8]}...")
                 return response
             finally:
                 # Reset context variable
                 WandBApiManager.reset_context_api_key(token)
         else:
-            # No API key available, let request through for MCP to handle
-            logger.warning(f"No API key available for request to {request.url.path}")
-            logger.info(f"   Session ID present: {bool(session_id)} ({session_id[:8] if session_id else 'None'}...)")
-            logger.info(f"   Bearer token present: {bool(authorization.startswith('Bearer '))}")
-            logger.info(f"   Request method: {request.method}")
-            logger.info("   Allowing request through for MCP to handle (may result in 401/404)")
             return await call_next(request)
     except Exception as e:
@@ -437,6 +415,7 @@ if __name__ == "__main__":
     logger.info("Health check: /health")
     logger.info("MCP endpoint: /mcp")
-    # Run with single async worker for MCP session compatibility
-    logger.info("Starting server with single async worker (MCP requires stateful sessions)")
-    uvicorn.run(app, host="0.0.0.0", port=PORT)

 #!/usr/bin/env python3
 """
+Thread-safe entry point for the Weights & Biases MCP Server.
 """
 import os
 else:
     logger.info("No server W&B API key configured - clients will provide their own")
+# Create the MCP server in stateless mode
+# All clients (OpenAI, Cursor, etc.) must provide Bearer token with each request
+# Session IDs are used only as correlation IDs, no state is persisted
+logger.info("Creating W&B MCP server in stateless HTTP mode...")
+mcp = FastMCP("wandb-mcp-server", stateless_http=True)
 # Register all W&B tools
 # The tools will use WandBApiManager.get_api_key() to get the current request's API key
 register_tools(mcp)
 # Custom authentication middleware
 async def thread_safe_auth_middleware(request: Request, call_next):
     """
+    Stateless authentication middleware for MCP endpoints.
+    Pure stateless operation - every request must include authentication:
+    - Session IDs are only used as correlation IDs
+    - No session state is stored between requests
+    - Each request must include Bearer token authentication
+    This works with all clients (OpenAI, Cursor, etc.) that support MCP.
     """
     # Only apply auth to MCP endpoints
     if not request.url.path.startswith("/mcp"):
     try:
         api_key = None
+        # Check if request has MCP session ID (correlation ID only in stateless mode)
         session_id = request.headers.get("Mcp-Session-Id") or request.headers.get("mcp-session-id")
         if session_id:
+            logger.debug(f"Request has correlation ID: {session_id[:8]}...")
         # Check for Bearer token (for new sessions or explicit auth)
         authorization = request.headers.get("Authorization", "")
         if authorization.startswith("Bearer "):
+            bearer_token = authorization[7:].strip()
             # Basic validation
+            if len(bearer_token) < 20 or len(bearer_token) > 100:
                 return JSONResponse(
                     status_code=401,
                     content={"error": f"Invalid W&B API key format. Get your key at: https://wandb.ai/authorize"},
                     headers={"WWW-Authenticate": 'Bearer realm="W&B MCP", error="invalid_token"'}
                 )
+            # Use Bearer token
+            api_key = bearer_token
+            logger.info(f"Using Bearer token for authentication")
+        # Handle session cleanup (stateless mode - just acknowledge and pass through)
         if request.method == "DELETE" and session_id:
+            logger.debug(f"Session cleanup: DELETE for {session_id[:8]}... (stateless - no action needed)")
             return await call_next(request)
         if api_key:
                 # Process the request
                 response = await call_next(request)
+                # In stateless mode, we don't store any session state
                 response_session_id = response.headers.get("Mcp-Session-Id") or response.headers.get("mcp-session-id")
                 if response_session_id:
+                    logger.debug(f"Response includes correlation ID: {response_session_id[:8]}...")
                 return response
             finally:
                 # Reset context variable
                 WandBApiManager.reset_context_api_key(token)
         else:
+            # No API key available - in stateless mode, this is expected to fail
+            logger.warning(f"No Bearer token provided for {request.url.path}")
+            logger.debug(f"   Request method: {request.method}")
+            logger.debug("   Passing to MCP (will likely return 401)")
             return await call_next(request)
     except Exception as e:
     logger.info("Health check: /health")
     logger.info("MCP endpoint: /mcp")
+    # In stateless mode, we can scale horizontally with multiple workers
+    # However, for HuggingFace Spaces we use single worker for simplicity
+    logger.info("Starting server (stateless mode - supports horizontal scaling)")
+    uvicorn.run(app, host="0.0.0.0", port=PORT, workers=1)  # Can increase workers if needed