NiWaRe commited on
Commit
40e1a91
·
1 Parent(s): 0783971

refactor for stateless: turn stateless on for FastMCP to work with OpenAI client etc

Browse files
ARCHITECTURE.md ADDED
@@ -0,0 +1,263 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # W&B MCP Server - Architecture & Scalability Guide
2
+
3
+ ## Table of Contents
4
+ 1. [Architecture Decision](#architecture-decision)
5
+ 2. [Stateless HTTP Design](#stateless-http-design)
6
+ 3. [Performance & Scalability](#performance--scalability)
7
+ 4. [Load Test Results](#load-test-results)
8
+ 5. [Deployment Recommendations](#deployment-recommendations)
9
+
10
+ ---
11
+
12
+ ## Architecture Decision
13
+
14
+ ### Decision: Pure Stateless HTTP Mode
15
+
16
+ **The W&B MCP Server uses pure stateless HTTP mode (`stateless_http=True`).**
17
+
18
+ This fundamental architecture decision enables:
19
+ - ✅ **Universal client compatibility** (OpenAI, Cursor, LeChat, Claude)
20
+ - ✅ **Horizontal scaling** capabilities
21
+ - ✅ **Simpler operations** and maintenance
22
+ - ✅ **Cloud-native** deployment patterns
23
+
24
+ ### Why Stateless?
25
+
26
+ The Model Context Protocol traditionally used stateful sessions, but this created issues:
27
+
28
+ | Client | Behavior | Problem with Stateful |
29
+ |--------|----------|----------------------|
30
+ | **OpenAI** | Deletes session after listing tools, then reuses ID | Session not found errors |
31
+ | **Cursor** | Sends Bearer token with every request | Expects stateless behavior |
32
+ | **Claude** | Can work with either model | No issues |
33
+
34
+ ### The Solution
35
+
36
+ ```python
37
+ # Pure stateless operation - no session persistence
38
+ mcp = FastMCP("wandb-mcp-server", stateless_http=True)
39
+ ```
40
+
41
+ With this approach:
42
+ - **Session IDs are correlation IDs only** - they match requests to responses
43
+ - **No state persists between requests** - each request is independent
44
+ - **Authentication required per request** - Bearer token must be included
45
+ - **Any worker can handle any request** - enables horizontal scaling
46
+
47
+ ---
48
+
49
+ ## Stateless HTTP Design
50
+
51
+ ### Architecture Overview
52
+
53
+ ```
54
+ ┌─────────────────────────────────────┐
55
+ │ MCP Clients (OpenAI/Cursor/etc) │
56
+ │ Bearer Token with Each Request │
57
+ └─────────────┬───────────────────────┘
58
+ │ HTTPS
59
+ ┌─────────────▼───────────────────────┐
60
+ │ Load Balancer (Optional) │
61
+ │ Round-Robin Distribution │
62
+ └──┬──────────┬──────────┬────────────┘
63
+ │ │ │
64
+ ┌──▼───┐ ┌──▼───┐ ┌──▼───┐
65
+ │ W1 │ │ W2 │ │ W3 │ (Multiple Workers Possible)
66
+ │ │ │ │ │ │
67
+ │ ASGI │ │ ASGI │ │ ASGI │ Uvicorn/Gunicorn
68
+ └──┬───┘ └──┬───┘ └──┬───┘
69
+ │ │ │
70
+ ┌──▼──────────▼──────────▼────────────┐
71
+ │ FastAPI Application │
72
+ │ ┌────────────────────────────┐ │
73
+ │ │ Stateless Auth Middleware │ │
74
+ │ │ (Bearer Token Validation) │ │
75
+ │ └────────────────────────────┘ │
76
+ │ ┌────────────────────────────┐ │
77
+ │ │ MCP Stateless Handler │ │
78
+ │ │ (No Session Storage) │ │
79
+ │ └────────────────────────────┘ │
80
+ └─────────────┬───────────────────────┘
81
+
82
+ ┌─────────────▼───────────────────────┐
83
+ │ W&B API Integration │
84
+ └─────────────────────────────────────┘
85
+ ```
86
+
87
+ ### Request Flow
88
+
89
+ 1. **Client sends request** with Bearer token and session ID
90
+ 2. **Middleware validates** Bearer token
91
+ 3. **MCP processes** request (session ID used for correlation only)
92
+ 4. **Response sent** with matching session ID
93
+ 5. **No state persisted** - request complete
94
+
95
+ ### Key Implementation Details
96
+
97
+ ```python
98
+ async def thread_safe_auth_middleware(request: Request, call_next):
99
+ """Stateless authentication middleware."""
100
+
101
+ # Session IDs are correlation IDs only
102
+ session_id = request.headers.get("Mcp-Session-Id")
103
+ if session_id:
104
+ logger.debug(f"Correlation ID: {session_id[:8]}...")
105
+
106
+ # Every request must have Bearer token
107
+ authorization = request.headers.get("Authorization", "")
108
+ if authorization.startswith("Bearer "):
109
+ api_key = authorization[7:].strip()
110
+ # Use API key for this request only
111
+ # No session storage or retrieval
112
+ ```
113
+
114
+ ---
115
+
116
+ ## Performance & Scalability
117
+
118
+ ### Single Worker Performance
119
+
120
+ Based on testing with stateless mode:
121
+
122
+ | Metric | Local Server | Remote (HF Spaces) |
123
+ |--------|--------------|-------------------|
124
+ | **Max Concurrent** | 1000 clients | 500+ clients |
125
+ | **Throughput** | ~50-60 req/s | ~35 req/s |
126
+ | **Latency (p50)** | <500ms | <2s |
127
+ | **Memory Usage** | 200-500MB | 300-600MB |
128
+
129
+ ### Horizontal Scaling Potential
130
+
131
+ With stateless mode, the server supports true horizontal scaling:
132
+
133
+ | Workers | Max Concurrent | Total Throughput | Notes |
134
+ |---------|----------------|------------------|-------|
135
+ | 1 | 1000 | ~50 req/s | Current deployment |
136
+ | 2 | 2000 | ~100 req/s | Linear scaling |
137
+ | 4 | 4000 | ~200 req/s | Near-linear |
138
+ | 8 | 8000 | ~400 req/s | Some overhead |
139
+
140
+ **Key Advantage**: No session affinity required - any worker can handle any request!
141
+
142
+ ---
143
+
144
+ ## Load Test Results
145
+
146
+ ### Latest Test Results (2025-09-25)
147
+
148
+ #### Local Server (MacOS, Single Worker)
149
+
150
+ | Concurrent Clients | Success Rate | Throughput | Mean Response |
151
+ |--------------------|-------------|------------|---------------|
152
+ | 10 | 100% | 47 req/s | 89ms |
153
+ | 100 | 100% | 47 req/s | 1.2s |
154
+ | 500 | 100% | 56 req/s | 4.4s |
155
+ | **1000** | **100%** | **48 req/s** | **9.3s** |
156
+ | 1500 | 80% | 51 req/s | 15.4s |
157
+ | 2000 | 70% | 53 req/s | 20.8s |
158
+
159
+ **Breaking Point**: ~1500 concurrent connections
160
+
161
+ #### Remote Server (mcp.withwandb.com)
162
+
163
+ | Concurrent Clients | Success Rate | Throughput | Mean Response |
164
+ |--------------------|-------------|------------|---------------|
165
+ | 10 | 100% | 10 req/s | 0.8s |
166
+ | 50 | 100% | 29 req/s | 1.2s |
167
+ | 100 | 100% | 33 req/s | 1.9s |
168
+ | 200 | 100% | 34 req/s | 3.3s |
169
+ | **500** | **100%** | **35 req/s** | **7.5s** |
170
+
171
+ **Key Finding**: Remote server handles 500+ concurrent connections reliably!
172
+
173
+ ### Performance Sweet Spots
174
+
175
+ 1. **Low Latency** (<1s response): Use ≤50 concurrent connections
176
+ 2. **Balanced** (good throughput & latency): Use 100-200 concurrent connections
177
+ 3. **Maximum Throughput**: Use 200-300 concurrent connections
178
+ 4. **Maximum Capacity**: Up to 500 concurrent (remote) or 1000 (local)
179
+
180
+ ---
181
+
182
+ ## Deployment Recommendations
183
+
184
+ ### Current Deployment (HuggingFace Spaces)
185
+
186
+ ```yaml
187
+ Configuration:
188
+ - Single worker (can be increased)
189
+ - Stateless HTTP mode
190
+ - 2 vCPU, 16GB RAM
191
+ - Port 7860
192
+
193
+ Performance:
194
+ - 500+ concurrent connections
195
+ - ~35 req/s throughput
196
+ - 100% reliability up to 500 concurrent
197
+ ```
198
+
199
+ ### Scaling Options
200
+
201
+ #### Option 1: Vertical Scaling
202
+ - Increase CPU/RAM on HuggingFace Spaces
203
+ - Can improve single-worker throughput
204
+
205
+ #### Option 2: Horizontal Scaling (Recommended)
206
+ ```python
207
+ # app.py - Enable multiple workers
208
+ uvicorn.run(app, host="0.0.0.0", port=PORT, workers=4)
209
+ ```
210
+
211
+ #### Option 3: Multi-Region Deployment
212
+ - Deploy to multiple regions
213
+ - Use global load balancer
214
+ - Reduce latency for users worldwide
215
+
216
+ ### Production Checklist
217
+
218
+ ✅ **Stateless mode enabled** (`stateless_http=True`)
219
+ ✅ **Bearer authentication** on every request
220
+ ✅ **Health check endpoint** (`/health`)
221
+ ✅ **Monitoring** for response times and errors
222
+ ✅ **Rate limiting** (recommended: 100 req/s per client)
223
+ ✅ **Connection limits** (recommended: 500 concurrent)
224
+
225
+ ### Configuration Example
226
+
227
+ ```python
228
+ # Production configuration
229
+ mcp = FastMCP("wandb-mcp-server", stateless_http=True)
230
+
231
+ # Uvicorn with multiple workers (if needed)
232
+ if __name__ == "__main__":
233
+ uvicorn.run(
234
+ app,
235
+ host="0.0.0.0",
236
+ port=7860,
237
+ workers=1, # Increase for horizontal scaling
238
+ limit_concurrency=1000, # Connection limit
239
+ timeout_keep_alive=30, # Keepalive timeout
240
+ )
241
+ ```
242
+
243
+ ### Security Considerations
244
+
245
+ 1. **API Key Validation**: Every request validates Bearer token
246
+ 2. **No Session Storage**: No risk of session hijacking
247
+ 3. **Rate Limiting**: Protect against abuse
248
+ 4. **HTTPS Only**: Always use TLS in production
249
+ 5. **Token Rotation**: Encourage regular API key rotation
250
+
251
+ ---
252
+
253
+ ## Summary
254
+
255
+ The W&B MCP Server's stateless architecture provides:
256
+
257
+ - **Universal Compatibility**: Works with all MCP clients
258
+ - **Excellent Performance**: 500+ concurrent connections, ~35 req/s
259
+ - **Horizontal Scalability**: Add workers to increase capacity
260
+ - **Simple Operations**: No session management complexity
261
+ - **Production Ready**: Deployed and tested at scale
262
+
263
+ The stateless design is not a compromise - it's the optimal architecture for MCP servers in production environments.
ARCHITECTURE_DECISION.md DELETED
@@ -1,75 +0,0 @@
1
- # Architecture Decision: Single-Worker Async
2
-
3
- ## Decision
4
-
5
- Use **single-worker async architecture** with Uvicorn and uvloop for the W&B MCP Server deployment.
6
-
7
- ## Context
8
-
9
- MCP (Model Context Protocol) requires stateful session management where:
10
- - Server creates session IDs on initialization
11
- - Clients must include session ID in subsequent requests
12
- - Session state must be maintained across the conversation
13
-
14
- ## Considered Options
15
-
16
- ### 1. Multi-Worker with Gunicorn (Rejected)
17
- - ❌ Session state not shared across workers
18
- - ❌ Requires Redis/Memcached (not available on HF Spaces)
19
- - ❌ Breaks MCP protocol compliance
20
-
21
- ### 2. Multi-Worker with Sticky Sessions (Rejected)
22
- - ❌ No load balancer control on HF Spaces
23
- - ❌ Complex configuration
24
- - ❌ Still doesn't guarantee session persistence
25
-
26
- ### 3. Single-Worker Async (Chosen) ✅
27
- - ✅ Full MCP protocol compliance
28
- - ✅ Handles 100-1000+ concurrent requests
29
- - ✅ Simple, reliable architecture
30
- - ✅ Used by GitHub MCP Server and other references
31
-
32
- ## Implementation
33
-
34
- ```dockerfile
35
- CMD ["uvicorn", "app:app",
36
- "--workers", "1",
37
- "--loop", "uvloop",
38
- "--limit-concurrency", "1000"]
39
- ```
40
-
41
- ## Performance
42
-
43
- Despite single-worker limitation:
44
- - **Concurrent Handling**: Async event loop processes I/O concurrently
45
- - **Non-blocking**: Database queries, API calls don't block other requests
46
- - **Throughput**: 500-2000 requests/second
47
- - **Memory Efficient**: ~200-500MB for hundreds of concurrent sessions
48
-
49
- ## Comparison with Industry Standards
50
-
51
- | Server | Architecture | Reasoning |
52
- |--------|------------|-----------|
53
- | GitHub MCP Server | Single process (Go) | Stateful sessions |
54
- | WebSocket servers | Single worker + async | Connection state |
55
- | GraphQL subscriptions | Single worker + async | Subscription state |
56
- | **W&B MCP Server** | **Single worker + async** | **MCP session state** |
57
-
58
- ## Future Scaling Path
59
-
60
- If we outgrow single-worker capacity:
61
-
62
- 1. **Vertical Scaling**: Increase CPU/memory (immediate)
63
- 2. **Edge Deployment**: Multiple regions with geo-routing
64
- 3. **Kubernetes StatefulSets**: When platform supports it
65
- 4. **Durable Objects**: For edge computing platforms
66
-
67
- ## Conclusion
68
-
69
- Single-worker async is the **correct architectural choice** for MCP servers, not a limitation. It provides:
70
- - Protocol compliance
71
- - High concurrency
72
- - Simple deployment
73
- - Reliable session management
74
-
75
- This mirrors how other stateful protocols (WebSockets, SSE, GraphQL subscriptions) are typically deployed.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
HUGGINGFACE_DEPLOYMENT.md DELETED
@@ -1,205 +0,0 @@
1
- # Hugging Face Spaces Deployment Guide
2
-
3
- This repository is configured for deployment on Hugging Face Spaces as a Model Context Protocol (MCP) server for Weights & Biases.
4
-
5
- ## Architecture
6
-
7
- The application runs as a FastAPI server on port 7860 (HF Spaces default) with:
8
- - **Main landing page**: `/` - Serves the index.html with setup instructions
9
- - **Health check**: `/health` - Returns server status and W&B configuration
10
- - **MCP endpoint**: `/mcp` - Streamable HTTP transport endpoint for MCP
11
- - Server can intelligently decide to return plan plan JSON or a SSE stream (the client always requests in the same way, see below)
12
- - Requires `Accept: application/json, text/event-stream` header
13
- - Supports initialize, tools/list, tools/call methods
14
-
15
- More information on the details of [streamable http](https://modelcontextprotocol.io/specification/draft/basic/transports#streamable-http) are in the official docs and [this PR](https://github.com/modelcontextprotocol/modelcontextprotocol/pull/206).
16
-
17
- ## Key Changes for HF Spaces
18
-
19
- ### 1. app.py
20
- - Creates a FastAPI application that serves the landing page
21
- - Mounts FastMCP server using `mcp.streamable_http_app()` pattern (following [example from Mistral here](https://huggingface.co/spaces/Jofthomas/Multiple_mcp_fastapi_template))
22
- - Uses lifespan context manager for session management
23
- - Configured to run on `0.0.0.0:7860` (HF Spaces requirement)
24
- - Sets W&B cache directories to `/tmp` to avoid permission issues
25
-
26
- ### 2. server.py
27
- - Exports necessary functions for HF Spaces initialization
28
- - Support for being imported as a module
29
- - Maintains backward compatibility with CLI usage
30
-
31
- ### 3. Dependencies
32
- - FastAPI and uvicorn as main dependencies
33
- - All dependencies listed in requirements.txt for HF Spaces
34
-
35
- ### 4. Lazy Loading Fix
36
- - `TraceService` initialization in `query_weave.py` to use lazy loading
37
- - This allows the server to start even without a W&B API key (when first adding in LeChat for example without connecting)
38
- - The service is only initialized when first needed
39
-
40
- ## Environment Variables
41
-
42
- No environment variables are required! The server works without any configuration.
43
-
44
- **Note**: Users provide their own W&B API keys as Bearer tokens. No server configuration needed (see AUTH_README.md).
45
-
46
- ## Deployment Steps
47
-
48
- 1. **Create a new Space on Hugging Face**
49
- - Choose "Docker" as the SDK
50
- - Set visibility as needed
51
-
52
- 2. **Configure Secrets**
53
- - Go to Settings → Variables and secrets
54
- - Add `MCP_SERVER_URL` as a variable for the URL to be correctly
55
-
56
- 3. **Push the Code**
57
- ```bash
58
- git add .
59
- git commit -m "Configure for HF Spaces deployment"
60
- git push
61
- ```
62
-
63
- 4. **Connect to the MCP Server**
64
- - Use the endpoint: `https://[your-username]-[space-name].hf.space/mcp`
65
- - Configure your MCP client with this URL and "streamable-http" transport
66
-
67
- ## File Structure
68
-
69
- ```
70
- .
71
- ├── app.py # HF Spaces entry point
72
- ├── index.html # Landing page
73
- ├── Dockerfile # Container configuration
74
- ├── requirements.txt # Python dependencies
75
- ├── pyproject.toml # Package configuration
76
- └── src/
77
- └── wandb_mcp_server/
78
- ├── server.py # MCP server implementation
79
- └── ... # Tool implementations
80
- ```
81
-
82
- ## Testing Locally
83
-
84
- To test the HF Spaces configuration locally:
85
-
86
- ```bash
87
- # Install dependencies
88
- pip install -r requirements.txt
89
-
90
- # Set environment variables
91
- export WANDB_API_KEY=your_key_here
92
-
93
- # Run the server
94
- python app.py
95
- ```
96
-
97
- The server will start on http://localhost:7860
98
-
99
- ## MCP Architecture & Key Learnings
100
-
101
- ### Understanding MCP and FastMCP
102
-
103
- The Model Context Protocol (MCP) is a protocol for communication between AI assistants and external tools/services. Through our experimentation, we discovered several important aspects:
104
-
105
- #### 1. FastMCP Framework
106
- - **FastMCP** is a Python framework that simplifies MCP server implementation
107
- - It provides decorators (`@mcp.tool()`) for easy tool registration
108
- - Internally uses Starlette for HTTP handling
109
- - Supports multiple transports: stdio, SSE, and streamable HTTP
110
-
111
- #### 2. Streamable HTTP Transport
112
- The streamable HTTP transport (introduced in [MCP PR #206](https://github.com/modelcontextprotocol/modelcontextprotocol/pull/206)) is the modern approach for remote MCP:
113
-
114
- - **Single endpoint** (`/mcp`) handles all communication
115
- - **Dual mode operation**:
116
- - Regular POST requests for stateless operations
117
- - SSE (Server-Sent Events) upgrade for streaming responses
118
- - **Key advantages**:
119
- - Stateless servers possible (no persistent connections required)
120
- - Better infrastructure compatibility ("just HTTP")
121
- - Supports both request-response and streaming patterns
122
-
123
- #### 3. Implementation Patterns
124
-
125
- ##### The HuggingFace Pattern
126
- Based on the [reference implementation](https://huggingface.co/spaces/Jofthomas/Multiple_mcp_fastapi_template), the correct pattern is:
127
-
128
- ```python
129
- # Create MCP server
130
- mcp = FastMCP("server-name")
131
-
132
- # Register tools
133
- @mcp.tool()
134
- def my_tool(): ...
135
-
136
- # Get streamable HTTP app (returns Starlette app)
137
- mcp_app = mcp.streamable_http_app()
138
-
139
- # Mount in FastAPI
140
- app.mount("/", mcp_app) # Note: mount at root, not at /mcp
141
- ```
142
-
143
- ##### Why Mount at Root?
144
- - `streamable_http_app()` creates internal routes at `/mcp`
145
- - Mounting at `/mcp` would create `/mcp/mcp` (double path)
146
- - Mounting at root gives us the clean `/mcp` endpoint
147
-
148
- #### 4. Session Management
149
- - FastMCP includes a `session_manager` for handling stateful operations
150
- - Use lifespan context manager to properly initialize/cleanup:
151
- ```python
152
- async with mcp.session_manager.run():
153
- yield
154
- ```
155
-
156
- #### 5. Response Format
157
- - MCP uses **Server-Sent Events (SSE)** for responses
158
- - Responses are prefixed with `event: message` and `data: `
159
- - JSON-RPC format for the actual message content
160
- - Example response:
161
- ```
162
- event: message
163
- data: {"jsonrpc":"2.0","id":1,"result":{...}}
164
- ```
165
-
166
- ### Critical Implementation Details
167
-
168
- #### 1. Required Headers
169
- Clients MUST send:
170
- - `Content-Type: application/json`
171
- - `Accept: application/json, text/event-stream`
172
-
173
- Without the correct Accept header, the server returns a "Not Acceptable" error.
174
-
175
- #### 2. Lazy Loading Pattern
176
- To avoid initialization issues (e.g., API keys required at import time):
177
- ```python
178
- # Instead of this:
179
- _service = Service() # Fails if no API key
180
-
181
- # Use lazy loading:
182
- _service = None
183
- def get_service():
184
- global _service
185
- if _service is None:
186
- _service = Service()
187
- return _service
188
- ```
189
-
190
- #### 3. Environment Setup for HF Spaces
191
- Critical for avoiding permission errors:
192
- ```python
193
- os.environ["WANDB_CACHE_DIR"] = "/tmp/.wandb_cache"
194
- os.environ["HOME"] = "/tmp"
195
- ```
196
-
197
- ### Common Pitfalls & Solutions
198
-
199
- | Issue | Symptom | Solution |
200
- |-------|---------|----------|
201
- | Double path (`/mcp/mcp`) | 404 errors on `/mcp` | Mount streamable_http_app() at root (`/`) |
202
- | Missing Accept header | "Not Acceptable" error | Include `Accept: application/json, text/event-stream` |
203
- | Import-time API key errors | Server fails to start | Use lazy loading pattern |
204
- | Permission errors in HF Spaces | `mkdir /.cache: permission denied` | Set cache dirs to `/tmp` |
205
- | Can't access MCP methods | Methods not exposed | Use FastMCP's built-in decorators and methods |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
README.md CHANGED
@@ -131,7 +131,7 @@ The integrated [wandbot](https://github.com/wandb/wandbot) support agent provide
131
 
132
  This MCP server can be deployed in three ways. **We recommend starting with the hosted server** for the easiest setup experience.
133
 
134
- ### 🌐 Option 1: Hosted Server (Recommended - No Installation Required)
135
 
136
  Use our publicly hosted server on Hugging Face Spaces - **zero installation needed!**
137
 
@@ -139,7 +139,7 @@ Use our publicly hosted server on Hugging Face Spaces - **zero installation need
139
 
140
  > **ℹ️ Quick Setup:** Click the button for your client above, then use the configuration examples in the sections below. Just replace `YOUR_WANDB_API_KEY` with your actual API key from [wandb.ai/authorize](https://wandb.ai/authorize).
141
 
142
- ### 💻 Option 2: Local Development (STDIO)
143
 
144
  Run the server locally with direct stdio communication - best for development and testing.
145
 
@@ -239,7 +239,7 @@ Use the HTTPS URL in your OpenAI client:
239
 
240
  > **Note:** Free ngrok URLs change each time you restart. For persistent URLs, consider ngrok's paid plans or alternatives like Cloudflare Tunnel.
241
 
242
- ### 🔌 Option 3: Self-Hosted HTTP Server
243
 
244
  Deploy your own HTTP server with API key authentication - great for team deployments or custom infrastructure.
245
 
@@ -842,7 +842,7 @@ Deploy your own instance of the W&B MCP Server on Hugging Face Spaces:
842
  https://YOUR_USERNAME-YOUR_SPACE_NAME.hf.space/mcp
843
  ```
844
 
845
- See [HUGGINGFACE_DEPLOYMENT.md](HUGGINGFACE_DEPLOYMENT.md) for detailed deployment instructions.
846
 
847
  ### Run Local HTTP Server
848
 
@@ -872,7 +872,7 @@ wandb-mcp-server/
872
  ├── requirements.txt # Python dependencies for HTTP deployment
873
  ├── index.html # Landing page for web interface
874
  ├── AUTH_README.md # Authentication documentation
875
- ├── HUGGINGFACE_DEPLOYMENT.md # HF Spaces deployment guide
876
  ├── src/
877
  │ └── wandb_mcp_server/
878
  │ ├── server.py # Core MCP server (STDIO & HTTP)
@@ -1056,11 +1056,11 @@ The W&B MCP Server is built with a modern, scalable architecture designed for bo
1056
 
1057
  ### Key Design Principles
1058
 
1059
- 1. **Stateless Architecture**: Each request is independent, enabling horizontal scaling
1060
- 2. **Per-Request Authentication**: API keys are isolated per request using Python's ContextVar
1061
- 3. **No Global State**: Eliminated `wandb.login()` in favor of `wandb.Api(api_key=...)`
1062
- 4. **Transport Agnostic**: Supports both STDIO (local) and HTTP (remote) transports
1063
- 5. **Cloud Native**: Designed for containerization and deployment on platforms like Hugging Face Spaces
1064
 
1065
  ### Deployment Architecture
1066
 
@@ -1072,17 +1072,17 @@ The server can be deployed in multiple configurations:
1072
  - **Containerized**: Docker with configurable worker counts
1073
  - **Cloud Platforms**: Hugging Face Spaces, AWS, GCP, etc.
1074
 
1075
- For detailed scalability information and advanced deployment options, see the [Scalability Guide](SCALABILITY_GUIDE.md).
1076
 
1077
  ### Performance & Scalability
1078
 
1079
- The server has been thoroughly tested and can handle significant production workloads:
1080
 
1081
- **Measured Performance (HF Spaces, 2 vCPU)**:
1082
- - **Maximum Capacity**: 600 concurrent connections
1083
- - **Peak Throughput**: 150 req/s
1084
- - **Breaking Point**: 650-700 concurrent connections
1085
- - **100% Success Rate**: Up to 600 clients
1086
 
1087
  Run your own load tests:
1088
 
@@ -1097,7 +1097,44 @@ python load_test.py --url https://mcp.withwandb.com --mode stress
1097
  python load_test.py --url https://mcp.withwandb.com --clients 100 --requests 20
1098
  ```
1099
 
1100
- See the comprehensive [Scalability Guide](SCALABILITY_GUIDE.md) for detailed performance analysis, testing instructions, and optimization strategies.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1101
 
1102
  ## Support
1103
 
 
131
 
132
  This MCP server can be deployed in three ways. **We recommend starting with the hosted server** for the easiest setup experience.
133
 
134
+ ### Option 1: Hosted Server (Recommended - No Installation Required)
135
 
136
  Use our publicly hosted server on Hugging Face Spaces - **zero installation needed!**
137
 
 
139
 
140
  > **ℹ️ Quick Setup:** Click the button for your client above, then use the configuration examples in the sections below. Just replace `YOUR_WANDB_API_KEY` with your actual API key from [wandb.ai/authorize](https://wandb.ai/authorize).
141
 
142
+ ### Option 2: Local Development (STDIO)
143
 
144
  Run the server locally with direct stdio communication - best for development and testing.
145
 
 
239
 
240
  > **Note:** Free ngrok URLs change each time you restart. For persistent URLs, consider ngrok's paid plans or alternatives like Cloudflare Tunnel.
241
 
242
+ ### Option 3: Self-Hosted HTTP Server
243
 
244
  Deploy your own HTTP server with API key authentication - great for team deployments or custom infrastructure.
245
 
 
842
  https://YOUR_USERNAME-YOUR_SPACE_NAME.hf.space/mcp
843
  ```
844
 
845
+ The server is deployed on HuggingFace Spaces at `https://mcp.withwandb.com`.
846
 
847
  ### Run Local HTTP Server
848
 
 
872
  ├── requirements.txt # Python dependencies for HTTP deployment
873
  ├── index.html # Landing page for web interface
874
  ├── AUTH_README.md # Authentication documentation
875
+ ├── ARCHITECTURE.md # Architecture & scalability guide
876
  ├── src/
877
  │ └── wandb_mcp_server/
878
  │ ├── server.py # Core MCP server (STDIO & HTTP)
 
1056
 
1057
  ### Key Design Principles
1058
 
1059
+ 1. **Pure Stateless Mode**: Session IDs are correlation IDs only - no state persists
1060
+ 2. **Horizontal Scalability**: Any worker can handle any request
1061
+ 3. **Universal Compatibility**: Works with OpenAI, Cursor, LeChat, and all MCP clients
1062
+ 4. **Per-Request Authentication**: Bearer token required with every request
1063
+ 5. **Cloud Native**: Optimized for containerization and cloud deployment
1064
 
1065
  ### Deployment Architecture
1066
 
 
1072
  - **Containerized**: Docker with configurable worker counts
1073
  - **Cloud Platforms**: Hugging Face Spaces, AWS, GCP, etc.
1074
 
1075
+ For detailed architecture and scalability information, see the [Architecture Guide](ARCHITECTURE.md).
1076
 
1077
  ### Performance & Scalability
1078
 
1079
+ The stateless server architecture provides excellent performance:
1080
 
1081
+ **Measured Performance**:
1082
+ - **Remote Server (mcp.withwandb.com)**: 500+ concurrent connections @ ~35 req/s
1083
+ - **Local Server**: 1000 concurrent connections @ ~50 req/s
1084
+ - **100% Success Rate**: Up to 500 clients (remote) or 1000 (local)
1085
+ - **Horizontal Scaling**: Add workers to multiply capacity
1086
 
1087
  Run your own load tests:
1088
 
 
1097
  python load_test.py --url https://mcp.withwandb.com --clients 100 --requests 20
1098
  ```
1099
 
1100
+ See the [Architecture Guide](ARCHITECTURE.md) for detailed performance analysis, testing instructions, and deployment recommendations.
1101
+
1102
+ ## Example: Using with OpenAI
1103
+
1104
+ Here's a complete example using the W&B MCP Server with OpenAI's client:
1105
+
1106
+ ```python
1107
+ from openai import OpenAI
1108
+ from dotenv import load_dotenv
1109
+ import os
1110
+
1111
+ load_dotenv()
1112
+
1113
+ client = OpenAI()
1114
+
1115
+ resp = client.responses.create(
1116
+ model="gpt-4o", # Use gpt-4o for larger context window to handle all MCP tools
1117
+ tools=[
1118
+ {
1119
+ "type": "mcp",
1120
+ "server_label": "wandb",
1121
+ "server_description": "A tool to query and analyze Weights & Biases data.",
1122
+ "server_url": "https://mcp.withwandb.com/mcp", # Must use public URL for OpenAI
1123
+ "authorization": os.getenv('WANDB_API_KEY'), # Use authorization field directly
1124
+ "require_approval": "never",
1125
+ },
1126
+ ],
1127
+ input="How many traces are in wandb-smle/hiring-agent-demo-public?",
1128
+ )
1129
+
1130
+ print(resp.output_text)
1131
+ ```
1132
+
1133
+ **Key Points:**
1134
+ - OpenAI's MCP implementation is server-side, so you must use a publicly accessible URL
1135
+ - The `authorization` field should contain your W&B API key directly (not in headers)
1136
+ - Use `gpt-4o` model for sufficient context window to handle all W&B tools
1137
+ - The server operates in stateless mode - each request includes authentication
1138
 
1139
  ## Support
1140
 
SCALABILITY_GUIDE.md DELETED
@@ -1,754 +0,0 @@
1
- # W&B MCP Server - Scalability & Performance Guide
2
-
3
- ## Table of Contents
4
- 1. [Current Architecture](#current-architecture)
5
- - [Architecture Decision](#architecture-decision-why-single-worker-async)
6
- - [Implementation Details](#implementation-details)
7
- 2. [Performance Test Results](#performance-test-results)
8
- 3. [Load Testing Guide](#load-testing-guide)
9
- 4. [Hardware Scaling Analysis](#hardware-scaling-analysis)
10
- 5. [Optimization Strategies](#optimization-strategies)
11
- 6. [Deployment Recommendations](#deployment-recommendations)
12
- 7. [Future Scaling Options](#future-scaling-options)
13
- 8. [Common Questions About the Architecture](#common-questions-about-the-architecture)
14
- 9. [Summary](#summary)
15
-
16
- ---
17
-
18
- ## Current Architecture
19
-
20
- ### Architecture Decision: Why Single-Worker Async?
21
-
22
- The W&B MCP server uses a **single-worker async architecture** - a deliberate design choice optimized for the Model Context Protocol's stateful session requirements.
23
-
24
- #### The Decision Process
25
-
26
- MCP (Model Context Protocol) requires stateful session management where:
27
- - Server creates session IDs on initialization
28
- - Clients must include session ID in subsequent requests
29
- - Session state must be maintained across the conversation
30
-
31
- #### Options We Considered
32
-
33
- | Option | Verdict | Reasoning |
34
- |--------|---------|-----------|
35
- | **Multi-Worker with Gunicorn** | ❌ Rejected | Session state not shared across workers; Requires Redis/Memcached (not available on HF Spaces); Breaks MCP protocol compliance |
36
- | **Multi-Worker with Sticky Sessions** | ❌ Rejected | No load balancer control on HF Spaces; Complex configuration; Doesn't guarantee session persistence |
37
- | **Single-Worker Async** | ✅ **Chosen** | Full MCP protocol compliance; Handles 1000+ concurrent requests; Simple, reliable architecture; Industry standard for stateful protocols |
38
-
39
- #### Industry Comparison
40
-
41
- | Server | Architecture | Reasoning |
42
- |--------|-------------|-----------|
43
- | GitHub MCP Server | Single process (Go) | Stateful sessions |
44
- | WebSocket servers | Single worker + async | Connection state |
45
- | GraphQL subscriptions | Single worker + async | Subscription state |
46
- | **W&B MCP Server** | **Single worker + async** | **MCP session state** |
47
-
48
- #### Why This Isn't a Limitation
49
-
50
- Single-worker async is the **correct architectural choice** for MCP servers, not a compromise. Despite using a single worker, the architecture provides:
51
- - **Concurrent Handling**: Async event loop processes I/O concurrently
52
- - **Non-blocking Operations**: Database queries and API calls don't block other requests
53
- - **High Throughput**: 500-2000 requests/second capability
54
- - **Memory Efficiency**: Only ~200-500MB for hundreds of concurrent sessions
55
-
56
- ### Single-Worker Async Design
57
-
58
- ```
59
- ┌─────────────────────────────────────┐
60
- │ Hugging Face Spaces │
61
- │ (2 vCPU, 16GB RAM) │
62
- └─────────────┬───────────────────────┘
63
-
64
- ┌─────────────▼───────────────────────┐
65
- │ Uvicorn ASGI Server (Port 7860) │
66
- │ Single Worker Process │
67
- │ ┌──────────────────────┐ │
68
- │ │ Async Event Loop │ │
69
- │ │ (uvloop if available)│ │
70
- │ └──────────────────────┘ │
71
- └─────────────┬───────────────────────┘
72
-
73
- ┌─────────────▼───────────────────────┐
74
- │ FastAPI Application │
75
- │ ┌────────────────────────────┐ │
76
- │ │ Authentication Middleware │ │
77
- │ │ (ContextVar API Keys) │ │
78
- │ └────────────────────────────┘ │
79
- │ ┌────────────────────────────┐ │
80
- │ │ MCP Session Manager │ │
81
- │ │ (In-Memory Session Store) │ │
82
- │ └────────────────────────────┘ │
83
- └─────────────┬───────────────────────┘
84
-
85
- ┌─────────────▼───────────────────────┐
86
- │ W&B MCP Tools │
87
- │ • query_weave_traces_tool │
88
- │ • count_weave_traces_tool │
89
- │ • query_wandb_tool │
90
- │ • create_wandb_report_tool │
91
- │ • query_wandb_entity_projects │
92
- │ • query_wandb_support_bot │
93
- └─────────────────────────────────────┘
94
- ```
95
-
96
- ### Key Design Principles
97
-
98
- 1. **Stateful Session Management**: MCP requires persistent session state, making single-worker optimal
99
- 2. **Async Concurrency**: Event loop handles thousands of concurrent connections
100
- 3. **ContextVar Isolation**: Thread-safe API key storage for concurrent requests
101
- 4. **Connection Pooling**: Reuses HTTP connections to W&B APIs
102
- 5. **Non-blocking I/O**: All tools use async operations
103
-
104
- ### Implementation Details
105
-
106
- #### Dockerfile Configuration
107
- ```dockerfile
108
- # Single worker with high concurrency limits
109
- CMD ["uvicorn", "app:app", \
110
- "--host", "0.0.0.0", \
111
- "--port", "7860", \
112
- "--workers", "1", \ # Single worker for session state
113
- "--log-level", "info", \
114
- "--timeout-keep-alive", "120", \ # Keep connections alive
115
- "--limit-concurrency", "1000"] # Handle 1000+ concurrent
116
- ```
117
-
118
- #### Session Management
119
- ```python
120
- # In-memory session storage (app.py)
121
- session_api_keys = {} # Maps MCP session ID to W&B API key
122
-
123
- # Session lifecycle:
124
- # 1. Client sends Bearer token on initialization
125
- # 2. Server creates session ID and stores API key
126
- # 3. Client uses session ID for subsequent requests
127
- # 4. Server retrieves API key from session storage
128
- ```
129
-
130
- #### API Key Isolation (ContextVar)
131
- ```python
132
- # Thread-safe API key storage for concurrent requests
133
- from contextvars import ContextVar
134
-
135
- api_key_context: ContextVar[str] = ContextVar('wandb_api_key')
136
-
137
- # Per-request isolation:
138
- # 1. Middleware sets API key in context
139
- # 2. Tools retrieve from context (not environment)
140
- # 3. Each concurrent request has isolated context
141
- ```
142
-
143
- ---
144
-
145
- ## Performance Test Results
146
-
147
- ### Executive Summary
148
-
149
- The W&B MCP Server deployed on Hugging Face Spaces has been thoroughly stress-tested. **Key Finding**: The server can reliably handle **up to 600 concurrent connections** with 100% success rate, achieving **113-150 req/s throughput**.
150
-
151
- ### Optimal Performance Zone (100% Success Rate)
152
-
153
- | Concurrent Clients | Success Rate | Throughput | Mean Response Time | p99 Response Time |
154
- |--------------------|-------------|------------|-------------------|-------------------|
155
- | 1 | 100% | 2.6 req/s | 340ms | N/A |
156
- | 10 | 100% | 25 req/s | 290ms | 380ms |
157
- | 50 | 100% | 86 req/s | 390ms | 550ms |
158
- | 100 | 100% | 97 req/s | 690ms | 1.0s |
159
- | 200 | 100% | 150 req/s | 890ms | 1.2s |
160
- | 300 | 100% | 129 req/s | 1.51s | 1.91s |
161
- | 500 | 100% | 98 req/s | 4.52s | 6.02s |
162
- | **600** | **100%** | **113 req/s** | ~5s | ~7s |
163
-
164
- ### Performance Degradation Zone
165
-
166
- | Concurrent Clients | Success Rate | Notes |
167
- |--------------------|-------------|-------|
168
- | 650 | 94% | First signs of degradation |
169
- | 700 | 12.7% | Breaking point - server overwhelmed |
170
- | 750+ | <10% | Complete failure |
171
-
172
- ### Performance Sweet Spots
173
-
174
- 1. **For Low Latency** (< 1s response time):
175
- - Use ≤ 100 concurrent connections
176
- - Expect ~97 req/s throughput
177
- - p99 latency: 1 second
178
-
179
- 2. **For Maximum Throughput**:
180
- - Use 200-300 concurrent connections
181
- - Achieve 130-150 req/s
182
- - p99 latency: 1.2-1.9 seconds
183
-
184
- 3. **For Maximum Capacity**:
185
- - Use up to 600 concurrent connections
186
- - Achieve ~113 req/s
187
- - p99 latency: ~7 seconds
188
-
189
- ### Capacity Limits
190
-
191
- - **Absolute Maximum**: 600 concurrent connections
192
- - **Safe Operating Limit**: 500 concurrent connections (with buffer)
193
- - **Recommended Production Limit**: 400 concurrent connections
194
- - **Breaking Point**: 650-700 concurrent connections
195
-
196
- ### Comparison: Local vs Deployed
197
-
198
- | Metric | Local (2 vCPU) | HF Spaces (2 vCPU) | Notes |
199
- |--------|----------------|-------------------|-------|
200
- | Max Concurrent | 100 | 600 | HF handles 6x more! |
201
- | Throughput | 600 req/s | 113-150 req/s | Network overhead |
202
- | p50 Latency | 20ms | 500ms | Network + processing |
203
- | Breaking Point | 100 clients | 650 clients | Better infrastructure |
204
-
205
- ---
206
-
207
- ## Load Testing Guide
208
-
209
- ### Prerequisites
210
-
211
- ```bash
212
- # Install dependencies
213
- pip install httpx
214
-
215
- # Or using uv (recommended)
216
- uv pip install httpx
217
- ```
218
-
219
- ### Test Tools Overview
220
-
221
- We provide a comprehensive load testing tool (`load_test.py`) with three modes:
222
-
223
- 1. **Standard Mode**: Runs predefined test suite (light, medium, heavy load)
224
- 2. **Stress Mode**: Finds the breaking point progressively
225
- 3. **Custom Mode**: Run specific test configurations
226
-
227
- ### Testing Local Server
228
-
229
- #### 1. Start the Local Server
230
-
231
- ```bash
232
- # Terminal 1: Start the server
233
- cd /path/to/mcp-server
234
- source .venv/bin/activate # or use uv
235
- uvicorn app:app --host 0.0.0.0 --port 7860 --workers 1
236
- ```
237
-
238
- #### 2. Run Load Tests
239
-
240
- ```bash
241
- # Terminal 2: Run tests
242
-
243
- # Standard test suite (recommended first test)
244
- python load_test.py --mode standard
245
-
246
- # Custom test with specific parameters
247
- python load_test.py --mode custom --clients 50 --requests 20 --delay 0.05
248
-
249
- # Stress test to find breaking point
250
- python load_test.py --mode stress
251
-
252
- # Test with real API key
253
- python load_test.py --api-key YOUR_WANDB_API_KEY --mode custom --clients 10 --requests 5
254
- ```
255
-
256
- ### Testing Deployed Hugging Face Space
257
-
258
- #### 1. Basic Functionality Test
259
-
260
- ```bash
261
- # Test with small load first
262
- python load_test.py \
263
- --url https://mcp.withwandb.com \
264
- --mode custom \
265
- --clients 5 \
266
- --requests 3
267
- ```
268
-
269
- #### 2. Progressive Load Testing
270
-
271
- ```bash
272
- # Light load (10 clients)
273
- python load_test.py \
274
- --url https://mcp.withwandb.com \
275
- --mode custom \
276
- --clients 10 \
277
- --requests 10
278
-
279
- # Medium load (50 clients)
280
- python load_test.py \
281
- --url https://mcp.withwandb.com \
282
- --mode custom \
283
- --clients 50 \
284
- --requests 10 \
285
- --delay 0.05
286
-
287
- # Heavy load (100 clients) - be careful!
288
- python load_test.py \
289
- --url https://mcp.withwandb.com \
290
- --mode custom \
291
- --clients 100 \
292
- --requests 20 \
293
- --delay 0.01
294
- ```
295
-
296
- #### 3. Comprehensive Stress Test
297
-
298
- ```bash
299
- # Run full stress test (gradually increases load)
300
- python load_test.py \
301
- --url https://mcp.withwandb.com \
302
- --mode stress
303
- ```
304
-
305
- ### Creating Custom Stress Tests
306
-
307
- For finding exact breaking points, create a custom test script:
308
-
309
- ```python
310
- #!/usr/bin/env python3
311
- """Custom stress test for finding precise limits"""
312
-
313
- import asyncio
314
- import time
315
- import httpx
316
-
317
- async def test_concurrent_load(url, num_clients):
318
- """Test specific number of concurrent clients"""
319
-
320
- async def make_request(client):
321
- try:
322
- response = await client.post(
323
- f"{url}/mcp",
324
- headers={
325
- "Authorization": "Bearer test_key_12345678901234567890",
326
- "Content-Type": "application/json",
327
- "Accept": "application/json, text/event-stream",
328
- },
329
- json={
330
- "jsonrpc": "2.0",
331
- "method": "initialize",
332
- "params": {
333
- "protocolVersion": "2025-06-18",
334
- "capabilities": {},
335
- "clientInfo": {"name": "stress_test", "version": "1.0"}
336
- },
337
- "id": 1
338
- },
339
- timeout=60
340
- )
341
- return response.status_code == 200
342
- except:
343
- return False
344
-
345
- print(f"Testing {num_clients} concurrent clients...")
346
- start = time.time()
347
-
348
- async with httpx.AsyncClient(limits=httpx.Limits(max_connections=1000)) as client:
349
- tasks = [make_request(client) for _ in range(num_clients)]
350
- results = await asyncio.gather(*tasks)
351
-
352
- elapsed = time.time() - start
353
- success_count = sum(results)
354
- success_rate = (success_count / num_clients) * 100
355
-
356
- print(f" ✅ Success: {success_count}/{num_clients} ({success_rate:.1f}%)")
357
- print(f" ⚡ Throughput: {num_clients/elapsed:.2f} req/s")
358
- print(f" ⏱️ Time: {elapsed:.2f}s")
359
-
360
- return success_rate
361
-
362
- async def main():
363
- # Test specific range to find breaking point
364
- for clients in [500, 550, 600, 650, 700]:
365
- success_rate = await test_concurrent_load(
366
- "https://mcp.withwandb.com",
367
- clients
368
- )
369
- if success_rate < 50:
370
- print(f"🔥 Breaking point at {clients} clients!")
371
- break
372
- await asyncio.sleep(3) # Let server recover
373
-
374
- if __name__ == "__main__":
375
- asyncio.run(main())
376
- ```
377
-
378
- ### Understanding Test Results
379
-
380
- #### Key Metrics to Monitor
381
-
382
- 1. **Success Rate**: Percentage of successful requests
383
- - 100%: Perfect performance
384
- - 90-99%: Acceptable with retries
385
- - <90%: Performance issues
386
- - <50%: Breaking point
387
-
388
- 2. **Throughput (req/s)**: Total requests per second
389
- - Local: Can achieve 600+ req/s
390
- - HF Spaces: Typically 100-150 req/s peak
391
-
392
- 3. **Response Time Percentiles**:
393
- - p50 (median): Typical response time
394
- - p95: 95% of requests faster than this
395
- - p99: 99% of requests faster than this
396
-
397
- 4. **Resource Usage**:
398
- - Monitor HF Space dashboard for CPU/Memory
399
- - Local: Use `htop` or system monitor
400
-
401
- ### Test Results Interpretation
402
-
403
- ```
404
- ============================================================
405
- Load Test Results
406
- ============================================================
407
-
408
- 📊 Overall Metrics:
409
- Total Time: 3.46s # How long the test took
410
- Total Requests: 2100 # Total requests made
411
- Successful: 2100 (100.0%) # Success rate - key metric!
412
- Failed: 0 # Should be 0 for good performance
413
- Requests/Second: 607.33 # Throughput
414
-
415
- 🔑 Session Creation:
416
- Mean: 1.348s # Average time to create session
417
- Median: 1.342s # Middle value (less affected by outliers)
418
- Std Dev: 0.157s # Consistency (lower is better)
419
-
420
- 🔧 Tool Calls:
421
- Mean: 0.024s # Average tool call time
422
- Median: 0.020s # Typical tool call time
423
- Min: 0.001s # Fastest response
424
- Max: 0.077s # Slowest response
425
-
426
- 📈 Latency Percentiles:
427
- p50: 0.020s # 50% of requests faster than this
428
- p95: 0.070s # 95% of requests faster than this
429
- p99: 0.076s # 99% of requests faster than this
430
-
431
- ⚡ Throughput:
432
- Concurrent Clients: 100 # Number of simultaneous clients
433
- Requests/Second/Client: 6.07 # Per-client throughput
434
- Total Throughput: 606.83 req/s # Overall server throughput
435
- ```
436
-
437
- ---
438
-
439
- ## Hardware Scaling Analysis
440
-
441
- ### Current Configuration (2 vCPU, 16GB RAM on HF Spaces)
442
-
443
- **Actual Measured Performance**:
444
- - ✅ 600 concurrent connections with 100% success
445
- - ✅ 113-150 req/s sustained throughput
446
- - ✅ 100% reliability up to 600 clients
447
- - ✅ Graceful degradation 600-700 clients
448
-
449
- **This significantly exceeds initial estimates!** The combination of:
450
- - Efficient async architecture
451
- - HF Spaces infrastructure
452
- - Optimized connection handling
453
-
454
- Results in 6x better performance than expected.
455
-
456
- ### Potential Upgrade (8 vCPU, 32GB RAM)
457
-
458
- **Estimated Performance** (linear scaling from current):
459
- - ~2,400 concurrent connections (4x current)
460
- - ~450-600 req/s throughput
461
- - Better response times under load
462
- - More consistent p99 latencies
463
-
464
- ### Scaling Factors
465
-
466
- | Resource | Impact on Performance |
467
- |----------|---------------------|
468
- | **CPU Cores** | More concurrent request processing, better I/O scheduling |
469
- | **RAM** | Larger connection pools, more session storage, better caching |
470
- | **Network** | HF Spaces has excellent network infrastructure |
471
- | **Event Loop** | Single async loop scales well with resources |
472
-
473
- ---
474
-
475
- ## Optimization Strategies
476
-
477
- ### 1. Connection Pooling
478
- ```python
479
- # Already implemented in httpx clients
480
- connector = httpx.AsyncHTTPTransport(
481
- limits=httpx.Limits(
482
- max_connections=100,
483
- max_keepalive_connections=50
484
- )
485
- )
486
- ```
487
-
488
- ### 2. Session Management
489
- ```python
490
- # Periodic cleanup of old sessions
491
- async def cleanup_old_sessions():
492
- """Remove sessions older than 1 hour"""
493
- cutoff = time.time() - 3600
494
- for session_id in list(session_api_keys.keys()):
495
- if session_timestamps.get(session_id, 0) < cutoff:
496
- del session_api_keys[session_id]
497
- ```
498
-
499
- ### 3. Rate Limiting
500
- ```python
501
- # Add per-client rate limiting
502
- from slowapi import Limiter
503
- limiter = Limiter(key_func=get_remote_address)
504
-
505
- @app.post("/mcp")
506
- @limiter.limit("100/minute")
507
- async def mcp_endpoint(request: Request):
508
- # Handle request
509
- ```
510
-
511
- ### 4. Response Caching
512
- - Cache frequently accessed data (entity/project lists)
513
- - Use TTL-based caching for tool responses
514
- - Implement ETag support for conditional requests
515
-
516
- ### 5. Monitoring & Metrics
517
- ```python
518
- # Add Prometheus metrics
519
- from prometheus_client import Counter, Histogram, Gauge
520
-
521
- request_count = Counter('mcp_requests_total', 'Total requests', ['method', 'status'])
522
- request_duration = Histogram('mcp_request_duration_seconds', 'Request duration', ['method'])
523
- active_sessions = Gauge('mcp_active_sessions', 'Number of active sessions')
524
- ```
525
-
526
- ---
527
-
528
- ## Deployment Recommendations
529
-
530
- ### By Team Size
531
-
532
- #### Development/Testing (1-10 users)
533
- - ✅ Current HF Space perfect
534
- - Sub-second response times
535
- - No changes needed
536
-
537
- #### Small Teams (10-50 users)
538
- - ✅ Current HF Space excellent
539
- - ~86 req/s throughput
540
- - Response times < 600ms
541
-
542
- #### Medium Organizations (50-200 users)
543
- - ✅ Current HF Space adequate
544
- - 150 req/s peak throughput
545
- - Recommendations:
546
- - Implement request queueing
547
- - Add client-side retries
548
- - Set up monitoring
549
-
550
- #### Large Deployments (200-500 users)
551
- - ⚠️ Current HF Space at limits
552
- - Recommendations:
553
- - Implement load balancer
554
- - Add monitoring/alerting (>400 connections)
555
- - Consider upgrading HF Space tier
556
- - Or deploy multiple instances
557
-
558
- #### Enterprise (500+ users)
559
- - ❌ Exceeds current capacity
560
- - Solutions:
561
- - Deploy on dedicated infrastructure
562
- - Use Kubernetes with HPA
563
- - Implement Redis for session storage
564
- - Multiple server instances with load balancing
565
-
566
- ### Production Checklist
567
-
568
- If deploying for production use:
569
-
570
- 1. **Monitoring Setup**:
571
- ```bash
572
- # Set up alerts for:
573
- - Concurrent connections > 400
574
- - p99 latency > 5s
575
- - Success rate < 95%
576
- - Memory usage > 80%
577
- ```
578
-
579
- 2. **Client Configuration**:
580
- ```python
581
- # Recommended client settings
582
- client = httpx.AsyncClient(
583
- timeout=httpx.Timeout(30.0), # 30 second timeout
584
- limits=httpx.Limits(
585
- max_connections=10, # Per-client connection limit
586
- max_keepalive_connections=5
587
- )
588
- )
589
-
590
- # Implement exponential backoff
591
- async def retry_with_backoff(func, max_retries=3):
592
- for i in range(max_retries):
593
- try:
594
- return await func()
595
- except Exception as e:
596
- if i == max_retries - 1:
597
- raise
598
- await asyncio.sleep(2 ** i) # Exponential backoff
599
- ```
600
-
601
- 3. **Rate Limiting**:
602
- - Limit per-client to 100 requests/minute
603
- - Implement request quotas per API key
604
- - Add circuit breakers for failing clients
605
-
606
- 4. **Documentation**:
607
- - Document the 500 client soft limit
608
- - Provide client configuration examples
609
- - Create runbooks for high load scenarios
610
-
611
- ---
612
-
613
- ## Future Scaling Options
614
-
615
- When the single-worker architecture reaches its limits (500+ concurrent users), here's the scaling progression:
616
-
617
- ### Immediate Options (No Code Changes)
618
-
619
- 1. **Vertical Scaling**:
620
- - Upgrade to 8 vCPU, 32GB RAM HF Space
621
- - Expected: 2,400 concurrent connections, 450-600 req/s
622
- - Cost: ~4x higher but 4-5x performance gain
623
-
624
- 2. **Edge Deployment**:
625
- - Deploy in multiple regions with geo-routing
626
- - Reduce latency for global users
627
- - Each region handles its own sessions
628
-
629
- ### Advanced Options (Code Changes Required)
630
-
631
- #### Option 1: Horizontal Scaling with External Session Store
632
-
633
- Replace in-memory session storage with Redis:
634
-
635
- ```python
636
- # Redis-based session management
637
- import redis.asyncio as redis
638
-
639
- class RedisSessionStore:
640
- def __init__(self, redis_url: str):
641
- self.redis = redis.from_url(redis_url)
642
-
643
- async def set_session(self, session_id: str, api_key: str):
644
- await self.redis.setex(f"mcp:session:{session_id}", 3600, api_key)
645
-
646
- async def get_session(self, session_id: str) -> Optional[str]:
647
- return await self.redis.get(f"mcp:session:{session_id}")
648
- ```
649
-
650
- This enables multiple worker processes while maintaining session state.
651
-
652
- ### Option 2: Edge Caching with CDN
653
-
654
- For read-heavy workloads:
655
- - Cache tool responses at CDN edge
656
- - Use cache keys based on (tool, params, api_key_hash)
657
- - TTL based on data freshness requirements
658
-
659
- ### Option 3: Serverless Functions
660
-
661
- For specific tools that don't need session state:
662
- - Deploy stateless tools as AWS Lambda / Cloud Functions
663
- - Route via API Gateway
664
- - Scale to thousands of concurrent executions
665
-
666
- ### Option 4: WebSocket Upgrade
667
-
668
- For real-time applications:
669
- - Upgrade to WebSocket connections
670
- - Maintain persistent connections
671
- - Push updates to clients
672
- - Reduce connection overhead
673
-
674
- ### Option 5: Multi-Region Deployment
675
-
676
- For global distribution:
677
- - Deploy in multiple regions
678
- - Use GeoDNS for routing
679
- - Implement cross-region session sync
680
- - Reduce latency for global users
681
-
682
- ---
683
-
684
- ### Option 6: Platform-Specific Solutions
685
-
686
- When platforms evolve to better support stateful applications:
687
-
688
- 1. **Kubernetes StatefulSets**:
689
- - When HF Spaces supports Kubernetes
690
- - Maintains pod identity across restarts
691
- - Enables persistent volume claims
692
-
693
- 2. **Durable Objects** (Cloudflare Workers):
694
- - Edge computing with guaranteed session affinity
695
- - Automatic scaling with state persistence
696
- - Global distribution
697
-
698
- ---
699
-
700
- ## Common Questions About the Architecture
701
-
702
- ### Q: Why not use multiple workers like traditional web apps?
703
-
704
- **A**: MCP is a stateful protocol, similar to WebSockets or GraphQL subscriptions. Multiple workers would break session continuity unless you add complex state synchronization (Redis, sticky sessions), which adds latency and complexity without improving performance for our I/O-bound workload.
705
-
706
- ### Q: Is single-worker a bottleneck?
707
-
708
- **A**: No. Our tests show a single async worker handles **600+ concurrent connections** and **150 req/s** on just 2 vCPUs. The bottleneck is network I/O to W&B APIs, not CPU processing. Adding workers wouldn't improve this.
709
-
710
- ### Q: How does this compare to multi-threaded servers?
711
-
712
- **A**: Python's GIL (Global Interpreter Lock) makes true multi-threading inefficient for CPU-bound work. For I/O-bound work (like our API calls), async/await with a single thread is actually more efficient than multi-threading due to lower overhead and no context switching.
713
-
714
- ### Q: What about reliability and fault tolerance?
715
-
716
- **A**:
717
- - **Health checks**: HF Spaces automatically restarts unhealthy containers
718
- - **Graceful shutdown**: Server properly closes connections on restart
719
- - **Session recovery**: Clients can re-authenticate with Bearer token
720
- - **Error handling**: Each request is isolated; one failure doesn't affect others
721
-
722
- ### Q: When would you need to change this architecture?
723
-
724
- **A**: Only when:
725
- 1. CPU-bound processing becomes significant (unlikely for MCP proxy)
726
- 2. You need 1000+ concurrent users (then use Redis for sessions)
727
- 3. Global distribution is required (deploy regional instances)
728
-
729
- ---
730
-
731
- ## Summary
732
-
733
- The W&B MCP Server on Hugging Face Spaces **significantly exceeds expectations**, handling 6x more concurrent connections than initially estimated.
734
-
735
- **Architecture Highlights**:
736
- - 🏗️ **Single-worker async**: The correct choice for stateful protocols
737
- - 🚀 **600 concurrent connections**: Proven capacity with 100% success rate
738
- - ⚡ **150 req/s peak throughput**: Excellent for I/O-bound operations
739
- - 🎯 **Simple and reliable**: No complex state synchronization needed
740
-
741
- **Key Achievements**:
742
- - ✅ **Industry-standard architecture** for stateful protocols
743
- - ✅ **Production-ready** for teams up to 500 users
744
- - ✅ **Clear scaling path** for larger deployments
745
- - ✅ **Cost-effective** on basic HF Space tier
746
-
747
- **Bottom Line by Team Size**:
748
- - ✅ **Development** (1-10 users): Perfect
749
- - ✅ **Small Teams** (10-50 users): Excellent
750
- - ✅ **Medium Teams** (50-200 users): Good
751
- - ⚠️ **Large Teams** (200-500 users): Adequate with monitoring
752
- - ❌ **Enterprise** (500+ users): Needs infrastructure upgrade
753
-
754
- The single-worker async architecture is not a limitation but a **deliberate design choice** that aligns with MCP's requirements and industry best practices for stateful protocols. The deployment on Hugging Face Spaces provides excellent value and surprising performance for small to medium-scale deployments.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
SCALABILITY_GUIDE_CONCISE.md DELETED
@@ -1,712 +0,0 @@
1
- # MCP Server Scalability Guide
2
-
3
- ## System Design & Architecture
4
-
5
- ### Core Components Overview
6
-
7
- The W&B MCP Server is built with a layered architecture optimized for scalability:
8
-
9
- #### 1. **FastAPI Application Layer**
10
- - **Purpose**: HTTP server handling incoming requests
11
- - **Technology**: FastAPI with Uvicorn/Gunicorn
12
- - **Key Features**:
13
- - Async request handling for non-blocking I/O
14
- - Automatic OpenAPI documentation
15
- - Middleware pipeline for authentication and logging
16
- - Static file serving for web interface
17
-
18
- #### 2. **Authentication Middleware**
19
- - **Purpose**: Secure, thread-safe API key management
20
- - **Technology**: Custom middleware using Python ContextVar
21
- - **Implementation**:
22
- ```python
23
- # Per-request API key isolation (no global state)
24
- api_key_context: ContextVar[str] = ContextVar('wandb_api_key')
25
-
26
- # Each request gets isolated context
27
- token = api_key_context.set(api_key)
28
- ```
29
- - **Benefits**:
30
- - No race conditions between concurrent requests
31
- - Thread-safe by design
32
- - Zero global state pollution
33
-
34
- #### 3. **MCP Protocol Layer**
35
- - **Purpose**: Model Context Protocol implementation
36
- - **Technology**: FastMCP framework with streamable HTTP transport
37
- - **Features**:
38
- - Tool registration and dynamic dispatch
39
- - Session management for stateful operations
40
- - SSE (Server-Sent Events) for response streaming
41
- - JSON-RPC 2.0 protocol compliance
42
-
43
- #### 4. **Tool Implementation Layer**
44
- - **Purpose**: W&B/Weave functionality exposure
45
- - **Components**:
46
- - `query_wandb_tool`: GraphQL queries for experiments
47
- - `query_weave_traces`: LLM trace analysis
48
- - `count_weave_traces`: Efficient analytics
49
- - `create_wandb_report`: Report generation
50
- - `query_wandb_support_bot`: RAG-powered help
51
-
52
- ### Request Flow Architecture
53
-
54
- ```
55
- ┌──────────────┐
56
- │ MCP Client │
57
- └──────┬───────┘
58
- │ HTTPS + Bearer Token
59
-
60
- ┌──────────────────────────────────┐
61
- │ 1. Nginx/Load Balancer (HF) │
62
- └──────┬───────────────────────────┘
63
-
64
-
65
- ┌──────────────────────────────────┐
66
- │ 2. Gunicorn Master Process │
67
- │ - Worker management │
68
- │ - Request distribution │
69
- └──────┬───────────────────────────┘
70
- │ Round-robin
71
-
72
- ┌──────────────────────────────────┐
73
- │ 3. Uvicorn Worker (1 of N) │
74
- │ - Async request handling │
75
- │ - WebSocket/SSE support │
76
- └──────┬───────────────────────────┘
77
-
78
-
79
- ┌──────────────────────────────────┐
80
- │ 4. FastAPI Application │
81
- │ - Route matching │
82
- │ - Request validation │
83
- └──────┬───────────────────────────┘
84
-
85
-
86
- ┌──────────────────────────────────┐
87
- │ 5. Authentication Middleware │
88
- │ - Bearer token extraction │
89
- │ - API key validation │
90
- │ - Context variable setup │
91
- └──────┬───────────────────────────┘
92
-
93
-
94
- ┌──────────────────────────────────┐
95
- │ 6. MCP Server (FastMCP) │
96
- │ - JSON-RPC parsing │
97
- │ - Tool dispatch │
98
- │ - Session management │
99
- └──────┬───────────────────────────┘
100
-
101
-
102
- ┌──────────────────────────────────┐
103
- │ 7. Tool Execution │
104
- │ - Get API key from context │
105
- │ - Create wandb.Api(api_key) │
106
- │ - Execute W&B/Weave operations │
107
- └──────┬───────────────────────────┘
108
-
109
-
110
- ┌──────────────────────────────────┐
111
- │ 8. Response Generation │
112
- │ - JSON-RPC formatting │
113
- │ - SSE streaming (if applicable)│
114
- │ - Error handling │
115
- └──────────────────────────────────┘
116
- ```
117
-
118
- ### Key Design Decisions
119
-
120
- #### 1. **No Global State**
121
- - **Problem**: `wandb.login()` sets global state, causing race conditions
122
- - **Solution**: Use `wandb.Api(api_key=...)` per request
123
- - **Benefit**: True request isolation, no cross-contamination
124
-
125
- #### 2. **ContextVar for API Keys**
126
- - **Problem**: Thread-local storage doesn't work with async
127
- - **Solution**: Python's ContextVar for async-aware context
128
- - **Benefit**: Automatic propagation through async call chains
129
-
130
- #### 3. **Stateless Architecture**
131
- - **Problem**: Session state limits scalability
132
- - **Solution**: Stateless design with session correlation
133
- - **Benefit**: Horizontal scaling without sticky sessions
134
-
135
- #### 4. **Worker Recycling**
136
- - **Problem**: Long-running processes accumulate memory
137
- - **Solution**: Gunicorn's `--max-requests` with jitter
138
- - **Benefit**: Automatic memory leak prevention
139
-
140
- ## Current Production Architecture: Single-Worker Async
141
-
142
- ### Why Single-Worker?
143
-
144
- MCP protocol requires stateful session management that is incompatible with multi-worker deployments:
145
- - Session IDs must be maintained across requests
146
- - Session state cannot be easily shared across worker processes
147
- - Similar to WebSocket connections, MCP sessions are inherently stateful
148
-
149
- Following the pattern of [GitHub's MCP Server](https://github.com/github/github-mcp-server) and other reference implementations, we use a **single-worker async architecture**.
150
-
151
- ### The Architecture: Async Event Loop Concurrency
152
-
153
- ```python
154
- # Single Uvicorn worker with async event loop
155
- CMD ["uvicorn", "app:app",
156
- "--workers", "1", # Single worker for session state
157
- "--loop", "uvloop", # High-performance event loop
158
- "--limit-concurrency", "1000"] # Handle 1000+ concurrent connections
159
- ```
160
-
161
- #### How It Handles Concurrent Requests
162
-
163
- ```
164
- ┌─────────────────────────────────────────────┐
165
- │ Single Uvicorn Process │
166
- │ │
167
- │ ┌─────────────────────────────────────┐ │
168
- │ │ Async Event Loop (uvloop) │ │
169
- │ │ │ │
170
- │ │ Request 1 ──┐ │ │
171
- │ │ Request 2 ──├── Concurrent │ │
172
- │ │ Request 3 ──├── Processing │ │
173
- │ │ Request N ──┘ (Non-blocking I/O) │ │
174
- │ └─────────────────────────────────────┘ │
175
- │ │
176
- │ ┌─────────────────────────────────────┐ │
177
- │ │ In-Memory Session Storage │ │
178
- │ │ { session_id: api_key, ... } │ │
179
- │ └─────────────────────────────────────┘ │
180
- └─────────────────────────────────────────────┘
181
- ```
182
-
183
- ### Performance Characteristics
184
-
185
- Despite being single-worker, the async architecture provides excellent concurrency:
186
-
187
- | Metric | Capability | Explanation |
188
- |--------|-----------|-------------|
189
- | **Concurrent Requests** | 100-1000+ | Event loop handles I/O concurrently |
190
- | **Throughput** | 500-2000 req/s | Non-blocking async operations |
191
- | **Latency** | < 100ms p50 | Efficient event loop scheduling |
192
- | **Memory** | ~200-500MB | Single process, shared memory |
193
-
194
- ### The Problems We Solved
195
-
196
- - ✅ **Thread-Safe API Keys**: Using ContextVar for proper isolation
197
- - ✅ **MCP Session Compliance**: Proper session management in single process
198
- - ✅ **High Concurrency**: Async event loop handles many concurrent requests
199
- - ✅ **No Race Conditions**: Request contexts properly isolated
200
-
201
- ## Future Scaling Architecture
202
-
203
- When single-worker async reaches its limits, here are proven scaling strategies:
204
-
205
- ### Option 1: Sticky Sessions with Load Balancer
206
-
207
- ```
208
- ┌──────────────────────────────────┐
209
- │ Load Balancer (Nginx/HAProxy) │
210
- │ with Session Affinity │
211
- └────────┬──────────┬──────────────┘
212
- │ │
213
- ┌────▼───┐ ┌───▼────┐
214
- │Worker 1│ │Worker 2│ (Each maintains
215
- │Sessions│ │Sessions│ own session state)
216
- └────────┘ └────────┘
217
- ```
218
-
219
- **Implementation:**
220
- ```nginx
221
- upstream mcp_servers {
222
- ip_hash; # Session affinity based on client IP
223
- server worker1:7860;
224
- server worker2:7860;
225
- }
226
- ```
227
-
228
- ### Option 2: Shared Session Storage
229
-
230
- ```
231
- ┌────────────┐ ┌────────────┐
232
- │ Worker 1 │ │ Worker 2 │
233
- └─────┬──────┘ └─────┬──────┘
234
- │ │
235
- ▼ ▼
236
- ┌────────────────────────────┐
237
- │ Redis/Memcached │
238
- │ (Shared Session Store) │
239
- └────────────────────────────┘
240
- ```
241
-
242
- **Implementation:**
243
- ```python
244
- import redis
245
- redis_client = redis.Redis(host='redis-server')
246
-
247
- # Store session
248
- redis_client.setex(f"session:{session_id}",
249
- 3600, api_key)
250
-
251
- # Retrieve session
252
- api_key = redis_client.get(f"session:{session_id}")
253
- ```
254
-
255
- ### Option 3: Kubernetes with StatefulSets
256
-
257
- For cloud-native deployments:
258
- ```yaml
259
- apiVersion: apps/v1
260
- kind: StatefulSet
261
- metadata:
262
- name: mcp-server
263
- spec:
264
- serviceName: mcp-service
265
- replicas: 3
266
- podManagementPolicy: Parallel
267
- # Each pod maintains persistent session state
268
- ```
269
-
270
- ### Option 4: Edge Computing with Durable Objects
271
-
272
- For global scale using Cloudflare Workers or similar:
273
- ```javascript
274
- // Durable Object for session state
275
- export class MCPSession {
276
- constructor(state, env) {
277
- this.state = state;
278
- this.sessions = new Map();
279
- }
280
-
281
- async fetch(request) {
282
- // Handle session-specific requests
283
- }
284
- }
285
- ```
286
-
287
- ## Current Deployment Reality on Hugging Face Spaces
288
-
289
- Due to platform constraints:
290
- - ❌ No Redis/Memcached available
291
- - ❌ No sticky session load balancer control
292
- - ❌ No Kubernetes StatefulSets
293
- - ✅ **Single-worker async is the optimal solution**
294
-
295
- This architecture successfully handles hundreds of concurrent users while maintaining MCP protocol compliance.
296
-
297
- ```python
298
- # Core Innovation: Context Variable Isolation
299
- from contextvars import ContextVar
300
-
301
- # Each request gets its own isolated API key context
302
- api_key_context: ContextVar[str] = ContextVar('wandb_api_key')
303
-
304
- # In middleware (per request)
305
- async def thread_safe_auth_middleware(request: Request, call_next):
306
- api_key = extract_from_bearer_token(request)
307
- token = api_key_context.set(api_key) # Thread-safe storage
308
- try:
309
- response = await call_next(request)
310
- finally:
311
- api_key_context.reset(token) # Cleanup
312
- return response
313
- ```
314
-
315
- #### Multi-Worker Deployment Configuration
316
-
317
- ```dockerfile
318
- # Current production setup in Dockerfile
319
- CMD ["gunicorn", "app:app", \
320
- "--bind", "0.0.0.0:7860", \
321
- "--workers", "4", \
322
- "--worker-class", "uvicorn.workers.UvicornWorker", \
323
- "--timeout", "120", \
324
- "--keep-alive", "5", \
325
- "--max-requests", "1000", \
326
- "--max-requests-jitter", "50"]
327
- ```
328
-
329
- **What each parameter does:**
330
- - `--workers 4`: 4 parallel processes (scales with CPU cores)
331
- - `--worker-class uvicorn.workers.UvicornWorker`: Full async/await support
332
- - `--max-requests 1000`: Auto-restart workers after 1000 requests (prevents memory leaks)
333
- - `--max-requests-jitter 50`: Randomize restarts to avoid all workers restarting simultaneously
334
- - `--timeout 120`: Allow long-running operations (e.g., large Weave queries)
335
-
336
- #### Request Flow Architecture
337
-
338
- ```
339
- Client Request
340
-
341
- [Gunicorn Master Process (PID 1)]
342
- ↓ (Round-robin distribution)
343
- [Worker Process (1 of 4)]
344
-
345
- [FastAPI App Instance]
346
-
347
- [Thread-Safe Middleware]
348
- ↓ (Sets ContextVar)
349
- [MCP Tool Execution]
350
- ↓ (Uses isolated API key)
351
- [Response Stream]
352
- ```
353
-
354
- ## Comprehensive Testing Results
355
-
356
- ### Test Suite Executed
357
-
358
- #### 1. **Multi-Worker Distribution Test**
359
- ```python
360
- # Test: 50 concurrent health checks
361
- async def test_concurrent_health_checks(num_requests=50):
362
- tasks = [send_health_request(session, i) for i in range(50)]
363
- results = await asyncio.gather(*tasks)
364
- ```
365
-
366
- **Results:**
367
- - ✅ **1,073 requests/second** throughput achieved
368
- - ✅ Even distribution across workers:
369
- - Worker PID 7: 11 requests (22.0%)
370
- - Worker PID 8: 13 requests (26.0%)
371
- - Worker PID 9: 11 requests (22.0%)
372
- - Worker PID 10: 15 requests (30.0%)
373
-
374
- #### 2. **API Key Isolation Test**
375
- ```python
376
- # Test: 100 concurrent requests from 20 different clients
377
- # Each client has unique API key: test_api_key_client_001, etc.
378
- for client_id in range(20):
379
- for request_num in range(5):
380
- tasks.append(send_request_with_api_key(f"key_{client_id}"))
381
- random.shuffle(tasks) # Simulate random arrival
382
- results = await asyncio.gather(*tasks)
383
- ```
384
-
385
- **Results:**
386
- - ✅ **Zero API key cross-contamination**
387
- - ✅ Each request maintained correct API key throughout execution
388
- - ✅ **1,014 requests/second** with authentication enabled
389
-
390
- #### 3. **Stress Test**
391
- ```python
392
- # Test: Sustained load for 5 seconds at 50 req/s target
393
- async def stress_test(duration_seconds=5, target_rps=50):
394
- # Send requests continuously for duration
395
- while time.time() < end_time:
396
- tasks.append(send_health_request())
397
- await asyncio.sleep(1.0 / target_rps)
398
- ```
399
-
400
- **Results:**
401
- - ✅ **239 total requests processed**
402
- - ✅ **100% success rate** (0 errors)
403
- - ✅ Actual RPS: 46.9 (close to 50 target)
404
- - ✅ All 4 workers utilized
405
-
406
- #### 4. **Authentication Enforcement Test**
407
- ```python
408
- # Test: Verify auth is properly enforced
409
- # 1. Request without token → Should get 401
410
- # 2. Request with invalid token → Should get 401
411
- # 3. Request with valid token → Should succeed
412
- ```
413
-
414
- **Results:**
415
- - ✅ Correctly rejected unauthenticated requests (401)
416
- - ✅ Invalid API keys properly rejected
417
- - ✅ Valid tokens processed successfully
418
-
419
- ### Performance Comparison
420
-
421
- | Metric | Original | Current Production | Improvement |
422
- |--------|----------|-------------------|-------------|
423
- | **Concurrent Users** | 10-20 | 50-100 | **5x** |
424
- | **Peak Throughput** | ~50 req/s | 1,073 req/s | **21x** |
425
- | **Sustained Load** | ~20 req/s | 47 req/s | **2.3x** |
426
- | **API Key Safety** | ❌ Race condition | ✅ Thread-safe | **Fixed** |
427
- | **Worker Processes** | 1 | 4 | **4x** |
428
- | **Memory Management** | Unbounded | Auto-recycled | **Stable** |
429
-
430
- ## Quick Deployment (Already in Production)
431
-
432
- The concurrent version is already deployed. To update or redeploy:
433
-
434
- ```bash
435
- # The current app.py already includes all concurrent improvements
436
- git add .
437
- git commit -m "Update MCP server"
438
- git push # Deploys to HF Spaces
439
-
440
- # To add more workers (if HF Spaces resources allow)
441
- echo "ENV WEB_CONCURRENCY=8" >> Dockerfile
442
- ```
443
-
444
- ---
445
-
446
- ## Large-Scale Deployment (100s-1000s of Agents)
447
-
448
- ### Architecture Overview
449
-
450
- ```
451
- [Load Balancer]
452
- |
453
- +-------------+-------------+
454
- | | |
455
- [Region 1] [Region 2] [Region 3]
456
- | | |
457
- +------+------+ +----+----+ +------+------+
458
- | | | | | | | | |
459
- [Pod1] [Pod2] [Pod3] [Pod4] [Pod5] [Pod6] [Pod7]
460
- | | | | | | |
461
- [Redis Cache] [Redis Cache] [Redis Cache]
462
- ```
463
-
464
- ### Implementation Tiers
465
-
466
- #### Tier 1: Enhanced HF Spaces (50-200 agents)
467
- ```yaml
468
- # Just use more workers
469
- ENV WEB_CONCURRENCY=8
470
- ```
471
-
472
- #### Tier 2: Kubernetes Deployment (200-1000 agents)
473
-
474
- ```yaml
475
- # k8s-deployment.yaml
476
- apiVersion: apps/v1
477
- kind: Deployment
478
- metadata:
479
- name: wandb-mcp-server
480
- spec:
481
- replicas: 10
482
- template:
483
- spec:
484
- containers:
485
- - name: mcp-server
486
- image: wandb-mcp:latest
487
- resources:
488
- requests:
489
- cpu: "2"
490
- memory: "4Gi"
491
- limits:
492
- cpu: "4"
493
- memory: "8Gi"
494
- env:
495
- - name: WEB_CONCURRENCY
496
- value: "8"
497
- ---
498
- apiVersion: v1
499
- kind: Service
500
- metadata:
501
- name: wandb-mcp-service
502
- spec:
503
- type: LoadBalancer
504
- ports:
505
- - port: 80
506
- targetPort: 7860
507
- ```
508
-
509
- #### Tier 3: Cloud-Native Architecture (1000+ agents)
510
-
511
- **Components:**
512
- 1. **API Gateway** (AWS API Gateway / Kong)
513
- - Rate limiting per client
514
- - Request routing
515
- - Authentication
516
-
517
- 2. **Container Orchestration** (ECS/EKS/GKE)
518
- ```bash
519
- # AWS ECS Example
520
- aws ecs create-service \
521
- --cluster mcp-cluster \
522
- --service-name wandb-mcp \
523
- --task-definition wandb-mcp:1 \
524
- --desired-count 20 \
525
- --launch-type FARGATE
526
- ```
527
-
528
- 3. **Caching Layer** (Redis Cluster)
529
- ```python
530
- # In app_concurrent.py
531
- import redis
532
- redis_client = redis.RedisCluster(
533
- startup_nodes=[{"host": "cache.aws.com", "port": "6379"}]
534
- )
535
-
536
- @lru_cache_redis(ttl=300)
537
- async def cached_query(key, query_func, *args):
538
- cached = redis_client.get(key)
539
- if cached:
540
- return json.loads(cached)
541
- result = await query_func(*args)
542
- redis_client.setex(key, 300, json.dumps(result))
543
- return result
544
- ```
545
-
546
- 4. **Queue System** (SQS/RabbitMQ for async processing)
547
- ```python
548
- # For heavy operations
549
- from celery import Celery
550
-
551
- celery_app = Celery('wandb_mcp', broker='redis://localhost:6379')
552
-
553
- @celery_app.task
554
- def process_large_report(params):
555
- return create_report(**params)
556
- ```
557
-
558
- 5. **Monitoring Stack**
559
- - **Prometheus** + **Grafana**: Metrics
560
- - **ELK Stack**: Logs
561
- - **Jaeger**: Distributed tracing
562
-
563
- ### Quick Deployment Commands
564
-
565
- #### Docker Swarm (Medium Scale)
566
- ```bash
567
- docker swarm init
568
- docker service create \
569
- --name wandb-mcp \
570
- --replicas 10 \
571
- --publish published=80,target=7860 \
572
- wandb-mcp:concurrent
573
- ```
574
-
575
- #### Kubernetes with Helm (Large Scale)
576
- ```bash
577
- helm create wandb-mcp-chart
578
- helm install wandb-mcp ./wandb-mcp-chart \
579
- --set replicaCount=20 \
580
- --set image.repository=wandb-mcp \
581
- --set image.tag=concurrent \
582
- --set autoscaling.enabled=true \
583
- --set autoscaling.minReplicas=10 \
584
- --set autoscaling.maxReplicas=50
585
- ```
586
-
587
- #### AWS CDK (Enterprise)
588
- ```python
589
- # cdk_stack.py
590
- from aws_cdk import (
591
- aws_ecs as ecs,
592
- aws_ecs_patterns as patterns,
593
- Stack
594
- )
595
-
596
- class WandBMCPStack(Stack):
597
- def __init__(self, scope, id):
598
- super().__init__(scope, id)
599
-
600
- patterns.ApplicationLoadBalancedFargateService(
601
- self, "WandBMCP",
602
- task_image_options=patterns.ApplicationLoadBalancedTaskImageOptions(
603
- image=ecs.ContainerImage.from_registry("wandb-mcp:concurrent"),
604
- container_port=7860,
605
- environment={
606
- "WEB_CONCURRENCY": "8"
607
- }
608
- ),
609
- desired_count=20,
610
- cpu=2048,
611
- memory_limit_mib=4096
612
- )
613
- ```
614
-
615
- ### Performance Optimization Checklist
616
-
617
- - [ ] **Connection Pooling**: Reuse W&B API connections
618
- - [ ] **Caching**: Redis for frequent queries
619
- - [ ] **CDN**: Static assets via CloudFlare
620
- - [ ] **Database**: Read replicas for analytics
621
- - [ ] **Async Everything**: No blocking operations
622
- - [ ] **Rate Limiting**: Per-user and global limits
623
- - [ ] **Circuit Breakers**: Prevent cascade failures
624
- - [ ] **Health Checks**: Automatic bad instance removal
625
-
626
- ### Cost Optimization
627
-
628
- | Scale | Architecture | Est. Monthly Cost |
629
- |-------|-------------|------------------|
630
- | 50-100 agents | HF Spaces Pro | $9-49 |
631
- | 100-500 agents | 5x ECS Fargate | $200-500 |
632
- | 500-1000 agents | 20x EKS nodes | $800-1500 |
633
- | 1000+ agents | Multi-region K8s | $2000+ |
634
-
635
- ### Monitoring Metrics
636
-
637
- ```python
638
- # Key metrics to track
639
- METRICS = {
640
- "request_rate": "promhttp_metric_handler_requests_total",
641
- "response_time_p99": "http_request_duration_seconds{quantile='0.99'}",
642
- "error_rate": "rate(http_requests_total{status=~'5..'}[5m])",
643
- "api_key_cache_hit": "redis_cache_hits_total / redis_cache_requests_total",
644
- "worker_saturation": "gunicorn_workers_busy / gunicorn_workers_total"
645
- }
646
- ```
647
-
648
- ### Emergency Scaling Playbook
649
-
650
- ```bash
651
- # Quick scale during traffic spike
652
- kubectl scale deployment wandb-mcp --replicas=50
653
-
654
- # Add more nodes
655
- eksctl scale nodegroup --cluster=mcp-cluster --nodes=20
656
-
657
- # Enable autoscaling
658
- kubectl autoscale deployment wandb-mcp --min=10 --max=100 --cpu-percent=70
659
- ```
660
-
661
- ---
662
-
663
- ## Migration Path
664
-
665
- ### Step 1: Fix Current Issues (Day 1)
666
- Deploy `app_concurrent.py` to fix API key race condition
667
-
668
- ### Step 2: Monitor & Optimize (Week 1)
669
- - Add metrics collection
670
- - Identify bottlenecks
671
- - Tune worker counts
672
-
673
- ### Step 3: Scale Horizontally (Month 1)
674
- - Deploy to Kubernetes
675
- - Add Redis caching
676
- - Implement rate limiting
677
-
678
- ### Step 4: Enterprise Features (Quarter 1)
679
- - Multi-region deployment
680
- - Advanced monitoring
681
- - SLA guarantees
682
-
683
- ---
684
-
685
- ## TL;DR for PR Description
686
-
687
- ```markdown
688
- ## Scalability Improvements
689
-
690
- This PR enables the MCP server to handle 100+ concurrent agents safely:
691
-
692
- ### Changes
693
- - ✅ Thread-safe API key handling using ContextVar
694
- - ✅ Multi-worker Gunicorn deployment (4x throughput)
695
- - ✅ Async execution for all tools
696
- - ✅ Worker recycling to prevent memory leaks
697
-
698
- ### Performance
699
- - Before: 10-20 concurrent users, 50 req/s
700
- - After: 50-100 concurrent users, 200 req/s
701
- - API keys now fully isolated (fixes security issue)
702
-
703
- ### Deployment
704
- ```bash
705
- # Simple upgrade - just use the new files
706
- cp app_concurrent.py app.py
707
- cp Dockerfile.concurrent Dockerfile
708
- ```
709
-
710
- ### Future Scale
711
- For 1000+ agents, see SCALABILITY_GUIDE_CONCISE.md for Kubernetes/cloud deployment options.
712
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
app.py CHANGED
@@ -1,6 +1,6 @@
1
  #!/usr/bin/env python3
2
  """
3
- Thread-safe HuggingFace Spaces entry point for the Weights & Biases MCP Server.
4
  """
5
 
6
  import os
@@ -103,28 +103,27 @@ if api_key:
103
  else:
104
  logger.info("No server W&B API key configured - clients will provide their own")
105
 
106
- # Create the MCP server
107
- # NOT using stateless mode - we'll handle session sharing across workers
108
- logger.info("Creating W&B MCP server...")
109
- mcp = FastMCP("wandb-mcp-server")
 
110
 
111
  # Register all W&B tools
112
  # The tools will use WandBApiManager.get_api_key() to get the current request's API key
113
  register_tools(mcp)
114
 
115
- # Session storage for API keys (maps MCP session ID to W&B API key)
116
- # This works in single-worker mode where all sessions are in the same process
117
- session_api_keys = {}
118
-
119
  # Custom authentication middleware
120
  async def thread_safe_auth_middleware(request: Request, call_next):
121
  """
122
- Thread-safe authentication middleware for MCP endpoints.
 
 
 
 
 
123
 
124
- Handles MCP session management with proper API key association:
125
- 1. Initial request with Bearer token → store API key with session ID
126
- 2. Subsequent requests with session ID → retrieve stored API key
127
- 3. All requests get proper W&B authentication via context
128
  """
129
  # Only apply auth to MCP endpoints
130
  if not request.url.path.startswith("/mcp"):
@@ -146,40 +145,31 @@ async def thread_safe_auth_middleware(request: Request, call_next):
146
  try:
147
  api_key = None
148
 
149
- # Check if request has MCP session ID (for established sessions)
150
  session_id = request.headers.get("Mcp-Session-Id") or request.headers.get("mcp-session-id")
151
  if session_id:
152
- logger.info(f"Request has MCP Session ID: {session_id[:8]}...")
153
- if session_id in session_api_keys:
154
- # Use stored API key for this session
155
- api_key = session_api_keys[session_id]
156
- logger.info(f"Session found in storage, using stored API key")
157
- else:
158
- logger.warning(f"Session ID {session_id[:8]}... NOT found in storage!")
159
- logger.info(f" Active sessions ({len(session_api_keys)}): {[sid[:8] for sid in session_api_keys.keys()]}")
160
- # Don't fail here - the request might have its own Bearer token
161
 
162
  # Check for Bearer token (for new sessions or explicit auth)
163
  authorization = request.headers.get("Authorization", "")
164
  if authorization.startswith("Bearer "):
165
- # Override with Bearer token if provided
166
- api_key = authorization[7:].strip()
167
 
168
  # Basic validation
169
- if len(api_key) < 20 or len(api_key) > 100:
170
  return JSONResponse(
171
  status_code=401,
172
  content={"error": f"Invalid W&B API key format. Get your key at: https://wandb.ai/authorize"},
173
  headers={"WWW-Authenticate": 'Bearer realm="W&B MCP", error="invalid_token"'}
174
  )
 
 
 
 
175
 
176
- # Handle session cleanup
177
  if request.method == "DELETE" and session_id:
178
- if session_id in session_api_keys:
179
- del session_api_keys[session_id]
180
- logger.info(f"Session cleanup: Deleted session {session_id[:8]}... (Remaining sessions: {len(session_api_keys)})")
181
- else:
182
- logger.warning(f"Session cleanup: Attempted to delete non-existent session {session_id[:8]}...")
183
  return await call_next(request)
184
 
185
  if api_key:
@@ -193,32 +183,20 @@ async def thread_safe_auth_middleware(request: Request, call_next):
193
  # Process the request
194
  response = await call_next(request)
195
 
196
- # If MCP returns a session ID, store our API key for future requests
197
  response_session_id = response.headers.get("Mcp-Session-Id") or response.headers.get("mcp-session-id")
198
  if response_session_id:
199
- if api_key:
200
- # Check if this session already exists
201
- if response_session_id in session_api_keys:
202
- logger.debug(f"Session {response_session_id[:8]}... already exists, updating API key")
203
- else:
204
- logger.info(f"New MCP session created: {response_session_id[:8]}...")
205
- session_api_keys[response_session_id] = api_key
206
- logger.info(f"Session storage updated. Total sessions: {len(session_api_keys)}")
207
- logger.debug(f" Active session IDs: {[sid[:8] for sid in session_api_keys.keys()]}")
208
- else:
209
- logger.warning(f"Session created but no API key to store: {response_session_id[:8]}...")
210
 
211
  return response
212
  finally:
213
  # Reset context variable
214
  WandBApiManager.reset_context_api_key(token)
215
  else:
216
- # No API key available, let request through for MCP to handle
217
- logger.warning(f"No API key available for request to {request.url.path}")
218
- logger.info(f" Session ID present: {bool(session_id)} ({session_id[:8] if session_id else 'None'}...)")
219
- logger.info(f" Bearer token present: {bool(authorization.startswith('Bearer '))}")
220
- logger.info(f" Request method: {request.method}")
221
- logger.info(" Allowing request through for MCP to handle (may result in 401/404)")
222
  return await call_next(request)
223
 
224
  except Exception as e:
@@ -437,6 +415,7 @@ if __name__ == "__main__":
437
  logger.info("Health check: /health")
438
  logger.info("MCP endpoint: /mcp")
439
 
440
- # Run with single async worker for MCP session compatibility
441
- logger.info("Starting server with single async worker (MCP requires stateful sessions)")
442
- uvicorn.run(app, host="0.0.0.0", port=PORT)
 
 
1
  #!/usr/bin/env python3
2
  """
3
+ Thread-safe entry point for the Weights & Biases MCP Server.
4
  """
5
 
6
  import os
 
103
  else:
104
  logger.info("No server W&B API key configured - clients will provide their own")
105
 
106
+ # Create the MCP server in stateless mode
107
+ # All clients (OpenAI, Cursor, etc.) must provide Bearer token with each request
108
+ # Session IDs are used only as correlation IDs, no state is persisted
109
+ logger.info("Creating W&B MCP server in stateless HTTP mode...")
110
+ mcp = FastMCP("wandb-mcp-server", stateless_http=True)
111
 
112
  # Register all W&B tools
113
  # The tools will use WandBApiManager.get_api_key() to get the current request's API key
114
  register_tools(mcp)
115
 
 
 
 
 
116
  # Custom authentication middleware
117
  async def thread_safe_auth_middleware(request: Request, call_next):
118
  """
119
+ Stateless authentication middleware for MCP endpoints.
120
+
121
+ Pure stateless operation - every request must include authentication:
122
+ - Session IDs are only used as correlation IDs
123
+ - No session state is stored between requests
124
+ - Each request must include Bearer token authentication
125
 
126
+ This works with all clients (OpenAI, Cursor, etc.) that support MCP.
 
 
 
127
  """
128
  # Only apply auth to MCP endpoints
129
  if not request.url.path.startswith("/mcp"):
 
145
  try:
146
  api_key = None
147
 
148
+ # Check if request has MCP session ID (correlation ID only in stateless mode)
149
  session_id = request.headers.get("Mcp-Session-Id") or request.headers.get("mcp-session-id")
150
  if session_id:
151
+ logger.debug(f"Request has correlation ID: {session_id[:8]}...")
 
 
 
 
 
 
 
 
152
 
153
  # Check for Bearer token (for new sessions or explicit auth)
154
  authorization = request.headers.get("Authorization", "")
155
  if authorization.startswith("Bearer "):
156
+ bearer_token = authorization[7:].strip()
 
157
 
158
  # Basic validation
159
+ if len(bearer_token) < 20 or len(bearer_token) > 100:
160
  return JSONResponse(
161
  status_code=401,
162
  content={"error": f"Invalid W&B API key format. Get your key at: https://wandb.ai/authorize"},
163
  headers={"WWW-Authenticate": 'Bearer realm="W&B MCP", error="invalid_token"'}
164
  )
165
+
166
+ # Use Bearer token
167
+ api_key = bearer_token
168
+ logger.info(f"Using Bearer token for authentication")
169
 
170
+ # Handle session cleanup (stateless mode - just acknowledge and pass through)
171
  if request.method == "DELETE" and session_id:
172
+ logger.debug(f"Session cleanup: DELETE for {session_id[:8]}... (stateless - no action needed)")
 
 
 
 
173
  return await call_next(request)
174
 
175
  if api_key:
 
183
  # Process the request
184
  response = await call_next(request)
185
 
186
+ # In stateless mode, we don't store any session state
187
  response_session_id = response.headers.get("Mcp-Session-Id") or response.headers.get("mcp-session-id")
188
  if response_session_id:
189
+ logger.debug(f"Response includes correlation ID: {response_session_id[:8]}...")
 
 
 
 
 
 
 
 
 
 
190
 
191
  return response
192
  finally:
193
  # Reset context variable
194
  WandBApiManager.reset_context_api_key(token)
195
  else:
196
+ # No API key available - in stateless mode, this is expected to fail
197
+ logger.warning(f"No Bearer token provided for {request.url.path}")
198
+ logger.debug(f" Request method: {request.method}")
199
+ logger.debug(" Passing to MCP (will likely return 401)")
 
 
200
  return await call_next(request)
201
 
202
  except Exception as e:
 
415
  logger.info("Health check: /health")
416
  logger.info("MCP endpoint: /mcp")
417
 
418
+ # In stateless mode, we can scale horizontally with multiple workers
419
+ # However, for HuggingFace Spaces we use single worker for simplicity
420
+ logger.info("Starting server (stateless mode - supports horizontal scaling)")
421
+ uvicorn.run(app, host="0.0.0.0", port=PORT, workers=1) # Can increase workers if needed