pluralchat

Runtime error

Avijit Ghosh commited on 9 days ago

Commit

db1c946

1 Parent(s): 13685a1

Implement dynamic model switching for all 3 models in UI

- Add vllm-manager.py proxy that shows all 3 models in picker
- Automatically switches vLLM backend when user selects different model
- Pre-downloads all models to /data/models (persistent storage)
- Models: Llama-3.1-8B, Qwen3-8B, gpt-oss-20b
- Switching takes ~2-3 minutes (after initial download)
- Install requests and huggingface-hub packages

Files changed (4) hide show

Dockerfile +3 -1
VLLM_SETUP.md +13 -14
entrypoint.sh +18 -26
vllm-manager.py +236 -0

Dockerfile CHANGED Viewed

@@ -24,7 +24,7 @@ RUN apt-get update
 RUN apt-get install -y libgomp1 libcurl4 curl python3 python3-pip python3-venv
 # Install vLLM with AWS Neuron support for INF2
-RUN pip3 install --break-system-packages --no-cache-dir vllm awscli
 # ensure vllm cache dir exists before adjusting ownership
 RUN mkdir -p /home/user/.cache && chown -R 1000:1000 /home/user/.cache
@@ -36,10 +36,12 @@ USER user
 COPY --chown=1000 .env /app/.env
 COPY --chown=1000 entrypoint.sh /app/entrypoint.sh
 COPY --chown=1000 package.json /app/package.json
 COPY --chown=1000 package-lock.json /app/package-lock.json
 RUN chmod +x /app/entrypoint.sh
 FROM node:20 AS builder

 RUN apt-get install -y libgomp1 libcurl4 curl python3 python3-pip python3-venv
 # Install vLLM with AWS Neuron support for INF2
+RUN pip3 install --break-system-packages --no-cache-dir vllm awscli requests huggingface-hub
 # ensure vllm cache dir exists before adjusting ownership
 RUN mkdir -p /home/user/.cache && chown -R 1000:1000 /home/user/.cache
 COPY --chown=1000 .env /app/.env
 COPY --chown=1000 entrypoint.sh /app/entrypoint.sh
+COPY --chown=1000 vllm-manager.py /app/vllm-manager.py
 COPY --chown=1000 package.json /app/package.json
 COPY --chown=1000 package-lock.json /app/package-lock.json
 RUN chmod +x /app/entrypoint.sh
+RUN chmod +x /app/vllm-manager.py
 FROM node:20 AS builder

VLLM_SETUP.md CHANGED Viewed

@@ -2,23 +2,13 @@
 This branch uses vLLM with AWS Neuron support for running models on Amazon INF2 instances.
-## Configuration
-### Environment Variables
-Set these in your HuggingFace Space secrets:
-```bash
-# Primary model to load (INF2 typically supports one model at a time)
-VLLM_MODEL=meta-llama/Llama-3.1-8B-Instruct
-# Alternative models (change VLLM_MODEL to switch):
-# VLLM_MODEL=Qwen/Qwen3-8B
-# VLLM_MODEL=openai/gpt-oss-20b
-# VLLM_MODEL=microsoft/Phi-3-mini-4k-instruct
-```
-### Model Equivalents (Ollama → HuggingFace)
 | Ollama Model | HuggingFace Model | Notes |
 |--------------|-------------------|-------|
@@ -26,6 +16,15 @@ VLLM_MODEL=meta-llama/Llama-3.1-8B-Instruct
 | `qwen3:8b` | `Qwen/Qwen3-8B` | Fast, multilingual |
 | `gpt-oss:20b` | `openai/gpt-oss-20b` | Larger, more capable |
 ### Supported Models for INF2
 vLLM with Neuron supports:

 This branch uses vLLM with AWS Neuron support for running models on Amazon INF2 instances.
+**✨ Dynamic Model Switching**: All three models appear in the UI's model picker. When you select a different model, the system automatically restarts vLLM with the new model (takes ~2-3 minutes after first download).
+## Configuration
+### Available Models
+All three models are pre-configured and cached in persistent storage:
 | Ollama Model | HuggingFace Model | Notes |
 |--------------|-------------------|-------|
 | `qwen3:8b` | `Qwen/Qwen3-8B` | Fast, multilingual |
 | `gpt-oss:20b` | `openai/gpt-oss-20b` | Larger, more capable |
+### Environment Variables
+```bash
+# Default model to load at startup
+VLLM_MODEL=meta-llama/Llama-3.1-8B-Instruct
+```
+You can change the default startup model, but all three models will be available in the UI regardless.
 ### Supported Models for INF2
 vLLM with Neuron supports:

entrypoint.sh CHANGED Viewed

@@ -14,45 +14,37 @@ if [ "$INCLUDE_DB" = "true" ] ; then
     nohup mongod &
 fi;
-# Start vLLM service with OpenAI-compatible API for HF space
-echo "Starting vLLM service with OpenAI-compatible API"
 # Ensure dir for model cache
 mkdir -p /data/models
 # Default model for vLLM (can be overridden via VLLM_MODEL env var)
-# Note: INF2 typically supports one model at a time due to memory constraints
-# Available models:
-#   - meta-llama/Llama-3.1-8B-Instruct (equivalent to llama3.1:8b)
-#   - Qwen/Qwen3-8B (equivalent to qwen3:8b)
-#   - openai/gpt-oss-20b (equivalent to gpt-oss:20b)
-VLLM_MODEL=${VLLM_MODEL:-"meta-llama/Llama-3.1-8B-Instruct"}
-echo "Loading model: $VLLM_MODEL"
-# Start vLLM OpenAI-compatible server
-# Using --served-model-name to make models accessible via simpler names
-nohup python3 -m vllm.entrypoints.openai.api_server \
-    --model "$VLLM_MODEL" \
-    --host 0.0.0.0 \
-    --port 8000 \
-    --device neuron \
-    --tensor-parallel-size 2 \
-    > /tmp/vllm.log 2>&1 &
-VLLM_PID=$!
-# Override OPENAI_BASE_URL to use local vLLM at runtime
 export OPENAI_BASE_URL=http://localhost:8000/v1
-echo "OPENAI_BASE_URL set to $OPENAI_BASE_URL for local vLLM"
-# Wait for vLLM to be ready
-MAX_RETRIES=60
 RETRY_COUNT=0
-echo "Waiting for vLLM to be ready (this may take a few minutes for model loading)..."
 until curl -s http://localhost:8000/health > /dev/null 2>&1; do
     RETRY_COUNT=$((RETRY_COUNT + 1))
     if [ $RETRY_COUNT -ge $MAX_RETRIES ]; then
-        echo "vLLM failed to start after $MAX_RETRIES attempts"
         echo "=== vLLM logs ==="
         cat /tmp/vllm.log
         exit 1
@@ -63,7 +55,7 @@ until curl -s http://localhost:8000/health > /dev/null 2>&1; do
     fi
 done
-echo "vLLM is ready!"
 export PUBLIC_VERSION=$(node -p "require('./package.json').version")

     nohup mongod &
 fi;
+# Start vLLM Model Manager (handles multiple models with dynamic switching)
+echo "Starting vLLM Model Manager"
 # Ensure dir for model cache
 mkdir -p /data/models
 # Default model for vLLM (can be overridden via VLLM_MODEL env var)
+# Available models: meta-llama/Llama-3.1-8B-Instruct, Qwen/Qwen3-8B, openai/gpt-oss-20b
+export VLLM_MODEL=${VLLM_MODEL:-"meta-llama/Llama-3.1-8B-Instruct"}
+# Make manager executable
+chmod +x /app/vllm-manager.py
+# Start the vLLM manager (it handles vLLM and provides model switching)
+nohup python3 /app/vllm-manager.py > /tmp/vllm-manager.log 2>&1 &
+MANAGER_PID=$!
+# Override OPENAI_BASE_URL to use local vLLM manager at runtime
 export OPENAI_BASE_URL=http://localhost:8000/v1
+echo "OPENAI_BASE_URL set to $OPENAI_BASE_URL for local vLLM with model switching"
+# Wait for vLLM manager to be ready
+MAX_RETRIES=120
 RETRY_COUNT=0
+echo "Waiting for vLLM manager to be ready (this may take several minutes for model loading)..."
 until curl -s http://localhost:8000/health > /dev/null 2>&1; do
     RETRY_COUNT=$((RETRY_COUNT + 1))
     if [ $RETRY_COUNT -ge $MAX_RETRIES ]; then
+        echo "vLLM manager failed to start after $MAX_RETRIES attempts"
+        echo "=== vLLM manager logs ==="
+        cat /tmp/vllm-manager.log
         echo "=== vLLM logs ==="
         cat /tmp/vllm.log
         exit 1
     fi
 done
+echo "vLLM manager is ready! All 3 models available in UI."
 export PUBLIC_VERSION=$(node -p "require('./package.json').version")

vllm-manager.py ADDED Viewed

	@@ -0,0 +1,236 @@

+#!/usr/bin/env python3
+"""
+vLLM Model Manager for INF2
+Handles dynamic model switching by restarting vLLM with the requested model.
+"""
+import os
+import subprocess
+import signal
+import time
+import json
+from http.server import HTTPServer, BaseHTTPRequestHandler
+from threading import Thread
+import requests
+# Model configurations
+MODELS = {
+    "meta-llama/Llama-3.1-8B-Instruct": {
+        "id": "meta-llama/Llama-3.1-8B-Instruct",
+        "displayName": "Llama 3.1 8B",
+        "description": "Meta's Llama 3.1 8B Instruct model"
+    },
+    "Qwen/Qwen3-8B": {
+        "id": "Qwen/Qwen3-8B",
+        "displayName": "Qwen 3 8B",
+        "description": "Alibaba's Qwen 3 8B model"
+    },
+    "openai/gpt-oss-20b": {
+        "id": "openai/gpt-oss-20b",
+        "displayName": "GPT OSS 20B",
+        "description": "OpenAI's GPT OSS 20B model"
+    }
+}
+# Current state
+current_model = os.environ.get("VLLM_MODEL", "meta-llama/Llama-3.1-8B-Instruct")
+vllm_process = None
+cache_dir = "/data/models"
+def start_vllm(model_id):
+    """Start vLLM server with the specified model"""
+    global vllm_process
+    print(f"Starting vLLM with model: {model_id}")
+    cmd = [
+        "python3", "-m", "vllm.entrypoints.openai.api_server",
+        "--model", model_id,
+        "--host", "0.0.0.0",
+        "--port", "8001",  # Use 8001 for actual vLLM
+        "--device", "neuron",
+        "--tensor-parallel-size", "2",
+        "--download-dir", cache_dir
+    ]
+    vllm_process = subprocess.Popen(
+        cmd,
+        stdout=open("/tmp/vllm.log", "a"),
+        stderr=subprocess.STDOUT
+    )
+    # Wait for vLLM to be ready
+    for i in range(120):  # 10 minutes timeout
+        try:
+            resp = requests.get("http://localhost:8001/health", timeout=1)
+            if resp.status_code == 200:
+                print(f"vLLM ready with model: {model_id}")
+                return True
+        except:
+            pass
+        time.sleep(5)
+        if i % 6 == 0:
+            print(f"Waiting for vLLM... ({i*5}s)")
+    print("ERROR: vLLM failed to start")
+    return False
+def stop_vllm():
+    """Stop the current vLLM process"""
+    global vllm_process
+    if vllm_process:
+        print("Stopping vLLM...")
+        vllm_process.send_signal(signal.SIGTERM)
+        vllm_process.wait(timeout=30)
+        vllm_process = None
+        time.sleep(2)
+def switch_model(new_model_id):
+    """Switch to a different model"""
+    global current_model
+    if new_model_id not in MODELS:
+        return False
+    if new_model_id == current_model:
+        return True
+    print(f"Switching from {current_model} to {new_model_id}")
+    stop_vllm()
+    current_model = new_model_id
+    return start_vllm(new_model_id)
+class ProxyHandler(BaseHTTPRequestHandler):
+    """Proxy requests to vLLM, with custom /models endpoint"""
+    def log_message(self, format, *args):
+        """Suppress default logging"""
+        pass
+    def do_GET(self):
+        if self.path == "/v1/models" or self.path == "/models":
+            # Return all available models
+            models_list = {
+                "object": "list",
+                "data": [
+                    {
+                        "id": model_id,
+                        "object": "model",
+                        "created": 1234567890,
+                        "owned_by": "system",
+                        "description": info["description"]
+                    }
+                    for model_id, info in MODELS.items()
+                ]
+            }
+            self.send_response(200)
+            self.send_header("Content-Type", "application/json")
+            self.end_headers()
+            self.wfile.write(json.dumps(models_list).encode())
+        elif self.path == "/health":
+            # Health check
+            try:
+                resp = requests.get("http://localhost:8001/health", timeout=1)
+                self.send_response(resp.status_code)
+                self.end_headers()
+                self.wfile.write(resp.content)
+            except:
+                self.send_response(503)
+                self.end_headers()
+        else:
+            # Proxy to vLLM
+            self.proxy_request()
+    def do_POST(self):
+        # Check if this is a chat completion request with model switch
+        if self.path.startswith("/v1/chat/completions"):
+            content_length = int(self.headers.get('Content-Length', 0))
+            body = self.rfile.read(content_length)
+            try:
+                data = json.loads(body)
+                requested_model = data.get("model")
+                # Switch model if needed
+                if requested_model and requested_model != current_model:
+                    if switch_model(requested_model):
+                        print(f"Switched to model: {requested_model}")
+                    else:
+                        self.send_response(500)
+                        self.send_header("Content-Type", "application/json")
+                        self.end_headers()
+                        self.wfile.write(json.dumps({
+                            "error": f"Failed to switch to model: {requested_model}"
+                        }).encode())
+                        return
+                # Update model in request to current model
+                data["model"] = current_model
+                body = json.dumps(data).encode()
+            except:
+                pass
+        # Proxy to vLLM
+        self.proxy_request(body)
+    def proxy_request(self, body=None):
+        """Forward request to vLLM"""
+        try:
+            url = f"http://localhost:8001{self.path}"
+            headers = dict(self.headers)
+            headers.pop("Host", None)
+            if body is None and self.command == "POST":
+                content_length = int(self.headers.get('Content-Length', 0))
+                body = self.rfile.read(content_length)
+            resp = requests.request(
+                method=self.command,
+                url=url,
+                headers=headers,
+                data=body,
+                stream=True
+            )
+            self.send_response(resp.status_code)
+            for key, value in resp.headers.items():
+                if key.lower() not in ['transfer-encoding', 'connection']:
+                    self.send_header(key, value)
+            self.end_headers()
+            for chunk in resp.iter_content(chunk_size=8192):
+                if chunk:
+                    self.wfile.write(chunk)
+        except Exception as e:
+            print(f"Proxy error: {e}")
+            self.send_response(500)
+            self.end_headers()
+def pre_download_models():
+    """Pre-download all models to cache"""
+    print("Pre-downloading models to cache...")
+    for model_id in MODELS.keys():
+        print(f"Downloading {model_id}...")
+        subprocess.run([
+            "huggingface-cli", "download", model_id,
+            "--cache-dir", cache_dir
+        ])
+    print("All models downloaded!")
+if __name__ == "__main__":
+    # Ensure cache directory exists
+    os.makedirs(cache_dir, exist_ok=True)
+    # Pre-download models in background
+    download_thread = Thread(target=pre_download_models, daemon=True)
+    download_thread.start()
+    # Start vLLM with default model
+    if not start_vllm(current_model):
+        print("Failed to start vLLM, exiting")
+        exit(1)
+    # Start proxy server
+    print("Starting proxy server on port 8000...")
+    server = HTTPServer(('0.0.0.0', 8000), ProxyHandler)
+    try:
+        server.serve_forever()
+    except KeyboardInterrupt:
+        print("Shutting down...")
+        stop_vllm()