pluralchat

Runtime error

Avijit Ghosh commited on 14 days ago

Commit

5496ece

1 Parent(s): 4075f84

Switch from Ollama to vLLM with AWS Neuron support for INF2

- Replace Ollama with vLLM in Dockerfile
- Update entrypoint.sh to start vLLM with Neuron device support
- Add tensor parallelism for better performance on INF2
- Update OpenAI API endpoint to vLLM server (port 8000)
- Add VLLM_SETUP.md with configuration guide
- Default model: Llama-3.1-8B-Instruct

Files changed (3) hide show

Dockerfile +5 -5
VLLM_SETUP.md +62 -0
entrypoint.sh +33 -29

Dockerfile CHANGED Viewed

@@ -21,13 +21,13 @@ RUN touch /app/.env.local
 USER root
 RUN apt-get update
-RUN apt-get install -y libgomp1 libcurl4 curl
-# Install Ollama
-RUN curl -fsSL https://ollama.ai/install.sh | sh
-# ensure ollama cache dir exists before adjusting ownership
-RUN mkdir -p /home/user/.ollama && chown -R 1000:1000 /home/user/.ollama
 # ensure npm cache dir exists before adjusting ownership
 RUN mkdir -p /home/user/.npm && chown -R 1000:1000 /home/user/.npm

 USER root
 RUN apt-get update
+RUN apt-get install -y libgomp1 libcurl4 curl python3 python3-pip python3-venv
+# Install vLLM with AWS Neuron support for INF2
+RUN pip3 install --no-cache-dir vllm awscli
+# ensure vllm cache dir exists before adjusting ownership
+RUN mkdir -p /home/user/.cache && chown -R 1000:1000 /home/user/.cache
 # ensure npm cache dir exists before adjusting ownership
 RUN mkdir -p /home/user/.npm && chown -R 1000:1000 /home/user/.npm

VLLM_SETUP.md ADDED Viewed

	@@ -0,0 +1,62 @@

+# vLLM Setup for AWS INF2
+This branch uses vLLM with AWS Neuron support for running models on Amazon INF2 instances.
+## Configuration
+### Environment Variables
+Set these in your HuggingFace Space secrets:
+```bash
+# Specify which models to load (comma-separated for multiple models)
+VLLM_MODELS=meta-llama/Llama-3.1-8B-Instruct
+# Or use smaller models for faster loading:
+# VLLM_MODELS=microsoft/Phi-3-mini-4k-instruct
+```
+### Supported Models for INF2
+vLLM with Neuron supports:
+- Llama 2 (7B, 13B)
+- Llama 3 / 3.1 (8B)
+- Mistral (7B)
+- Mixtral (8x7B with careful configuration)
+### Model Loading Time
+First startup will take 5-15 minutes as vLLM:
+1. Downloads the model from HuggingFace
+2. Compiles it for Inferentia2 chips
+3. Caches the compiled version
+Subsequent starts will be faster (2-3 minutes) as the compiled model is cached.
+## Hardware Requirements
+- **Minimum**: inf2.xlarge (1 Inferentia2 chip, 2 cores)
+- **Recommended**: inf2.8xlarge (1 Inferentia2 chip, 32 cores) for better performance
+- **For larger models**: inf2.24xlarge or inf2.48xlarge
+## Architecture
+- vLLM runs on port 8000 with OpenAI-compatible API
+- Chat UI connects via `http://localhost:8000/v1`
+- Tensor parallelism splits model across Neuron cores
+## Troubleshooting
+If vLLM fails to start:
+1. Check Space logs for compilation errors
+2. Ensure model is compatible with Neuron
+3. Try a smaller model first
+4. Increase timeout in entrypoint.sh if needed
+## Switching Back to Ollama
+To revert to the Ollama setup:
+```bash
+git checkout ollama-setup
+git push origin main --force
+```

entrypoint.sh CHANGED Viewed

@@ -14,46 +14,50 @@ if [ "$INCLUDE_DB" = "true" ] ; then
     nohup mongod &
 fi;
-# Start Ollama service for HF space (local gpu)
-echo "Starting local Ollama service"
-# Ensure dir for persistent model storage
 mkdir -p /data/models
-nohup env OLLAMA_MODELS=/data/models ollama serve > /tmp/ollama.log 2>&1 &
-OLLAMA_PID=$!
-# Override OPENAI_BASE_URL to use local Ollama at runtime
-export OPENAI_BASE_URL=http://localhost:11434/v1
-echo "OPENAI_BASE_URL set to $OPENAI_BASE_URL for local Ollama"
-# Wait for Ollama to be ready
-MAX_RETRIES=30
 RETRY_COUNT=0
-until curl -s http://localhost:11434/api/tags > /dev/null 2>&1; do
     RETRY_COUNT=$((RETRY_COUNT + 1))
     if [ $RETRY_COUNT -ge $MAX_RETRIES ]; then
-        echo "Ollama failed to start after $MAX_RETRIES attempts"
-        cat /tmp/ollama.log
         exit 1
     fi
-    sleep 2
-done
-# Pull models
-OLLAMA_MODELS=${OLLAMA_MODELS:-llama3.1:8b}
-IFS=',' read -ra MODEL_ARRAY <<< "$OLLAMA_MODELS"
-for MODEL in "${MODEL_ARRAY[@]}"; do
-    MODEL=$(echo "$MODEL" | xargs)
-    if ! env OLLAMA_MODELS=/data/models ollama list | grep -q "$MODEL"; then
-        echo "  Pulling model: $MODEL (this may take several minutes)..."
-        env OLLAMA_MODELS=/data/models ollama pull "$MODEL"
-        echo "  $MODEL pulled successfully!"
-    else
-        echo "  $MODEL already exists"
     fi
 done
 export PUBLIC_VERSION=$(node -p "require('./package.json').version")
 dotenv -e /app/.env -c -- node /app/build/index.js -- --host 0.0.0.0 --port 3000

     nohup mongod &
 fi;
+# Start vLLM service with OpenAI-compatible API for HF space
+echo "Starting vLLM service with OpenAI-compatible API"
+# Ensure dir for model cache
 mkdir -p /data/models
+# Default models for vLLM (can be overridden via VLLM_MODELS env var)
+VLLM_MODELS=${VLLM_MODELS:-"meta-llama/Llama-3.1-8B-Instruct"}
+# Start vLLM OpenAI-compatible server
+# Using --served-model-name to make models accessible via simpler names
+nohup python3 -m vllm.entrypoints.openai.api_server \
+    --model "$VLLM_MODELS" \
+    --host 0.0.0.0 \
+    --port 8000 \
+    --device neuron \
+    --tensor-parallel-size 2 \
+    > /tmp/vllm.log 2>&1 &
+VLLM_PID=$!
+# Override OPENAI_BASE_URL to use local vLLM at runtime
+export OPENAI_BASE_URL=http://localhost:8000/v1
+echo "OPENAI_BASE_URL set to $OPENAI_BASE_URL for local vLLM"
+# Wait for vLLM to be ready
+MAX_RETRIES=60
 RETRY_COUNT=0
+echo "Waiting for vLLM to be ready (this may take a few minutes for model loading)..."
+until curl -s http://localhost:8000/health > /dev/null 2>&1; do
     RETRY_COUNT=$((RETRY_COUNT + 1))
     if [ $RETRY_COUNT -ge $MAX_RETRIES ]; then
+        echo "vLLM failed to start after $MAX_RETRIES attempts"
+        echo "=== vLLM logs ==="
+        cat /tmp/vllm.log
         exit 1
     fi
+    sleep 5
+    if [ $((RETRY_COUNT % 6)) -eq 0 ]; then
+        echo "Still waiting for vLLM... (${RETRY_COUNT}/${MAX_RETRIES})"
     fi
 done
+echo "vLLM is ready!"
 export PUBLIC_VERSION=$(node -p "require('./package.json').version")
 dotenv -e /app/.env -c -- node /app/build/index.js -- --host 0.0.0.0 --port 3000