Avijit Ghosh commited on
Commit
5496ece
·
1 Parent(s): 4075f84

Switch from Ollama to vLLM with AWS Neuron support for INF2

Browse files

- Replace Ollama with vLLM in Dockerfile
- Update entrypoint.sh to start vLLM with Neuron device support
- Add tensor parallelism for better performance on INF2
- Update OpenAI API endpoint to vLLM server (port 8000)
- Add VLLM_SETUP.md with configuration guide
- Default model: Llama-3.1-8B-Instruct

Files changed (3) hide show
  1. Dockerfile +5 -5
  2. VLLM_SETUP.md +62 -0
  3. entrypoint.sh +33 -29
Dockerfile CHANGED
@@ -21,13 +21,13 @@ RUN touch /app/.env.local
21
 
22
  USER root
23
  RUN apt-get update
24
- RUN apt-get install -y libgomp1 libcurl4 curl
25
 
26
- # Install Ollama
27
- RUN curl -fsSL https://ollama.ai/install.sh | sh
28
 
29
- # ensure ollama cache dir exists before adjusting ownership
30
- RUN mkdir -p /home/user/.ollama && chown -R 1000:1000 /home/user/.ollama
31
 
32
  # ensure npm cache dir exists before adjusting ownership
33
  RUN mkdir -p /home/user/.npm && chown -R 1000:1000 /home/user/.npm
 
21
 
22
  USER root
23
  RUN apt-get update
24
+ RUN apt-get install -y libgomp1 libcurl4 curl python3 python3-pip python3-venv
25
 
26
+ # Install vLLM with AWS Neuron support for INF2
27
+ RUN pip3 install --no-cache-dir vllm awscli
28
 
29
+ # ensure vllm cache dir exists before adjusting ownership
30
+ RUN mkdir -p /home/user/.cache && chown -R 1000:1000 /home/user/.cache
31
 
32
  # ensure npm cache dir exists before adjusting ownership
33
  RUN mkdir -p /home/user/.npm && chown -R 1000:1000 /home/user/.npm
VLLM_SETUP.md ADDED
@@ -0,0 +1,62 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # vLLM Setup for AWS INF2
2
+
3
+ This branch uses vLLM with AWS Neuron support for running models on Amazon INF2 instances.
4
+
5
+ ## Configuration
6
+
7
+ ### Environment Variables
8
+
9
+ Set these in your HuggingFace Space secrets:
10
+
11
+ ```bash
12
+ # Specify which models to load (comma-separated for multiple models)
13
+ VLLM_MODELS=meta-llama/Llama-3.1-8B-Instruct
14
+
15
+ # Or use smaller models for faster loading:
16
+ # VLLM_MODELS=microsoft/Phi-3-mini-4k-instruct
17
+ ```
18
+
19
+ ### Supported Models for INF2
20
+
21
+ vLLM with Neuron supports:
22
+ - Llama 2 (7B, 13B)
23
+ - Llama 3 / 3.1 (8B)
24
+ - Mistral (7B)
25
+ - Mixtral (8x7B with careful configuration)
26
+
27
+ ### Model Loading Time
28
+
29
+ First startup will take 5-15 minutes as vLLM:
30
+ 1. Downloads the model from HuggingFace
31
+ 2. Compiles it for Inferentia2 chips
32
+ 3. Caches the compiled version
33
+
34
+ Subsequent starts will be faster (2-3 minutes) as the compiled model is cached.
35
+
36
+ ## Hardware Requirements
37
+
38
+ - **Minimum**: inf2.xlarge (1 Inferentia2 chip, 2 cores)
39
+ - **Recommended**: inf2.8xlarge (1 Inferentia2 chip, 32 cores) for better performance
40
+ - **For larger models**: inf2.24xlarge or inf2.48xlarge
41
+
42
+ ## Architecture
43
+
44
+ - vLLM runs on port 8000 with OpenAI-compatible API
45
+ - Chat UI connects via `http://localhost:8000/v1`
46
+ - Tensor parallelism splits model across Neuron cores
47
+
48
+ ## Troubleshooting
49
+
50
+ If vLLM fails to start:
51
+ 1. Check Space logs for compilation errors
52
+ 2. Ensure model is compatible with Neuron
53
+ 3. Try a smaller model first
54
+ 4. Increase timeout in entrypoint.sh if needed
55
+
56
+ ## Switching Back to Ollama
57
+
58
+ To revert to the Ollama setup:
59
+ ```bash
60
+ git checkout ollama-setup
61
+ git push origin main --force
62
+ ```
entrypoint.sh CHANGED
@@ -14,46 +14,50 @@ if [ "$INCLUDE_DB" = "true" ] ; then
14
  nohup mongod &
15
  fi;
16
 
17
- # Start Ollama service for HF space (local gpu)
18
- echo "Starting local Ollama service"
19
 
20
- # Ensure dir for persistent model storage
21
  mkdir -p /data/models
22
 
23
- nohup env OLLAMA_MODELS=/data/models ollama serve > /tmp/ollama.log 2>&1 &
24
- OLLAMA_PID=$!
25
-
26
- # Override OPENAI_BASE_URL to use local Ollama at runtime
27
- export OPENAI_BASE_URL=http://localhost:11434/v1
28
- echo "OPENAI_BASE_URL set to $OPENAI_BASE_URL for local Ollama"
29
-
30
- # Wait for Ollama to be ready
31
- MAX_RETRIES=30
 
 
 
 
 
 
 
 
 
 
 
32
  RETRY_COUNT=0
33
- until curl -s http://localhost:11434/api/tags > /dev/null 2>&1; do
 
34
  RETRY_COUNT=$((RETRY_COUNT + 1))
35
  if [ $RETRY_COUNT -ge $MAX_RETRIES ]; then
36
- echo "Ollama failed to start after $MAX_RETRIES attempts"
37
- cat /tmp/ollama.log
 
38
  exit 1
39
  fi
40
- sleep 2
41
- done
42
-
43
- # Pull models
44
- OLLAMA_MODELS=${OLLAMA_MODELS:-llama3.1:8b}
45
- IFS=',' read -ra MODEL_ARRAY <<< "$OLLAMA_MODELS"
46
- for MODEL in "${MODEL_ARRAY[@]}"; do
47
- MODEL=$(echo "$MODEL" | xargs)
48
- if ! env OLLAMA_MODELS=/data/models ollama list | grep -q "$MODEL"; then
49
- echo " Pulling model: $MODEL (this may take several minutes)..."
50
- env OLLAMA_MODELS=/data/models ollama pull "$MODEL"
51
- echo " $MODEL pulled successfully!"
52
- else
53
- echo " $MODEL already exists"
54
  fi
55
  done
56
 
 
 
57
  export PUBLIC_VERSION=$(node -p "require('./package.json').version")
58
 
59
  dotenv -e /app/.env -c -- node /app/build/index.js -- --host 0.0.0.0 --port 3000
 
14
  nohup mongod &
15
  fi;
16
 
17
+ # Start vLLM service with OpenAI-compatible API for HF space
18
+ echo "Starting vLLM service with OpenAI-compatible API"
19
 
20
+ # Ensure dir for model cache
21
  mkdir -p /data/models
22
 
23
+ # Default models for vLLM (can be overridden via VLLM_MODELS env var)
24
+ VLLM_MODELS=${VLLM_MODELS:-"meta-llama/Llama-3.1-8B-Instruct"}
25
+
26
+ # Start vLLM OpenAI-compatible server
27
+ # Using --served-model-name to make models accessible via simpler names
28
+ nohup python3 -m vllm.entrypoints.openai.api_server \
29
+ --model "$VLLM_MODELS" \
30
+ --host 0.0.0.0 \
31
+ --port 8000 \
32
+ --device neuron \
33
+ --tensor-parallel-size 2 \
34
+ > /tmp/vllm.log 2>&1 &
35
+ VLLM_PID=$!
36
+
37
+ # Override OPENAI_BASE_URL to use local vLLM at runtime
38
+ export OPENAI_BASE_URL=http://localhost:8000/v1
39
+ echo "OPENAI_BASE_URL set to $OPENAI_BASE_URL for local vLLM"
40
+
41
+ # Wait for vLLM to be ready
42
+ MAX_RETRIES=60
43
  RETRY_COUNT=0
44
+ echo "Waiting for vLLM to be ready (this may take a few minutes for model loading)..."
45
+ until curl -s http://localhost:8000/health > /dev/null 2>&1; do
46
  RETRY_COUNT=$((RETRY_COUNT + 1))
47
  if [ $RETRY_COUNT -ge $MAX_RETRIES ]; then
48
+ echo "vLLM failed to start after $MAX_RETRIES attempts"
49
+ echo "=== vLLM logs ==="
50
+ cat /tmp/vllm.log
51
  exit 1
52
  fi
53
+ sleep 5
54
+ if [ $((RETRY_COUNT % 6)) -eq 0 ]; then
55
+ echo "Still waiting for vLLM... (${RETRY_COUNT}/${MAX_RETRIES})"
 
 
 
 
 
 
 
 
 
 
 
56
  fi
57
  done
58
 
59
+ echo "vLLM is ready!"
60
+
61
  export PUBLIC_VERSION=$(node -p "require('./package.json').version")
62
 
63
  dotenv -e /app/.env -c -- node /app/build/index.js -- --host 0.0.0.0 --port 3000