Spaces:
Runtime error
Runtime error
Avijit Ghosh
commited on
Commit
·
5496ece
1
Parent(s):
4075f84
Switch from Ollama to vLLM with AWS Neuron support for INF2
Browse files- Replace Ollama with vLLM in Dockerfile
- Update entrypoint.sh to start vLLM with Neuron device support
- Add tensor parallelism for better performance on INF2
- Update OpenAI API endpoint to vLLM server (port 8000)
- Add VLLM_SETUP.md with configuration guide
- Default model: Llama-3.1-8B-Instruct
- Dockerfile +5 -5
- VLLM_SETUP.md +62 -0
- entrypoint.sh +33 -29
Dockerfile
CHANGED
|
@@ -21,13 +21,13 @@ RUN touch /app/.env.local
|
|
| 21 |
|
| 22 |
USER root
|
| 23 |
RUN apt-get update
|
| 24 |
-
RUN apt-get install -y libgomp1 libcurl4 curl
|
| 25 |
|
| 26 |
-
# Install
|
| 27 |
-
RUN
|
| 28 |
|
| 29 |
-
# ensure
|
| 30 |
-
RUN mkdir -p /home/user/.
|
| 31 |
|
| 32 |
# ensure npm cache dir exists before adjusting ownership
|
| 33 |
RUN mkdir -p /home/user/.npm && chown -R 1000:1000 /home/user/.npm
|
|
|
|
| 21 |
|
| 22 |
USER root
|
| 23 |
RUN apt-get update
|
| 24 |
+
RUN apt-get install -y libgomp1 libcurl4 curl python3 python3-pip python3-venv
|
| 25 |
|
| 26 |
+
# Install vLLM with AWS Neuron support for INF2
|
| 27 |
+
RUN pip3 install --no-cache-dir vllm awscli
|
| 28 |
|
| 29 |
+
# ensure vllm cache dir exists before adjusting ownership
|
| 30 |
+
RUN mkdir -p /home/user/.cache && chown -R 1000:1000 /home/user/.cache
|
| 31 |
|
| 32 |
# ensure npm cache dir exists before adjusting ownership
|
| 33 |
RUN mkdir -p /home/user/.npm && chown -R 1000:1000 /home/user/.npm
|
VLLM_SETUP.md
ADDED
|
@@ -0,0 +1,62 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# vLLM Setup for AWS INF2
|
| 2 |
+
|
| 3 |
+
This branch uses vLLM with AWS Neuron support for running models on Amazon INF2 instances.
|
| 4 |
+
|
| 5 |
+
## Configuration
|
| 6 |
+
|
| 7 |
+
### Environment Variables
|
| 8 |
+
|
| 9 |
+
Set these in your HuggingFace Space secrets:
|
| 10 |
+
|
| 11 |
+
```bash
|
| 12 |
+
# Specify which models to load (comma-separated for multiple models)
|
| 13 |
+
VLLM_MODELS=meta-llama/Llama-3.1-8B-Instruct
|
| 14 |
+
|
| 15 |
+
# Or use smaller models for faster loading:
|
| 16 |
+
# VLLM_MODELS=microsoft/Phi-3-mini-4k-instruct
|
| 17 |
+
```
|
| 18 |
+
|
| 19 |
+
### Supported Models for INF2
|
| 20 |
+
|
| 21 |
+
vLLM with Neuron supports:
|
| 22 |
+
- Llama 2 (7B, 13B)
|
| 23 |
+
- Llama 3 / 3.1 (8B)
|
| 24 |
+
- Mistral (7B)
|
| 25 |
+
- Mixtral (8x7B with careful configuration)
|
| 26 |
+
|
| 27 |
+
### Model Loading Time
|
| 28 |
+
|
| 29 |
+
First startup will take 5-15 minutes as vLLM:
|
| 30 |
+
1. Downloads the model from HuggingFace
|
| 31 |
+
2. Compiles it for Inferentia2 chips
|
| 32 |
+
3. Caches the compiled version
|
| 33 |
+
|
| 34 |
+
Subsequent starts will be faster (2-3 minutes) as the compiled model is cached.
|
| 35 |
+
|
| 36 |
+
## Hardware Requirements
|
| 37 |
+
|
| 38 |
+
- **Minimum**: inf2.xlarge (1 Inferentia2 chip, 2 cores)
|
| 39 |
+
- **Recommended**: inf2.8xlarge (1 Inferentia2 chip, 32 cores) for better performance
|
| 40 |
+
- **For larger models**: inf2.24xlarge or inf2.48xlarge
|
| 41 |
+
|
| 42 |
+
## Architecture
|
| 43 |
+
|
| 44 |
+
- vLLM runs on port 8000 with OpenAI-compatible API
|
| 45 |
+
- Chat UI connects via `http://localhost:8000/v1`
|
| 46 |
+
- Tensor parallelism splits model across Neuron cores
|
| 47 |
+
|
| 48 |
+
## Troubleshooting
|
| 49 |
+
|
| 50 |
+
If vLLM fails to start:
|
| 51 |
+
1. Check Space logs for compilation errors
|
| 52 |
+
2. Ensure model is compatible with Neuron
|
| 53 |
+
3. Try a smaller model first
|
| 54 |
+
4. Increase timeout in entrypoint.sh if needed
|
| 55 |
+
|
| 56 |
+
## Switching Back to Ollama
|
| 57 |
+
|
| 58 |
+
To revert to the Ollama setup:
|
| 59 |
+
```bash
|
| 60 |
+
git checkout ollama-setup
|
| 61 |
+
git push origin main --force
|
| 62 |
+
```
|
entrypoint.sh
CHANGED
|
@@ -14,46 +14,50 @@ if [ "$INCLUDE_DB" = "true" ] ; then
|
|
| 14 |
nohup mongod &
|
| 15 |
fi;
|
| 16 |
|
| 17 |
-
# Start
|
| 18 |
-
echo "Starting
|
| 19 |
|
| 20 |
-
# Ensure dir for
|
| 21 |
mkdir -p /data/models
|
| 22 |
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
#
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
RETRY_COUNT=0
|
| 33 |
-
|
|
|
|
| 34 |
RETRY_COUNT=$((RETRY_COUNT + 1))
|
| 35 |
if [ $RETRY_COUNT -ge $MAX_RETRIES ]; then
|
| 36 |
-
echo "
|
| 37 |
-
|
|
|
|
| 38 |
exit 1
|
| 39 |
fi
|
| 40 |
-
sleep
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
# Pull models
|
| 44 |
-
OLLAMA_MODELS=${OLLAMA_MODELS:-llama3.1:8b}
|
| 45 |
-
IFS=',' read -ra MODEL_ARRAY <<< "$OLLAMA_MODELS"
|
| 46 |
-
for MODEL in "${MODEL_ARRAY[@]}"; do
|
| 47 |
-
MODEL=$(echo "$MODEL" | xargs)
|
| 48 |
-
if ! env OLLAMA_MODELS=/data/models ollama list | grep -q "$MODEL"; then
|
| 49 |
-
echo " Pulling model: $MODEL (this may take several minutes)..."
|
| 50 |
-
env OLLAMA_MODELS=/data/models ollama pull "$MODEL"
|
| 51 |
-
echo " $MODEL pulled successfully!"
|
| 52 |
-
else
|
| 53 |
-
echo " $MODEL already exists"
|
| 54 |
fi
|
| 55 |
done
|
| 56 |
|
|
|
|
|
|
|
| 57 |
export PUBLIC_VERSION=$(node -p "require('./package.json').version")
|
| 58 |
|
| 59 |
dotenv -e /app/.env -c -- node /app/build/index.js -- --host 0.0.0.0 --port 3000
|
|
|
|
| 14 |
nohup mongod &
|
| 15 |
fi;
|
| 16 |
|
| 17 |
+
# Start vLLM service with OpenAI-compatible API for HF space
|
| 18 |
+
echo "Starting vLLM service with OpenAI-compatible API"
|
| 19 |
|
| 20 |
+
# Ensure dir for model cache
|
| 21 |
mkdir -p /data/models
|
| 22 |
|
| 23 |
+
# Default models for vLLM (can be overridden via VLLM_MODELS env var)
|
| 24 |
+
VLLM_MODELS=${VLLM_MODELS:-"meta-llama/Llama-3.1-8B-Instruct"}
|
| 25 |
+
|
| 26 |
+
# Start vLLM OpenAI-compatible server
|
| 27 |
+
# Using --served-model-name to make models accessible via simpler names
|
| 28 |
+
nohup python3 -m vllm.entrypoints.openai.api_server \
|
| 29 |
+
--model "$VLLM_MODELS" \
|
| 30 |
+
--host 0.0.0.0 \
|
| 31 |
+
--port 8000 \
|
| 32 |
+
--device neuron \
|
| 33 |
+
--tensor-parallel-size 2 \
|
| 34 |
+
> /tmp/vllm.log 2>&1 &
|
| 35 |
+
VLLM_PID=$!
|
| 36 |
+
|
| 37 |
+
# Override OPENAI_BASE_URL to use local vLLM at runtime
|
| 38 |
+
export OPENAI_BASE_URL=http://localhost:8000/v1
|
| 39 |
+
echo "OPENAI_BASE_URL set to $OPENAI_BASE_URL for local vLLM"
|
| 40 |
+
|
| 41 |
+
# Wait for vLLM to be ready
|
| 42 |
+
MAX_RETRIES=60
|
| 43 |
RETRY_COUNT=0
|
| 44 |
+
echo "Waiting for vLLM to be ready (this may take a few minutes for model loading)..."
|
| 45 |
+
until curl -s http://localhost:8000/health > /dev/null 2>&1; do
|
| 46 |
RETRY_COUNT=$((RETRY_COUNT + 1))
|
| 47 |
if [ $RETRY_COUNT -ge $MAX_RETRIES ]; then
|
| 48 |
+
echo "vLLM failed to start after $MAX_RETRIES attempts"
|
| 49 |
+
echo "=== vLLM logs ==="
|
| 50 |
+
cat /tmp/vllm.log
|
| 51 |
exit 1
|
| 52 |
fi
|
| 53 |
+
sleep 5
|
| 54 |
+
if [ $((RETRY_COUNT % 6)) -eq 0 ]; then
|
| 55 |
+
echo "Still waiting for vLLM... (${RETRY_COUNT}/${MAX_RETRIES})"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 56 |
fi
|
| 57 |
done
|
| 58 |
|
| 59 |
+
echo "vLLM is ready!"
|
| 60 |
+
|
| 61 |
export PUBLIC_VERSION=$(node -p "require('./package.json').version")
|
| 62 |
|
| 63 |
dotenv -e /app/.env -c -- node /app/build/index.js -- --host 0.0.0.0 --port 3000
|