A newer version of the Gradio SDK is available:
5.49.1
Linux
This page describes how to manually install and run h2oGPT on Linux. Note that the following instructions are for Ubuntu x86_64. (The steps in the following subsection can be adapted to other Linux distributions by substituting apt-get with the appropriate package management command.)
Install
Set up a Python 3.10 environment. We recommend using Miniconda.
Download Miniconda for Linux. After downloading, run:
bash ./Miniconda3-py310_23.1.0-1-Linux-x86_64.sh # follow license agreement and add to bash if requiredOpen a new shell and look for
(base)in the prompt to confirm that Miniconda is properly installed, then create a new env:conda create -n h2ogpt -y conda activate h2ogpt conda install python=3.10 -c conda-forge -yYou should see
(h2ogpt)in the shell prompt.Alternatively, on newer Ubuntu systems, you can set up a Python 3.10 environment by doing the following:
sudo apt-get update sudo apt-get install -y build-essential gcc python3.10-dev virtualenv -p python3 h2ogpt source h2ogpt/bin/activateCheck your python version with the following command:
python --versionThe return should say 3.10.xx, and:
python -c "import os, sys ; print('hello world')"should print
hello world. Then clone:git clone https://github.com/h2oai/h2ogpt.git cd h2ogptOn some systems,
pipstill refers back to the system one, then one can usepython -m piporpip3instead ofpipor trypython3instead ofpython.For GPU: Install CUDA ToolKit with ability to compile using nvcc for some packages like llama-cpp-python, AutoGPTQ, exllama, flash attention, TTS use of deepspeed, by going to CUDA Toolkit. E.g. CUDA 11.8 Toolkit. In order to avoid removing the original CUDA toolkit/driver you have, on NVIDIA's website, use the
runfile (local)installer, and choose to not install driver or overwrite/usr/local/cudalink and just install the toolkit, and rely upon theCUDA_HOMEenv to point to the desired CUDA version. Then do:export CUDA_HOME=/usr/local/cuda-11.8Or if you do not plan to use packages like deepspeed in coqui's TTS or build other packages (i.e. only use binaries), you can just use the non-dev version from conda if preferred:
conda install cudatoolkit=11.8 -c conda-forge -y export CUDA_HOME=$CONDA_PREFIXDo not install
cudatoolkit-devas it only goes up to cuda 11.7 that is no longer supported.Place the
CUDA_HOMEexport into your~/.bashrcor before starting h2oGPT for TTS's use of deepspeed to work.Prepare to install dependencies:
export PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cu118"Choose cu118+ for A100/H100+. Or for CPU set
export PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu"Run (
bash docs/linux_install.sh)[linux_install.sh] for full normal document Q/A installation. To allow all (GPL too) packages, run:GPLOK=1 bash docs/linux_install.shOne can pick and choose different optional things to install instead by commenting them out in the shell script, or edit the script if any issues. See script for notes about installation.
Run
See the FAQ for many ways to run models. The following are some other examples.
Note that models are stored in /home/$USER/.cache/ for chroma, huggingface, selenium, torch, weaviate, etc. directories.
Check that can see CUDA from Torch:
import torch print(torch.cuda.is_available())should print True.
Place all documents in
user_pathor upload in UI (Help with UI).UI using GPU with at least 24GB with streaming:
python generate.py --base_model=h2oai/h2ogpt-4096-llama2-13b-chat --load_8bit=True --score_model=None --langchain_mode='UserData' --user_path=user_pathSame with a smaller model without quantization:
python generate.py --base_model=h2oai/h2ogpt-4096-llama2-7b-chat --score_model=None --langchain_mode='UserData' --user_path=user_pathUI using LLaMa.cpp LLaMa2 model:
python generate.py --base_model='llama' --prompt_type=llama2 --score_model=None --langchain_mode='UserData' --user_path=user_path --model_path_llama=https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF/resolve/main/llama-2-7b-chat.Q6_K.gguf?download=true --max_seq_len=4096which works on CPU or GPU (assuming llama cpp python package compiled against CUDA or Metal).
If using OpenAI for the LLM is ok, but you want documents to be parsed and embedded locally, then do:
OPENAI_API_KEY=<key> python generate.py --inference_server=openai_chat --base_model=gpt-3.5-turbo --score_model=Nonewhere
<key>should be replaced by your OpenAI key that probably starts withsk-. OpenAI is not recommended for private document question-answer, but it can be a good reference for testing purposes or when privacy is not required.
Perhaps you want better image caption performance and focus local GPU on that, then do:OPENAI_API_KEY=<key> python generate.py --inference_server=openai_chat --base_model=gpt-3.5-turbo --score_model=None --captions_model=Salesforce/blip2-flan-t5-xlFor Azure OpenAI:
OPENAI_API_KEY=<key> python generate.py --inference_server="openai_azure_chat:<deployment_name>:<base_url>:<api_version>" --base_model=gpt-3.5-turbo --h2ocolors=False --langchain_mode=UserDatawhere the entry
<deployment_name>is required for Azure, others are optional and can be filled with stringNoneor have empty input between:. Azure OpenAI is a bit safer for private access to Azure-based docs.Add
--share=Trueto make gradio server visible via sharable URL.If you see an error about protobuf, try:
pip install protobuf==3.20.0
See CPU and GPU for some other general aspects about using h2oGPT on CPU or GPU, such as which models to try.
Google Colab
A Google Colab version of a 3B GPU model is at:
A local copy of that GPU Google Colab is h2oGPT_GPU.ipynb.
A Google Colab version of a 7B LLaMa CPU model is at:
A local copy of that CPU Google Colab is h2oGPT_CPU.ipynb.