4 Ways to deploy LLM locally
llama.cpp
https://github.com/ggerganov/llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make LLAMA_CUBLAS=1
# or
mkdir build
cd build
cmake .. -DLLAMA_CUBLAS=ON
cmake --build . --config Release
# start inference on a gguf model
./main -m /path/to/xxx.gguf -n 28
https://github.com/abetlen/llama-cpp-python
python3 -m venv py3venv
source py3venv/bin/activate
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install 'llama-cpp-python[server]'
python3 -m llama_cpp.server \
--host 0.0.0.0 \
--port 1234 \
--model /path/to/xxx.gguf \
--n_gpu_layers 28
Ollama
https://github.com/ollama/ollama
sudo curl -L https://ollama.com/download/ollama-linux-amd64 -o /usr/bin/ollama
sudo chmod +x /usr/bin/ollama
ollama serve
# Create a file named Modelfile with content "FROM /path/to/xxx.gguf"
ollama create xxx -f /path/to/Modelfile
ollama list
ollama run xxx
A model file is the blueprint to create and share models with Ollama.
Instruction | Description |
---|---|
FROM | Defines the base model to use. |
PARAMETER | Sets the parameters for how Ollama will run the model. |
TEMPLATE | The full prompt template to be sent to the model. |
SYSTEM | Specifies the system message that will be set in the template. |
ADAPTER | Defines the (Q)LoRA adapters to apply to the model. |
LICENSE | Specifies the legal license. |
MESSAGE | Specify message history. |
LM studio
👾 LM Studio - Discover and run LLMs locally
vLLM
https://github.com/vllm-project/vllm
python3 -m venv py3venv
source py3venv/bin/activate
pip install vllm
python -m vllm.entrypoints.openai.api_server --model facebook/opt-125m
Supported Models:
https://docs.vllm.ai/en/latest/models/supported_models.html