4 Ways to deploy LLM locally

llama.cpp

git  clone  https://github.com/ggerganov/llama.cpp
cd  llama.cpp
make  LLAMA_CUBLAS=1
# or
mkdir  build
cd  build
cmake  ..  -DLLAMA_CUBLAS=ON
cmake  --build  .  --config  Release
# start inference on a gguf model

./main  -m  /path/to/xxx.gguf  -n  28

https://github.com/abetlen/llama-cpp-python

python3  -m  venv  py3venv
source  py3venv/bin/activate
CMAKE_ARGS="-DLLAMA_CUBLAS=on"  FORCE_CMAKE=1  pip  install  'llama-cpp-python[server]'
python3  -m  llama_cpp.server  \
 --host 0.0.0.0 \
 --port  1234  \
 --model /path/to/xxx.gguf \
 --n_gpu_layers  28

Ollama

https://github.com/ollama/ollama

sudo  curl  -L  https://ollama.com/download/ollama-linux-amd64  -o  /usr/bin/ollama
sudo  chmod  +x  /usr/bin/ollama
ollama  serve
# Create a file named Modelfile with content "FROM /path/to/xxx.gguf"
ollama  create  xxx  -f  /path/to/Modelfile
ollama  list
ollama  run  xxx

A model file is the blueprint to create and share models with Ollama.

Instruction	Description
FROM	Defines the base model to use.
PARAMETER	Sets the parameters for how Ollama will run the model.
TEMPLATE	The full prompt template to be sent to the model.
SYSTEM	Specifies the system message that will be set in the template.
ADAPTER	Defines the (Q)LoRA adapters to apply to the model.
LICENSE	Specifies the legal license.
MESSAGE	Specify message history.

LM studio

👾 LM Studio - Discover and run LLMs locally

vLLM

https://github.com/vllm-project/vllm

python3  -m  venv  py3venv
source  py3venv/bin/activate
pip  install  vllm
python  -m  vllm.entrypoints.openai.api_server  --model  facebook/opt-125m

Supported Models:
https://docs.vllm.ai/en/latest/models/supported_models.html