This article is driven by two events:
- Recently, Meta, the largest AI supplier of this AI season, heavily criticized in social and VR fields but revered as a living Bodhisattva in the AI sector, released Llama 2. It's said to compete head-to-head with OpenAI's GPT series and allows for easy fine-tuning.
- About a month ago, llama.cpp added support for CLBlast.
So, my AMD Radeon card can now join the fun without much hassle. Below, I'll share how to run llama.cpp + Llama 2 on Ubuntu 22.04 Jammy Jellyfish.
Download the Model
Thanks to TheBloke, who kindly provided the converted Llama 2 models for download:
- TheBloke/Llama-2-70B-GGML
- TheBloke/Llama-2-70B-Chat-GGML
- TheBloke/Llama-2-13B-GGML
- TheBloke/Llama-2-13B-chat-GGML
- TheBloke/Llama-2-7B-GGML
- TheBloke/Llama-2-7B-Chat-GGML
Choose the version that fits your memory capacity. For example, the 70B version requires about 31GB~70GB of memory. I downloaded llama-2-13b-chat.ggmlv3.q4_K_M.bin.
The "q4" indicates a 4-bit version. Once downloaded, save the model as a .bin
file, e.g., ~/Downloads/llama-2-13b-chat.ggmlv3.q4_K_M.bin
.
Compile llama.cpp
On Ubuntu, download the necessary tools and libraries:
sudo apt install git make cmake vim
Clone the Llama.cpp code:
git clone https://github.com/ggerganov/llama.cpp
According to the official website, simply running make should suffice, but I encountered some issues, so I switched to cmake:
mkdir build
cd build
cmake ..
cmake --build . --config Release
The built program will be located in llama.cpp/build/bin
, with main
as the command program entry and server
as the web server entry.
Copy and rename it:
cp ./bin/main/main ../llama-cpu
cd ..
Test It
Inside llama.cpp/examples
, there are several test scripts. Copy one and modify it for our own use:
cp examples/chat-13B.sh examples/chat-llama2-13B.sh
vim examples/chat-llama2-13B.sh
Change the MODEL
path in examples/chat-llama2-13B.sh
to your own, like so:
MODEL="/home/lyric/Downloads/llama-2-13b-chat.ggmlv3.q4_K_M.bin"
Replace ./main
with your own name ./llama-cpu
Then run:
./examples/chat-llama2-13B.sh
Depending on your machine's configuration, it may take a while before you can start chatting. The result should look like the image below, where the green text is what I input, and the white text is Llama 2's response.
Enable GPU Acceleration
Download the driver from AMD: https://repo.radeon.com/amdgpu-install/
Note that the 5.*
versions are the newer ones, and `
2*..versions are actually older and not usable. I installed version
5.5`: https://repo.radeon.com/amdgpu-install/5.5/ubuntu/jammy/
After installation, run:
amdgpu-install --usecase=opencl,rocm
On Ubuntu, download the necessary libraries:
sudo apt install ocl-icd-dev ocl-icd-opencl-dev \
opencl-headers libclblast-dev
Recompile with -DLLAMA_CLBLAST=ON
option:
cd build
cmake .. -DLLAMA_CLBLAST=ON -DCLBlast_dir=/usr/local
cmake --build . --config Release
Copy and rename it:
cp ./bin/main/main ../llama-cl
cd ..
Then modify the launch script:
vim examples/chat-llama2-13B.sh
Replace ./llama-cpu
with your new name ./llama-cl
And add --n-gpu-layers 40
to the second to last line, for example, mine is:
./llama-cl $GEN_OPTIONS \
--model "$MODEL" \
--threads "$N_THREAD" \
--n_predict "$N_PREDICTS" \
--color --interactive \
--file ${PROMPT_FILE} \
--reverse-prompt "${USER_NAME}:" \
--in-prefix ' ' \
--n-gpu-layers 40
"$@"
The --n-gpu-layers
option utilizes VRAM to accelerate token generation. I set it to 40 for my card, but you can set a very large number, like 100000, and llama.cpp will adjust to the maximum number of layers your GPU can handle.
Then run:
./examples/chat-llama2-13B.sh
Theoretically, using GPU acceleration should significantly reduce waiting time, and you should see output indicating GPU acceleration, like:
ggml_opencl: selecting platform: 'AMD Accelerated Parallel Processing'
...
ama_model_load_internal: using OpenCL for GPU acceleration
llama_model_load_internal: mem required = 710.19 MB (+ 1600.00 MB per state)
llama_model_load_internal: offloading 40 repeating layers to GPU
llama_model_load_internal: offloaded 40/41 layers to GPU
llama_model_load_internal: total VRAM used: 7285 MB
...
This indicates GPU acceleration is in use. On my computer, the generation speed reached over 600 tokens per second, which feels incredibly fast.
Run a Service
llama.cpp also provides a server, which you can learn about in the official documentation.
For me, I simply ran:
./server -m ~/Download/llama-2-13b-chat.ggmlv3.q4_K_M.bin \
-c 2048 -ngl 40 --port 10081
Then, opening http://localhost:10081
allowed me to use the Web UI.
This Web server supports API requests, for example:
curl --request POST \
--url http://localhost:10081/completion \
--header "Content-Type: application/json" \
--data \
'{"prompt": "Building a website can be done in 10 steps:","n_predict": 128}'
This makes it very convenient to experiment with the model.
Troubleshooting
OpenCL Permission Issues
There may be cases where OpenCL functions are only accessible with root permissions. For instance, running clinfo
might fail to find OpenCL, but sudo clinfo
works just fine.
In such cases, execute the following (replacing LOGIN_NAME with your username):
sudo usermod -a -G video LOGIN_NAME
sudo usermod -a -G render LOGIN_NAME
This grants the current user the necessary permissions.
About Llama2's Quirks
OpenAI's ChatGPT has undergone extensive prompt engineering and optimization, but Llama2, run independently, lacks these refinements. If Llama2 seems unintelligent, consider increasing the complexity of your prompts; otherwise, Llama2 might not meet your expectations.
For example, if you want Llama2 to output JSON, your prompt should include several examples of generating JSON, like this intent recognition example:
You read the following text and recognize the user's intent.
Possible intents are:
1. "eating"
2. "sleeping"
3. "fighting"
9999. "unknown intent"
You must return the intent with the highest confidence.
You must return the result
in JSON format.
Here is the template:
{ "id": id, "intent": "USER'S INTENT", "confidence": 0.9 }
**instructions: I'm hungry**
{ "id": 1, "intent": "eating", "confidence": 0.9 }
**instructions: I'm tired and want to sleep**
{ "id": 2, "intent": "sleeping", "confidence": 0.9 }
**instructions: Where's the bean? I want to hit it**
{ "id": 3, "intent": "fighting", "confidence": 0.7 }
**instructions: What time is it?**
{ "id": 9999, "intent": "unknown intent", "confidence": 0.9 }
Configured in the Web UI like this:
The running effect is as follows:
Not bad~