Running Llama 2 with llama.cpp, Accelerated by AMD Radeon RX 6900 GPU

Meta ・Llama 2 ・GPU acceleration

Running Llama 2 with llama.cpp, Accelerated by AMD Radeon RX 6900 GPU

・

22 Feb, 2024

This article is driven by two events:

Recently, Meta, the largest AI supplier of this AI season, heavily criticized in social and VR fields but revered as a living Bodhisattva in the AI sector, released Llama 2. It's said to compete head-to-head with OpenAI's GPT series and allows for easy fine-tuning.
About a month ago, llama.cpp added support for CLBlast.

So, my AMD Radeon card can now join the fun without much hassle. Below, I'll share how to run llama.cpp + Llama 2 on Ubuntu 22.04 Jammy Jellyfish.

Download the Model

Thanks to TheBloke, who kindly provided the converted Llama 2 models for download:

Choose the version that fits your memory capacity. For example, the 70B version requires about 31GB~70GB of memory. I downloaded llama-2-13b-chat.ggmlv3.q4_K_M.bin.

The "q4" indicates a 4-bit version. Once downloaded, save the model as a .bin file, e.g., ~/Downloads/llama-2-13b-chat.ggmlv3.q4_K_M.bin.

Compile llama.cpp

On Ubuntu, download the necessary tools and libraries:

sudo apt install git make cmake vim

Clone the Llama.cpp code:

git clone https://github.com/ggerganov/llama.cpp

According to the official website, simply running make should suffice, but I encountered some issues, so I switched to cmake:

mkdir build
cd build
cmake ..
cmake --build . --config Release

The built program will be located in llama.cpp/build/bin, with main as the command program entry and server as the web server entry.

Copy and rename it:

cp ./bin/main/main ../llama-cpu
cd ..

Test It

Inside llama.cpp/examples, there are several test scripts. Copy one and modify it for our own use:

cp examples/chat-13B.sh examples/chat-llama2-13B.sh
vim examples/chat-llama2-13B.sh

Change the MODEL path in examples/chat-llama2-13B.sh to your own, like so:

MODEL="/home/lyric/Downloads/llama-2-13b-chat.ggmlv3.q4_K_M.bin"

Replace ./main with your own name ./llama-cpu

Then run:

./examples/chat-llama2-13B.sh

Depending on your machine's configuration, it may take a while before you can start chatting. The result should look like the image below, where the green text is what I input, and the white text is Llama 2's response.

An image to describe post Running Llama 2 with llama.cpp, Accelerated by AMD Radeon RX 6900 GPU

Enable GPU Acceleration

Download the driver from AMD: https://repo.radeon.com/amdgpu-install/

TIP

Note that the 5.* versions are the newer ones, and `

2*..versions are actually older and not usable. I installed version5.5`: https://repo.radeon.com/amdgpu-install/5.5/ubuntu/jammy/

After installation, run:

amdgpu-install --usecase=opencl,rocm

On Ubuntu, download the necessary libraries:

sudo apt install ocl-icd-dev ocl-icd-opencl-dev \
	opencl-headers libclblast-dev

Recompile with -DLLAMA_CLBLAST=ON option:

cd build
cmake .. -DLLAMA_CLBLAST=ON -DCLBlast_dir=/usr/local
cmake --build . --config Release

Copy and rename it:

cp ./bin/main/main ../llama-cl
cd ..

Then modify the launch script:

vim examples/chat-llama2-13B.sh

Replace ./llama-cpu with your new name ./llama-cl

And add --n-gpu-layers 40 to the second to last line, for example, mine is:

./llama-cl $GEN_OPTIONS \
  --model "$MODEL" \
  --threads "$N_THREAD" \
  --n_predict "$N_PREDICTS" \
  --color --interactive \
  --file ${PROMPT_FILE} \
  --reverse-prompt "${USER_NAME}:" \
  --in-prefix ' ' \
  --n-gpu-layers 40
  "$@"

TIP

The --n-gpu-layers option utilizes VRAM to accelerate token generation. I set it to 40 for my card, but you can set a very large number, like 100000, and llama.cpp will adjust to the maximum number of layers your GPU can handle.

Then run:

./examples/chat-llama2-13B.sh

Theoretically, using GPU acceleration should significantly reduce waiting time, and you should see output indicating GPU acceleration, like:

ggml_opencl: selecting platform: 'AMD Accelerated Parallel Processing'
...
ama_model_load_internal: using OpenCL for GPU acceleration
llama_model_load_internal: mem required  =  710.19 MB (+ 1600.00 MB per state)
llama_model_load_internal: offloading 40 repeating layers to GPU
llama_model_load_internal: offloaded 40/41 layers to GPU
llama_model_load_internal: total VRAM used: 7285 MB
...

This indicates GPU acceleration is in use. On my computer, the generation speed reached over 600 tokens per second, which feels incredibly fast.

Run a Service

llama.cpp also provides a server, which you can learn about in the official documentation.

For me, I simply ran:

./server -m ~/Download/llama-2-13b-chat.ggmlv3.q4_K_M.bin \
	-c 2048 -ngl 40 --port 10081

Then, opening http://localhost:10081 allowed me to use the Web UI.

This Web server supports API requests, for example:

curl --request POST \
    --url http://localhost:10081/completion \
    --header "Content-Type: application/json" \
    --data \
		'{"prompt": "Building a website can be done in 10 steps:","n_predict": 128}'

This makes it very convenient to experiment with the model.

Troubleshooting

OpenCL Permission Issues

There may be cases where OpenCL functions are only accessible with root permissions. For instance, running clinfo might fail to find OpenCL, but sudo clinfo works just fine.

In such cases, execute the following (replacing LOGIN_NAME with your username):

sudo usermod -a -G video LOGIN_NAME
sudo usermod -a -G render LOGIN_NAME

This grants the current user the necessary permissions.

About Llama2's Quirks

OpenAI's ChatGPT has undergone extensive prompt engineering and optimization, but Llama2, run independently, lacks these refinements. If Llama2 seems unintelligent, consider increasing the complexity of your prompts; otherwise, Llama2 might not meet your expectations.

For example, if you want Llama2 to output JSON, your prompt should include several examples of generating JSON, like this intent recognition example:

You read the following text and recognize the user's intent.
Possible intents are:

1. "eating"
2. "sleeping"
3. "fighting"
9999. "unknown intent"

You must return the intent with the highest confidence.
You must return the result

 in JSON format. 
Here is the template: 
{ "id": id, "intent": "USER'S INTENT", "confidence": 0.9 }

**instructions: I'm hungry**
{ "id": 1, "intent": "eating", "confidence": 0.9 }

**instructions: I'm tired and want to sleep**
{ "id": 2, "intent": "sleeping", "confidence": 0.9 }

**instructions: Where's the bean? I want to hit it**
{ "id": 3, "intent": "fighting", "confidence": 0.7 }

**instructions: What time is it?**
{ "id": 9999, "intent": "unknown intent", "confidence": 0.9 }

Configured in the Web UI like this:

An image to describe post Running Llama 2 with llama.cpp, Accelerated by AMD Radeon RX 6900 GPU

The running effect is as follows:

An image to describe post Running Llama 2 with llama.cpp, Accelerated by AMD Radeon RX 6900 GPU

Not bad~

Embracing the Shadows: The Dance of Regulation and Disruption in the age of AI and Cryptocurrency

Blockchain and AI are facing increased regulatory pressure from governments. Cryptocurrencies are being targeted by regulations and restrictions, but they continue to gain support and demonstrate their strength. AI technology, dominated by big companies, is criticized for its power and potential impact on individual freedom. Open source is suggested as the best approach to ensure AI's freedom and prevent centralized control. Freedom comes at a cost, but as technology develops, the price decreases. Combining Large Language Models (LLMs) with blockchain can provide freedom of access and personal security. The existence of 'super individuals empowered by technology' must be accepted. Freedom and flexibility can be regained by moving away from centralized cloud services. Reduction in costs and restoration of engineering flexibility can be achieved.

AI Gold Rush: To Dig for Gold Oneself or Sell Shovels?

The AI project has stopped due to intense competition and limited commercial scenarios. Prompt engineering, AI replacements for classical functions, RAG, and tool products are some viable applications. User experience and competitive advantage in AI technology are crucial. The development cost and quality of AI applications are challenging. The decision to target the consumer or enterprise market depends on market size, revenue model, customer behavior, market response, marketing strategy, and product features.

Quick Primality Test

Fermat's Little Theorem and Miller-Rabin test are probabilistic methods for determining prime numbers. The fermat-test is fast but can be tricked by Carmichael numbers. Miller-Rabin test improves upon fermat-test by checking for non-trivial square roots of 1 mod n. It is a reliable method for industrial applications.

Not Balding, Just Getting Stronger: Let AI Be the Master Chef, Tripling Coding and Writing Efficiency

The author explains how AI has significantly increased their efficiency in coding, writing copy, and drawing illustrations. They describe how AI can generate code for mundane tasks, help with writing technical documents, and provide illustrations without copyright issues. This has resulted in significant time savings and increased productivity.

Human Replacement Plan: A Guide to Using AI to Replace Colleagues

This generation's AI technology has reached a level where it can replace many human tools in terms of productivity. Programmers and designers cannot be directly replaced by AI, but AI can assist them in their work. However, programmers may be indirectly replaced due to increased efficiency. High-level designers cannot be directly replaced, but tool-level designers like illustrators may be replaced. Senior product managers cannot be directly replaced by AI.

Church Counting and Lambda Calculus

Lambda Calculus is a formal system used to study function definition, implementation, recursion, and is the computational model for functional languages. It consists of variables, symbols, and parenthesis pairs, and can be expressed as lambda expressions. Alpha-conversion allows for symbol replacement, beta-reduction represents function application, and eta-conversion expresses extensionality. Lambda calculus provides a foundation for computational power.

DEL