Running popular AI software on AMD GPU

Here is how to get popular AI software running on a recent RDNA or RDNA2 AMD card. This method uses distrobox so it should be the same for every Linux distribution.

There is always a chance these instructions would be deprecated in a few months.

Requirements

Hardware

RDNA and RDNA2 should work. Polaris and Vega are also known to work but some parameters might be different.

Software

The main advantage AMD has over Nvidia on Linux is the mainline open source driver working out of the box. Installing amdgpu-pro defats the purpose so you might as well buy Nvidia if you are to use binary blobs.

I recommend a recent kernel in order to make use of the drivers available on mainline kernel and avoid using the proprietary amdgpu-pro driver. The one included on Ubuntu 22.04 LTS should suffice which is around 5.4.14 but chances are you are on 6.x.

The first step is to install distrobox. On Nixos I've achieved that by adding the distrobox package to my user list of packages. If distrobox is not available on your package manager follow the instructions on the project page to get it installed.

On Fedora the package toolbox works in a similar fashion. In fact both programs run on top of Podman rootless containers.

This article has been created on a system running the current version of Nixos (23.05). It shouldn't matter what you are running as long as you have a proper kernel and can run Podman rootless containers.

Access rights

Your user must have access to /dev/kfd and /dev/dri/* devices. Normally these belong to root user and render group. This is done by default on most desktop distributions but if you are not on the render group add yourself to it and restart your session.

Find out which group owns /dev/kfd:

$ ls -l /dev/kfd
crw-rw-rw- 1 root render 242, 0 Sep 13 15:35 /dev/kfd

Find out if I belong to group render.

$ groups
users wheel video render

I'm case I'm not, assuming my username is iru I can do this by:

$ gpasswd -a iru render

Then just exit desktop or ssh session and log in again. A reboot is not needed.

Preparing environment and installing libraries

Initializing distrobox

First I recommend creating a separate home for your distrobox. The reason behind it being some compiler environment variables you might not want to mix with your main system. Here I'm going to use ~/Applications.

$ mkdir -p ~/Applications

You will be asked confirmation for the commands bellow.

$ distrobox create -n rocm --home /home/iru/Applications
Creating 'rocm' using image registry.fedoraproject.org/fedora-toolbox:36	 [ OK ]
Distrobox 'rocm' successfully created.
To enter, run:

distrobox enter rocm

rocm

Now we can enter the distrobox container:

$ distrobox enter rocm
Container rocm is not running.
Starting container rocm
run this command to follow along:

 podman logs -f rocm

 Starting container...                  	 [ OK ]
 Installing basic packages...           	 [ OK ]
 Setting up read-only mounts...         	 [ OK ]
 Setting up read-write mounts...        	 [ OK ]
 Setting up read-only mounts...         	 [ OK ]
 Setting up read-write mounts...        	 [ OK ]
 Setting up host's sockets integration...	 [ OK ]
 Integrating host's themes, icons, fonts...	 [ OK ]
 Setting up package manager exceptions...	 [ OK ]
 Setting up rpm exceptions...           	 [ OK ]
 Setting up sudo...                     	 [ OK ]
 Setting up groups...                   	 [ OK ]
 Setting up users...                    	 [ OK ]
 Integrating host's themes, icons, fonts...	 [ OK ]
 Setting up package manager exceptions...	 [ OK ]
 Setting up rpm exceptions...           	 [ OK ]
 Setting up sudo...                     	 [ OK ]
 Setting up groups...                   	 [ OK ]
 Setting up users...                    	 [ OK ]
 Executing init hooks...                	 [ OK ]

Container Setup Complete!

You are now inside a rootless (user level, non administrative) container integrated with your home folder and system devices. You can access your files and folders but with no visibility to the world outside your home folder. The operating system is now Fedora 36 regardless of the underlying distribution.

You can sudo and install programs into this toolbox using dnf.
The root superuser inside this environment is not the same as your system root user.
Programs installed through the system do not affect your underlying system.
Deleting the distrobox will have no affect on your system.
Your home is still mapped therefore if you delete your personal files those will be gone for good.
Programs installed locally through your user folder (i.e. Go, Rust stuff) might work as normal.
Since we opted to set ~/Applications as the home folder the dot files will be separate from your main system.
Distrobox will try to use some semblance of your main system like the setting the same shell program.

Regardless of how you've arrived here, you should have a Fedora 36 environment with access to /dev/kfd and /dev/dri/* on your system. It doesnt matter if you are running distrobox toolbox LXC or even a physical system.

It is a good idea to update the system before we begin installing dependencies.

$ sudo dnf update -y

Installing ROCm

Now lets install ROCm and other packages by following the instructions on the AMD documentation site for ROCM under "Red Hat Enterprise Linux" then "RHEL 9.1".

First we are going to install the driver repo just in case some dependency needs to be pulled from here but not installing the driver.

I'm sticking to 5.4 as is the version I've encountered less problems so far.

$ ver=5.4
$ sudo tee /etc/yum.repos.d/amdgpu.repo <<EOF
[amdgpu]
name=amdgpu
baseurl=https://repo.radeon.com/amdgpu/$ver/rhel/9.1/main/x86_64/
enabled=1
priority=50
gpgcheck=1
gpgkey=https://repo.radeon.com/rocm/rocm.gpg.key
EOF
$ sudo yum clean all

Now we install the runtime repos.

$ for ver in 5.3.3 5.4.6 5.5.3 5.6.1 5.7; do sudo tee --append /etc/yum.repos.d/rocm.repo <<EOF
[ROCm-$ver]
name=ROCm$ver
baseurl=https://repo.radeon.com/rocm/rhel9/$ver/main
enabled=1
priority=50
gpgcheck=1
gpgkey=https://repo.radeon.com/rocm/rocm.gpg.key
EOF
done
$ sudo yum clean all

Finally we install ROCm. MESA is also required further ahead for Stable Diffusion (I don't know why) and cmake for Llama.

$ sudo dnf install rocm-hip-sdk5.4.6 mesa-libGL-devel cmake
...
Total download size: 2.1 G
Installed size: 15 G
Is this ok [y/N]:

You will be asked to accept the AMD signing keys for the packages. You might also see something like this:

Failed:
 amdgpu-core-1:5.7.50700-1652687.el9.noarch

That is the proprietary driver which we don't need. There is no concept of kernel inside a container so we can ignore that.

Add the ROCm runtime libraries to the default library path:

$ sudo tee --append /etc/ld.so.conf.d/rocm.conf <<EOF
/opt/rocm/lib
/opt/rocm/lib64
EOF
sudo ldconfig

Optionally you might also want to install radeontop a useful TUI resource monitor.

$ sudo dnf install radeontop

The rocm-smi monitor:

$ rocm-smi
========================= ROCm System Management Interface =========================
=================================== Concise Info ===================================
ERROR: GPU[1]	: sclk clock is unsupported
====================================================================================
GPU[1]		: get_power_cap, Not supported on the given system
GPU  Temp (DieEdge)  AvgPwr  SCLK    MCLK     Fan     Perf  PwrCap       VRAM%  GPU%  
0    50.0c           22.0W   500Mhz  96Mhz    40.78%  auto  130.0W         0%   3%    
1    57.0c           31.0W   None    1600Mhz  0%      auto  Unsupported   77%   0%    
====================================================================================
=============================== End of ROCm SMI Log ================================

The rocminfo command:

$ rocminfo
ROCk module is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          NO

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen 9 5900HX with Radeon Graphics
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen 9 5900HX with Radeon Graphics
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3300                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            16                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    65245764(0x3e39244) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    65245764(0x3e39244) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    65245764(0x3e39244) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx1031                            
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon RX 6800M                
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      3072(0xc00) KB                     
    L3:                      98304(0x18000) KB                  
  Chip ID:                 29663(0x73df)                      
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2500                               
  BDFID:                   768                                
  Internal Node ID:        1                                  
  Compute Unit:            40                                 
  SIMDs per CU:            2                                  
  Shader Engines:          2                                  
  Shader Arrs. per Eng.:   2                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    1024(0x400)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 109                                
  SDMA engine uCode::      80                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    12566528(0xbfc000) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS:                     
      Size:                    12566528(0xbfc000) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1031         
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*******                  
Agent 3                  
*******                  
  Name:                    gfx90c                             
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon Graphics                
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    2                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      1024(0x400) KB                     
  Chip ID:                 5688(0x1638)                       
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2100                               
  BDFID:                   2048                               
  Internal Node ID:        2                                  
  Compute Unit:            8                                  
  SIMDs per CU:            4                                  
  Shader Engines:          1                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    2560(0xa00)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 464                                
  SDMA engine uCode::      40                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    524288(0x80000) KB                 
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS:                     
      Size:                    524288(0x80000) KB                 
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx90c:xnack-   
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***

Lots of devices here. I'm running one of those laptops with hybrid GPUs so it prints details about the integrated graphics unit on the Ryzen processor plus the discrete 6800m GPU. There is also a third virtual device that acts as a mux between discrete and integrated GPUs.

In my case this is the relevant part:

*******                  
Agent 2                  
*******                  
  Name:                    gfx1031                            
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon RX 6800M

From my personal experience APUs don't work.

Next we need to add some environment variables.

export CC=/opt/rocm/llvm/bin/clang
export CXX=/opt/rocm/llvm/bin/clang++
export PATH=/opt/rocm/bin:/opt/rocm/opencl/bin:$PATH

Another variable that needs to be set is HSA_OVERRIDE_GFX_VERSION. AMD doesn't actually support ROCm on consumer grade hardware so you need this variable to force it use a certain target. Remember rocminfo? It gave me a "gfx1031" for my 6800m so going to set this variable as "10.3.0" which would apply to 1030, 1031, 1032 and so on. For non RDNA cards we need to setup values for Vega (GCN5) which would be 9.0.0. RDNA3 uses "11.0.0".

export CC=/opt/rocm/llvm/bin/clang
export CXX=/opt/rocm/llvm/bin/clang++
export PATH=/opt/rocm/bin:/opt/rocm/opencl/bin:$PATH
export HSA_OVERRIDE_GFX_VERSION="10.3.0"

Finally source the shell configuration file or exit the session and log in again and verify the environment variables are properly set.

$ echo $CC
/opt/rocm/llvm/bin/clang
$ echo $CXX
/opt/rocm/llvm/bin/clang++

Installing software

Following are instructions to install Stable Diffusion webui, ooba's text generation ui and Langchain.

Stable Diffusion WebUI

The most popular way to interact with Stable Diffusion is through AUTOMATIC1111's Gradio webui. Stable diffusion depends on Torch.

We start by cloning the Automatic1111 project into our home folder.

$ cd
$ git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui.git
$ cd stable-diffusion-webui

This project changes a lot so for your sanity I recommend switching to the last release (1.6.0 at the time of this writing).

$ git checkout v1.6.0

On Nvidia or pure CPU you just need to run webui.sh and have all the software and dependencies installed AMD requires that you define an alternative source for the libraries besides setting certain environment variables.

The file we are looking for is webui-user.sh where custom configuration can be added.

The first one is TORCH_COMMAND which defines how Torch should be installed. By going to pytorch.org and scrolling down til you see an "INSTALL PYTORCH" with a selection box.

Choose Stable (at the time of this article is 2.0.1)
Choose Linux
Choose Pip
Choose ROCm (at the time of this article is 5.4.2)

You will get something like this:

Run this Command:  pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.4.2

The above command should also be enough if you just want to play with PyTorch.

Torch+ROCm doesn't necessarily needs to match the ROCm version you have installed. Of course matching versions would ensure best compatibility but it seems to not always be the case. I've tried the nightly Torch+ROCm 5.6 together with matching ROCm runtime version and it couldn't detect CUDA support.

So add this to webui-user.sh and make sure it's the only occurrence of TORCH_COMMAND. You probably don't need torchaudio.

export TORCH_COMMAND="pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/rocm5.4.2"

An optional setting is telling Torch how to manage memory. Aparently using the CUDA runtime (in our case ROCm passing as CUDA) native memory management is more efficient.

export PYTORCH_CUDA_ALLOC_CONF="backend:cudaMallocAsync"

Now the Command-line args to the program. In my experience the --precision full --no-half --opt-sub-quad-attention arguments will offer the best compatibility for AMD cards.

export COMMANDLINE_ARGS="--precision full --no-half --opt-sub-quad-attention --deepdanbooru"

If your card supports fp16 you can replace --precision full --no-half with --upcast-sampling for better performance and more efficient memory management. This is the case for at least RDNA2.

Here is the list of possible arguments: github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Command-Line-Arguments-and-Settings. A dedicated AMD session exists in github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Install-and-Run-on-AMD-GPUs and might have up to date instructions.

So my webui-user.sh would look like this:

export HSA_OVERRIDE_GFX_VERSION="10.3.0"
export PYTORCH_CUDA_ALLOC_CONF="backend:cudaMallocAsync"
export COMMANDLINE_ARGS="--precision full --no-half --opt-sub-quad-attention --deepdanbooru"
export TORCH_COMMAND="pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/rocm5.4.2"

A general tip is to copy that file somewhere else or add it to .gitignore. This file is tracked so if you run something like git reset --hard you might lose it.

Now run ./webui.sh.

$ ./webui.sh
################################################################
Install script for stable-diffusion + Web UI
Tested on Debian 11 (Bullseye)
################################################################

################################################################
Running on iru user
################################################################

################################################################
Repo already cloned, using it as install directory
################################################################

################################################################
Create and activate python venv
################################################################

################################################################
Launching launch.py...
################################################################
Cannot locate TCMalloc (improves CPU memory usage)
Python 3.10.11 (main, Apr  5 2023, 00:00:00) [GCC 12.2.1 20221121 (Red Hat 12.2.1-4)]
Version: v1.6.0
Commit hash: 5ef669de080814067961f28357256e8fe27544f4
Installing torch and torchvision
Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/rocm5.4.2
Collecting torch
  Downloading https://download.pytorch.org/whl/rocm5.4.2/torch-2.0.1%2Brocm5.4.2-cp310-cp310-linux_x86_64.whl (1536.4 MB)

Some installation mumbo jumbo, yellow font pip warnings and a bunch of large files downloaded later you'll see something like this:

Applying attention optimization: sub-quadratic... done.
Model loaded in 8.2s (calculate hash: 2.6s, load weights from disk: 0.1s, create model: 2.9s, apply weights to model: 2.1s, calculate empty prompt: 0.4s).

If you see some message recommending you to skip CUDA then something wrong happened, you are missing some library or don't have access to the /dev/kfd and /dev/dri/* devices.

If everything is correct you should open your browser pointing it to localhost:7860 and start generating images. It needs to compile some stuff so the first image you try to generate might take a while to start. Subsequent gens will be faster.

If you encounter errors while starting or generating images related to image either adjust your parameters or set --medvram or --lowvram to COMMANDLINE_ARGS on webui-user.sh, at the cost of performance of course.

OOM errors typically look like this:

    torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 5.06 GiB (GPU 0; 11.98 GiB total capacity; 7.99 GiB already allocated; 3.39 GiB free; 8.50 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_HIP_ALLOC_CONF

It must be said that not even Nvidia users can generate high res images out of the box. The general strategy is to generate small images (like 512x512 or 768x768) and then use Hires Fix to upscale it.

SDXL

SDXL works like any model but bigger. If you get OOM errors while trying to generate images with SDXL add --medvram-sdxl to COMMANDLINE_ARGS. Again at a performance cost.

Llama.cpp

Starting from commit 6bbc598 llama.cpp supports ROCm. Previous versions supported AMD through OpenCL. This project also changes and breaks very fast so stick to releases.

Make sure you've followed the step where we set the environment variables specially for the compilers.

$ cd
$ git clone https://github.com/ggerganov/llama.cpp.git
$ cd llama.cpp
$ git checkout b1266

Use cmake. GNU Make compiles a binary that segfaults while loading the models.

$ mkdir build
$ cd build
$ CC=/opt/rocm/llvm/bin/clang CXX=/opt/rocm/llvm/bin/clang++ cmake .. -DLLAMA_HIPBLAS=ON
$ cmake --build .
$ cp main ..

The variables are not really needed since we already defined in the profile but I'm writing it anyway just to stress how they are needed.

If everything went alright you should see this message at the end.

====  Run ./main -h for help.  ====

That's it. Look for scripts under examples to find out how to use the program. You need to pass the -ngl N argument to load a number of layers into the video RAM. A convenient feature of llama.cpp is that you can offload some of the model data into your normal generic RAM, again at the cost of speed. So if you have plenty of vram+ram and are not in a hurry you can run a 70b model.

Current llama.cpp uses a format named GGUF. You can download open source models from hugging face. I'm going to use CodeLlama-13B-Instruct-GGUF.

Check the quant method table under the model card. It tells how much memory that quantization uses. As a general rule I use Q4_K_M or Q5_K_M. Anything bellow Q4_K_M is often weird.

If you are sure the whole model fits in your VRAM (check the quant method table) just give some ridiculous value to -ngl like 1000. If that's not the case you need to make sure you pass enough layers to not OOM. I recommend to leave a bit of space on your VRAM for some swapping otherwise it will be painfully slow. So if you have 12GB VRAM make sure to pass enough layers to stay around 10GB and offload the rest to RAM.

$ curl -L -o ./models/codellama-13b-instruct.Q4_K_M.gguf https://huggingface.co/TheBloke/CodeLlama-13B-Instruct-GGUF/resolve/main/codellama-13b-instruct.Q4_K_M.gguf
$ ./examples/chat-13B.sh -ngl 1000 -m ./models/codellama-13b-instruct.Q4_K_M.gguf

A possible hiccup if you have more than one AMD GPU device (including the APU) you might get stuck in this:

Log start
main: warning: changing RoPE frequency base to 0 (default 10000.0)
main: warning: scaling RoPE frequency by 0 (default 1.0)
main: build = 1269 (51a7cf5)
main: built with cc (GCC) 12.2.1 20221121 (Red Hat 12.2.1-4) for x86_64-redhat-linux
main: seed  = 1695477515
ggml_init_cublas: found 2 ROCm devices:
  Device 0: AMD Radeon RX 6800M, compute capability 10.3
  Device 1: AMD Radeon Graphics, compute capability 10.3

So we need to force it to use the proper device by passing HIP_VISIBLE_DEVICES=N to the script:

$ export HIP_VISIBLE_DEVICES=0
$ ./examples/chat-13B.sh -ngl 1000 -m ./models/codellama-13b-instruct.Q4_K_M.gguf

If that's the case then add HIP_VISIBLE_DEVICES=0 to your shell profile.

If everything went fine you'll be presented with an interactive chat with a bot on your terminal. Pay attention to the messages before the prompt and look for this:

llm_load_tensors: VRAM used: 9014 MB
....................................................................................................
llama_new_context_with_model: kv self size  = 1600.00 MB
llama_new_context_with_model: compute buffer total size =  381.47 MB
llama_new_context_with_model: VRAM scratch buffer: 380.00 MB

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |

Important here is how much VRAM is used and if BLAS = 1. If BLAS =0 you might be running on slower CPU AVX. If you are offloading layers (as in offloading part of a model larger than your vram to your ram) adjust the number of layers to fit in your vram, again with some slack for swapping.

If everything is correct you'll be able to make it output bullshit really fast.

 Text transcript of a never ending dialog, where USER interacts with an AI assistant named ChatLLaMa.
ChatLLaMa is helpful, kind, honest, friendly, good at writing and never fails to answer USER's requests immediately and with details and precision.
There are no annotations like (30 seconds passed...) or (to himself), just what USER and ChatLLaMa say aloud to each other.
The dialog lasts for years, the entirety of it is shared below. It's 10000 pages long.
The transcript only includes text, it does not include markup like HTML and Markdown.

USER: Hello, ChatLLaMa!
ChatLLaMa: Hello USER! How may I help you today?
USER: What year is it?
ChatLLaMa: We are in 2023.
USER: Please tell me the largest city in Europe.
ChatLLaMa: The largest city in Europe is Moscow, the capital of Russia.
USER: What can you tell me about Moscow?
ChatLLaMa: Moscow, on the Moskva River in western Russia, is the nation's cosmopolitan capital. In its historic core is the Kremlin, a complex that's home to the president and tsarist treasures in the Armoury. Outside its walls is Red Square, Russia’s symbolic center.
USER: What is a cat?
ChatLLaMa: A cat is a domestic species of small carnivorous mammal. It is the only domesticated species in the family Felidae.
USER: How do I pass command line arguments to a Node.js program?
ChatLLaMa: The arguments are stored in process.argv.

    argv[0] is the path to the Node. js executable.
    argv[1] is the path to the script file.
    argv[2] is the first argument passed to the script.
    argv[3] is the second argument passed to the script and so on.
USER: Name a color.
ChatLLaMa: Blue.
USER: What time is it?
ChatLLaMa: It is 17:46.
USER: How to write hello world in ADA.
ChatLLaMa: To write "Hello World" in Ada, you need to create an empty project called "hello_world", then open your text editor (like Visual Studio Code or Sublime Text) and add the following code to it:
with Ada.Text_IO; use Ada.Text_IO; with Ada.Integer_Text_IO; use Ada.Integer_Text_IO; procedure HelloWorld is begin Put("Hello World"); NewLine; end HelloWorld; begin HelloWorld; end;

Then save the file as "hello_world.adb". Finally run the program by opening the terminal window and typing the command "gnatmake -P hello_world.gpr" and press enter key. Then wait for few seconds until the program finishes running and shows the output result on the screen.

Oobabooga's text-generation-webui and llama-cpp-python

This is a Gradio UI for LLMs. It listens on the same port as stable diffusion webui.

$ cd
$ git clone https://github.com/oobabooga/text-generation-webui.git
$ cd text-generation-webui
$ python3 -m venv ooba
$ source ooba/bin/activate 
# for fish run "source ooba/bin/activate.fish" instead
$ pip install -r requirements.txt

We need to replace the llama-cpp-python lib with one that has ROCm support.

$ pip uninstall llama-cpp-python
$ CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install --no-cache llama-cpp-python

Place your model under the models folder and launch the program. As always make sure the environment variables are passed to the program.

HIP_VISIBLE_DEVICES=0 HSA_OVERRIDE_GFX_VERSION="10.3.0" python server.py

You might not need to pass the variables if they are in your profile but I'm stressing how they are necessary here.

Point your browser to localhost:7860 and load the desired model from the web interface. This one can OOM if you load too many layers so you might need to actually adjust the value.

You need to activate the venv before launching it. Having the same venv for all the programs or not using venv at all is not a good idea. These projects often use specific versions and you might end up with a dependency mess if.

Langchain

This follows a similar path from ooba since it uses llama-cpp-python.

$ python3 -m venv langchain
$ source langchain/bin/activate 
# for fish run "source langchain/bin/activate.fish" instead
$ CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install --no-cache llama-cpp-python
$ pip install langchain

Read Langchain llamacpp doc and start coding. Alternatively you can setup an OpenAI compatible endpoint with pure llama.cpp in server mode or llama-cpp-python.