llama.cpp on AMD Ryzen AI Max+ 395 w/Radeon 8060S
Configure the iGPU memory in Advanced to be Auto and iGPU Memory Size to
be 0.5GB. This will allow the ROCm software to manage the memory split.
Using Artix Linux with OpenRC with the rocm-hip-sdk installed.
pacman -S rocm-hip-sdk
Get llama.cpp from github:
git clone --depth=1 https://github.com/ggml-org/llama.cpp
Build with cmake:
cmake -S . -B build -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1151 -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -- -j$(nproc)
Run with a model:
GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 build/bin/llama-server --host 0.0.0.0 \
--port 8080 \
--flash-attn on \
--cache-prompt \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.0 \
--presence-penalty 0.0 \
--repeat-penalty 1.0 \
--gpu-layers 99 \
--ctx-size 32768 \
--mmproj ../models/Huihui-Qwen3.6-35B-A3B-abliterated-mmproj-BF16.gguf \
--model ../models/Huihui-Qwen3.6-35B-A3B-abliterated-Q8_0.gguf
Running llama.cpp with Radeon RX 9070 XT (gfx1201)
TODO: write some stuff here
Running an MCP Server for File System Access
On Artix Linux and using podman.
pacman -S podman crun
Update /etc/containers/registries.conf so that docker.io is searched when
an unqualified image is present in Dockerfile or specified on the command
line.
unqualified-search-registries = ["docker.io"]
References
- https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md#unified-memory
- https://strixhalo.wiki/AI/llamacpp-with-ROCm
- https://www.theregister.com/software/2026/05/02/how-to-roll-your-own-local-ai-coding-agents/5230018
- https://huggingface.co/Abiray/Huihui-Qwen3.6-35B-A3B-abliterated-GGUF/tree/main
