Run Llama.cpp on a Mac Pro 6,1 with Dual FirePro D700 GPUs on Ubuntu

Run llama.cpp on a Mac Pro 6,1 with Dual FirePro D700 GPUs on Ubuntu | Matthew Gribben Running llama.cpp on a Mac Pro 6,1 with Dual FirePro D700s on Ubuntu

The 2013 Mac Pro is still a strange machine: thermally dense, beautifully overbuilt, and awkwardly dependent on two workstation GPUs that most modern ML stacks have forgotten. The D700 version is the most interesting one for local LLM work because it gives you dual AMD FirePro D700 cards with 6 GB of GDDR5 each.

That is 12 GB of aggregate VRAM, but it is not a single 12 GB GPU. Treat it as two separate 6 GB pools that llama.cpp can use well when the Vulkan backend is configured correctly.

Mac Pro 6,1 D700 memory shape

llama.cpp Vulkan backend split-mode: layer +--------------+--------------+ | | FirePro D700 0 FirePro D700 1 Tahiti / GCN 1.0 Tahiti / GCN 1.0 6 GB GDDR5 6 GB GDDR5

The practical outcome is simple: the D700 machine can comfortably run the class of models that are annoying on a D300. Seven billion parameter Q4 models become realistic with useful context sizes. Thirteen billion parameter models are still a poor fit if you expect full GPU offload, because the Mac Pro's dual cards do not behave like one contiguous accelerator.

This guide is a D700-specific rewrite of Edward Chalupa's excellent D300 guide. The main flow is the same: Ubuntu, the amdgpu kernel driver, Mesa RADV, llama.cpp built with Vulkan, and a few settings that matter much more than they look.

Hardware target

Apple shipped three GPU tiers in the Mac Pro 6,1. The D700 is the top configuration: each card has 6 GB of GDDR5, 2048 stream processors, a 384-bit memory bus, and 264 GB/s of memory bandwidth.

GPUArchitecture familyVRAM per cardAggregate VRAMPractical llama.cpp targetFirePro D300GCN 1.0 / Pitcairn-class2 GB4 GB3B and small 4B modelsFirePro D500GCN 1.0 / Tahiti-class3 GB6 GB4B and some compact 7B quantsFirePro D700GCN 1.0 / Tahiti-class6 GB12 GB7B Q4/Q5, sometimes 8B Q4 The important difference is not raw TFLOPS. It is memory headroom. A 7B Q4_K_M GGUF is usually around 4.0-4.5 GB before runtime buffers and KV cache. On a D300 that is a non-starter. On a D700 pair, layer splitting gives the model enough room.

What fits

Use these as planning numbers, not promises. Exact memory depends on architecture, quantization, context size, batch settings, and llama.cpp version.

Model classQuantTypical GGUF sizeD700 verdict3BQ8_0~3.0-3.5 GBEasy, but underuses the hardware7BQ4_K_M~4.0-4.5 GBGood default target7BQ5_K_M~5.0-5.5 GBGood with conservative context8BQ4_K_M~4.5-5.0 GBUsually workable13BQ4_K_M~7.5-8.5 GBUsually not worth it on this bus The trap is reading "12 GB VRAM" as "anything under 12 GB fits." It does not. llama.cpp can distribute layers across devices, but each card still has a 6 GB ceiling and the runtime needs additional memory for compute buffers and KV cache.

Why a 13B Q4 model is awkward

Model weights + buffers + KV cache +----------------------------------+ | more than one D700 can hold well | +----------------------------------+

Splitting helps with layers, but the old PCIe path and sync cost make CPU/GPU mixed inference unattractive once full offload fails.

For this machine, optimize for models that fully offload. If the model does not fit with --n-gpu-layers 99, the fallback should usually be CPU-only, not partial offload.

The driver stack

The D700 is old GCN hardware. The old radeon kernel driver can drive displays, but it is the wrong foundation for Vulkan inference. You want this stack:

llama-server | GGML Vulkan backend Mesa RADV Vulkan driver | userspace Vulkan implementation Linux amdgpu kernel driver Dual FirePro D700 GPUs

Mesa documents RADV as the Vulkan driver for AMD GCN/RDNA GPUs, with the caveat that GCN 1-2 hardware may need amdgpu explicitly enabled instead of radeon. Ubuntu 24.04 often does the right thing on this Mac Pro, but you should verify rather than assume.

Step 1: verify both GPUs use amdgpu

Start with PCI detection:

lspci -nnk | grep -A3 -E "VGA|Display|FirePro|AMD"

You want both D700 devices to report:

Kernel driver in use: amdgpu

If either card is bound to radeon, add the Southern Islands amdgpu flags:

sudoedit /etc/default/grub

Set or extend GRUB_CMDLINE_LINUX_DEFAULT:

radeon.si_support=0 amdgpu.si_support=1

Then update GRUB and reboot:

sudo update-grub sudo reboot

After reboot, check again. Do not continue until both cards are on amdgpu.

Step 2: install and test Vulkan

Install the Vulkan userspace pieces and the headers llama.cpp needs during build:

sudo apt update sudo apt install -y \ build-essential \ cmake \ curl \ git \ glslc \ libvulkan-dev \ mesa-vulkan-drivers \ spirv-headers \ vulkan-tools

Now check what Vulkan sees:

vulkaninfo --summary

For a working D700 setup you should see two RADV devices. They may be labelled as RADV TAHITI, AMD FirePro D700, or similar depending on Mesa and kernel versions.

Expected shape, not exact text:

Devices: GPU0: RADV TAHITI / AMD FirePro D700 GPU1: RADV...

Run Llama.cpp on a Mac Pro 6,1 with Dual FirePro D700 GPUs on Ubuntu

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Agent Memory: An Anatomy

SpaceX not the behemoth everyone thought

The Mirror Is Part of the Machine

Elevated error rates on requests to multiple models