RTX 5080 and RTX 3090 Setup: 80 Tok/s on Qwen 3.6 27B Q8

RTX 5080 + RTX 3090 Setup: 80+ Tok/s on Qwen 3.6 27B Q8 - iMil.net

A year ago, I bought an RTX 5080 for both gaming and AI experiments. Little did I know back then that I would be giving into the joys of local LLM setups.

Fast forward 2026, Qwen 3.5, Gemma, Qwen 3.6, I needed more than 16GB. So I got myself a refurbished RTX 3090 with 24GB. I could then run Qwen 3.6 Q4 quants, first at ~30 tok/s, then 50-60 with MTP. Not bad. But still felt limited while my 5080 was barely used.

So I began digging what kind of setup could take profit of those 2 cards together. I already had DDR4 sticks and SSD disks ready, I only needed a mobo capable of handling the two cards.

Enters the Asus Prime X570-Pro, the “Pro” is important, it is what ensures the 16x PCIe can be splitted in 2x8.

The 5080 being the monster it is I bought a good quality PCIe 4 riser to plug it on the second slot.

BIOS

The BIOS part was more complex than I anticipated. First and foremost: you CAN’T boot the OS in BIOS/MBR mode, this will forbid the use of both cards and implies kernel parameters unnecessary trickery even for one of them.

The parameters that should be set:

Go to the Boot tab and set CSM (Compatibility Support Module) to Disabled

Go to the Advanced tab -> PCI Subsystem Settings

Set Above 4G Decoding to Enabled

Set ReSize BAR Support to Auto or Enabled.

Still on the Advanced tab -> PCIEX16_1 Link Mode: Gen 4

PCIEX16_2 Link Mode: Gen 4

kernel

NVidia documentation is a mess, here’s the link to driver’s installation procedure, yes, with /tesla in the URL, because why not: https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/introduction.html

The two GPUs being different models, I unfortunately can’t setup this beauty https://github.com/aikitoria/open-gpu-kernel-modules I tested it, the feature is enabled, but it was clear from the start it will likely to fail with different GPUs, moreover different generations.

Nevertheless for the lucky readers owning 2 cards of the same type, once the patched driver is built / installed, don’t forget to:

Uninstall nvidia-dkms-open

blacklist the new nova driver

Only then the freshly patched driver will load at boot. You should see the following:

$ nvidia-smi topo -p2p r GPU0 GPU1 GPU0 X OK GPU1 OK X

Legend:

X = Self OK = Status Ok CNS = Chipset not supported GNS = GPU not supported TNS = Topology not supported NS = Not supported DR = Disabled by regkey U = Unknown

If like me you own different NVidia cards, just use the nvidia-open driver.

Once rebooted with the nvidia driver loaded, check that the cards are well seen by it:

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| +-----------------------------------------------------------------------------------------+

llama.cpp

Those are the build flags I use to support both cards generation:

# cmake -B build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DGGML_NATIVE=ON -DGGML_CUDA_FA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DCMAKE_CUDA_ARCHITECTURES="86;120" -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc -DGGML_CUDA_NCCL=OFF

The relevant flag is CMAKE_CUDA_ARCHITECTURES="86;120" which enables both Ampere and Blackwell architectures. Note the -DGGML_CUDA_NCCL=OFF flag, I found out nccl was actually counter productive, even if llama-server logs say otherwise.

Now to startup options:

llama-server -m ./models/Huihui-Qwen3.6-27B-abliterated-ggml-model-Q8_0.gguf \ -c 229376 \ -np 1 -fa on -ngl 99 -ub 512 -t 6 --no-mmap \ --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0 \ -ctk q8_0 -ctv q8_0 --kv-unified \ --chat-template-kwargs {"preserve_thinking": true} \ --spec-type ngram-mod,draft-mtp --spec-draft-n-max 3 \ -sm tensor -ts 2,3 \ --port 8001 --host 0.0.0.0

The...

RTX 5080 and RTX 3090 Setup: 80 Tok/s on Qwen 3.6 27B Q8

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

It's Not Just X. It's Y