RTX 5080 and RTX 3090 Setup: 80 Tok/s on Qwen 3.6 27B Q8

iMil1 pts0 comments

RTX 5080 + RTX 3090 Setup: 80+ Tok/s on Qwen 3.6 27B Q8 - iMil.net

A year ago, I bought an RTX 5080 for both gaming and AI experiments. Little did I know back then that I would be giving into the joys of local LLM setups.

Fast forward 2026, Qwen 3.5, Gemma, Qwen 3.6, I needed more than 16GB. So I got myself a refurbished RTX 3090 with 24GB. I could then run Qwen 3.6 Q4 quants, first at ~30 tok/s, then 50-60 with MTP. Not bad. But still felt limited while my 5080 was barely used.

So I began digging what kind of setup could take profit of those 2 cards together. I already had DDR4 sticks and SSD disks ready, I only needed a mobo capable of handling the two cards.

Enters the Asus Prime X570-Pro, the “Pro” is important, it is what ensures the 16x PCIe can be splitted in 2x8.

The 5080 being the monster it is I bought a good quality PCIe 4 riser to plug it on the second slot.

BIOS

The BIOS part was more complex than I anticipated. First and foremost: you CAN’T boot the OS in BIOS/MBR mode, this will forbid the use of both cards and implies kernel parameters unnecessary trickery even for one of them.

The parameters that should be set:

Go to the Boot tab and set CSM (Compatibility Support Module) to Disabled

Go to the Advanced tab -> PCI Subsystem Settings

Set Above 4G Decoding to Enabled

Set ReSize BAR Support to Auto or Enabled.

Still on the Advanced tab -> PCIEX16_1 Link Mode: Gen 4

PCIEX16_2 Link Mode: Gen 4

kernel

NVidia documentation is a mess, here’s the link to driver’s installation procedure, yes, with /tesla in the URL, because why not: https://docs.nvidia.com/datacenter/tesla/driver-installation-guide/introduction.html

The two GPUs being different models, I unfortunately can&rsquo;t setup this beauty https://github.com/aikitoria/open-gpu-kernel-modules<br>I tested it, the feature is enabled, but it was clear from the start it will likely to fail with different GPUs, moreover different generations.

Nevertheless for the lucky readers owning 2 cards of the same type, once the patched driver is built / installed, don&rsquo;t forget to:

Uninstall nvidia-dkms-open

blacklist the new nova driver

Only then the freshly patched driver will load at boot. You should see the following:

$ nvidia-smi topo -p2p r<br>GPU0 GPU1<br>GPU0 X OK<br>GPU1 OK X

Legend:

X = Self<br>OK = Status Ok<br>CNS = Chipset not supported<br>GNS = GPU not supported<br>TNS = Topology not supported<br>NS = Not supported<br>DR = Disabled by regkey<br>U = Unknown

If like me you own different NVidia cards, just use the nvidia-open driver.

Once rebooted with the nvidia driver loaded, check that the cards are well seen by it:

$ nvidia-smi<br>Sat Jun 13 09:29:23 2026<br>+-----------------------------------------------------------------------------------------+<br>| NVIDIA-SMI 610.43.02 KMD Version: 610.43.02 CUDA UMD Version: 13.3 |<br>+-----------------------------------------+------------------------+----------------------+<br>| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |<br>| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |<br>| | | MIG M. |<br>|=========================================+========================+======================|<br>| 0 NVIDIA GeForce RTX 3090 On | 00000000:07:00.0 On | N/A |<br>| 0% 34C P8 17W / 350W | 23646MiB / 24576MiB | 0% Default |<br>| | | N/A |<br>+-----------------------------------------+------------------------+----------------------+<br>| 1 NVIDIA GeForce RTX 5080 On | 00000000:08:00.0 Off | N/A |<br>| 0% 31C P8 15W / 360W | 15861MiB / 16303MiB | 0% Default |<br>| | | N/A |<br>+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+<br>| Processes: |<br>| GPU GI CI PID Type Process name GPU Memory |<br>| ID ID Usage |<br>|=========================================================================================|<br>+-----------------------------------------------------------------------------------------+

llama.cpp

Those are the build flags I use to support both cards generation:

# cmake -B build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DGGML_NATIVE=ON -DGGML_CUDA_FA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DCMAKE_CUDA_ARCHITECTURES="86;120" -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc -DGGML_CUDA_NCCL=OFF

The relevant flag is CMAKE_CUDA_ARCHITECTURES="86;120" which enables both Ampere and Blackwell architectures. Note the -DGGML_CUDA_NCCL=OFF flag, I found out nccl was actually counter productive, even if llama-server logs say otherwise.

Now to startup options:

llama-server -m ./models/Huihui-Qwen3.6-27B-abliterated-ggml-model-Q8_0.gguf \<br>-c 229376 \<br>-np 1 -fa on -ngl 99 -ub 512 -t 6 --no-mmap \<br>--temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0 \<br>-ctk q8_0 -ctv q8_0 --kv-unified \<br>--chat-template-kwargs {"preserve_thinking": true} \<br>--spec-type ngram-mod,draft-mtp --spec-draft-n-max 3 \<br>-sm tensor -ts 2,3 \<br>--port 8001 --host 0.0.0.0

The...

nvidia cards driver qwen rsquo setup

Related Articles