Why isn't Quarkus 2x faster than Spring on my machine?

LaSombra2 pts0 comments

Why isn't Quarkus 2x faster than Spring on my machine? - Quarkus

Blog Why isn't Quarkus 2x faster than Spring on my machine?

June 08, 2026

#performance#containers

Why isn't Quarkus 2x faster than Spring on my machine?

By

Francesco Nigro

A community member ran our Quarkus vs Spring CRUD benchmark on their bare-metal Fedora workstation and asked:

Why do I see only 1.19x instead of 2x?

Our perf-lab shows Quarkus at 2.08x Spring’s throughput, but locally the gap nearly disappears.

This post walks through the investigation that found the culprit.

The gap

The benchmark is a REST/CRUD application backed by PostgreSQL. The app runs on the host, PostgreSQL in a rootless podman container. Each HTTP request executes 2 SQL queries (confirmed via pg_stat_statements).

Spring delivers roughly the same throughput in both environments. Quarkus swings from 15.5K to 24.5K TPS — it is being held back locally. Something in the local environment is capping Quarkus but not Spring.

mpstat: where is the CPU going?

The benchmark collects mpstat data during every run — per-CPU utilization split into %usr (application code), %sys (kernel), %soft (softirq, mainly network packet processing), and %idle. This is part of our active benchmarking practice: observing the system while it runs, not just collecting final TPS numbers.

Both environments run Quarkus at 2.3GHz with the same workload and CPU pinning. The mpstat profiles could not be more different:

Environment<br>%usr<br>%sys<br>%soft<br>%idle

Local (Fedora, 15,504 TPS)

39-50%

34-41%

9-17%

3-5%

Perf-lab (RHEL, 24,472 TPS)

87-94%

5-11%

0-2%

0%

%usr is time running application code. %sys is time in the kernel. On perf-lab, over 85% of CPU goes to the application. Locally, nearly half goes to the kernel. Same application, same clock speed, same workload: locally, a significant fraction of CPU time is spent in the kernel rather than in application code.

Where is the kernel time going?

A differential flamegraph of the JFR CPU profiles (collected via async-profiler) from the perf-lab and local Quarkus runs shows exactly where the extra kernel time is spent:

Red frames appear more in the local run; blue frames appear more on the perf-lab. The brightest red hotspots are kernel spin locks (_raw_spin_unlock_irqrestore), nftables firewall evaluation (nft_do_chain, nft_meta_get_eval), and TCP packet processing (tcp_clean_rtx_queue, skb_defer_free_flush). The blue band at the bottom is application code that gets more CPU on the perf-lab — because the kernel isn’t eating it. The local kernel is spending cycles on network packet processing and firewall rules that the perf-lab doesn’t need.

The brightest red frame — _raw_spin_unlock_irqrestore — is worth a closer look. The stack trace shows it’s triggered by Agroal (Quarkus’s connection pool) returning a JDBC connection after a query: ConnectionPool.returnConnectionHandler → LinkedTransferQueue.tryTransfer → LockSupport.unpark → kernel futex_wake → try_to_wake_up → spin lock. If network round-trips are slower, JDBC connections are held longer and more threads pile up waiting for a free connection. Every connection return triggers a futex_wake to unpark a waiter — the higher the network latency, the more waiters accumulate, and the more kernel time is spent waking them.

The suspect: pasta, the userspace TCP proxy

Rootless podman on Fedora uses pasta (passt) to forward container ports. Unlike rootful podman (which uses kernel-level port forwarding), pasta is a userspace process that proxies every TCP packet:

With pasta (default rootless):<br>App --> kernel --> pasta (userspace) --> kernel --> container netns --> PostgreSQL

With --network=host:<br>App --> kernel --> PostgreSQL (same network namespace)

Every JDBC packet traverses two extra kernel/userspace boundary crossings plus a userspace copy in the pasta process. For a chatty protocol like JDBC with small, frequent packets, this adds up fast. The kernel functions visible in the flamegraph — nft_do_chain, tcp_clean_rtx_queue, skb_defer_free_flush — are not pasta’s own CPU time (pasta runs in a separate process), but they are the kernel-side cost of the extra network hops that the application’s syscalls now traverse. The connection pool contention (futex_wake from Agroal) could be a consequence of the added queuing delay: if each round-trip takes longer, connections are held longer, and waiters accumulate.

Crucially, pasta is single-threaded . It processes all forwarded packets on a single CPU core. If that core saturates, packet processing queues up — latency spikes and throughput hits a ceiling regardless of how many cores the application has available. The alternative is --network=host: the container shares the host’s network namespace, so packets stay in the kernel and never pass through a proxy.

Quantifying the overhead with pgbench

To measure pasta’s impact on database traffic, we ran pgbench with the same 2-query workload (50 clients — matching the default JDBC...

kernel quarkus pasta application network spring

Related Articles