Where Is the JVM Tax?

Malp2 pts0 comments

Same buffers, same instructions, same hardware. Where Is the JVM Tax? | Sem SinchenkoTLDR;<br>This is a toy benchmark, not a paper, not a Spark benchmark, and not a “Java beats native” victory lap.<br>I implemented simple arithmetic kernels like addInt32 and mulFloat64 in 100% Java over Apache Arrow buffers, using MemorySegment and the JDK Vector API. No JNI, no native extension, no Unsafe, no sidecar process. Then I compared them with a native Apache Arrow reference implementation (production grade library). The result was boring in the interesting way: same class of performance, sometimes Java slightly ahead in my local runs, sometimes native is slightly ahead. The bottlenecks were exactly what you would expect: memory layout, CPU cache, RAM bandwidth, and hardware.<br>That is why the phrase "JVM tax" annoys me.<br>If you mean Spark scheduler overhead, say Spark tax.<br>If you mean shuffle, say shuffle tax.<br>If you mean object graphs and GC-visible data, say object layout tax or GC tax.<br>But if the claim is that a warmed-up JVM cannot run analytical kernels efficiently over modern columnar memory, then show me where that tax is collected. Same buffers. Same semantics. Same hardware. Where is the JVM tax?

Disclaimer<br>This is not a paper.<br>I am not trying to prove that Java is faster than native code. I am not trying to prove that Spark is fast. I am not trying to rewrite arrow-compute in Java just to win an internet argument. I wrote a small toy project, implemented a few simple kernels, ran JMH on my laptop, and compared the results with a native Arrow reference implementation used in the most straightforward way I could find.<br>The Java code is most probably not optimal being 100% vibe-coded. The native side is probably not optimal either. I did not spend a week staring at JFR, or trying to overfit every last benchmark parameter. And honestly, I do not care that much about the last 3%.<br>What I care about is the order of magnitude of the alleged JVM tax. If the tax is supposed to be a serious reason to rewrite Big Data systems away from the JVM, I would expect it to show up as something more interesting than benchmark noise, cache effects, memory bandwidth, and normal implementation details. The code is public. Clone it, run it, break it, improve it, complain about it. That is more useful than another vague comment about "JVM tax".

Introduction: the phrase that annoyed me<br>Every second post about rewriting Big Data systems in native code has some version of this sentence:<br>Stop paying the JVM tax.

And every time I see it, I have the same question:<br>Which tax exactly?

Because very often "JVM tax" is used as a lazy bucket for everything people dislike about old Big Data systems.<br>Spark has real overheads. A lot of them. It has a distributed scheduler. It has shuffle. It has spill. It has stage boundaries. It has task overhead. It has fault tolerance machinery. It has old APIs. It has row-oriented paths. It has blocking execution patterns. It has a shuffle model that was designed for a very different era of hardware and storage. It has a lot of architecture that predates the current fashion of tight columnar/vectorized engines inspired by things like MonetDB/X100.<br>All of that is real. But why is that called JVM tax?<br>If you mean Spark tax, say Spark tax.<br>If you mean shuffle tax, say shuffle tax.<br>If you mean scheduler tax, say scheduler tax.<br>If you mean row-oriented execution tax, say row-oriented execution tax.<br>If you mean GC-visible object graph tax, say GC-visible object graph tax.<br>If you mean JVM tax, then I want to see the JVM part.<br>So I took a deliberately boring question: What happens if Java, through the official Apache Arrow Java SDK, reads Arrow buffers, writes Arrow buffers, and runs a simple vectorized arithmetic kernel over them?<br>No Spark. No scheduler. No shuffle. No stage boundaries. No blocking distributed execution model. No network. No Spring Boot. No Stream. No object-per-value data plane. Just modern Java, the official Apache Arrow Java SDK, java.lang.foreign.MemorySegment, the JDK Vector API, JMH, and the corresponding arithmetic kernels from arrow-rs as a production native Arrow reference point.<br>And no, this is not Java vs Rust . I am not comparing int[] with Vec, or Java collections with Rust collections. I am comparing kernel paths over Arrow-style columnar memory. One happens to be implemented in Java. One happens to be implemented in arrow-rs. The interesting question is not which mascot wins. The interesting question is whether the alleged JVM tax shows up when the memory layout and the kernel shape are comparable. Not because this explains all of Big Data performance. It obviously does not. But because if the claim is that a warmed-up JVM is inherently bad at executing analytical kernels over modern columnar memory, this is a pretty good place to look for the tax.

Setup: boring modern Big Data<br>The setup is intentionally boring.<br>I did not invent a custom memory layout. I did not write a native helper. I...

java arrow native mean spark memory

Related Articles