G1 GC Throughput Improvements: 5-15% Performance Gains with Dual Card Tables – Ionut Balosin
Skip to content
Java
Performance
G1 GC Throughput Improvements: 5-15% Performance Gains with Dual Card Tables
ByIonut Balosin
May 19, 2026
#Card Tables, #G1 GC, #Garbage Collection, #GC Optimization, #hotspot, #java, #JDK 26, #JVM Internals, #Memory Management, #performance, #throughput, #Write Barriers
Content
Introduction
The Problem: Synchronized Card Table Updates
The Solution: Dual Card Tables with Atomic Swap
Technical Deep Dive: Write Barrier Code Generation
Performance Analysis
Practical Examples
Migration Considerations
Conclusions
References
Introduction
The Garbage-First (G1) collector balances latency and throughput by performing much of its work concurrently with the application. However, this concurrency comes at a cost: application threads must coordinate with GC threads, introducing synchronization overhead that lowers throughput. JEP 522 eliminates this bottleneck through an elegant architectural change – dual card tables that let application and GC threads work independently.
The impact is substantial. In write-intensive applications (those that frequently store object references), throughput improves by 5-15% . Even applications with modest reference updates see up to 5% gains from simpler write barriers. On x64, write barriers shrink from ~50 instructions to just 12, reducing code footprint and improving instruction cache utilization.
The solution is conceptually simple: instead of one shared card table requiring fine-grained synchronization, G1 maintains two tables. Application threads mark dirty cards in one table without locks, while optimizer threads refine the other table. When the active table fills, G1 atomically swaps them. This cooperative design eliminates contention while maintaining the semantics needed for incremental collection.
For developers, this is transparent – no API changes, no configuration adjustments. For JVM engineers, it demonstrates how architectural rethinking can unlock performance: remove synchronization from the hot path, batch operations, and let each component work at full speed.
The Problem: Synchronized Card Table Updates
G1 reclaims memory by copying live objects from one heap region to another, making the source region available for new allocations. When an object moves, any references to it (stored in other objects’ fields) must be updated to point to the new location. Scanning the entire heap for such references would be prohibitively expensive – the key challenge is finding the references that need updating .
Card Tables: Tracking Cross-Region References
G1 uses a card table to track which heap regions contain inter-region references. The heap is conceptually divided into fixed-size cards (typically 512 bytes). Each byte in the card table corresponds to one heap card and records whether that card contains interesting references:
Heap Layout:<br>[Region 0: Objects 0-2MB] [Region 1: Objects 2-4MB] ...
Card Table:<br>[byte 0: clean] [byte 1: dirty] [byte 2: dirty] ...
A card is “dirty” if it contains at least one reference that might cross region boundaries. During a GC pause, G1 scans only dirty cards to find references requiring updates. This is efficient – scanning a 256KB card table is vastly faster than scanning a 4GB heap.
Cards are dirtied by write barriers – small code fragments injected into the application by the JIT compiler. Every time the application stores an object reference in a field, the write barrier marks the corresponding card as dirty.
Here’s a conceptual write barrier:
// Application code<br>obj.field = reference;
// Injected write barrier (conceptual)<br>byte* card = card_table_base + (address_of(obj) >> 9); // 512-byte cards<br>*card = DIRTY;
The JIT compiles this into native code that executes after every reference store.
The Synchronization Problem
Write barriers are fast – typically 3-5 instructions. However, G1 has a problem: if dirty cards accumulate too quickly, scanning them during the next GC pause would exceed G1’s pause-time goal (default 200ms). To prevent this, G1 runs concurrent refinement threads that process dirty cards in the background, updating remembered sets and clearing the cards.
This creates a synchronization problem : refinement threads and application threads both access the card table. Application threads write new dirty marks, while refinement threads read and clear old ones. Without coordination, race conditions occur:
Thread 1 (application): Thread 2 (refinement):<br>Read card value (clean)<br>Read card value (dirty)<br>Process card<br>Write card (clean)<br>Write card (dirty)<br>Miss dirty mark!
The refinement thread clears the card before the application thread writes the new dirty mark, losing track of a reference update.
Legacy Solution: Complex Synchronization
To avoid this, G1’s write barriers used elaborate synchronization. Here’s a simplified version of the old x64 write barrier:
; Old G1...