How 25 NOPs fixed a 10-year-old GTA V performance issue on AMD — Daniel Gallego
Back to blogIf you've played GTA V on PC you've probably seen it: stand on Mount Chiliad at night, point the camera at Los Santos, and watch your frame rate fall off a cliff. People have been complaining since at least 2019 on /r/GrandTheftAutoV_PC.
The real answer is two patterns of six bytes each, in a single function, on AMD CPUs.
This is the story of finding it, understanding why it hits AMD so much harder, and the 25-byte fix, which I worked on together with @divocbn.
The known unknown
The bug is famous enough to have folklore. Static scene, no chases, no NPCs, no rain, turn the camera toward downtown Los Santos around 2 AM and your CPU frame time roughly doubles. AMD owners get hit harder than Intel.
It's been like this since the original PC port. Forum threads, Reddit posts, "fixes" involving regedits and driver tweaks, none of it actually addressing the cause, because the cause is not in any .ini file.
The thing to hold on to: both vendors slow down at night, but AMD users slow down much harder. That asymmetry is a clue.
What's actually rendering at night
Up close, GTA V renders street lights as proper deferred light sources, volumetric, shadowing, the usual modern lighting cost. That's untenable for a panoramic view of Los Santos at night, where you can see thousands of points of light.
Beyond a distance threshold, lights switch to a cheap representation: a billboarded sprite per light, batched into a vertex buffer and uploaded as a few draw calls. Coronas, in old engine terminology. These are the distant LOD lights , and they're rendered by a single function: RenderDistantLODLights.
During the day, or out in the countryside, that function does almost nothing. Looking at the city at night, it iterates over every light bucket within draw distance, culls each one, writes the survivor's position and color into a vertex buffer, and dispatches a draw. Standard CPU-side scene work. The cost should be low. It wasn't.
Drag the slider to see what the function actually contributes, same camera, same time of day, only RenderDistantLODLights toggled off and on:
LOD lights off250 FPS<br>LOD lights on120 FPS
And it's not just the main camera pass. RenderDistantLODLights takes an renderMode parameter, and the engine calls it once per render phase that needs distant lights:
void RenderDistantLODLights(uint32_t mode, float intensity)<br>// ...<br>if (mode == 0x2) // WATER_REFLECTION<br>AdjustDistantLights(intensity);<br>if (intensity 0.0f) return;<br>// ...<br>The five modes are DEFAULT, MIRROR_REFLECTION, WATER_REFLECTION, CUBE_REFLECTION, and SEETHROUGH. So when you look at the LS coast at night, or any scene with reflective water, the function runs twice, once for what you see directly, once for what the water reflects. Mirrors and cube-mapped surfaces add their own passes on top.
The prefetch stalls happen in every pass. Doubling or tripling the cost in exactly the scenes that already have the most lights on screen.
LOD lights off<br>LOD lights on
Locating the cost
Profiling GTA5 at night, looking at the LS skyline, the sampler kept pointing at the same address:
Function<br>Samples<br>Module
GTA5 : 0x1405909D8 (CLODLights::RenderDistantLODLights)
29.1%
GTA5
GTA5 : 0x140590900
7.3%
GTA5
GTA5 : 0x141397F66
5.8%
GTA5
">CContext::TID3D11DeviceContext_IASetIndexBuffer_
5.8%
d3d11
">CContext::TID3D11DeviceContext_IASetVertexBuffers_
4.4%
d3d11
NtWaitForSingleObject
2.9%
ntdll
">CContext::TID3D11DeviceContext_SetShaderResources_
2.9%
d3d11
GTA5 : 0x1413BD46E
1.9%
GTA5
">boost::serialization::singleton
1.5%
amdxx64
GTA5 : 0x1413E6EC4
1.5%
GTA5
The top entry, GTA5 : 0x1405909D8, which the disassembly identifies as CLODLights::RenderDistantLODLights, sits at 29.1% of all samples in the frame, more than the next four functions combined. In wall-clock terms, around 1.6 ms of CPU per frame. Enormous for a function whose job is just to build a vertex buffer.
We pulled it up in a disassembler, plain x86-64 in GTA5.exe, and walked through the body. The structure is simple: two inner loops, one for street lights and one for everything else, each iterating over light entries and writing a position+color pair into a vertex buffer.
Six prefetchnta instructions stood out. Three at the top of each inner loop, all back-to-back, all targeting addresses derived from the vertex buffer pointer and the per-light position/color arrays:
PrefetchDC(pRGBIPrefetch + j); // per-light color array (cacheable)<br>PrefetchDC(pPositionPrefetch + j); // per-light position array (cacheable)<br>PrefetchDC(pOutputPrefetch + j); // D3D vertex buffer slot (write-combine)<br>Standard prefetching pattern, the kind of thing nobody flags in a hot inner loop. A lookahead of j + 4 entries: 16 bytes for color, 48 bytes for position, 64 bytes for the output. The two array prefetches look reasonable enough; the third targets an address that, if you know D3D11 memory...