Felix86 26.06: Some Gaming

camel-cdr1 pts0 comments

felix86 26.06 – felix86 – Run x86 and x86-64 games on RISC-V

felix86 26.06

Finally, some gaming!

Hardware arrives

This month we received the new SpacemiT K3 board. Since its inception, felix86 wasn’t able to run on any of the out-of-order execution hardware, such as the SiFive P550 or the SOPHON SG2042. The former has no vector support, the latter has XTheadVector support. While initially there was consideration for supporting hardware without RVV 1.0 or hardware with XTheadVector, ultimately the decision was that we should instead focus on the future of RISC-V consumer hardware which will have RVV 1.0 due to it being mandatory in the RVA23 profile.

If you watched the felix86 talk at the RISC-V NA summit you might’ve seen a video of gameplay on K1 hardware. You would notice the lack of modern 3D games running on the emulator, because a lot of them would run at less than 5 frames per second. Now that we have much faster hardware, there’s more to show!

TLDW: Huge performance improvements over the K1, and RISC-V performance will only go up from here.

Lessons learned from new hardware

We’re grateful to SpacemiT for providing us a 16 GB Pico-ITX K3 board. Here’s some lessons learned within just a week of using this board.

Performance in games went up by a ton

In most games, performance is up 3-4x compared to the K1. In other games, the performance boost is even bigger. For example, Trackmania Nations Forever would run at 3-4 FPS in the K1, but now runs at ~35 FPS.

The performance improvement is 10x in this game

4x PCIE might bottleneck some games

On heavier 3D titles, GPU usage frequently hits 100%. This may be indicative of a GPU bottleneck which may be related to the 4x PCIE slot (via M.2 M-Key) that we use on this hardware. Hopefully a future board contains an 8x or 16x slot.

Zacas is more important than previously thought!

One of the games we tried running on the new hardware was Cuphead. Initially it seemed smoother on the world navigation area, which was expected. But then when a stage is entered, the entire game slows down to 4 FPS. Looking at perf shows us exactly why.

70-80% of execution time spent on a single block?

This is usually a good sign when profiling a game. Looks like it has a clear bottleneck that needs to be optimized. But what could be causing such problems?

Oh… The one instruction we can’t efficiently emulate without a specific extension

You see, RISC-V doesn’t have 128-bit atomics in the base A extension. In the Zacas extension, the instruction AMOCAS.Q was introduced, which performs a 128-bit compare-and-swap. This is equivalent to the CMPXCHG16B instruction you see in the disassembly here. Without the extension, it can’t be emulated. No hardware comes with this extension as it only recently got ratified and isn’t mandatory in the RVA23 profile.

In felix86, we would emulate this with a global lock. But what happens when a game has many threads and they all use this instruction? We get excessive locking. In hindsight, this was a naive solution, we can do better. By creating a hash of the address, we can index an array of spinlocks based on the address hash. This way we have a fast lookup of address to spinlock. This significantly reduces contention while giving us the same atomicity, as CAS operations on the same 16B line will spin, but the ones on different lines won’t, except in the relatively rare case of collisions. With this new CMPXCHG16B emulation method, we increased Cuphead’s FPS from 4 FPS to 25 FPS in-game, a ~6.25 performance improvement .

From 4 FPS in felix86 26.05 to 25 FPS in felix86 26.06 in this game and likely other Unity games

While this is great, hardware support for Zacas will push performance even further, and improve stability in games that use other memory operations on the same address.

And so are unaligned atomics

Unaligned atomics aren’t a remnant of the past. Even modern(-ish) games like God of War (2018) use them frequently. Without hardware support they can’t be properly emulated efficiently, so our emulation may cause instability in games, especially as hardware gets faster.

Unaligned atomics are everywhere!

In a perfect universe, unaligned atomics work even if they span two cache lines, just like in x86. However, even allowing unaligned atomics within a 16-byte boundary (Zama16b) would be better than nothing.

Oh and TSO, of course

Faster out-of-order hardware means that games that previously didn’t require TSO emulation on the K1 now do. TSO emulation has performance implications, and enabling it may kill a good amount of the performance gains. RISC-V defines the RVTSO memory model which would work in a fashion similar to x86, and also has a fast-track extension for a dynamic TSO mode called Ssdtso, both of which would help. Apart from those, Zalasr would help us implement something similar to FEX-Emu’s half-barrier TSO mode.

Enabling TSO is still faster than using the A100 cores

The A100 cores in this board are in-order,...

hardware games performance felix86 risc game

Related Articles