What if we made SIMA2 from Temu | Frisson Labs Blog
At Frisson Labs, we want AI companions that are actually fun to play with. Playing the game is the first step. Before an agent can feel present, useful, or worth keeping around, it has to handle the basics of being in the world.
That means it has to:
read the screen<br>figure out what to do<br>recover when it gets stuck
SIMA2 is the best public example for that direction: agents that learn from human play, act from screen observations, use keyboard/mouse controls, and use reasoning for longer-term goals.
Google DeepMind's SIMA 2 overview video.
We are a small lab, so we are not going to outspend Google on game-agent training. The part of SIMA2 we cared about was the shape: it looks at the screen, acts through keyboard/mouse controls, uses learned skills, reasons about goals, and improves from experience. We might not be able to copy the DeepMind training loop, but what if we built the "SIMA2 from Temu" in Roblox and saw what the cheap stack could teach us?
For our "Temu" SIMA2, we approximated:
Screen observations: Roblox screenshots fed into the fast action loop and slower reasoning loop<br>Keyboard/mouse control: timed action sequences representing key presses and mouse movement<br>Fast action loop: Gemini 3.1 Flash Lite planning the next few seconds<br>Slower reasoning loop: Gemini 2.5 Pro updating goals and failures, then directing the fast action loop<br>Learned skills: named behaviors like explore_open_path and recover_if_stuck<br>Learning loop: a recording overlay showing the plan, action batch, and actual keys, so each run could teach us what to patch next
The game and the run
I picked Slime RNG because it was trending on Roblox and the gameplay was idiot-proof. You click the roll button, the game AFK farms for you, and eventually you unlock the next area. It came with training wheels, so I figured it would be a good place to start. That made the agent's job pretty easy on paper: ignore the gate it could not afford, find slimes or coins, and keep moving.
In this run, the Canyon gate cost 216M coins and the agent had 21.7M. It did some real player-ish things: rolled a slime, avoided the gate, and even bunny-hopped around like a person at a keyboard. Then a lot of the run became the actual problem: it kept drifting the wrong way, getting pinned, and trying to walk through a tree.
The latest 60-second Roblox run with planner notes, Flash batches, and key instructions overlaid.
We used ffmpeg to record our sessions and add debug logs over the video: what the slower reasoning loop wanted, what the fast action loop chose, which keys fired, and where the body walked into the tree anyway. For this run, the details are below.
Run: all 13 action batches finished<br>Fast loop: 9 Flash movement batches<br>Planner: 3 Pro strategy updates<br>Motion: 7 human movement skills<br>Recovery: 6 long wall turns<br>Latency: 3.5s average Flash call
We also did the least scientific evaluation possible: watched replays as a team and argued about how human the movement felt. There is no clean razor here. "Looks like a player" is a mix of timing, intent, hesitation, recovery, and weird keyboard habits people do without noticing. P.S. we liked the bunny hops.
Where we started
The first version was painfully literal:
screenshot -> Gemini 3.1 Flash Lite -> one keyboard command -> local executor -> next screenshot
It technically worked, but the avatar's movement looked awful: the loop would look at the screen, wait around three seconds, press a key, then wait for the next move. So we asked Gemini for a three-second action sequence instead of a single keypress, which let the avatar keep moving while the next request ran.
"sequence_id": "move_to_path",<br>"ttl_ms": 4000,<br>"interruptible": true,<br>"actions": [<br>{ "type": "key_hold", "key": "w", "start_ms": 0, "duration_ms": 3000 },<br>{ "type": "key_hold", "key": "d", "start_ms": 100, "duration_ms": 900 },<br>{ "type": "key_tap", "key": "space", "start_ms": 1800, "duration_ms": 140 }<br>That small change helped a lot in our vibe evals. Continuous movement made the avatar feel more player-like, but the movement was still obviously bot-like.
Making movement feel natural
Prompting the model to "move like a human" was basically useless. It gave us clean movement, and clean movement looks fake. People playing Roblox are constantly doing dumb little keyboard things: feathering A/D, jumping too much, overcorrecting, holding W while they think.
So we just recorded ourselves playing and turned the raw key down/up events into reusable movement skills:
human_forward_left_jump_chain_zigzag_00<br>human_forward_left_jump_chain_zigzag_01<br>human_forward_right_jump_chain_zigzag_02
Gemini picked the kind of movement it wanted, and the local executor handled the timing. The movement skills looked way more player-ish than the sequences Gemini made from scratch.
Building a brain to get unstuck
The movement got better, but the agent still kept finding walls. It would run into a...