How the Community Trained Gemma to "Think" with Tunix and TPUs

simonpure1 pts0 comments

How the community trained Gemma to "Think" with Tunix and TPUs

- Google Developers Blog

Search

How the community trained Gemma to "Think" with Tunix and TPUs

MAY 28, 2026

Wei Wei

Developer Advocate

Weiren Yu

Product Manager

Tianshu Bao

Senior Staff Software Engineer

Lance Wang

Software Engineer

Chris Achard

Developer Advocate

Share

Facebook

Twitter

LinkedIn

Mail

Large Language Models (LLMs) often benefit from "thinking" before they speak for complex tasks. Frontier LLMs like Gemini 3 and leading open weight models like Gemma 4 can produce explicit reasoning traces, commonly called Chain-of-Thought, before answering user questions. But how this reasoning capability is trained is often not disclosed. While there are many reasoning tutorials available on the Internet to train for simple verifiable tasks such as math or coding, accessible and easy-to-reproduce training recipes (including data, training strategy, runnable code and evaluations) for general reasoning remain scarce.<br>This motivated us to hold the Google Tunix Hack: Train a model to show its work hackathon on Kaggle: we challenged developers to transform non-reasoning base models (Gemma-2-2B and Gemma-3-1B) into general reasoning models, using Tunix and Kaggle TPUs. The response was overwhelming: over 11,000 entrants and 300+ high-quality submissions proved that decent reasoning training can be done by the community even with a very limited compute budget (Kaggle TPU v5e-8 for 9 hours). In this post, we’ll highlight the techniques used by the winners and share key recipes that allow models to reason across key vertical industries, so you can train your own reasoning models.<br>Highlighting the Winners: Key Innovations<br>The winning submissions demonstrated a sophisticated understanding of post-training, combining supervised learning, preference optimization, and reinforcement learning in creative ways.<br>🥇 1st Place: G-RaR (Rubric-Based Reinforcement Learning)<br>G-RaR trains Gemma models to produce structured reasoning by combining Supervised Fine-Tuning (SFT) with GRPO, driven by a novel rubric-based LLM-as-judge reward system.

How It Improves Reasoning The model's reasoning power is improved by explicitly training it to "show its work" inside tags before outputting an answer. The underlying technique (for GRPO), G-RaR (Rubrics as Rewards), uses a larger judge model (Gemma-3-12B) to evaluate the quality of these intermediate logical steps based on task-specific rubrics. By converting discrete rubric scores into continuous, normalized reward signals, the technique provides dense, smooth feedback on the model's logic. This allows the model to continuously improve its reasoning capabilities without relying solely on exact-match correctness, making it highly effective even for open-ended, non-verifiable tasks.<br>Technical Solution The team utilized a two-stage post-training pipeline:Stage 1 (SFT): The Gemma-2-2B-IT model is fine-tuned via LoRA on a ~33k sample dataset to establish a baseline. This "warm start" teaches the model to reliably output the ...... structure.<br>Stage 2 (GRPO): The model is then refined using GRPO-based on a composite reward function (Format Reward + Exact Answer Reward + G-RaR Score). To overcome compute constraints, the team used a split-mesh architecture on a single Kaggle TPU v5e-8, placing the policy/reference models on one mesh and the judge model on the other for true parallel execution.

🥈 2nd Place: Pinocchio-1B (Creating a Reasoning Model in 3 Acts)<br>Evolving a 1B parameter model into a structured reasoning engine ("Pinocchio") via a highly efficient, 9-hour TPU pipeline (SFT → SimPO → GRPO)

How it Improves Reasoning The model learns to generate a structured trace before answering, shifting from basic pattern matching to logical deduction. This is built sequentially: SFT instills foundational Chain-of-Thought, SimPO locks in strict formatting (preventing verbosity hacks), and GRPO refines logic by using an LLM-as-a-Judge to reward coherence and heavily penalize hallucinations..<br>Technical Solution The pipeline consists of three stages:SFT (Distillation): Trained on 70k prompts using an OSS-120B teacher model and a Gemini task-router.<br>SimPO (Alignment): Replaced memory-heavy DPO to efficiently enforce strict XML formatting.<br>GRPO (Refinement): Used Gemini 2.0 Flash as an asynchronous judge to dynamically reward accuracy, logic, and format.

Customizing Tunix: The team explicitly extended the Tunix library to support this workflow by:Injecting a custom SimPO loss function (with length normalization) into the DPOTrainer.<br>Creating a high-throughput, asynchronous evaluation engine to process GRPO reward signals on the fly.

🥉 3rd Place: IDEA-E Distillation with Curriculum Guided GRPO Training<br>Distilling the structured "IDEA-E" ethical reasoning framework into a 2B model using curriculum-guided GRPO and a fast TF-IDF reward system.

Why it Improves Reasoning The IDEA-E scaffold forces the model through a...

reasoning model grpo gemma reward models

Related Articles