Training a small model to write better OCaml with RLVR and GRPO - nilenso blog
Kiran Gangadharan
For a while now, I’ve been interested in exploring the capabilities of small language models. When my colleague Atharva introduced me to RLVR and GRPO for doing RL training without a human feedback loop, I wanted to know more.
In the previous post, we explored the workings of RLVR and GRPO. In this post, I’ll walk through a code-generation experiment where I trained a small 1.5B model with GRPO, improved its ability to generate correct and valid OCaml code, and share what I learned along the way.
It started with a simple hypothesis:
“Can training a small local model on OCaml code generation using RLVR and GRPO give me a much better model that can help me explore the language and write better OCaml code?”
The small model that I eventually chose was trained on public GitHub repositories across 92 programming languages. Since OCaml is a relatively niche language in this mix, the setting was a good fit for evaluating how well RLVR could improve the model’s capabilities.
I had some idea about training models, but not enough to make all the design decisions. I needed an anchor to get started.
Constraints keep you focused
Given the broad spectrum of decisions to make, I wanted to define some requirements upfront to keep things simple and focused.
Local inference : train a small model that can be run on my M2 Pro with 16 GB RAM. After some testing, I decided to use Qwen2.5-Coder-1.5B-Instruct as the base model. It did the best in my small eval dataset amongst similar models in that class. It also had a relatively better knowledge of the OCaml syntax and could solve trivial problems.
Single GPU : train the model on a single rented GPU. I decided to use an RTX 6000 with 48GB of VRAM. The full fine-tuning of the model would’ve been feasible, but tight, when accounting for the model weights, gradients, optimizer states, activations, and GRPO rollouts. LoRA made it more practical by training only ~37M adapter parameters, significantly reducing training memory. It also gave me more room to experiment with different training configurations.
Fast dev feedback loop : sanity test the training code on my Mac quickly before the actual run. I was dabbling with Nix at the time and used it to create the platform-specific environments. I eventually used it for training in production as well.
Training and test dataset : use a small dataset of programming problems with varied difficulty for both training and evaluation. I could not find a good OCaml dataset to use, so I ended up porting a subset of the AceCode-87K dataset with tests to OCaml using Claude with some programmatic verification steps. The evaluation dataset was generated using a combination of 99lisp, leetcode and AceCode problems of varying difficulty in terms of the concepts covered and reasoning effort required.
Training loop : work with simple abstractions for the actual training loop. Instead of writing a custom GRPO implementation, I used Hugging Face’s trl library because it already supported GRPOTrainer with PEFT/LoRA integration, working with Hugging Face datasets, and useful logging hooks.
Reward functions shape the learning trajectory
In the previous blog post about RLVR and GRPO, we saw that reward design is central to how the model learns. When a lot of OCaml solutions failed early, a binary pass/fail reward provided little-to-no information for learning. I ended up using a graduated reward system that recognized nuanced progress across type-check, compilation, and test phases while penalizing degenerate completions.
The knobs that shaped learning
Since the previous post already explained the mechanics of GRPO, I’ll skip the parameter glossary here. In this section, we’ll look at a small set of parameters that impacted model learning more than the rest:
Number of generations per prompt controlled how much relative signal GRPO had. Too few samples made most groups look identical; more samples improved comparison but increased rollout cost.
Sampling temperature and top_p controlled exploration. More randomness helped when the model was failing early, but too much produced noisy OCaml and made rewards harder to interpret.
KL penalty kept the trained model from drifting too far from the base model’s capabilities. Too much slowed learning; too little made collapse easier.
Gradient clipping helped keep updates bounded as reward variance increased. Without it, training was more prone sudden instability.
Completion length affected both cost and behavior. Longer completions gave the model more tokens to solve problems, but also made rambling and code-block spam more likely.
LoRA rank and target modules controlled how much the model could adapt while keeping the run feasible on a single GPU.
These knobs were tightly coupled. Increasing exploration could improve reward diversity, but it also made training less stable. Increasing stability reduced...