Composer 2.5

meetpateltech1 pts0 comments

Introducing Composer 2.5 · Cursor

Product →<br>Enterprise

Pricing

Resources →

Sign inContactContact salesDownload

Blog / research

Composer 2.5 is now available in Cursor.

It's a substantial improvement in intelligence and behavior over Composer 2. It is better at sustained work on long-running tasks, follows complex instructions more reliably, and is more pleasant to collaborate with.

We improved Composer by scaling training, generating more complex RL environments, and introducing new learning methods.

In addition to training Composer 2.5 on more difficult tasks, we improved behavioral aspects of the model like communication style and effort calibration. These dimensions are not well captured by existing benchmarks, but we find that they matter for real-world usefulness.

Composer 2.5 is built on the same open-source checkpoint as Composer 2, Moonshot's Kimi K2.5.

Together with SpaceXAI, we're training a significantly larger model from scratch, using 10x more total compute. With Colossus 2's million H100-equivalents and our combined data and training techniques, we expect this to be a major leap in model capability.

#Try Composer 2.5

Composer 2.5 is priced at $0.50/M input and $2.50/M output tokens.

There's also a faster variant with the same intelligence at $3.00/M input and $15.00/M output tokens, a lower cost than the fast tiers of other frontier models. Similar to Composer 2, fast is the default option. See our model docs for full details.

Composer 2.5 includes double usage for the first week.

#Training Composer 2.5

Composer 2.5 contains several new improvements to our training stack. These changes target both model intelligence and usability.

#Targeted RL with textual feedback

Credit assignment during RL is becoming an increasingly difficult challenge as rollouts can span hundreds of thousands of tokens. When a reward is computed over an entire rollout, it may be hard for the model to tell which specific decision helped or hurt the outcome. This is especially limiting when we want to discourage a localized behavior, such as a bad tool call, a confusing explanation, or a style violation. The final reward can tell us that something went wrong, but it is a noisy signal for where it went wrong.

To address this, we trained Composer 2.5 with targeted textual feedback.1 The idea is to provide feedback directly at the point in the trajectory where the model could have behaved better. For a target model message, we construct a short hint describing the desired improvement, insert that hint into the local context, and use the resulting model distribution as a teacher. We use the policy with the original context as the student and add an on-policy distillation KL loss that moves the student's token probabilities toward the teacher's. This gives us a localized training signal for the behavior we want to change, while still retaining the broader RL objective over the full trajectory.

As an illustration of the text feedback process, consider a long rollout that includes a tool call error where the model attempts to call a tool that is not available. During the rollout, the model will receive a “Tool not found” error and continue making additional valid tool calls. The fact that it hit one error in the process of hundreds of tool calls will have a minimal impact on its final reward.

With text feedback, we can target this specific mistake by inserting a hint in the context of the problematic turn, such as “Reminder: Available tools…” with a list of available tools. This hint changes the probabilities for the teacher, lowering those for the wrong tool and increasing those for a valid replacement. For that turn only, we then update the student weights towards to the new probabilities.

During the Composer 2.5 run, we applied this method to a variety of model behaviors, from coding style to model communication.

#Synthetic data

During RL training, Composer's coding ability improves substantially to the point where it begins to get most training problems correct. To continue increasing intelligence, we both select for and create harder tasks dynamically throughout the run. Composer 2.5 is trained with 25x more synthetic tasks than Composer 2.

We use a range of approaches for creating synthetic tasks that are grounded in real codebases. For example, one synthetic approach is feature deletion. For these tasks the agent is given a codebase with a large set of tests, and asked to delete code and files in such a way that the codebase remains functional while specific testable features are removed. The synthetic task is to reimplement the feature, and the tests are used as a verifiable reward.

One downstream consequence of large scale synthetic task creation is that it can cause unexpected reward hacking. As the model became more adept, Composer 2.5 was able to find increasingly sophisticated workarounds to solve the task at hand. In one example, the model found a leftover Python type-checking cache and...

composer model training tool tasks synthetic

Related Articles