Testing MiniMax M2.7 via API on three real ML and coding workflows

Testing MiniMax M2.7 via API on three real ML and coding workflows – Andrey Lukyanenko

18 May 2026

Testing MiniMax M2.7 via API on three real ML and coding workflows

I recently got access to some MiniMax M2.7 API credits, so I decided to plug this model directly into Claude Code and run it on three workflows I do regularly. The same tasks were run using Claude Opus 4.7 as the comparison baseline.

The three workflows: scaffolding an entry for an active Kaggle competition, drafting and auditing knowledge-base notes for my Obsidian vault, and updating an old PyTorch project that became outdated. I wanted to find out how well M2.7 works inside an agentic loop when the task has clear boundaries. The results were consistent across the three runs: M2.7 was useful when the constraints were explicit, and the output format was concrete. It stumbled when important context was left implicit, though some of the same gaps appeared with Opus 4.7 as well.

For the more open-ended cases, I would still keep a human review pass in the loop.

Setup

I added a claude-mm command that points Claude Code at the MiniMax API and ran M2.7 with thinking set to max in the CC interface. I ran on MiniMax’s Plus tier (High-Speed, $40/month), where the context window and per-day throughput no longer became bottlenecks for multi-step agentic work.

claude-mm() { ANTHROPIC_BASE_URL="https://api.minimax.io/anthropic" \ ANTHROPIC_AUTH_TOKEN="$MINIMAX_API_KEY" \ ANTHROPIC_MODEL="MiniMax-M2.7" \ ANTHROPIC_DEFAULT_SONNET_MODEL="MiniMax-M2.7" \ ANTHROPIC_DEFAULT_OPUS_MODEL="MiniMax-M2.7" \ ANTHROPIC_DEFAULT_HAIKU_MODEL="MiniMax-M2.7" \ ANTHROPIC_SMALL_FAST_MODEL="MiniMax-M2.7" \ API_TIMEOUT_MS="3000000" \ CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC="1" \ claude "$@"

In agentic work, the harness can be as important as the model itself. Most of the failures I describe below had similar reasons: the prompt did not explicitly state a constraint the task depended on, and the model filled the gap with a plausible default. In practice, model quality and harness design are hard to separate. A stronger model may infer missing constraints; a better harness may make those constraints explicit. I treated this as a workflow test, not a pure model benchmark.

Refactoring an old PyTorch project

The first workflow was a refactor: my pytorch_tempest repo is a framework for training neural nets using Hydra + PyTorch Lightning. I wanted to update dependencies, modernize the tooling, and clean up the code issues that had accumulated over time. The merged result is PR: refactoring old code and updating dependencies.

The changes:

Updated CI versions and pre-commit hooks.

Replaced black and flake8 with ruff for both linting and formatting.

Enabled fsdp_sharding_strategy in the Lightning trainer config.

Refreshed the documentation.

Added uv for environment management.

Switched to modern Python typing (list[X] over List[X], X | None over Optional[X]).

Removed duplicate code paths.

Fixed a lot of small issues.

I guided M2.7 explicitly: provided step-by-step requirements (“switch black + flake8 to ruff”, “update the pre-commit config”), reviewed each change before moving to the next, and provided feedback when the diff went outside scope. I had enough tests to check whether anything broke after the changes, and rerunning model training took only several minutes. I had some challenges running CI, and the agent helped me fix them one by one.

A lot of engineers I know do not want to give an agent free rein over a codebase they care about; they want to supervise the execution and know every existing line of code. M2.7 fits this approach well. You can write short, narrow-scope prompts, conduct line-level review, and then move to the next step.

Knowledge notes for the Obsidian vault

The second workflow was writing and auditing notes for my Obsidian vault, where I keep around ML reference notes. I write most of them by hand; sometimes I have an LLM draft a parallel version to compare against and take inspiration from.

It is important to remember that different models prefer different prompt styles. A 100-line prompt tuned for Opus 4.7 does not transfer one-to-one to M2.7. To handle that, I did a small bootstrap: I asked both models to generate notes from the same starting prompt, then asked M2.7 to read both notes and propose an improved prompt for itself. The next iteration used the M2.7-tuned prompt.

I used two prompts (a writer command and a critic agent), each around 100 lines. Here is a condensed version of the first one:

Fill one broken-link stub in the DSWoK vault: research the topic, draft the note in DSWoK voice, run draft-critic-mm, save to the right folder.

1. Read context: writing style guide, frontmatter taxonomy, alias rule. 2. Pick the stub. 3. Locate references — Grep for [[]] across the vault. 4. Pick the destination folder based on topical group. 5....

Testing MiniMax M2.7 via API on three real ML and coding workflows

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast