Show HN: Lazarus, a coding agent for long-horizon tasks

Sai_Praneeth1 pts0 comments

I have been interested in long-horizon coding tasks for a while, especially with benchmarks like FrontierSWE, where even the best coding agents like Codex and Claude Code struggle to complete tasks.These agents come with a collection of tools like bash, file edits, grep, glob, etc.Lazarus takes a different approach. The idea is to give the model exactly one tool: a persistent Python runtime.Model writes Python code, executes it, and receives stdout/stderr. Through Python it inspects repos, reads and edits files, runs builds, executes tests, invokes linters, even build custom harnesses and automate whatever workflows it needs.The motivation for this was: - Tool selection itself is a planning problem.- Specialized tools are often difficult to compose together efficiently.- Long-horizon tasks frequently require custom workflows that predefined tools don t provide.- Python is expressive enough for the model to build those workflows itself.Another decision is avoid agent hierarchies. Lazarus runs a single tool-calling loop rather than managers, planners, and worker agents.The intuition being current models are much better at writing code than coordinating fleets of agents. Agent orchestration consumes context, introduces extra modes of failure, and adds complexity.How does Lazarus manage context? When the usable context window of a model is nearly exhausted, the model gets one final opportunity to execute a Python tool call, containing anything it wants to preserve: notes, plans, functions, summaries, partial results, etc.The loop is then restarted with only:- The original user task- The carryover cell- The carryover cell s outputThis allows the agent to periodically compress its own state and continue working without requiring an ever-growing context window.I evaluated Lazarus on two FrontierSWE tasks: - git-to-zig (rewriting git in zig) - dart-style-haskell (rewriting dart-style formatter in haskell)The runs with scores are available here: https://github.com/ExpressGradient/frontier-swe-lazarus-runsUsing GPT-5.5 at medium reasoning effort, Lazarus achieved scores comparable to reported GPT-5.5 in Codex with xhigh reasoning.The runs were not completed to exhaustion, I stopped them because I ran out of OpenAI credits. So I suspect there is still room for improvement from longer runtimes and higher reasoning.The project is still early, but the results made me wonder whether coding agents have become over-specialized around tool collections and orchestration systems, while under-investing in giving models a programmable environment they can shape themselves.Lazarus: https://github.com/ExpressGradient/lazarus

lazarus runs tasks agents model tool

Related Articles