The future will be millions agents running task everyday?

GitHub - wilmanrojas/sinqua · GitHub

/" data-turbo-transient="true" />

Search or jump to...

Search code, repositories, users, issues, pull requests...

-->

Clear

Search syntax tips

Provide feedback

--> We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

-->

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

/;ref_cta:Sign up;ref_loc:header logged out"}" Sign up

Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

wilmanrojas

sinqua

Public

Notifications You must be signed in to change notification settings

Fork

Star

main

BranchesTags

Go to file

CodeOpen more actions menu

Folders and files NameNameLast commit message Last commit date Latest commit

History 21 Commits 21 Commits

dataset

include

results/raw

runner

scripts

src

test

.gitignore

CMakeLists.txt

README.md

hardware.md

View all files

Repository files navigation

agent-runtime-bench

A controlled, apples-to-apples benchmark of agent runtimes — the orchestration layer that drives an LLM through a write → execute → self-correct loop — across C++, Python, TypeScript, and Rust.

Why this matters

When people compare "coding agents" they almost always compare the model (pass@1 on HumanEval, SWE-bench, etc.). But in production the model runs behind a runtime : the code that fans out hundreds of agents, streams tokens, spawns test processes, retries on failure, and tracks state. That runtime — not the model — decides:

Memory footprint when you run 100+ agents at once,

Concurrency ceiling and tail behavior under load,

Overhead added on top of model latency.

These costs dominate the bill once agents move to scale, yet there is no controlled cross-language comparison of agent runtimes. Published numbers aren't comparable: different hardware, different model, different framework. This project fixes the variables — same tasks, same model, same hardware, same loop logic — and changes only the language runtime, so the runtime's cost is isolated and measurable.

The workload

HumanEval (first 100 problems). For each task the runtime runs a real agentic loop, not one-shot codegen:

build prompt (spec + pytest) → LLM completion (streamed) → extract the Python code block → write solution.py into an isolated workspace → run `python3 -B -m pytest` → pass? → Done fail? → feed the pytest error back into the prompt, retry (max 3) → still failing → Failed

The agent must write code, run the tests, read the failure, and fix itself — which is what exercises the runtime (concurrency, process spawning, I/O, memory), not just the model.

The C++ runtime

Component What it does

ThreadPool 100 std::jthread workers, per-worker work-stealing deques

LLMClient / AsyncLLMClient libcurl + SSE streaming to any OpenAI-compatible endpoint (sync, and curl_multi async)

ToolDispatcher atomic write_file; bash via fork/exec with separate stdout/stderr, timeout (process-group SIGKILL) and per-call workspace; plus read_file / list_dir / search

AgentLoop the write → pytest → retry loop, one isolated workspace per agent

Telemetry background RSS sampler (peak), per-task metrics, CSV + summary JSON with p50/p95/p99

Dataset loader + runner loads dataset/humaneval_100.json, fans the tasks across the pool, writes the report

No heap-heavy framework: just the standard library, libcurl, nlohmann/json and spdlog. Every component is covered by tests built with -UNDEBUG so assertions stay live even in Release.

Results — C++ baseline

100 HumanEval tasks, qwen2.5-coder:7b, 100-way concurrency, single GPU:

Metric Value

Peak RSS (100 concurrent agents) ~93 MiB

pass@1 (with up to 3 self-review retries) 96 % (96/100)

first-attempt pass 87/100

recovered via self-review

failed after 3 retries

avg retries 0.27

wall time (100 tasks) 126 s

How to read these honestly:

Peak RSS is the runtime number. ~93 MiB for 100 concurrent agents is the headline for the C++ stack — and the metric that will actually differ between languages.

pass@1 and retries are model properties , not runtime properties — they will be identical across stacks. They're here to prove the harness runs a real agentic loop (the self-review recovered 6 tasks), not to compare runtimes.

Per-task latency is intentionally omitted from the headline. At 100-way concurrency against one GPU, per-task time is dominated by server-side queueing, not the runtime. Throughput (wall time) is likewise model-bound here....

The future will be millions agents running task everyday?

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy

SpaceX not the behemoth everyone thought

Naphtha Shortages Having a Growing Impact in Japan