The future will be millions agents running task everyday?

wilmanro771 pts0 comments

GitHub - wilmanrojas/sinqua · GitHub

/" data-turbo-transient="true" />

Skip to content

Search or jump to...

Search code, repositories, users, issues, pull requests...

-->

Search

Clear

Search syntax tips

Provide feedback

--><br>We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

-->

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

Sign in

/;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up

Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

{{ message }}

wilmanrojas

sinqua

Public

Notifications<br>You must be signed in to change notification settings

Fork

Star

main

BranchesTags

Go to file

CodeOpen more actions menu

Folders and files<br>NameNameLast commit message<br>Last commit date<br>Latest commit

History<br>21 Commits<br>21 Commits

dataset

dataset

include

include

results/raw

results/raw

runner

runner

scripts

scripts

src

src

test

test

.gitignore

.gitignore

CMakeLists.txt

CMakeLists.txt

README.md

README.md

hardware.md

hardware.md

View all files

Repository files navigation

agent-runtime-bench

A controlled, apples-to-apples benchmark of agent runtimes — the orchestration<br>layer that drives an LLM through a write → execute → self-correct loop — across<br>C++, Python, TypeScript, and Rust.

Why this matters

When people compare "coding agents" they almost always compare the model<br>(pass@1 on HumanEval, SWE-bench, etc.). But in production the model runs behind a<br>runtime : the code that fans out hundreds of agents, streams tokens, spawns<br>test processes, retries on failure, and tracks state. That runtime — not the<br>model — decides:

Memory footprint when you run 100+ agents at once,

Concurrency ceiling and tail behavior under load,

Overhead added on top of model latency.

These costs dominate the bill once agents move to scale, yet there is no<br>controlled cross-language comparison of agent runtimes. Published numbers aren't<br>comparable: different hardware, different model, different framework. This project<br>fixes the variables — same tasks, same model, same hardware, same loop logic —<br>and changes only the language runtime, so the runtime's cost is isolated and<br>measurable.

The workload

HumanEval (first 100 problems). For each<br>task the runtime runs a real agentic loop, not one-shot codegen:

build prompt (spec + pytest)<br>→ LLM completion (streamed)<br>→ extract the Python code block<br>→ write solution.py into an isolated workspace<br>→ run `python3 -B -m pytest`<br>→ pass? → Done<br>fail? → feed the pytest error back into the prompt, retry (max 3)<br>→ still failing → Failed

The agent must write code, run the tests, read the failure, and fix itself —<br>which is what exercises the runtime (concurrency, process spawning, I/O, memory),<br>not just the model.

The C++ runtime

Component<br>What it does

ThreadPool<br>100 std::jthread workers, per-worker work-stealing deques

LLMClient / AsyncLLMClient<br>libcurl + SSE streaming to any OpenAI-compatible endpoint (sync, and curl_multi async)

ToolDispatcher<br>atomic write_file; bash via fork/exec with separate stdout/stderr, timeout (process-group SIGKILL) and per-call workspace; plus read_file / list_dir / search

AgentLoop<br>the write → pytest → retry loop, one isolated workspace per agent

Telemetry<br>background RSS sampler (peak), per-task metrics, CSV + summary JSON with p50/p95/p99

Dataset loader + runner<br>loads dataset/humaneval_100.json, fans the tasks across the pool, writes the report

No heap-heavy framework: just the standard library, libcurl, nlohmann/json and<br>spdlog. Every component is covered by tests built with -UNDEBUG so assertions<br>stay live even in Release.

Results — C++ baseline

100 HumanEval tasks, qwen2.5-coder:7b, 100-way concurrency, single GPU:

Metric<br>Value

Peak RSS (100 concurrent agents)<br>~93 MiB

pass@1 (with up to 3 self-review retries)<br>96 % (96/100)

first-attempt pass<br>87/100

recovered via self-review

failed after 3 retries

avg retries<br>0.27

wall time (100 tasks)<br>126 s

How to read these honestly:

Peak RSS is the runtime number. ~93 MiB for 100 concurrent agents is the<br>headline for the C++ stack — and the metric that will actually differ between<br>languages.

pass@1 and retries are model properties , not runtime properties — they will<br>be identical across stacks. They're here to prove the harness runs a real<br>agentic loop (the self-review recovered 6 tasks), not to compare runtimes.

Per-task latency is intentionally omitted from the headline. At 100-way<br>concurrency against one GPU, per-task time is dominated by server-side queueing,<br>not the runtime. Throughput (wall time) is likewise model-bound here....

runtime model agents search task agent

Related Articles