GitHub - wilmanrojas/sinqua · GitHub
/" data-turbo-transient="true" />
Skip to content
Search or jump to...
Search code, repositories, users, issues, pull requests...
-->
Search
Clear
Search syntax tips
Provide feedback
--><br>We read every piece of feedback, and take your input very seriously.
Include my email address so I can be contacted
Cancel
Submit feedback
Saved searches
Use saved searches to filter your results more quickly
-->
Name
Query
To see all available qualifiers, see our documentation.
Cancel
Create saved search
Sign in
/;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up
Appearance settings
Resetting focus
You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.
Dismiss alert
{{ message }}
wilmanrojas
sinqua
Public
Notifications<br>You must be signed in to change notification settings
Fork
Star
main
BranchesTags
Go to file
CodeOpen more actions menu
Folders and files<br>NameNameLast commit message<br>Last commit date<br>Latest commit
History<br>21 Commits<br>21 Commits
dataset
dataset
include
include
results/raw
results/raw
runner
runner
scripts
scripts
src
src
test
test
.gitignore
.gitignore
CMakeLists.txt
CMakeLists.txt
README.md
README.md
hardware.md
hardware.md
View all files
Repository files navigation
agent-runtime-bench
A controlled, apples-to-apples benchmark of agent runtimes — the orchestration<br>layer that drives an LLM through a write → execute → self-correct loop — across<br>C++, Python, TypeScript, and Rust.
Why this matters
When people compare "coding agents" they almost always compare the model<br>(pass@1 on HumanEval, SWE-bench, etc.). But in production the model runs behind a<br>runtime : the code that fans out hundreds of agents, streams tokens, spawns<br>test processes, retries on failure, and tracks state. That runtime — not the<br>model — decides:
Memory footprint when you run 100+ agents at once,
Concurrency ceiling and tail behavior under load,
Overhead added on top of model latency.
These costs dominate the bill once agents move to scale, yet there is no<br>controlled cross-language comparison of agent runtimes. Published numbers aren't<br>comparable: different hardware, different model, different framework. This project<br>fixes the variables — same tasks, same model, same hardware, same loop logic —<br>and changes only the language runtime, so the runtime's cost is isolated and<br>measurable.
The workload
HumanEval (first 100 problems). For each<br>task the runtime runs a real agentic loop, not one-shot codegen:
build prompt (spec + pytest)<br>→ LLM completion (streamed)<br>→ extract the Python code block<br>→ write solution.py into an isolated workspace<br>→ run `python3 -B -m pytest`<br>→ pass? → Done<br>fail? → feed the pytest error back into the prompt, retry (max 3)<br>→ still failing → Failed
The agent must write code, run the tests, read the failure, and fix itself —<br>which is what exercises the runtime (concurrency, process spawning, I/O, memory),<br>not just the model.
The C++ runtime
Component<br>What it does
ThreadPool<br>100 std::jthread workers, per-worker work-stealing deques
LLMClient / AsyncLLMClient<br>libcurl + SSE streaming to any OpenAI-compatible endpoint (sync, and curl_multi async)
ToolDispatcher<br>atomic write_file; bash via fork/exec with separate stdout/stderr, timeout (process-group SIGKILL) and per-call workspace; plus read_file / list_dir / search
AgentLoop<br>the write → pytest → retry loop, one isolated workspace per agent
Telemetry<br>background RSS sampler (peak), per-task metrics, CSV + summary JSON with p50/p95/p99
Dataset loader + runner<br>loads dataset/humaneval_100.json, fans the tasks across the pool, writes the report
No heap-heavy framework: just the standard library, libcurl, nlohmann/json and<br>spdlog. Every component is covered by tests built with -UNDEBUG so assertions<br>stay live even in Release.
Results — C++ baseline
100 HumanEval tasks, qwen2.5-coder:7b, 100-way concurrency, single GPU:
Metric<br>Value
Peak RSS (100 concurrent agents)<br>~93 MiB
pass@1 (with up to 3 self-review retries)<br>96 % (96/100)
first-attempt pass<br>87/100
recovered via self-review
failed after 3 retries
avg retries<br>0.27
wall time (100 tasks)<br>126 s
How to read these honestly:
Peak RSS is the runtime number. ~93 MiB for 100 concurrent agents is the<br>headline for the C++ stack — and the metric that will actually differ between<br>languages.
pass@1 and retries are model properties , not runtime properties — they will<br>be identical across stacks. They're here to prove the harness runs a real<br>agentic loop (the self-review recovered 6 tasks), not to compare runtimes.
Per-task latency is intentionally omitted from the headline. At 100-way<br>concurrency against one GPU, per-task time is dominated by server-side queueing,<br>not the runtime. Throughput (wall time) is likewise model-bound here....