Show HN: Argus – Capture, replay and QA every Claude Code session your team runs

zamtam1 pts0 comments

Sign in

Alpha is open · sign in to start<br>Is Claude Cowork actually working for your users and clients?<br>Argus captures every session of your users across your team or organization, then reads across thousands of them at once — surfacing where skills are holding, where they're quietly breaking, and which patterns are pointing at the next thing you should build.<br>Sign in to Argus Read a sample session<br>Magic-link sign in. No password. Free during alpha.

arguslab.co/northwind/dashboard<br>AArgus/Nnorthwind<br>Sessions captured<br>142+12%

Cost<br>$58.41+8%

Tokens<br>18.4M+14%

Active users<br>8flat

Errors<br>3−47%

SessionsErrors

Spot checks<br>Cache hit rate86%<br>Avg session cost$0.41<br>P50 latency78 ms<br>Tools / session14.2

Recent sessions<br>workspace

Refactor the Stripe webhook handler to retry on idempotency conflicts…runningsara@northwind.io$0.84

Sync Linear backlog with the GitHub release milestone…completedjames@northwind.io$1.42

Draft the weekly customer-success digest from HubSpot + Slack threads…flaggedpriya@northwind.io$4.21

Live preview · Argus dashboard, one workspace

The problem<br>Once it ships, you go blind.<br>Claude Cowork's built-in telemetry tells you a skill was invoked. It can't tell you whether it worked, whether the user took the answer, whether you should ship a fix tomorrow.

№ 01The counter that says nothing.<br>Cowork's built-in telemetry logs your skill as Skill: 3. Three invocations. Across 412 sessions for one client this month, you have a single number per skill — invocation count. Nothing about which versions ran, what they returned, whether the user accepted the answer or had to ask twice.<br>observed412 sessions<br>surfaced1 aggregate counter

№ 02The skill that broke quietly.<br>You shipped weekly-review four weeks ago. Across the first 38 sessions, users accepted the answer on the first turn. Across the next 9, they re-asked, rephrased, switched tools. Something started failing on session 39. Nobody noticed — the cost line stayed flat and no single session looked broken on its own.<br>skillweekly-review · v1.2.0<br>patternfirst-turn 96% → 22%

№ 03The skill that wasn't there yet.<br>Across the team's traffic this month, “turn this Linear ticket into a release-notes entry” came up fourteen times in eight different phrasings. Each one got a different ad-hoc answer; one user gave up. A skill is waiting to be written there. No telemetry surface — yours or anyone else's — will ever find it.<br>pattern&ldquo;linear → release notes&rdquo;<br>sessions14 · 8 phrasings · no skill

The instrument<br>What the counters can't see.<br>Argus runs as a Claude Cowork plugin. It captures every session — prompts, assistant responses, every tool call, every follow-up — in plain text, stitched back into the conversation the user actually had. The qualitative layer that makes everything else possible.

№ 01Did the user accept the answer?<br>A skill that works ends the conversation. A skill that doesn't gets re-prompted, rephrased, abandoned. Argus captures the user's exact follow-ups so the Agent can see, at a glance across hundreds of sessions, which versions of which skill are landing on the first turn — and which aren't.<br>Captured · used in: first-turn acceptance, follow-up patterns

№ 02Did the assistant ask the user to do its job?<br>Skills should answer questions, not ask new ones. When a skill is under-specified, the assistant stalls — “could you clarify…”, “which one did you mean…” — and the user does the work the skill was meant to do. Argus captures the stalls so the Agent can show where they cluster.<br>Captured · used in: stall frequency, under-specified prompts

№ 03What did the tool actually return?<br>The MCP call succeeded. Status 200. But the payload was empty, or a 400-row dump, or a JSON that didn't match what the skill asked for. Argus captures the tool's plain-text output beside the assistant's response, so the Agent can flag the sessions where the skill kept going on bad input.<br>Captured · used in: tool-output mismatches, silent failures

№ 04What did users ask for that no skill could handle?<br>The prompts your customisations don't yet cover. Same export, same lookup, same wrangle — captured verbatim, even when nothing answered them. The Agent reads across the unmet-prompts corpus and surfaces patterns ready to become the next skill.<br>Captured · used in: unmet-prompt clustering, skill candidates

The loop<br>From every session, a catalogue that improves itself.<br>Five moves, read bottom-up — raw work at the foundation, refined knowledge on top. Each layer rests on the one beneath it; the loop settles new and refined skills back into the next session.

Refined<br>knowledge

Raw<br>work

№ 04<br>Layer 5

Refine<br>Rate the work; an agent refines weak skills and drafts the ones your usage is asking for.

Agent · plannedrate · refine · draft

№ 03<br>Layer 4

Review<br>Replay grouped by skill; analyse every invocation across the org.

Org catalogue<br>SkillsMCPsAgentsPlugins

№ 02<br>Layer 3

Clean & structure<br>Redacted and tenant-isolated, then each session is rebuilt into a complete,...

skill session across argus user sessions

Related Articles