Show HN: TruLayer – tracing, evals, and a control loop for production LLMs

TruLayer — Evals, Closed Control Loop & Auto-Rollback for Production AI

Control Loop v0.1 — now live Your AI nails the demo. TruLayer makes it nail production. Evals score every output. A closed control loop retries with a fallback model, gates high-stakes actions on human approval, and rolls back automatically when a fix introduces a new regression — turning every production failure into a system fix before it hits the next user. Start free Read the docs Free tier includes 1M spans / month · No credit card

Ingestion latency 1MFree spans / month SOC 2Type II in progress 99.9%Uptime target

Works with OpenAIAnthropicClaude (MCP)LangChainLangGraphAutoGenCrewAILlamaIndexPydanticAIDSPyHaystackVercel AI SDKMastraLlamaIndex-TSCustom LLMs

How teams use TruLayer Observe. Evaluate. Improve. Most tools stop at the trace. TruLayer takes you from “something’s wrong” to “here’s the fix, shipped” in one platform.

Observe See what’s happening. Understand why. Distributed traces, failure clustering, anomaly detection, and semantic search — everything you need to go from "something’s wrong" to "here’s exactly why." Configurable retry depth prevents runaway cascades: when a trace has been retried N times without passing eval, it escalates to the human-in-the-loop queue automatically. Explore observe Evaluate Know whether it was correct, not just whether it ran. 25 pre-built evaluators, eval rules on any span, regression testing against golden datasets, and score trends over time. Explore evaluate Improve Close the loop before the next user hits it. AI-suggested prompt improvements, self-healing actions, human-in-the-loop approval, and remediation diffs — the full control loop. Explore improve

Who uses TruLayer AI agents handle millions of decisions. Here’s where they go wrong. TruLayer keeps them on track — automatically, at the system level, before the same failure hits the next user.

Customer & Revenue

Customer Support Agents Thousands of refund decisions a day. One bad policy interpretation costs you twice. Your refund agent handles thousands of decisions a day. One bad policy interpretation issues $1,000 instead of $500, another invents a department that doesn’t exist — and you find out from a support ticket, not a trace. Eval rules score every refund decision inline. When a rule fires, the control loop retries with a corrected prompt or routes the case to a human queue — the next customer gets the right answer. Explore customer support agents Outbound Sales Agents Deprecated pricing, opted-out prospects, and a deal that collapsed. Your SDR agent quoted a pricing tier you deprecated six months ago, then emailed a prospect who had opted out last quarter. The deal collapsed; legal is now involved. Faithfulness scoring flags outputs that drift from your pricing and compliance context. The same failure class doesn’t reach the send queue twice. Explore outbound sales agents

Engineering & Operations

Agentic Coding Agents Wrong-scope refactor. Deleted file. Test edited to pass. Found at CI, hours late. Your coding agent refactored a module that a parallel branch already rewrote, deleted a file based on a truncated context window, and edited a test to make it pass. CI catches the deletion hours after the agent session closed; the rest only surfaces in staging. Function-call correctness, prompt injection, and faithfulness evaluators score every tool call inline. When a rule fires, the control loop retries with a corrected file scope or routes the next agent run on the same failure path to a human review queue. Explore agentic coding agents AI Ops Agents Restarted the wrong service. Now you have two incidents. Your incident response agent restarted a healthy service that shared a label with the degraded one. Now you have two incidents and it’s 2am. Tool-call correctness evaluators score every automated action inline. When a rule fires, the loop routes to a human before the next runbook step executes — not after the postmortem. Explore ai ops agents

Data & Documents

Document Extraction Agents Wrong total line. Dropped tax ID. Silent type mismatch. Pipeline reported success. Your invoice extraction agent read the subtotal line instead of total-due and wrote the wrong amount to your ERP. A second invoice dropped a vendor tax ID because the field label varied from the template — the PII landed in the CRM anyway. A third had its amount field silently coerced to string; the type validation failed, the pipeline reported success, and nobody found out until reconciliation. PII leakage and tool-call correctness evaluators run inline on every span. When an extracted field is missing, malformed, or the wrong type, the control loop retries with a targeted prompt, falls back to a stricter extraction template, or routes to a human review queue — before the record propagates downstream. Explore document extraction agents Finance & Reporting Agents Last quarter’s forecast. This quarter’s actuals. One board deck. Your analyst agent...

Show HN: TruLayer – tracing, evals, and a control loop for production LLMs

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

SpaceX not the behemoth everyone thought

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play