What We Learned from One Year of Building Production Agents

What We Learned from One Year of Building Production Agents | Strands Agents SDK

Skip to content SDKs PY sdk-python ↗ TS sdk-typescript ↗ Organizations strands-agents ↗ strands-labs ↗

Python

TypeScript Select theme DarkLightAuto

All posts We turned 1 year old!

Strands Agents launched over a year ago thanks to an internal effort by AWS engineers building a network troubleshooting agent. They didn’t use a heavy-duty framework or write enormous amounts of workflow boilerplate. They essentially wired up a system prompt, a Claude 3 model, and tools, resolving 80% of network root causes. Today that’s branded as an “agent harness”.

Our philosophy for keeping architecture minimal helped teams across AWS and beyond ship production agents handling customer traffic at scale. There are a lot of lessons our engineers learned after open sourcing this framework and hitting 25 million downloads. Here are the key ones:

Workflow boilerplate in agents can become easily outdated

When Strands launched in May 2025, the most advanced models had a 200k context window. The agents we saw developers build were usually chatbots connected to a knowledge base or a workflow that classified data in batches. Agent frameworks gave tons of scaffolding to optimize these tasks. But what if a model got better? You potentially wound up with a lot of technical debt refactoring the agent.

For example, take this Sonnet 3.7 agent that reviews CloudWatch alarms across AWS accounts. We saw a lot of customers build something like this with another agent framework:

from agentframework import Agent, tool, Graph, GraphNode, GraphEdge, SlidingWindowConversationManager

@tool

def list_cloudwatch_alarms(state_filter: str = "ALARM") -> dict:

"""List CloudWatch alarms filtered by state."""

...

@tool

def get_alarm_logs(alarm_name: str) -> dict:

"""Get CloudWatch Logs related to an alarm from the past 7 days."""

...

@tool

def format_alarm_report(raw_data: str) -> dict:

"""Format alarm data into a human-readable report."""

...

fetch_agent = Agent(

model_id="us.anthropic.claude-3-7-sonnet-20250219-v1:0",

system_prompt="List all alarms currently in ALARM state.",

tools=[list_cloudwatch_alarms],

conversation_manager=SlidingWindowConversationManager(window_size=10, per_turn=True),

logs_agent = Agent(

model_id="us.anthropic.claude-3-7-sonnet-20250219-v1:0",

system_prompt="For each alarm provided, get its logs from the past 7 days to determine what's been happening.",

tools=[get_alarm_logs],

conversation_manager=SlidingWindowConversationManager(window_size=10, per_turn=True),

formatter_agent = Agent(

model_id="us.anthropic.claude-3-7-sonnet-20250219-v1:0",

system_prompt="Format the alarm data and logs into a clean report grouped by severity. Include relevant log patterns for each alarm.",

tools=[format_alarm_report],

conversation_manager=SlidingWindowConversationManager(window_size=20, per_turn=True),

graph = Graph(

nodes={

"fetch": GraphNode(agent=fetch_agent),

"logs": GraphNode(agent=logs_agent),

"format": GraphNode(agent=formatter_agent),

edges=[

GraphEdge(source="fetch", target="logs"),

GraphEdge(source="logs", target="format"),

result = graph.invoke({"input": "What alarms are firing right now?"})

dict: """List CloudWatch alarms filtered by state.""" ...@tooldef get_alarm_logs(alarm_name: str) -> dict: """Get CloudWatch Logs related to an alarm from the past 7 days.""" ...@tooldef format_alarm_report(raw_data: str) -> dict: """Format alarm data into a human-readable report.""" ...fetch_agent = Agent( model_id="us.anthropic.claude-3-7-sonnet-20250219-v1:0", system_prompt="List all alarms currently in ALARM state.", tools=[list_cloudwatch_alarms], conversation_manager=SlidingWindowConversationManager(window_size=10, per_turn=True),)logs_agent = Agent( model_id="us.anthropic.claude-3-7-sonnet-20250219-v1:0", system_prompt="For each alarm provided, get its logs from the past 7 days to determine what's been happening.", tools=[get_alarm_logs], conversation_manager=SlidingWindowConversationManager(window_size=10, per_turn=True),)formatter_agent = Agent( model_id="us.anthropic.claude-3-7-sonnet-20250219-v1:0", system_prompt="Format the alarm data and logs into a clean report grouped by severity. Include relevant log patterns for each alarm.", tools=[format_alarm_report], conversation_manager=SlidingWindowConversationManager(window_size=20, per_turn=True),)graph = Graph( nodes={ "fetch": GraphNode(agent=fetch_agent), "logs": GraphNode(agent=logs_agent), "format": GraphNode(agent=formatter_agent), }, edges=[ GraphEdge(source="fetch", target="logs"), GraphEdge(source="logs", target="format"), ],)result = graph.invoke({"input": "What alarms are firing right now?"})">

A lot of developers thought best practices meant using a graph workflow that scaffolded each step into a separate agent. For simpler use cases like this, that seemed over-engineered to us. This type of scaffolding...

What We Learned from One Year of Building Production Agents

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast