Reducing total token consumption of agentic coding
Reducing Total Token Consumption<br>of Agentic Coding
TL;DR β Two levers reduce cost:
1. Less turns (parallel tool calls β fewer API round-trips)
2. Less context (methodology + snippets β only relevant info survives)
π Understanding the problem: why tokens grow quadratically
What you see vs. what is sent
π€ User side<br>0 chars
π€ Sent to the API<br>0 chars
βΆ Next message<br>β© Auto<br>βΊ Reset
The bottom line after 10 exchanges
Characters typed by the user
Characters sent to the API (cumulative)
Multiplication factor
O(nΒ²)
Cumulative cost complexity
Each new message resends the entire history. The total cost grows like the sum 1+2+3+β¦+n = n(n+1)/2 .
Interactive simulation
Messages (n):
10
Avg size per message (s):
2k
Tool loops per turn (k):
Classic chatbot vs. agentic coding
π¬ Classic chatbot
User: "Fix this bug"β 200 chars
API response #1system + msg = 4,200 chars
User: "Also add tests"β 150 chars
API response #2system + history = 8,500 chars
User: "Rename the var"β 100 chars
API response #3system + history = 12,800 chars
π€ Agentic coding
User: "Fix this bug"β 200 chars
API call #1system + msg = 4,200 chars
β Read(main.py)tool result: 3,000 chars
API call #2everything + tool = 7,400 chars
β Edit(main.py)tool result: 500 chars
API call #3everything + tools = 8,100 chars
β RunTests()tool result: 2,000 chars
API call #4everything + tools = 10,300 chars
β Result returned to user1 user message β 4 API calls
βΆ Animate<br>βΊ Reset
1 user message = 3-15 API calls
Each tool call (Read, Edit, Bashβ¦) triggers a new API call with the full accumulated context.
Agentic cost: the hidden multiplier
Classic chatbot
n Γ s β O(nΒ²)
10 msgs Γ 2k = ~110k chars total
Agentic (k loops/turn)
n Γ k Γ s β O(nΒ² Γ k)
10 msgs Γ 5 loops Γ 2k = ~550k chars
With agentic coding, the context grows k times faster . A 20-message session with 5 tool loops per turn sends millions of characters to the API.
1. Less Turns = Less Token Consumption
Aggressive parallelization of tool calls reduces API round-trips. I prefer a failed tool call to another full API call.
The 3-turn process
Discover β Read β Act. Group independent calls in the same turn to minimize context resends.
Sequential (8 turns)
Turn 1: Glob("**/*.py")
β wait
Turn 2: Grep("handler")
β wait
Turn 3: GetFolderDescription()
β wait
Turn 4: Read(main.py)
β wait
Turn 5: Read(config.py)
β wait
Turn 6: Edit(main.py)
β wait
Turn 7: Write(test.py)
β wait
Turn 8: RunTests()
8 turns = 8 Γ full context resent
DAG parallel (3 turns)
TURN 1 β discover
Glob("**/*.py")
Grep("handler")
GetFolderDescription()
β need results to know what to read
TURN 2 β read
Read(main.py, symbol="handle")
Read(config.py, symbol="Config")
β need content to write correct code
TURN 3 β act (all at once with depends_on)
WritePlan
Edit(main.py)
Write(test.py)
RunTests()
3 turns = 3 Γ context β less tokens
2. Less Context = Less Tokens
The context is append-only. Condense early or pay forever.
The naive approach: keep everything
Each turn resends the entire conversation history. File reads, bash outputs, reasoning β nothing is thrown away.
β API Call 1
sys
β Response: "I'll read main.py" + tool_call(Read)
β API Call 2
sys
asst
Read(main.py) 400L
β Response: "Found bug at L42" + tool_call(Edit)
β API Call 3
sys
asst
main.py 400L
asst
edit OK
β Response: "Running tests" + tool_call(RunTests)
β API Call 4
sys
asst
main.py 400L
asst
edit
asst
pytest 200L
β Response: "all tests pass! Task done."
β cached prefix ($0.30/MTok)<br>β fresh content ($3.00/MTok)
Append-only: previous turns stay in place (cached), new content goes at the end (full price). The 400 lines of main.py are sent 3 times β you only needed 1 function.
With Snippets: same work, smaller context
You still pay to Read once. But instead of carrying 400 lines forever, you save a 20-line Snippet β the savings start next turn. Works with any tool result: Read, Grep, Skills, GetFolderDescriptionβ¦
β API Call 1 (same as naive)
β Response: "I'll read main.py" β tool_call(Read)
β API Call 2 (same as naive β you pay the 400L once)
Read(main.py) 400L FRESH
β Response: finds bug, emits Snippet(L40-60) + Edit(main.py)
β API Call 3 (HERE the difference β 400L β 20L Snippet)
Snippet 20L
edit OK
β Response: "done, running tests" β tool_call(RunTests)
β API Call 4 (still small)
20L
edit
pytest 200L
β Response: "all tests pass! Task done."
Same structure [S][U][A][...] β just 20L instead of 400L from Call 3 onward
With Methodology: append-only working memory
Each turn, the model appends a Methodology note (goal, plan, discoveries) to the cached prefix. Old tool_results are destroyed β only Methodology + Snippets + the original user message survive.
β API Call 1
β Response: tool_call(Read) + Methodology#1
β API Call 2 (400L paid once)
Meth#1
Snippet
Read(main.py) 400L FRESH
β Response: Methodology#2 +...