Type Checking in Agentic Workflows – Conner Nilsen – PyCon US 2026 Typing Summit

ocamoss2 pts0 comments

Talk: Type Checking in Agentic Workflows | Pyrefly

Skip to main content

Does adding type checking to an agentic workflow really help agents?

We ran an experiment recently to determine whether there are improvements in the success rate for completing different kinds of tasks. In theory, having a type checker present should help the agent catch type errors earlier, validate fixes incrementally as it works, and reduce the need for slow, test-driven iterative feedback loops.

This talk was originally presented at the PyCon US 2026 Typing Conference. The slides and edited transcript are provided below for your convenience.

📎 Slides (PDF)

Transcript​

The agenda for today: we'll start with the background to give you some context, go through some of our findings, talk about our experiment setup, and end with some of the remaining open questions. The slide deck contains an appendix with more details which you can read later.

Background​

At a high level, our experiment consisted of having a bunch of tasks and having agents attempt to work on those tasks with and without different type checkers. We mainly looked at three success metrics:

Success rate — how often the agent was able to successfully complete different tasks, verified with tests.

Number of steps — how many times an agent searches for information, makes edits, and does other operations.

Task duration — basically wall time: how long does it take an agent to do the task?

Findings​

To answer the initial question of whether type checking helped or not: it depends.

We noticed that type checking helped the agent only when the code base was well typed . In those cases there was less exploration work by the agent, and the success rate increased from about 80% to 84%. We also saw a reduced number of steps, and the agent generally finished tasks faster.

When there was low type coverage, however, there was no meaningful impact: we noticed that the type errors actually distracted the agent and took it off course. In those cases, the agent would typically solve for type checker cleanliness in adjacent code rather than solving the task it was meant to do. This involved fixing import issues, missing attributes, or type signature mismatches, instead of focusing on real fixes.

We also found that how you deliver the feedback to the agent matters at least as much as the feedback itself. Models generally don't just use tools that you tell them about. To get around this, we ended up creating a separate lightweight agent to make sure that the model focused on the type errors it saw on every edit. More on that later.

Models also responded better to feedback delivered as a separate conversation step rather than in edit output—that is, as a separate conversation step rather than bundled in when we told the agent "Hey, your edits were successful," or other follow-up information.

Different models also had different sensitivities to feedback. For example, Claude was very sensitive to errors. It would fix whatever type errors it was shown, but that also meant noisy signals would throw it off course often. GPT Codex, however, was very goal-oriented and needed structural intervention to make sure it actually addressed the errors we provided. We expect these sensitivities to change as the models progress over time. So it's worth exploring how aggressively you want to filter the errors that are surfaced, and where in your agentic loop you want to add this feedback.

Our last finding is that feedback at higher frequencies gives the model more confidence. Consistent feedback verifies that the model is going in the right direction and prevents what we found are "search spirals", where the model goes back and re-verifies things that occur after every edit. We also noticed that empty output, like "type checked X files and found zero errors," acted as a great external reflection cue for agents and helped make sure they stayed on course.

Only running once at the end of a task is too late. In that case, the models never went back to fix anything. It needs to be very consistent that you're providing this feedback to the model.

Experiment Setup​

Now onto how we came up with some of our findings. We ran two similar experiments: one on an open-source benchmark and another on an internal benchmark that we created.

The external experiment tested Pyrefly on SWE-bench Verified , a benchmark for evaluating AI agents on solving real-world engineering tasks. This involves a couple of very popular libraries like Matplotlib, Django, SymPy, and others. Just due to the involvement of different legacy code bases, there's generally low type coverage in these code bases.

We also re-ran the experiment with an internal benchmark called MetaSWEBench , internally curated for evaluating AI agents. With this experiment, we used Pyre as our type checker, since that was what was available at the time the code in MetaSWEBench was committed. This benchmark had generally higher type coverage, just...

type agent feedback errors experiment different

Related Articles