Evaluating performance and efficiency of the GitHub Copilot agentic harness

Evaluating performance and efficiency of the GitHub Copilot agentic harness across models and tasks - The GitHub Blog

Try GitHub Copilot CLI

Attend GitHub Universe

Shibani Basava & Carlos Castro

June 25, 2026

7 minutes

While the model provides the raw intelligence, the harness shapes how effectively that intelligence is applied. The GitHub Copilot agentic harness is a single shared component of the GitHub Copilot SDK, which powers the GitHub Copilot CLI, GitHub Copilot app, and Copilot code review, along with a wide variety of experiences across GitHub and Microsoft. Improve the harness, and every surface benefits.

The GitHub Copilot agentic harness powers GitHub Copilot experiences.

The tools, context, and workflow are orchestrated by the harness. A harness should be fast, token-efficient, and predictable for developers. That’s what we designed GitHub Copilot’s agentic harness to do.

In this post, we’ll present data showing the efficiency and performance of the GitHub Copilot agentic harness across a wide range of agentic software engineering tasks.

More optimizations we are making

Read more about our latest optimizations on context handling and model routing to get the most out of each token. We have also shared more about experiments and optimizations around delegation, and how it benefits developers today.

How we iterate with benchmarks

We continuously evaluate the capability and efficiency of the GitHub Copilot agentic harness through a combination of public and internally developed benchmarks. Our public benchmarks include industry standards, while several internal benchmarks are derived from large codebases inside GitHub and Microsoft. We complement this with real-world metrics and online experiments to ensure we understand the harness’s performance in controlled environments and its practical impact on agentic problem solving and task completion.

We control as many variables as possible to evaluate the performance of GitHub Copilot’s harness compared to the model provider’s harness: use the same model , the same benchmark task , normalized on context window, reasoning efforts, tool selection, and MCP servers.

Below we report our latest results for a subset of the benchmarks we track, across four leading models: Claude Sonnet 4.6 , Claude Opus 4.7 , GPT‑5.4 , and GPT‑5.5 :

Benchmark Domain Purpose SWE-bench Verified 500 human-validated bug-fix tasks from open-source Python repositories Established industry-standard benchmark for coding agents SWE-bench Pro More difficult, multi-step engineering tasks requiring deeper reasoning and broader code changes Better reflects complex, real-world software engineering work SkillsBench How effectively an agent uses skills to solve tasks Evaluates extensibility and skill use and triggering capabilities TerminalBench Agent performance on terminal-based tasks Measures effectiveness in command-line workflows used by developers Win-Hill Internal benchmark for tasks running inside Windows containers Validates that performance generalizes across operating systems and environments

Throughout, we compare GitHub Copilot CLI against the model-vendor harnesses that ship those models natively: Claude Code for Sonnet 4.6 and Opus 4.7, and Codex CLI for GPT‑5.4 and GPT‑5.5.

Token efficiency

Holding the model and task fixed, across multiple benchmark results, the GitHub Copilot harness achieves task completion rates on par with other model-vendor harnesses, while showing lower token consumption across most configurations.

Token efficiency: GitHub Copilot CLI vs. other model-vendor harnesses

Task resolution

Token efficiency only matters if the work actually gets done.

Task resolution rates for the GitHub Copilot agentic harness across these benchmarks are on-par with model-vendor harnesses when used with a fixed model and benchmark task. This ensures that the full potential of the underlying model is available, along with multi-model flexibility, token efficiency, and memory and context capabilities.

Task resolution: GitHub Copilot CLI vs. the model-vendor harnesses

These results reflect effective parity, since the differences in either direction are within the variance due to the stochastic nature of the models, making the cross-harness performance on-par.

TerminalBench: Token efficiency, task completion, and variance

To continuously improve the GitHub Copilot agentic harness on task completion and token efficiency, we regularly perform thorough analyses across benchmarks. Below is an example of variance analysis on TerminalBench 2.0, which not only highlights GitHub Copilot’s strength on task completion and token efficiency, but also shows the run-to-run variance intrinsic to this kind of benchmark.

Resolution rate vs. cost per task. Up and to the left is better: solve more, spend less.

Every marker is one agent-and-model configuration on TerminalBench 2.0, with resolution rate on the...

Evaluating performance and efficiency of the GitHub Copilot agentic harness

Related Articles

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

How to Earn a Billion Dollars