Model Selection for Coding Agents

Selecting a foundation model for an autonomous coding agent is no longer a leaderboard exercise. It is a systems engineering decision. You are wiring a decision node that must parse abstract syntax trees, negotiate tool contracts, recover from execution failures, and emit deterministic diffs. The evaluation surface has shifted from static benchmarks to dynamic harness metrics. This post breaks down how to evaluate LLM providers against four axes: pass rate, token cost, latency, and harness compatibility. We focus on the mechanics that survive production traffic, not synthetic prompts.

Pass Rate and the Failure Taxonomy

Raw benchmark scores obscure what matters in a live harness. Pass rate must be decomposed by failure mode. When an agent attempts a repository migration or refactors a dependency, you need a failure taxonomy that separates reasoning errors from tooling misfires. Classify outcomes into syntax violations, incorrect API usage, hallucinated dependencies, and execution timeouts. Each category maps to a different mitigation strategy.

Syntax failures often point to inadequate prefill or token budget constraints. Tool misfires usually indicate misaligned tool contracts or schema drift. Execution timeouts reveal context window fragmentation or excessive intermediate generation. Maintain strict eval isolation for these measurements. Run each provider against an identical snapshot of the test corpus. Freeze the environment, pin the dependency versions, and eliminate external API variance.

Only then can you attribute pass rate deltas to the model itself rather than harness noise. Track failure rates per attempt, not just per suite. The first pass tells you baseline capability. The tail behavior under repeated attempts reveals where you must inject recovery logic or switch routing strategies.

Token Economics and Retry Budgets

Cost per token is a surface metric. True cost accounting requires measuring effective cost per successful resolution. Coding agents rarely succeed on the first call. They require multi-step reasoning, intermediate file writes, and test verification loops. Each iteration burns context and increases the probability of context window overflow or degradation.

Build a retry budget into your evaluation framework. Allocate a fixed token allowance per task and measure how efficiently each provider converges on a correct state. Some models optimize for brevity and fail to emit sufficient diagnostic steps. Others over-explain, exhausting the budget before reaching a compile-ready state. Map the cost/latency curve across your task distribution.

Identify the inflection point where additional tokens yield diminishing returns on pass rate. That inflection defines your default routing policy and your overflow trigger for secondary models. Track token burn across the entire execution graph, including system prompts, intermediate scratchpads, and tool invocation payloads.

Latency and Observability Integration

Latency is not a single number. It is p50 think-time, p95 tool call duration, and tail latency under concurrent execution. A model that returns fast but requires three sequential retries often performs worse than a slower, more deliberate caller. Instrument your harness to capture end-to-end duration per tool contract execution, not just raw API round-trip times.

Observability must extend into the decision graph. Log prompt payloads, tool call sequences, and parsed responses. Store them in a replayable format. When a coding agent stalls or produces a malformed patch, you need deterministic replay to reconstruct the exact state transition. Replay requires capturing system prompts, sampling parameters, and the exact schema version of each tool.

Without this, post-mortems become guesswork. Integrate structured tracing early. Route traces through your eval pipeline so you can correlate latency spikes with specific failure categories in your taxonomy. Use distributed trace IDs that survive across provider switches, allowing you to isolate bottlenecks in your orchestrator versus the upstream API.

Harness Compatibility and Tool Contracts

A model is only as good as its ability to honor the harness interface. Tool contracts define the boundaries of agent action. They specify allowed file operations, network constraints, and output schemas. Evaluate providers on structural compliance, not just functional correctness. Can the model consistently emit JSON that validates against your tool schema? Does it respect rate limits and circuit breakers without requiring wrapper logic?

Test contract adherence under adversarial conditions. Introduce malformed intermediate states, truncated outputs, and schema updates mid-run. Measure how quickly the model self-corrects versus how often it compounds errors. Models that maintain strict output discipline reduce your need for post-processing validators and defensive parsing. This compatibility directly impacts your observability overhead and your ability to scale the harness horizontally. We do not embed academic references here; research.epsilondelta.tech tracks related research papers separately for deeper methodological dives.

Operational Checklist

Define a failure taxonomy covering syntax, tool misuse, dependency errors, and timeouts before running initial evals.
Enforce eval isolation with frozen snapshots, pinned environments, and deterministic seeds.
Allocate explicit retry budgets per task and track effective cost per successful resolution.
Instrument p50/p95 latency across full tool call cycles, capturing think-time, network, and parsing overhead.
Implement replayable logging for every prompt, tool contract invocation, and schema version used.
Stress-test tool contracts under truncation, schema drift, and partial execution states.
Route providers dynamically based on cost/latency inflection points and failure category profiles.
Review pass rate distributions weekly, correlating tail failures with specific contract violations.