AI Agent Harness Research

A daily engineering digest for agent harness design, evaluation infrastructure, observability, replay, and production agent operations. Last updated 2026-05-26 UTC.

10 Recent Research Publications

2026-04

Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

Harness evolution, observability, token efficiency, SWE-bench Verified, Terminal-Bench 2

2025-11

The OpenHands Software Agent SDK: A Composable and Extensible Foundation for Production Agents

Composable software-agent SDK, production harness interfaces, extensible agent components

2025-11

Live-SWE-agent: Can Software Engineering Agents Self-Evolve on the Fly?

Runtime self-evolution for software-engineering agents

2025-09

Kimi-Dev: Agentless Training as Skill Prior for SWE-Agents

Training priors and agentless-to-agent transfer for SWE agents

2025-09

SecureAgentBench: Benchmarking Secure Code Generation under Realistic Vulnerability Scenarios

Security-aware coding-agent evaluation with multi-file repositories and vulnerability checks

2025-06

SWE-Bench-CL: Continual Learning for Coding Agents

Chronological issue sequences for measuring agent learning and forgetting

2025-06

SWE-Dev: Building Software Engineering Agents with Training and Inference Scaling

Training and inference scaling for SWE-bench Verified performance

2025-05

MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research

Research-agent harness with tasks, judge, and modular research scaffold

2025-04

BrowseComp: A Benchmark for Browsing Agents

Deep browsing-agent evaluation for hard-to-find information tasks

2024-07

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Sandboxed generalist software-agent platform and benchmark harness integration

Daily Harness Engineering Blogs (3 published)

2026-05-26

Replayable Agent Harnesses: 2026-05-26

how to design trace capture, deterministic replay, and failure triage for coding agents

2026-05-26

Evaluation Isolation for Agent Tools: 2026-05-26

how to separate model quality from harness bugs, flaky tools, and contaminated benchmarks

2026-05-26

Model Selection for Coding Agents: 2026-05-26

how to evaluate LLM providers on pass rate, token cost, latency, and harness compatibility