AI Agent Harness Research
A daily engineering digest for agent harness design, evaluation infrastructure, observability, replay, and production agent operations. Last updated 2026-05-26 UTC.
10 Recent Research Publications
Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses
Harness evolution, observability, token efficiency, SWE-bench Verified, Terminal-Bench 2
The OpenHands Software Agent SDK: A Composable and Extensible Foundation for Production Agents
Composable software-agent SDK, production harness interfaces, extensible agent components
Live-SWE-agent: Can Software Engineering Agents Self-Evolve on the Fly?
Runtime self-evolution for software-engineering agents
Kimi-Dev: Agentless Training as Skill Prior for SWE-Agents
Training priors and agentless-to-agent transfer for SWE agents
SecureAgentBench: Benchmarking Secure Code Generation under Realistic Vulnerability Scenarios
Security-aware coding-agent evaluation with multi-file repositories and vulnerability checks
SWE-Bench-CL: Continual Learning for Coding Agents
Chronological issue sequences for measuring agent learning and forgetting
SWE-Dev: Building Software Engineering Agents with Training and Inference Scaling
Training and inference scaling for SWE-bench Verified performance
MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research
Research-agent harness with tasks, judge, and modular research scaffold
BrowseComp: A Benchmark for Browsing Agents
Deep browsing-agent evaluation for hard-to-find information tasks
OpenHands: An Open Platform for AI Software Developers as Generalist Agents
Sandboxed generalist software-agent platform and benchmark harness integration
Daily Harness Engineering Blogs (3 published)
Replayable Agent Harnesses: 2026-05-26
how to design trace capture, deterministic replay, and failure triage for coding agents
Evaluation Isolation for Agent Tools: 2026-05-26
how to separate model quality from harness bugs, flaky tools, and contaminated benchmarks
Model Selection for Coding Agents: 2026-05-26
how to evaluate LLM providers on pass rate, token cost, latency, and harness compatibility