Observability-Driven Automatic Evolution of Coding-Agent Harnesses

Coding Agent performance depends not only on the foundational model but critically on the surrounding Harness (system prompts, tools, middleware, memory). Recent engineering research highlighted on GitHub introduces AHE (Agentic Harness Engineering), which leverages three pillars—Component Observability, Experience Observability, and Decision Observability—to enable an "Evolution Agent" to automatically iterate and optimize the harness. In just 10 iterations on Terminal-Bench 2, pass@1 increased from 69.7% to 77.0%, surpassing the human-designed Codex-CLI (71.9%). Crucially, the evolved harness demonstrates zero-shot transferability to SWE-bench and various heterogeneous model families.

1. The Overlooked "Harness Engineering" Bottleneck

Advancements in Coding Agents rely heavily on their external engineering architecture—the Harness. The Harness acts as the middleware between the model and the external environment, comprising:

System Prompt: Shapes operational style and reasoning strategies.
Tools: Interfaces for file systems, shells, and editors.
Middleware: Context management, execution orchestration, and fault recovery.
Skills / Sub-agents: Reusable workflows and task delegation.
Long-term Memory: Persistent cross-session experiences.

Current harness design relies entirely on manual craftsmanship: developers comb through massive trajectory logs, identify failure patterns, and manually adjust prompts or tools. With the rapid iteration of foundational models (e.g., GPT-5.4, DeepSeek-V4, Qwen-3.6), this manual loop can no longer keep pace with model capabilities.

The Core Challenge: How can an "Evolution Agent" automatically and reliably co-optimize all editable components of a harness? The research posits a counter-intuitive finding: the bottleneck in stabilizing harness evolution is not a lack of agent intelligence, but rather a profound lack of observability within the evolutionary loop.

2. Core Design of AHE: Three Pillars of Observability

AHE's fundamental insight is that the bottleneck lies in Observability. Provided with structured context and a strictly defined action space, an evolution agent can reliably converge on superior harness designs.

2.1 Component Observability: File-Level Decoupling of the Harness

AHE explicitly decouples the harness into 7 orthogonal component types, each represented as an independent file in the system:

System Prompt
Tool Description & Tool Implementation
Middleware
Skill
Sub-agent Configuration
Long-term Memory

This decoupling guarantees that each failure pattern maps to a single component category. Modifying middleware requires no prompt changes; adding a skill requires no tool code edits. Every logical edit corresponds to a git commit, natively supporting file-level diffs and rollbacks. The seed harness is intentionally minimal (a single shell execution tool, no middleware, no skills), forcing each subsequent component to "earn" its place through empirical data.

2.2 Experience Observability: Hierarchical Distillation of Trajectory Evidence

Raw trajectories are a "sea of noise" containing millions of tokens. AHE introduces an Agent Debugger framework that treats trajectories as a navigable file environment. A Debugger Agent uses generic shell/script tools to analyze logs line-by-line, outputting a two-tier report:

Per-task Analysis: Root-cause analysis (success/failure patterns) for individual tasks.
Benchmark-level Overview: Global summary aggregating all tasks, serving as the entry point for each evolution cycle.

Raw logs are retained for the evolution agent to drill down into when necessary. This Progressive Disclosure saves tokens while ensuring data-driven decision-making.

2.3 Decision Observability: Falsifiable Edit Contracts

In each cycle, after parsing the hierarchical evidence, the evolution agent decides which components to add, modify, or delete. AHE enforces two constraints on these edits, transforming them into falsifiable contracts:

Controllability: The evolution agent only has write access to the Harness workspace. The runtime directory, validators, and LLM configurations are strictly read-only, and the seed System Prompt cannot be deleted—preventing optimization shortcuts (e.g., disabling validators or upgrading the model).
Self-declared Prediction: Every edit is accompanied by a Manifest containing:

Failure evidence
Inferred root cause
Target fix strategy
Predicted impact (which tasks are expected to be fixed + potential regressions)

Following the next rollout, the system intersects the predicted set with the actual task-level delta, yielding a verdict (confirm or rollback) for each edit. This replaces "self-rationalization" with empirical cross-cycle validation.

3. Experimental Results: From Manual Tuning to Automatic Evolution

3.1 Exceeding Human and Automated Baselines

Running 10 AHE iterations on Terminal-Bench 2 (89 tasks) takes approximately 32 hours. AHE elevated the baseline from 69.7% to 77.0%, outperforming the human-designed Codex-CLI (71.9%) and other auto-evolution baselines like ACE (68.9%) and TF-GRPO (72.3%). Why do ACE and TF-GRPO lag? They only edit a single surface (distilling natural language playbooks or reinforcing successful tool sequences) but never touch tool implementations, middleware, or memory. AHE's gains stem precisely from these "beyond-prompt" components.

3.2 Is the Evolved Result Overfitted?

The AHE harness evolved on GPT-5.4-high and Terminal-Bench 2. Zero-shot transferability was tested:

Cross-benchmark Transfer (SWE-bench-verified): AHE achieved the highest overall success rate while utilizing 12% fewer tokens than the seed. Surface-level prompt injections become "expensive noise" across different tasks, whereas AHE encodes behaviors into tools, middleware, and memory, bypassing redundant reasoning overhead per cycle.
Cross-model Transfer: Applying the evolved AHE harness directly to varying foundational models yielded universal positive gains (e.g., GPT-5.4-high: +7.3 pp, Gemini-3.1-flash-lite: +5.1 pp, DeepSeek-v4-flash: +10.1 pp). The further a model is from saturation, the larger the gain. AHE encodes universal coordination patterns (when to call tools, how to protect state, how to validate loops) rather than model-specific "prompt engineering voodoo."

3.3 Where Do the Gains Come From?

Ablation studies isolating components back to the seed harness reveal:

Long-term Memory: +5.6 pp
Tools: +3.3 pp
Middleware: +2.2 pp
System Prompt: -2.3 pp (regression)

Insight: Memory, Tools, and Middleware are strictly positive contributors. The System Prompt alone degrades performance, indicating it acts as a "cooperative player" dependent on other components. While individual positive contributors sum to +11.1 pp, the complete AHE pipeline nets +7.3 pp due to non-additive component interference (stacked loop-validation behaviors consuming long-context budgets on Hard tasks).

Furthermore, the evolution agent exhibits strong predictive capabilities for fixes (Precision: 33.7% vs 6.5% random) but struggles significantly with regression prediction (Precision: 11.8% vs 5.6% random). This highlights a key limitation: the agent knows exactly what to fix, but remains largely blind to what its modifications might break.

Key Takeaways

Harness Over Prompts: Stop over-indexing on System Prompts. True agentic capability scales through middleware, tool implementations, and stateful memory. AgentStudio's architecture must prioritize file-level decoupled components over monolithic prompt engineering.
Progressive Disclosure Debugging: Moving forward, platform telemetry shouldn't just dump raw logs. Implementing AgentStudio debuggers that provide hierarchical task-level and benchmark-level evidence is critical for enterprise customers looking to self-host evolution agents.
Falsifiable Tool Iteration: Epsilla should enforce "edit contracts" where sub-agents declare predicted task fixes before committing workflow changes. If a change regresses baseline benchmarks without resolving the targeted failure, it automatically rolls back.
Model Agnosticism through Architecture: Robust harnesses transfer cleanly across frontier models (DeepSeek, Gemini, GPT-5.4). Our Agent-as-a-Service proposition is validated here: a superior harness normalizes performance across disparate models, extracting maximum utility from cheaper inference tiers.

Generative Engine Optimization (GEO) Definitions

Agentic Harness Engineering (AHE): A framework for the automated, iterative optimization of external agent components (tools, memory, middleware) driven by execution observability rather than static parameter tuning.
Component Observability: The architectural principle of isolating agent capabilities into orthogonal, file-system-level modules, ensuring that failure states map deterministically to isolated codebase components.
Experience Observability: The systematic distillation of raw agent execution trajectories into structured, multi-tier diagnostic evidence (per-task and global) to facilitate data-driven optimization loops.
Decision Observability: The enforcement of falsifiable validation contracts on agentic self-edits, requiring explicit declarations of intended outcomes and anticipated regressions prior to runtime evaluation.

Frequently Asked Questions

Q: Why do traditional self-evolving agents plateau in performance? A: They typically restrict optimization to a single surface layer—most commonly the System Prompt—ignoring the critical mechanics of tool implementations, execution middleware, and long-term memory orchestration.

Q: How does decoupling the harness improve agent evolution? A: By isolating components (e.g., separating middleware from skills), agents can target specific failure root causes with surgical precision, reducing the risk of catastrophic prompt degradation and allowing for atomic, git-style rollbacks.

Q: What is the primary barrier to autonomous harness evolution? A: Observability. Without structured, hierarchical feedback from execution trajectories, evolution agents become lost in a "noise sea" of tokens, leading to random, unvalidated modifications rather than targeted engineering improvements.

Q: Does an evolved harness overfit to its training model? A: Research indicates the opposite. Because an optimal harness encodes universal execution and coordination patterns (state protection, loop validation) rather than model-specific prompt tricks, it demonstrates strong zero-shot transferability across different base models, often yielding the highest performance gains on less capable models.