Cline Improves AI Agent Benchmark From 47% to 57% Via Hill Climbing

TL;DR

Cline team systematized AI agent evaluation using Harbor and Terminal Bench, improving performance 10 points to outpace Claude Code through iterative prompt/config optimization.

Key Points

Cline scored 47% baseline on Terminal Bench's 89 real-world coding tasks, improved to 57% in one weekend
Identified key failure patterns: timeout configs, missing verification logic, command exit code handling, long-running process cuts
Harbor + Modal parallelization reduced full eval runs from hours to 40-50 minutes across 89 tasks
Practical guide covers failure categorization, Pass@k noise reduction, and systematic hill climbing methodology for any AI coding agent

Why It Matters

This establishes a reproducible framework for benchmarking and improving AI agents at scale. Developers building or evaluating coding agents (Claude Code, Cursor, OpenHands, Gemini CLI) can now systematically identify bottlenecks and iterate rapidly, turning evaluation into an optimization pipeline rather than a one-time measurement.

Read the full hill climbing guide

Source: cline.bot