AI Time Horizons Double Every Seven Months, METR Study Shows

TL;DR

METR research reveals LLM capabilities on complex software tasks are doubling every seven months, with latest models solving 8-hour expert tasks in under two hours.

Key Points

GPT-2's 50%-success time horizon: 2 seconds; o3's: ~2 hours; Opus 4.6's: ~12 hours
Trend suggests frontier AIs will handle month-long expert tasks by 2027-2031
Study measured 170 real software tasks (2 seconds to 8 hours) against 12 LLMs released 2019-2025
Critical gap remains: half of SWE-bench passing PRs rejected by human maintainers for code quality and convention violations

Why It Matters

This quantifies AI capability scaling on realistic software engineering work, but reveals the gap between benchmark success and production code quality. For engineers, it signals both opportunity and risk: AI agents are accelerating toward autonomy on complex tasks, yet automated tests miss human judgment about maintainability, conventions, and correctness—critical for real-world deployment.

Read the full technical analysis

Source: emptysqua.re