TL;DR
Researcher introduces meta-benchmark testing how well LLMs create useful benchmarks; GPT 5.2 only model to succeed, revealing gaps between solver and creator capabilities.
Key Points
- GPT 5.2 only model successfully created practically solvable benchmarks that challenged frontier models; GPT 5.5 and Opus 4.6 produced tasks too easy or unsolvable
- Gemini 3.1 Pro showed highest creativity in benchmark design (spatial traversal, corruption recovery tasks) but suffered from brittleness and puzzle-like constraints
- BenchBench reveals capability divergence: top solver models (GPT 5.5, Opus 4.6) are poor benchmark creators; Gemini 3.5 Flash better creator despite weaker solver performance
- All models converged on bureaucratic-forensics style tasks (reimbursement validation, policy compliance) suggesting alignment with real-world messy data scenarios
Why It Matters
This meta-benchmark exposes blind spots in frontier models' self-knowledge and creative capabilities that traditional benchmarks miss. For AI researchers, it provides a new evaluation dimension—whether models can identify their own capability gaps and design appropriate tests—critical for autonomous AI improvement and RL training loop design.
Source: www.strangeloopcanon.com