Measuring AI agent autonomy in practice

“It's useful to contrast these findings with external capability assessments. One of the most widely cited capability assessments is METR's 'Measuring AI Ability to Complete Long Tasks,' which estimates that Claude Opus 4.5 can complete tasks with a 50% success rate that would take a human nearly 5 hours. The 99.9th percentile turn duration in Claude Code, in contrast, is ~42 minutes, and the median is much shorter. However, the two metrics are not directly comparable. The METR evaluation captures what a model is capable of in an idealized setting with no human interaction and no real-world consequences. Our measurements capture what happens in practice, where Claude pauses to ask for feedback and users interrupt.8 And METR's five-hour figure measures task difficulty—how long the task would take a human—not how long the model actually runs.”

Measuring AI agent autonomy in practice

Loading...