The multi-dimensional agent benchmark. Same SWE-bench tasks, richer metrics — cost efficiency, tool usage, and Pareto-optimal rankings.
| # ▲ | Model ▲ | Provider ▲ | % Resolved ▲ | Avg $ ▲ | $/Fix ▲ | Tools ▲ | Latency ▲ | Date ▲ |
|---|---|---|---|---|---|---|---|---|
| 1 | Codex o3 | openai | 77.0% | $0.82 | $1.06 | 11.5 | 62s | 2026-04-04 |
| 2 | Claude Opus 4.6 | anthropic | 72.7% | $0.55 | $0.76 | 8.2 | 45s | 2026-04-08 |
| 3 | Gemini 2.5 Pro | 71.0% | $0.36 | $0.51 | 6.4 | 38s | 2026-04-05 | |
| 4 | GPT-5.2 | openai | 68.3% | $0.45 | $0.66 | 9.1 | 42s | 2026-04-06 |
| 5 | Claude Sonnet 4.5 | anthropic | 65.3% | $0.28 | $0.43 | 7.1 | 35s | 2026-04-07 |
| 6 | DeepSeek V3 | deepseek | 56.0% | $0.12 | $0.21 | 10.3 | 52s | 2026-04-03 |
Models on the frontier line achieve the best resolution rate for their cost. Closer to top-left is better.