AgentV Leaderboard — SWE-bench Lite

The multi-dimensional agent benchmark. Same SWE-bench tasks, richer metrics — cost efficiency, tool usage, and Pareto-optimal rankings.

# Model Provider % Resolved Avg $ $/Fix Tools Latency Date
1 Codex o3 openai 77.0% $0.82 $1.06 11.5 62s 2026-04-04
2 Claude Opus 4.6 anthropic 72.7% $0.55 $0.76 8.2 45s 2026-04-08
3 Gemini 2.5 Pro google 71.0% $0.36 $0.51 6.4 38s 2026-04-05
4 GPT-5.2 openai 68.3% $0.45 $0.66 9.1 42s 2026-04-06
5 Claude Sonnet 4.5 anthropic 65.3% $0.28 $0.43 7.1 35s 2026-04-07
6 DeepSeek V3 deepseek 56.0% $0.12 $0.21 10.3 52s 2026-04-03

Pareto Frontier — Score vs Cost

Models on the frontier line achieve the best resolution rate for their cost. Closer to top-left is better.

Run it yourself

$ git clone https://github.com/EntityProcess/agentv
$ cd agentv/benchmarks/swe-bench-lite
$ bun run setup.ts
$ agentv eval ./evals/ --target claude
# Then submit your results via PR →

Submit your results →