AgentV Leaderboard — SWE-bench Lite

The multi-dimensional agent benchmark. Same SWE-bench tasks, richer metrics — cost efficiency, tool usage, and Pareto-optimal rankings.

# ▲	Model ▲	Provider ▲	% Resolved ▲	Avg $ ▲	$/Fix ▲	Tools ▲	Latency ▲	Date ▲
1	Codex o3	openai	77.0%	$0.82	$1.06	11.5	62s	2026-04-04
2	Claude Opus 4.6	anthropic	72.7%	$0.55	$0.76	8.2	45s	2026-04-08
3	Gemini 2.5 Pro	google	71.0%	$0.36	$0.51	6.4	38s	2026-04-05
4	GPT-5.2	openai	68.3%	$0.45	$0.66	9.1	42s	2026-04-06
5	Claude Sonnet 4.5	anthropic	65.3%	$0.28	$0.43	7.1	35s	2026-04-07
6	DeepSeek V3	deepseek	56.0%	$0.12	$0.21	10.3	52s	2026-04-03

Pareto Frontier — Score vs Cost

Models on the frontier line achieve the best resolution rate for their cost. Closer to top-left is better.

$ git clone https://github.com/EntityProcess/agentv

$ cd agentv/benchmarks/swe-bench-lite

$ bun run setup.ts

$ agentv eval ./evals/ --target claude

# Then submit your results via PR →