Topic

#swe-bench

15 articles tagged swe-bench. Browse the full set below, or see all topics.

Tagged "swe-bench"

Cross-cutting reads on this topic

15 articles
Claude Fable 5 tops SWE-bench Verified at 95%, but 99 of 100 results are self-reported and the scaffold gap can exceed 28 points. How to read the numbers.
#swe-bench#ai-coding-benchmarks+6 more
2026-06-16
Read Article
We compare Claude Opus 4.8 and GPT-5.5 on coding, agents, reasoning, and real cost — including where GPT-5.5 still wins and which model fits which job.
#claude-opus-4-8#gpt-5-5+6 more
2026-05-28
Read Article
How to read AI model leaderboards without being fooled by benchmark contamination, eval gaming, and cherry-picked MMLU, GPQA, and SWE-bench scores.
#llm-benchmarks#ai-evaluation+6 more
2026-05-27
Read Article
Five-way comparison — Claude Code, Cursor, Codex Desktop, Replit Agent 3, Devin. Pricing, agent autonomy, MCP, eval scores, and reference workloads.
#ai-coding-agents#claude-code+8 more
2026-04-28
Read Article
Six production-tested GPT-5.5 Pro coding workflows — refactor, review, debug, test-gen, migration, codebase Q&A — with cost, latency, and success-rate data.
#gpt-5-5-pro#openai+8 more
2026-04-23
Read Article
Claude Opus 4.7 scores 64.3% on SWE-bench Pro with 2576px vision, xhigh effort, and same Opus 4.6 pricing. Full benchmark and migration guide.
#Claude#Anthropic+4 more
2026-04-16
Read Article
Cursor Composer 2 beats Opus 4.6 on coding benchmarks at 90% lower cost. Built on Kimi K2.5. GPT-5.4 still leads. Full benchmark comparison guide.
#cursor-composer-2#cursor-ide+5 more
2026-03-19
Read Article
GPT-5.4 Mini launches for free-tier ChatGPT users with 54.38% SWE-Bench Pro performance, only 3 points behind full GPT-5.4. 2x faster guide.
#gpt-5-4-mini#openai+5 more
2026-03-17
Read Article
NVIDIA releases Nemotron 3 Super 120B with 60.47% SWE-Bench Verified and 2.2x throughput. Open-source coding model for enterprise AI agent deployments.
#nemotron-3#nvidia+4 more
2026-03-11
Read Article
Gemini 3.1 Pro vs Claude Opus 4.6 vs GPT-5.3-Codex for agentic coding. SWE-Bench, Terminal-Bench, LiveCodeBench, and pricing comparison with recommendations.
#Gemini 3.1 Pro#Claude Opus 4.6+6 more
2026-02-19
Read Article
MiniMax M2.5 scores 80.2% SWE-Bench Verified and costs 1/10th of competitors. Complete guide to features, benchmarks, pricing, API access, and model comparison.
#MiniMax M2.5#AI coding models+5 more
2026-02-12
Read Article
Xiaomi's MiMo-V2-Flash: 309B open-weight MoE running 150 tok/s with 73.4% SWE-Bench. Deploy the fastest open-source coding model with this guide.
#MiMo-V2-Flash#Xiaomi AI+5 more
2025-12-15
Read Article
Master Devstral 2 (72.2% SWE-bench) and Mistral Vibe CLI. Open-weight coding models that run locally. Complete autonomous agent guide.
#Devstral 2#Mistral AI+5 more
2025-12-10
Read Article
Master Claude Opus 4.5: 80.9% SWE-bench, Memory Tool, self-improving agents. Complete guide with pricing and API integration.
#Claude Opus 4.5#Anthropic+5 more
2025-11-24
Read Article
MiniMax M2 achieves 69.4 on SWE-bench at 8% of Claude's cost. Complete guide to China's open-source AI model for agents, coding & multimodal apps.
#AI Development#MiniMax+6 more
2025-10-28
Read Article