Topic

#swe-bench

15 articles tagged swe-bench. Browse the full set below, or see all topics.

Tagged "swe-bench"

Cross-cutting reads on this topic

15 articles

AI Development

SWE-bench in 2026: Benchmarks vs Scaffolding Reality

Claude Fable 5 tops SWE-bench Verified at 95%, but 99 of 100 results are self-reported and the scaffold gap can exceed 28 points. How to read the numbers.

#swe-bench#ai-coding-benchmarks+6 more

2026-06-16

Read Article

AI Development

Claude Opus 4.8 vs GPT-5.5: Benchmarks & Cost Compared

We compare Claude Opus 4.8 and GPT-5.5 on coding, agents, reasoning, and real cost — including where GPT-5.5 still wins and which model fits which job.

#claude-opus-4-8#gpt-5-5+6 more

2026-05-28

Read Article

AI Development

LLM Benchmark Methodology 2026: Reading Leaderboards

How to read AI model leaderboards without being fooled by benchmark contamination, eval gaming, and cherry-picked MMLU, GPQA, and SWE-bench scores.

#llm-benchmarks#ai-evaluation+6 more

2026-05-27

Read Article

AI Development

AI Coding Agents: Claude Code vs Cursor vs Codex 2026

Five-way comparison — Claude Code, Cursor, Codex Desktop, Replit Agent 3, Devin. Pricing, agent autonomy, MCP, eval scores, and reference workloads.

#ai-coding-agents#claude-code+8 more

2026-04-28

Read Article

AI Development

GPT-5.5 Pro Coding Workflow Patterns: Developer Guide

Six production-tested GPT-5.5 Pro coding workflows — refactor, review, debug, test-gen, migration, codebase Q&A — with cost, latency, and success-rate data.

#gpt-5-5-pro#openai+8 more

2026-04-23

Read Article

AI Development

Claude Opus 4.7: Anthropic's New Frontier Model Guide

Claude Opus 4.7 scores 64.3% on SWE-bench Pro with 2576px vision, xhigh effort, and same Opus 4.6 pricing. Full benchmark and migration guide.

#Claude#Anthropic+4 more

2026-04-16

Read Article

AI Development

Cursor Composer 2: Coding Model That Beats Opus 4.6

Cursor Composer 2 beats Opus 4.6 on coding benchmarks at 90% lower cost. Built on Kimi K2.5. GPT-5.4 still leads. Full benchmark comparison guide.

#cursor-composer-2#cursor-ide+5 more

2026-03-19

Read Article

AI Development

GPT-5.4 Mini: Free-Tier AI With 54% SWE-Bench Pro Score

GPT-5.4 Mini launches for free-tier ChatGPT users with 54.38% SWE-Bench Pro performance, only 3 points behind full GPT-5.4. 2x faster guide.

#gpt-5-4-mini#openai+5 more

2026-03-17

Read Article

AI Development

Nemotron 3 Super 120B: NVIDIA Open-Source Coding Model

NVIDIA releases Nemotron 3 Super 120B with 60.47% SWE-Bench Verified and 2.2x throughput. Open-source coding model for enterprise AI agent deployments.

#nemotron-3#nvidia+4 more

2026-03-11

Read Article

AI Development

Gemini 3.1 Pro vs Opus 4.6 vs Codex: Agentic Coding

Gemini 3.1 Pro vs Claude Opus 4.6 vs GPT-5.3-Codex for agentic coding. SWE-Bench, Terminal-Bench, LiveCodeBench, and pricing comparison with recommendations.

#Gemini 3.1 Pro#Claude Opus 4.6+6 more

2026-02-19

Read Article

AI Development

MiniMax M2.5: Coding Benchmarks, Pricing, and Guide

MiniMax M2.5 scores 80.2% SWE-Bench Verified and costs 1/10th of competitors. Complete guide to features, benchmarks, pricing, API access, and model comparison.

#MiniMax M2.5#AI coding models+5 more

2026-02-12

Read Article

AI Development

MiMo-V2-Flash: Xiaomi's 309B MoE Open-Weight Model Guide

Xiaomi's MiMo-V2-Flash: 309B open-weight MoE running 150 tok/s with 73.4% SWE-Bench. Deploy the fastest open-source coding model with this guide.

#MiMo-V2-Flash#Xiaomi AI+5 more

2025-12-15

Read Article

AI Development

Devstral 2 & Mistral Vibe CLI: Complete Coding Guide

Master Devstral 2 (72.2% SWE-bench) and Mistral Vibe CLI. Open-weight coding models that run locally. Complete autonomous agent guide.