Topic

#ai-benchmarks

36 articles tagged ai-benchmarks. Browse the full set below, or see all topics.

Tagged "ai-benchmarks"

Cross-cutting reads on this topic

36 articles
Z.ai's GLM-5.2 lands with full benchmarks and MIT open weights: #2 on Code Arena Frontend, near Opus 4.8 on agentic coding, at GLM-5.1 pricing.
#glm-5-2#zhipu-ai+6 more
2026-06-16
Read Article
Epoch AI found errors in 42% of FrontierMath problems and shipped v2 on June 12, 2026. Scores jumped, rankings held — here is what that means for model choice.
#frontiermath#ai-benchmarks+4 more
2026-06-14
Read Article
A panel of 8,128 users puts AI agent task completion at 75.3%, yet 54% still trust manual search more. Inside the per-agent variance and the 2026 trust paradox.
#AI agents#task completion+6 more
2026-06-12
Read Article
Stanford's 2026 AI Index distilled into 20 essential numbers: $581.7B in AI investment, a closing US-China gap, jagged intelligence, and what it means for you.
#Stanford AI Index#AI statistics 2026+6 more
2026-06-12
Read Article
Claude Fable 5 & Mythos 5 as an agentic coding model, read from the system card: the real coding benchmarks, the candid failure modes, and how to oversee it.
#claude-fable-5#claude-mythos-5+6 more
2026-06-09
Read Article
Claude Fable 5 leads the benchmarks; GPT-5.5 costs half as much and owns Codex. We compare coding, knowledge work, long context, and cost to find the fit.
#claude-fable-5#gpt-5-5+6 more
2026-06-09
Read Article
Anthropic shipped its strongest model as two products: Fable 5, generally available with safeguards, and restricted Mythos 5. Benchmarks, pricing, the catch.
#claude-fable-5#claude-mythos-5+6 more
2026-06-09
Read Article
Opus 4.8 tops the Artificial Analysis index, but GPT-5.5 still leads Terminal-Bench. An evidence-graded roundup of the first 48 hours of independent evals.
#claude-opus-4-8#ai-benchmarks+5 more
2026-05-30
Read Article
Claude Opus 4.8 lands May 28 with stronger coding benchmarks, a major honesty gain, new effort controls, and dynamic workflows in Claude Code.
#claude-opus-4-8#anthropic+6 more
2026-05-28
Read Article
We compare Claude Opus 4.8 and GPT-5.5 on coding, agents, reasoning, and real cost — including where GPT-5.5 still wins and which model fits which job.
#claude-opus-4-8#gpt-5-5+6 more
2026-05-28
Read Article
Gemini 3.5 Flash beats Claude Opus 4.8 on MCP-Atlas and Finance Agent at a third of the price — but a 61% hallucination rate complicates the routing call.
#claude-opus-4-8#gemini-3-5-flash+6 more
2026-05-28
Read Article
Alibaba's Qwen 3.7 Max ships with 1M context, $2.50/$7.50 pricing, and benchmarks topping Opus 4.6 on Terminal-Bench, SWE-Bench Pro, and MCP-Atlas.
#qwen-3-7-max#alibaba-qwen+7 more
2026-05-25
Read Article
Agentic coding head-to-head: Gemini 3.5 Flash vs GPT-5.5 vs Opus 4.7. MCP Atlas, SWE-Bench Pro, Terminal-Bench, plus Antigravity 2.0 launch context.
#gemini-3-5-flash#gpt-5-5+8 more
2026-05-19
Read Article
Gemini 3.5 Flash launched today: 83.6% MCP Atlas, 1M context, new thinking_level API. Full benchmarks vs Opus 4.7 and GPT-5.5 with migration notes.
#gemini-3-5-flash#google-gemini+7 more
2026-05-19
Read Article
Cross-modal benchmark scores — image understanding, video, OCR, ASR, code-with-vision — across GPT-5.5, Gemini 3, Claude 4.7, Qwen 3.5 Omni. 80+ data cells.
#multimodal-ai#vision-language-models+8 more
2026-04-24
Read Article
Updated NIAH-2 results across 1M-context models — single-needle, multi-needle, and reasoning-over-context. Where models silently fail above 200K tokens.
#long-context#needle-in-haystack+8 more
2026-04-24
Read Article
Head-to-head: GPT-5.5 and Claude Opus 4.7 on agentic coding, computer use, 1M context, pricing, and the right model for each production workload.
#gpt-5-5#claude-opus-4-7+8 more
2026-04-23
Read Article
OpenAI's GPT-5.5 ships April 23, 2026 with 1M context, Thinking and Pro variants, 82.7% Terminal-Bench, and same latency as GPT-5.4. Pricing inside.
#gpt-5-5#gpt-5-5-pro+8 more
2026-04-23
Read Article
We measured low/medium/high reasoning effort across 5 frontier models on math, code, and analysis. Quality lift, latency tax, and cost-per-correct-answer data.
#reasoning-effort#ai-benchmarks+8 more
2026-04-23
Read Article
Cross-model hallucination rates on factual recall, citation accuracy, and code reference. 5,000 prompts tested across 5 frontier models with confidence bands.
#ai-hallucination#ai-benchmarks+8 more
2026-04-23
Read Article
MCP tool-call success across 12 task types — search, file ops, data, calendar, email. Pass-rate, retry-rate, and cost-to-completion for 5 frontier AI models.
#tool-use#mcp+8 more
2026-04-23
Read Article
Time-to-first-token and tokens-per-second across 30 model+provider pairings. P50/P95 numbers, regional spread, and how reasoning-mode tax cold latency budgets.
#ai-latency#ttft+8 more
2026-04-23
Read Article
Why $/token is the wrong unit and $/successful-task is the right one. Formulas, worked examples across 6 task families, and a downloadable scoring template.
#ai-evaluation#cost-per-task+8 more
2026-04-23
Read Article
Frontier model comparison: Qwen 3.6 Plus vs Claude Opus 4.6 vs GPT-5.4. Benchmarks, pricing, context windows, and capabilities for 1M+ token models.
#qwen-3-6-plus#claude-opus-4-6+5 more
2026-04-02
Read Article
Comparing omnimodal AI models: Qwen 3.5-Omni, Gemini 3.1 Pro, and GPT-5.4 across text, image, audio, and video tasks. Benchmarks and use case analysis.
#qwen-3-5-omni#gemini-3-1-pro+5 more
2026-03-30
Read Article
The definitive collection of 150+ agentic AI statistics for 2026 covering market size, adoption rates, ROI metrics, security data, and enterprise benchmarks.
#agentic-ai#statistics+5 more
2026-03-13
Read Article
Google's Gemini 3.1 Flash-Lite costs $0.25 per million tokens and outperforms GPT-5 Mini on key benchmarks. Complete pricing and performance comparison guide.
#gemini-3-1-flash-lite#google-ai+4 more
2026-03-09
Read Article
GPT-5.4 ships three variants: Standard, Thinking, and Pro. Native computer use, 1M context, tool search, and 33% fewer factual errors. Complete guide.
#gpt-5-4#openai+5 more
2026-03-06
Read Article
Three-way frontier model comparison: GPT-5.4 vs Claude Opus 4.6 vs Gemini 3.1 Pro benchmarks, agentic AI capabilities, pricing, and which model wins.
#gpt-5-4#claude-opus-4-6+6 more
2026-03-05
Read Article
OpenAI releases GPT-5.4 with native computer use, 1M context, and tool search reducing tokens by 47%. Complete benchmarks, pricing, and developer guide.
#gpt-5-4#openai+5 more
2026-03-05
Read Article
OpenAI releases GPT-5.3 Instant with 26.8% fewer hallucinations, 400K context, and anti-cringe tone overhaul. Complete benchmarks, pricing, and migration guide.
#gpt-5-3-instant#openai+4 more
2026-03-03
Read Article
Google launches Gemini 3.1 Flash-Lite at $0.25 per million input tokens. 2.5x faster, tops 6 benchmarks. Complete pricing and performance comparison guide.
#gemini-flash-lite#google-ai+4 more
2026-03-03
Read Article
Qwen 3.5 medium series: Flash, 35B-A3B, 122B-A10B, and 27B. Benchmarks vs GPT-5 mini and Claude Sonnet 4.5, pricing from $0.10/M tokens.
#qwen-3-5#alibaba-ai+6 more
2026-02-25
Read Article
MiniMax M2.5 scores 80.2% SWE-Bench Verified and costs 1/10th of competitors. Complete guide to features, benchmarks, pricing, API access, and model comparison.
#MiniMax M2.5#AI coding models+5 more
2026-02-12
Read Article
Compare GPT-5.2, Gemini 3 Pro, Claude Opus 4.5, DeepSeek V3.2. Complete benchmark analysis with SWE-bench, pricing, and use cases.
#LLM Comparison#GPT-5.2+5 more
2025-12-07
Read Article
Moonshot AI's Kimi K2 Thinking achieves SOTA with 1T parameters, INT4 training, 200-300 tool calls. First open model competitive with GPT-5/Claude.
#Kimi K2 Thinking#Open Source AI+6 more
2025-11-07
Read Article