Topic

#ai-benchmarks

36 articles tagged ai-benchmarks. Browse the full set below, or see all topics.

Tagged "ai-benchmarks"

Cross-cutting reads on this topic

36 articles

AI Development

GLM-5.2 Benchmarks: Open Weights vs Claude Opus 4.8

Z.ai's GLM-5.2 lands with full benchmarks and MIT open weights: #2 on Code Arena Frontend, near Opus 4.8 on agentic coding, at GLM-5.1 pricing.

#glm-5-2#zhipu-ai+6 more

2026-06-16

Read Article

AI Development

FrontierMath v2: When AI Benchmarks Get Error-Corrected

Epoch AI found errors in 42% of FrontierMath problems and shipped v2 on June 12, 2026. Scores jumped, rankings held — here is what that means for model choice.

#frontiermath#ai-benchmarks+4 more

2026-06-14

Read Article

AI Development

AI Agent Task Completion in 2026: What 8,128 Users Reveal

A panel of 8,128 users puts AI agent task completion at 75.3%, yet 54% still trust manual search more. Inside the per-agent variance and the 2026 trust paradox.

#AI agents#task completion+6 more

2026-06-12

Read Article

AI Development

Stanford AI Index 2026: The 20 Numbers That Matter

Stanford's 2026 AI Index distilled into 20 essential numbers: $581.7B in AI investment, a closing US-China gap, jagged intelligence, and what it means for you.

#Stanford AI Index#AI statistics 2026+6 more

2026-06-12

Read Article

AI Development

Claude Fable 5 & Mythos 5: Agentic Coding Deep Dive

Claude Fable 5 & Mythos 5 as an agentic coding model, read from the system card: the real coding benchmarks, the candid failure modes, and how to oversee it.

#claude-fable-5#claude-mythos-5+6 more

2026-06-09

Read Article

AI Development

Claude Fable 5 vs GPT-5.5: Benchmarks & Cost Compared

Claude Fable 5 leads the benchmarks; GPT-5.5 costs half as much and owns Codex. We compare coding, knowledge work, long context, and cost to find the fit.

#claude-fable-5#gpt-5-5+6 more

2026-06-09

Read Article

AI DevelopmentPopular

Claude Fable 5 & Mythos 5: The Frontier, Split in Two

Anthropic shipped its strongest model as two products: Fable 5, generally available with safeguards, and restricted Mythos 5. Benchmarks, pricing, the catch.

#claude-fable-5#claude-mythos-5+6 more

2026-06-09

Read Article

AI Development

Claude Opus 4.8, 48 Hours In: The Early Eval Roundup

Opus 4.8 tops the Artificial Analysis index, but GPT-5.5 still leads Terminal-Bench. An evidence-graded roundup of the first 48 hours of independent evals.

#claude-opus-4-8#ai-benchmarks+5 more

2026-05-30

Read Article

AI Development

Claude Opus 4.8: Benchmarks, Effort & Dynamic Workflows

Claude Opus 4.8 lands May 28 with stronger coding benchmarks, a major honesty gain, new effort controls, and dynamic workflows in Claude Code.

#claude-opus-4-8#anthropic+6 more

2026-05-28

Read Article

AI Development

Claude Opus 4.8 vs GPT-5.5: Benchmarks & Cost Compared

We compare Claude Opus 4.8 and GPT-5.5 on coding, agents, reasoning, and real cost — including where GPT-5.5 still wins and which model fits which job.

#claude-opus-4-8#gpt-5-5+6 more

2026-05-28

Read Article

AI Development

Claude Opus 4.8 vs Gemini 3.5 Flash: AI Agent Routing

Gemini 3.5 Flash beats Claude Opus 4.8 on MCP-Atlas and Finance Agent at a third of the price — but a 61% hallucination rate complicates the routing call.

#claude-opus-4-8#gemini-3-5-flash+6 more

2026-05-28

Read Article

AI Development

Qwen 3.7 Max: Alibaba's New Flagship AI Model 2026

Alibaba's Qwen 3.7 Max ships with 1M context, $2.50/$7.50 pricing, and benchmarks topping Opus 4.6 on Terminal-Bench, SWE-Bench Pro, and MCP-Atlas.

#qwen-3-7-max#alibaba-qwen+7 more

2026-05-25

Read Article

AI Development

Gemini 3.5 Flash vs GPT-5.5 vs Opus 4.7: Agentic Coding

Agentic coding head-to-head: Gemini 3.5 Flash vs GPT-5.5 vs Opus 4.7. MCP Atlas, SWE-Bench Pro, Terminal-Bench, plus Antigravity 2.0 launch context.

#gemini-3-5-flash#gpt-5-5+8 more

2026-05-19

Read Article

AI Development

Gemini 3.5 Flash: Benchmarks, Thinking & API Guide 2026

Gemini 3.5 Flash launched today: 83.6% MCP Atlas, 1M context, new thinking_level API. Full benchmarks vs Opus 4.7 and GPT-5.5 with migration notes.

#gemini-3-5-flash#google-gemini+7 more

2026-05-19

Read Article

AI Development

Multimodal AI Benchmarks 2026: Vision, Audio, Code

Cross-modal benchmark scores — image understanding, video, OCR, ASR, code-with-vision — across GPT-5.5, Gemini 3, Claude 4.7, Qwen 3.5 Omni. 80+ data cells.

#multimodal-ai#vision-language-models+8 more

2026-04-24

Read Article

AI Development

Long-Context Retrieval 2026: Needle-in-Haystack Test

Updated NIAH-2 results across 1M-context models — single-needle, multi-needle, and reasoning-over-context. Where models silently fail above 200K tokens.

#long-context#needle-in-haystack+8 more

2026-04-24

Read Article

AI Development

GPT-5.5 vs Claude Opus 4.7: Benchmarks & Pricing

Head-to-head: GPT-5.5 and Claude Opus 4.7 on agentic coding, computer use, 1M context, pricing, and the right model for each production workload.

#gpt-5-5#claude-opus-4-7+8 more

2026-04-23

Read Article

AI Development

GPT-5.5 Complete Guide: Thinking, Pro & 1M Context

OpenAI's GPT-5.5 ships April 23, 2026 with 1M context, Thinking and Pro variants, 82.7% Terminal-Bench, and same latency as GPT-5.4. Pricing inside.

#gpt-5-5#gpt-5-5-pro+8 more

2026-04-23

Read Article

AI Development

Reasoning Effort: Cost vs Quality Benchmarks 2026

We measured low/medium/high reasoning effort across 5 frontier models on math, code, and analysis. Quality lift, latency tax, and cost-per-correct-answer data.

#reasoning-effort#ai-benchmarks+8 more

2026-04-23

Read Article

AI Development

AI Hallucination Rate Benchmarks 2026: 5-Model Study

Cross-model hallucination rates on factual recall, citation accuracy, and code reference. 5,000 prompts tested across 5 frontier models with confidence bands.

#ai-hallucination#ai-benchmarks+8 more

2026-04-23

Read Article

AI Development

Tool-Use Success Rates: 5 Frontier Models Tested

MCP tool-call success across 12 task types — search, file ops, data, calendar, email. Pass-rate, retry-rate, and cost-to-completion for 5 frontier AI models.

#tool-use#mcp+8 more

2026-04-23

Read Article

AI Development

AI Model Latency Benchmarks 2026: TTFT & TPS Data

Time-to-first-token and tokens-per-second across 30 model+provider pairings. P50/P95 numbers, regional spread, and how reasoning-mode tax cold latency budgets.

#ai-latency#ttft+8 more

2026-04-23

Read Article

AI Development

Cost-Per-Successful-Task: A New AI Evaluation Metric

Why $/token is the wrong unit and $/successful-task is the right one. Formulas, worked examples across 6 task families, and a downloadable scoring template.

#ai-evaluation#cost-per-task+8 more

2026-04-23

Read Article

AI Development

Qwen 3.6 Plus vs Claude Opus 4.6 vs GPT-5.4 Compared

Frontier model comparison: Qwen 3.6 Plus vs Claude Opus 4.6 vs GPT-5.4. Benchmarks, pricing, context windows, and capabilities for 1M+ token models.

#qwen-3-6-plus#claude-opus-4-6+5 more

2026-04-02

Read Article

AI Development

Qwen 3.5-Omni vs Gemini 3.1 vs GPT-5.4 Comparison

Comparing omnimodal AI models: Qwen 3.5-Omni, Gemini 3.1 Pro, and GPT-5.4 across text, image, audio, and video tasks. Benchmarks and use case analysis.

#qwen-3-5-omni#gemini-3-1-pro+5 more

2026-03-30

Read Article

AI Development

Agentic AI Statistics 2026: 150+ Data Points Collection

The definitive collection of 150+ agentic AI statistics for 2026 covering market size, adoption rates, ROI metrics, security data, and enterprise benchmarks.

#agentic-ai#statistics+5 more

2026-03-13

Read Article

AI Development

Gemini 3.1 Flash-Lite: Cheapest AI That Beats GPT-5 Mini

Google's Gemini 3.1 Flash-Lite costs $0.25 per million tokens and outperforms GPT-5 Mini on key benchmarks. Complete pricing and performance comparison guide.

#gemini-3-1-flash-lite#google-ai+4 more

2026-03-09

Read Article

AI Development

GPT-5.4 Complete Guide: Standard, Thinking, and Pro

GPT-5.4 ships three variants: Standard, Thinking, and Pro. Native computer use, 1M context, tool search, and 33% fewer factual errors. Complete guide.

#gpt-5-4#openai+5 more

2026-03-06

Read Article

AI Development

GPT-5.4 vs Opus 4.6 vs Gemini 3.1 Pro: Best AI Model?

Three-way frontier model comparison: GPT-5.4 vs Claude Opus 4.6 vs Gemini 3.1 Pro benchmarks, agentic AI capabilities, pricing, and which model wins.

#gpt-5-4#claude-opus-4-6+6 more

2026-03-05

Read Article

AI Development

GPT-5.4: Computer Use, Tool Search, Benchmarks, Pricing

OpenAI releases GPT-5.4 with native computer use, 1M context, and tool search reducing tokens by 47%. Complete benchmarks, pricing, and developer guide.

#gpt-5-4#openai+5 more

2026-03-05

Read Article

AI Development

GPT-5.3 Instant: Benchmarks, Pricing, Migration

OpenAI releases GPT-5.3 Instant with 26.8% fewer hallucinations, 400K context, and anti-cringe tone overhaul. Complete benchmarks, pricing, and migration guide.

#gpt-5-3-instant#openai+4 more

2026-03-03

Read Article

AI Development

Gemini 3.1 Flash-Lite: Cheapest AI Beats GPT-5 Mini

Google launches Gemini 3.1 Flash-Lite at $0.25 per million input tokens. 2.5x faster, tops 6 benchmarks. Complete pricing and performance comparison guide.

#gemini-flash-lite#google-ai+4 more

2026-03-03

Read Article

AI Development

Qwen 3.5 Medium Models: Benchmarks, Pricing, and Guide

Qwen 3.5 medium series: Flash, 35B-A3B, 122B-A10B, and 27B. Benchmarks vs GPT-5 mini and Claude Sonnet 4.5, pricing from $0.10/M tokens.

#qwen-3-5#alibaba-ai+6 more

2026-02-25

Read Article

AI Development

MiniMax M2.5: Coding Benchmarks, Pricing, and Guide

MiniMax M2.5 scores 80.2% SWE-Bench Verified and costs 1/10th of competitors. Complete guide to features, benchmarks, pricing, API access, and model comparison.

#MiniMax M2.5#AI coding models+5 more

2026-02-12

Read Article

AI Development

LLM Comparison Guide: December 2025 Rankings

Compare GPT-5.2, Gemini 3 Pro, Claude Opus 4.5, DeepSeek V3.2. Complete benchmark analysis with SWE-bench, pricing, and use cases.

#LLM Comparison#GPT-5.2+5 more

2025-12-07

Read Article

AI Development

Kimi K2 Thinking: 1T Open-Source Reasoning AI Model

Moonshot AI's Kimi K2 Thinking achieves SOTA with 1T parameters, INT4 training, 200-300 tool calls. First open model competitive with GPT-5/Claude.

#Kimi K2 Thinking#Open Source AI+6 more

2025-11-07

Read Article