Ai Benchmarks for Code

21d

Kimi K2.7-Code cuts thinking tokens 30% — but practitioners say the benchmarks don't check out

Kimi K2.7-Code claims 30% fewer thinking tokens and a drop-in API swap path, but independent benchmarks show kernel regressions and no DeepSWE submission.

4hon MSN

Benchmark raises its Datadog target as the AI super cycle builds

Benchmark raises its Datadog target, marking the platform's boldest Wall Street bet yet.

1don MSN

Meta's upcoming 'Watermelon' AI model matches OpenAI's GPT-5.5 on key benchmarks, Alexandr Wang reportedly tells employees

Meta Platforms Inc. META forthcoming AI model, Watermelon, has reportedly reached the same performance level as OpenAI’s ...

22d

Xiaomi's new open source, agentic AI coding harness MiMo Code beats Claude Code at ultra-long, 200+ step tasks

The persistent memory system addresses a real and widely felt pain point in agentic development workflows — one that competitors are also racing to solve.

17h

Meta to release new AI model with advanced coding capabilities ‘soon’

Meta Platforms Inc. is gearing up to release a new version of its flagship Muse Spark artificial intelligence model. Alexandr ...

SD Times

Beyond Benchmarks: Measuring the True Cost of AI-Generated Code

Value stream management involves people in the organization to examine workflows and other processes to ensure they are deriving the maximum value from their efforts while eliminating waste — of ...

7don MSN

Top AI models might be confident—doesn’t mean they’re right

“Mostly right is the wrong bar,” Pearl CEO Andy Kurtzig says, as research tests top AI models against professional judgment.

InfoWorld

Why benchmarks are key to AI progress

Researchers are racing to develop more challenging, interpretable, and fair assessments of AI models that reflect real-world use cases. The stakes are high. Benchmarks are often reduced to leaderboard ...

techtimes

Open-Weight AI Enters GitHub Copilot: Kimi K2.7 Code Costs Less, Audits Differently

GitHub moved the AI coding landscape on Wednesday when it made Kimi K2.7 Code — a Beijing-built, open-weight model from Moonshot AI — generally available in the GitHub Copilot model picker, marking ...

1mon

Microsoft’s multi-agent AI system tops Anthropic’s Mythos on cybersecurity benchmark

Microsoft's new vulnerability-scanning system, codenamed MDASH, scored 88.45% on the CyberGym benchmark, surpassing single-model systems from Anthropic and OpenAI by using more than 100 specialized AI ...

Forbes

AI Models Still Struggle With Reasoning — And Here’s Why

Forbes contributors publish independent expert analyses and insights. I write about the economics of AI. What looks like intelligence in AI models may just be memorization. A closer look at benchmarks ...

24d

AI Coding Agents Write 180% More Code But Ship Only 30% More Software

AI coding agents boost code output by 180% but shipping rises only 30%, MIT finds. Why private data access beats benchmark scores as the real AI investment moat.

Some results have been hidden because they may be inaccessible to you

Show inaccessible results