Large Language Models Benchmarks

Z.ai’s open-weights GLM-5.2 beats GPT-5.5 on multiple long-horizon coding benchmarks for 1/6th the cost

It allows engineering teams to host frontier-level AI on their own sovereign infrastructure, entirely eliminating vendor lock ...

AI has passed the test but not the exam: Why ‘Humanity’s Last Exam’ matters

There is a temptation, when AI systems begin to outperform human baselines on established tests, to interpret this as a sign ...

Why Weibo’s tiny VibeThinker-3B has the AI world arguing over benchmarks again

B, a 3-billion-parameter AI model, is challenging OpenAI, Google and DeepSeek on math and coding benchmarks while reigniting ...

Geeky Gadgets

AI Benchmarks Are Broken : The Leaderboard Illusion

What if the tools we trust to measure progress are actually holding us back? In the rapidly evolving world of large language models (LLMs), AI benchmarks and leaderboards have become the gold standard ...

ascopubs.org

RadOncRAG: A Novel Retrieval-Augmented Generation Framework Improves Large Language Model Benchmark Performance in Radiation Oncology

Large language models (LLMs) show promise in assisting knowledge-intensive fields such as oncology, where up-to-date information and multidisciplinary expertise are critical. Traditional LLMs risk ...

12d

Show inaccessible results

Z.ai’s open-weights GLM-5.2 beats GPT-5.5 on multiple long-horizon coding benchmarks for 1/6th the cost

AI has passed the test but not the exam: Why ‘Humanity’s Last Exam’ matters

Why Weibo’s tiny VibeThinker-3B has the AI world arguing over benchmarks again

AI Benchmarks Are Broken : The Leaderboard Illusion

RadOncRAG: A Novel Retrieval-Augmented Generation Framework Improves Large Language Model Benchmark Performance in Radiation Oncology

These LLMs are the best at resisting Russian propaganda

Google unveiled TurboQuant, a method that cuts the memory bottleneck slowing large AI models

How to Build Custom LLM Benchmarks for Your AI Applications

Leading AI models ace many vaccine questions but falter on clinical rules