Ai Benchmarks for Code

KushoAI Launches APIEval-20, the First Open Benchmark for AI API Test Generation

-- No existing benchmark measured whether AI agents can find real API bugs from a schema and payload alone -- 100+ downloads in first week by developers and contributors; freely available on ...

Qodo raises $70M for code verification as AI coding scales

As AI floods software development with code, Qodo is betting the real challenge is making sure it actually works.

Digital Trends

If you code Android apps with AI, Google’s new benchmark makes it easier to pick the right model

For Android app developers relying on AI to code, picking the right model can be tricky. Not all models are built the same, and many are not specifically trained for Android development workflows. To ...

Hosted on MSN

AI benchmark numbers are meaningless — here's what to look for instead

Every time a new AI model launches, the cacophony of AI benchmarking sites whirs into life and bombards us with colorful charts, imperceptible and marginal improvements to uncontextualized numbers ...

Decrypt

Is AGI Here? Not Even Close, New AI Benchmark Suggests

ARC-AGI-3 dropped the same week Jensen Huang declared AGI achieved. Gemini scored 0.37%. GPT-5.4 got 0.26%. Humans hit 100%.

Business Insider

AI Helps Low-Performing Engineering Teams 4x More Than High-Performing Ones, New Benchmarks Show

LONDON, March 05, 2026 (GLOBE NEWSWIRE) -- The 2026 Engineering Productivity Benchmarks from Plandek, a Developer Productivity Intelligence (DPI) platform, has analyzed data from more than 2,000 ...

Forbes

The Messy Cost Of AI Code

AI-driven coding promised speed, but its code often fractures under pressure, leaving teams to carry the weight of failures that slow products and raise real costs. Buoyed by the rise of AI, many ...

VentureBeat

Has this stealth startup finally cracked the code on enterprise AI agent reliability? Meet AUI's Apollo-1

For more than a decade, conversational AI has promised human-like assistants that can do more than chat. Yet even as large language models (LLMs) like ChatGPT, Gemini, and Claude learn to reason, ...

Ars Technica

Anthropic says its new AI model “maintained focus” for 30 hours on multistep tasks

Claude is popular with some software developers thanks to Claude Code, and Anthropic is confident about the latest version of Sonnet’s coding capability: “Claude Sonnet 4.5 is the best coding model in ...

SlashGear

Is OpenAI Falling Behind In The Artificial Intelligence 'Arms Race'?

Describing AI development as an "arms race" might seem needlessly bombastic, but there's a reason why this term has entered common usage. It encapsulates the speed and intensity at which companies are ...

VentureBeat

Kilo launches AI-powered Slack bot that ships code from a chat message

Kilo Code, the open-source AI coding startup backed by GitLab cofounder Sid Sijbrandij, is launching a Slack integration that allows software engineering teams to execute code changes, debug issues, ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results