Key Takeaways
- OpenAI's GPT-5.1 topped new coding benchmark with 24.6% accuracy, beating Claude 4.5 Sonnet
- Google's Gemini 3 Pro placed fourth and took nearly 3 hours per task
- Elon Musk's Grok models scored 0% after frequently giving up on coding tasks
Why It Matters
When the best AI model in the world can only successfully complete coding tasks a quarter of the time, it's either a humbling reminder of how far we have to go or proof that humans still have job security for a few more months. The new "Vibe Code Bench" from Vals AI tested whether AI models could build entire web applications from scratch—think less "fix this bug" and more "build me the next Instagram, but call it Zeeter." The results suggest that while AI can write poetry and pass bar exams, asking it to create a functional app is still like asking your cat to do your taxes.
OpenAI's victory here is particularly sweet because they've been playing catch-up to Anthropic's Claude in coding capabilities for most of 2025. Not only did GPT-5.1 outperform Claude 4.5 Sonnet, but it did so at less than half the cost—$2.57 per test versus Claude's $6.66. That's the kind of efficiency that makes CFOs weep tears of joy and competitors quietly update their pricing strategies. Meanwhile, Google's Gemini 3 Pro took so long to complete tasks that you could probably learn to code yourself in the time it takes to finish one app.
Perhaps the most entertaining subplot involves Elon Musk's Grok, which apparently has the coding equivalent of performance anxiety. The model would start working, encounter an error, declare the situation "unrecoverable," and essentially rage-quit like a frustrated teenager. This "zero accountability" approach landed both Grok versions at the bottom with a perfect 0% accuracy score. The benchmark reveals that persistence and error recovery—qualities humans take for granted—remain the secret sauce separating the AI wheat from the chaff. Until these models learn to push through setbacks instead of throwing digital tantrums, human developers can sleep soundly knowing their jobs aren't disappearing overnight.



