vibecoding goes brrrrrrr!!!
So, for a while I’ve been spending some $, and to my surprise, the models’ effectiveness at following complex instructions, doing scaffolding, handling complex projects, and sitting with the vagueness of the human mind is not bad.
imo I’ve tested gpt-5-codex, gpt-5-high, grok-4 and grok-4-fast, Opus 4.1 and Sonnet 4/4.5, GLM 4.5/4.6, Qwen 3 Max, DeepSeek V3.1 Terminus/V3.2, and Kimi K2.
- Claude Code with GLM 4.5/GLM 4.6 ⇒ Best alternative to the Sonnet models, and tool calling is magic (Berkeley Tool Calling Index: Rank 1).
- gpt-5-codex ⇒ Seems to be the best at implementing complex projects.
- gpt-5-high ⇒ The best at planning; really good with long context and can rationalize and reason very well.
- grok code fast 1 ⇒ Fastest for reiterating but with a low brain (like an effective clown). A complex model can plan and provide a git diff, then ask this to apply; there must be constant to-and-fro, but it’s effective for latency and throughput.
- grok-4 ⇒ idk, bro. It didn’t feel as good at programming as the other models, but it’s really good with web and research.
- Qwen 3 Max ⇒ It’s alright—really good compared with the competition.
- DeepSeek V3.1 Terminus and V3.2 ⇒ Not bad, and it’s worth it given the API cost atp.
- Sonnet 4 and Opus 4.1 ⇒ Not as good as I thought, but it’s alright, ig.
- Sonnet 4.5 ⇒ Really good. Using it with gpt-5-high is enough to see end-to-end, but while the goal is quite well known to gpt-5-high, with Sonnet 4.5 we sometimes need to remind it.
- Kimi K2 ⇒ It’s just so creative—love it.