Overview

Nate B Jones runs comprehensive blind evaluations of GPT-5.4 against Claude Opus and Gemini, revealing that while GPT-5.4 excels at complex quantitative modeling and file processing, it has a critical flaw: it builds elaborate systems without judgment, often missing obvious logical errors that other models catch easily.

Key Takeaways

  • Always use thinking mode over auto mode - GPT-5.4’s default auto mode performs significantly worse on factual accuracy and retrieval tasks, sometimes giving last-place results on questions where thinking mode would place first
  • GPT-5.4 excels at complex data processing and quantitative modeling - it handles 99% of file types including handwritten receipts via OCR, making it superior for comprehensive business document analysis
  • The model treats tasks as pipelines to execute rather than problems to understand - it will build technically perfect systems while missing obvious data quality issues like fake customers named ‘Mickey Mouse’
  • OpenAI is positioning this as infrastructure for autonomous agents - the emphasis on tool search, computer use, and sustained workflows signals their strategic move toward competing with systems like OpenClaw
  • For writing and creative work, Claude Opus still outperforms significantly - GPT-5.4 struggles with tone, voice, and editorial tasks despite improvements from previous versions

Topics Covered