Overview
Nate B Jones runs comprehensive blind evaluations of GPT-5.4 against Claude Opus and Gemini, revealing that while GPT-5.4 excels at complex quantitative modeling and file processing, it has a critical flaw: it builds elaborate systems without judgment, often missing obvious logical errors that other models catch easily.
Key Takeaways
- Always use thinking mode over auto mode - GPT-5.4’s default auto mode performs significantly worse on factual accuracy and retrieval tasks, sometimes giving last-place results on questions where thinking mode would place first
- GPT-5.4 excels at complex data processing and quantitative modeling - it handles 99% of file types including handwritten receipts via OCR, making it superior for comprehensive business document analysis
- The model treats tasks as pipelines to execute rather than problems to understand - it will build technically perfect systems while missing obvious data quality issues like fake customers named ‘Mickey Mouse’
- OpenAI is positioning this as infrastructure for autonomous agents - the emphasis on tool search, computer use, and sustained workflows signals their strategic move toward competing with systems like OpenClaw
- For writing and creative work, Claude Opus still outperforms significantly - GPT-5.4 struggles with tone, voice, and editorial tasks despite improvements from previous versions
Topics Covered
- 0:00 - The Car Wash Test: GPT-5.4 fails a simple logic test that other AI models pass - recommending walking to a car wash instead of driving
- 2:30 - Evaluation Methodology: Overview of blind evaluation suite comparing GPT-5.4 against Claude Opus and Gemini across six structured tests
- 4:30 - Writing and Creative Performance: GPT-5.4 struggles with creative writing, tone recognition, and business communication compared to competitors
- 7:00 - The Eval From Hell: Complex schema migration test with messy business data - GPT-5.4 excels at file discovery but fails data hygiene
- 9:30 - Thinking Mode vs Auto Mode: Critical performance gap between GPT-5.4’s thinking mode (competitive) and auto mode (often last place)
- 14:00 - Where GPT-5.4 Wins: Three key strengths: quantitative modeling, file type processing, and understanding the AI competitive landscape
- 18:30 - Where GPT-5.4 Fails: Cannot write well, builds infrastructure without judgment, processes slowly compared to competitors
- 23:30 - OpenAI’s Strategic Direction: Analysis of Peter Steinberger hire and OpenAI’s focus on agentic systems and autonomous workflows
- 29:00 - Practical Recommendations: Specific guidance on when to use GPT-5.4 vs competitors based on task type and performance requirements
- 32:00 - The Future is Agentic: GPT-5.4 as infrastructure for sustained AI workflows rather than single-turn conversations