Overview

This article explains an experiment applying Karpathy’s autoresearch methodology to LLM inference optimization. The author built a system where an AI agent automatically searches for inference speed improvements while maintaining strict evaluation constraints. The most revealing aspect wasn’t the speedups found, but what the agent could and couldn’t optimize, providing insights into real hardware performance bottlenecks.

The Breakdown

  • Bounded optimization framework - Creates a constrained search space where an AI agent can only modify inference.py while prepare.py locks down the evaluation harness, preventing the agent from “winning” by changing the benchmark
  • Automatic hill-climbing on inference speed - The agent continuously edits code, commits changes, runs benchmarks, and keeps improvements that increase generation tokens per second while maintaining quality gates
  • Multi-dimensional evaluation harness - Tests across five different prompt types (explanation, long-context summarization, reasoning, creative generation, code generation) on Apple Silicon hardware to capture real performance characteristics
  • Reversibility and observability design - Every experiment is logged and bad ideas can be cheaply discarded, making the search process transparent unlike typical “AI optimization” demos that only show final results
  • Real hardware performance insights - The experiment reveals what kinds of optimizations actually work on production hardware versus what fails, exposing the gap between theoretical speedups and practical inference engineering