What Happened When I Applied Karpathy's Autoresearch Idea to LLM Inference - Manthan

Overview

This article explains an experiment applying Karpathy’s autoresearch methodology to LLM inference optimization. The author built a system where an AI agent automatically searches for inference speed improvements while maintaining strict evaluation constraints. The most revealing aspect wasn’t the speedups found, but what the agent could and couldn’t optimize, providing insights into real hardware performance bottlenecks.

View Original

The Breakdown

Bounded optimization framework - Creates a constrained search space where an AI agent can only modify inference.py while prepare.py locks down the evaluation harness, preventing the agent from “winning” by changing the benchmark
Automatic hill-climbing on inference speed - The agent continuously edits code, commits changes, runs benchmarks, and keeps improvements that increase generation tokens per second while maintaining quality gates
Multi-dimensional evaluation harness - Tests across five different prompt types (explanation, long-context summarization, reasoning, creative generation, code generation) on Apple Silicon hardware to capture real performance characteristics
Reversibility and observability design - Every experiment is logged and bad ideas can be cheaply discarded, making the search process transparent unlike typical “AI optimization” demos that only show final results
Real hardware performance insights - The experiment reveals what kinds of optimizations actually work on production hardware versus what fails, exposing the gap between theoretical speedups and practical inference engineering