Intelligence Arbitrage

175 experiments measuring whether cheaper models can match frontier quality on real tasks — with the right prompting.

DSPyGPT-4o175 experiments

Research complete

The Problem

The AI cost conversation is dominated by vibes. Teams assume they need GPT-4o for everything without testing whether a cheaper model — with better prompting — could do the same job. The flip side is also common: teams switch to cheaper models, quality drops, and they switch back without understanding why. Neither approach is systematic.

The Build

175 structured experiments across classification, extraction, summarization, and reasoning tasks. Each experiment holds the task constant and varies the model (GPT-4o, GPT-4o-mini, Claude Haiku, Claude Sonnet) and prompting strategy (zero-shot, few-shot, chain-of-thought, DSPy-optimized). Results are scored against a held-out ground truth set. The findings are compiled into a final report with per-task cost/accuracy curves.

What Makes It Different

The research question is practical: for which task types does prompt optimization eliminate the need for a frontier model? The answer varies significantly by task. Extraction tasks show the biggest arbitrage opportunity. Open-ended reasoning shows the least. The findings directly informed the Arbitrage Factory product.