Small models that loop match 5x-larger models — inference costs drop ~60%

Mira Castellanos, Dawei Lin, Jonas Brekke et al.~35s readarXiv:2605.12473

Bottom line: a new architecture lets a 1.3B-parameter model match 7B-class models on reasoning benchmarks by re-running its own layers on hard inputs — cutting inference compute roughly 60% at equal quality.

The method, Mixture-of-Recursions, adds a router that decides per token how many internal passes to spend. Easy tokens exit after one pass; hard reasoning steps loop up to eight times. Code and weights are public and the approach is compatible with standard transformer stacks.

Caveats: results are unverified above 1.3B parameters, and hard inputs answer slightly slower, which matters for latency-sensitive products.

If your AI spend is inference-heavy, this is the cost curve to watch — quality per dollar, not model size, is becoming the metric.

Recommended action: have your ML team benchmark Mixture-of-Recursions against your current model on one production workload this quarter.