srinivas raghav blog's

least action, long context, and swarm learners

an intuition-first thought experiment about “doing more with less” in AI, taking the sutton's interview in handsight

paper correlating least action with current learning techniques

tl;dr: physics has this beautiful rule called the principle of least action: nature picks the path that “costs” least. if we look at learning through that lens, the per-token loss is like velocity, its change is acceleration, and learning typically decelerates as models digest more context. that doesn’t doom us to “parrots”—it just says naïve in-context gains taper. the fun part: imagine inference-time learning where many small “child” models adapt locally, sync what matters globally, and aim to minimize total action (compute, latency, and error) per task. that could look less like one giant brain and more like a honeybee colony: small, fast learners + smart coordination.


what “least action” actually means

if we treat compute + time + mistakes as the “cost” of a solution, a least-action learner tries to solve tasks with the smallest total bill, not just the smallest instantaneous loss or the fastest but sloppy guess.


information kinematics for language models

empirically, as context grows the loss usually drops but the rate of drop slows → deceleration. intuitively: early tokens reveal easy structure, later tokens add smaller bits of novelty.

takeaway: in-context learning often improves quickly, then flattens. that’s a property of information exposure—not necessarily a hard cap on intelligence.


what does deceleration mean

deceleration says: this particular stream is giving diminishing marginal info. three ways around that:

  1. better signals: feedback, consequences, or richer tasks create higher-value bits.
  2. better mechanisms: memory, tools, retrieval, and planning increase how much “useful work” we squeeze from each bit.
  3. better coordination: multiple agents dividing labor (swarm) can reduce wasted compute.

lagrangian vs hamiltonian

two languages; same physics; complementary intuitions for learning systems.


the thought experiment: inference-time learning as a swarm

what if models didn’t only learn during pre-training, but adapted a bit during inference—and did so collectively?

ingredients

least-action objective (rough sketch)

we want solutions that minimize a task’s total “cost”:

actionα·computehow hard we worked+β·latencyhow long users waited+γ·residual errorhow wrong we stayed

each child chooses micro-updates and tool calls that reduce this action. “big updates” are expensive; “tiny tweaks that help” are cheap. globally, the colony learns what to synchronize (only high-value skills), also minimizing action.

how it learns what to prioritize


an architecture sketch (maybe one pass)

  1. router: breaks a user task into sub-goals; picks a subset of children.
  2. children: do fast local reasoning with short-term memory + tiny adapters.
  3. tool belt: retrieval, code exec, calculators, APIs; children can call tools.
  4. sync hub: periodically ingests child deltas (compressed gradients/skills), runs distillation, and emits new tiny adapter packs.
  5. governor: enforces the least-action budget (compute/time caps, safety rails).
  6. ledger: logs velocity/acceleration of loss, latency, and compute per solved sub-goal.

result: each inference is also a mini-learning episode; each episode leaves the colony a bit smarter, without heavy global retraining.


what we’d measure


failure modes (and how a least-action view helps)


closing: leverage, not laziness

least action isn’t about being lazy—it’s about maximizing leverage: the right bit of work, at the right place, for the biggest downstream effect. even if single-model in-context learning decelerates, a swarm that adapts, coordinates, and distills can keep the overall system on a low-action trajectory. that feels less like a parrot and more like a hive that keeps getting better.

“if you know the way broadly, you will see it in all things.” — musashi the way here might be: learn locally, share globally, spend compute only where it moves the needle.