least action, long context, and swarm learners

05 Oct, 2025

an intuition-first thought experiment about “doing more with less” in AI, taking the sutton's interview in handsight

paper correlating least action with current learning techniques

tl;dr: physics has this beautiful rule called the principle of least action: nature picks the path that “costs” least. if we look at learning through that lens, the per-token loss is like velocity, its change is acceleration, and learning typically decelerates as models digest more context. that doesn’t doom us to “parrots”—it just says naïve in-context gains taper. the fun part: imagine inference-time learning where many small “child” models adapt locally, sync what matters globally, and aim to minimize total action (compute, latency, and error) per task. that could look less like one giant brain and more like a honeybee colony: small, fast learners + smart coordination.

what “least action” actually means

you can describe a system by a single score over its path, called action.
action = “how costly was that whole journey?” (in mechanics it’s built from energies).
nature tends to follow paths where action is stationary (usually minimal among nearby paths).
this is more general than “shortest distance” or “fastest time”—it’s “best tradeoff.”

if we treat compute + time + mistakes as the “cost” of a solution, a least-action learner tries to solve tasks with the smallest total bill, not just the smallest instantaneous loss or the fastest but sloppy guess.

information kinematics for language models

think of the running surprise of a text stream as position in “information space.”
the velocity per step is the per-token cross-entropy loss (how surprising the next token is).
acceleration is how quickly that per-token loss is changing.

empirically, as context grows the loss usually drops but the rate of drop slows → deceleration. intuitively: early tokens reveal easy structure, later tokens add smaller bits of novelty.

takeaway: in-context learning often improves quickly, then flattens. that’s a property of information exposure—not necessarily a hard cap on intelligence.

what does deceleration mean

deceleration says: this particular stream is giving diminishing marginal info. three ways around that:

better signals: feedback, consequences, or richer tasks create higher-value bits.
better mechanisms: memory, tools, retrieval, and planning increase how much “useful work” we squeeze from each bit.
better coordination: multiple agents dividing labor (swarm) can reduce wasted compute.

lagrangian vs hamiltonian

lagrangian view (supervised vibe): describe learning by a scalar function of “where you are” and “how fast you’re changing” (loss + step geometry). stationary action ⇒ update rules (think gradient flows, natural gradients).
hamiltonian view (rl vibe): track states and their conjugate “momenta” (value/advantage), write coupled dynamics; optimal control pops out. this maps cleanly to rl’s tradeoff of reward vs effort.

two languages; same physics; complementary intuitions for learning systems.

the thought experiment: inference-time learning as a swarm

what if models didn’t only learn during pre-training, but adapted a bit during inference—and did so collectively?

ingredients

child models at the edge: small, specialized learners running near the data/user.
local adaptation loop: ephemeral updates, short-term memory, or parameter-efficient tweaks (adapters/LoRA-like), guided by recent feedback.
collective sync: after a window, each child shares summaries (gradients/skills/constraints) to a hub; the hub distills and redistributes. no single “head node” makes all decisions; it’s a colony.

least-action objective (rough sketch)

we want solutions that minimize a task’s total “cost”:

$action \approx \underset{how hard we worked}{\underset{⏟}{α \cdot compute}} + \underset{how long users waited}{\underset{⏟}{β \cdot latency}} + \underset{how wrong we stayed}{\underset{⏟}{γ \cdot residual error}}$

each child chooses micro-updates and tool calls that reduce this action. “big updates” are expensive; “tiny tweaks that help” are cheap. globally, the colony learns what to synchronize (only high-value skills), also minimizing action.

how it learns what to prioritize

frame “when to adapt, when to sync, what to keep” as a reinforcement learning problem: reward = action reduction.
exploration is local and cheap; exploitation is coordinated and conservative.
over time the colony discovers: which tweaks generalize, which are just noise.

an architecture sketch (maybe one pass)

router: breaks a user task into sub-goals; picks a subset of children.
children: do fast local reasoning with short-term memory + tiny adapters.
tool belt: retrieval, code exec, calculators, APIs; children can call tools.
sync hub: periodically ingests child deltas (compressed gradients/skills), runs distillation, and emits new tiny adapter packs.
governor: enforces the least-action budget (compute/time caps, safety rails).
ledger: logs velocity/acceleration of loss, latency, and compute per solved sub-goal.

result: each inference is also a mini-learning episode; each episode leaves the colony a bit smarter, without heavy global retraining.

what we’d measure

loss velocity & acceleration in-session (does the colony still improve or has it plateaued?).
compute-per-useful-bit (how many flops per 1% error reduction?).
latency-regret (extra waiting vs a naive single-shot baseline).
stability (flip-rate of answers under small perturbations).
distill yield (what fraction of local tweaks survive global distill?).

failure modes (and how a least-action view helps)

stale or noisy sync → wasteful updates. fix: penalize sync that doesn’t reduce action.
catastrophic drift → safety regressions. fix: governor with rollbacks and canaries.
privacy/cost → compress summaries, on-device adaptation, differential privacy.
over-fitting to local quirks → decay ephemeral adapters unless promoted by global reward.

closing: leverage, not laziness

least action isn’t about being lazy—it’s about maximizing leverage: the right bit of work, at the right place, for the biggest downstream effect. even if single-model in-context learning decelerates, a swarm that adapts, coordinates, and distills can keep the overall system on a low-action trajectory. that feels less like a parrot and more like a hive that keeps getting better.

“if you know the way broadly, you will see it in all things.” — musashi the way here might be: learn locally, share globally, spend compute only where it moves the needle.