srinivas raghav blog's

peeking inside how llama-3 understands hindi

why poke at hindi representations?

let's say you want to "remove" hindi from the model. sounds crazy, right? i mean you would want it for some reasons like maybe its performance is not great in that and still want the model running or maybe something else.

now our bet: if you truly turn off the hindi bits, the whole language ability wobbles like the model goes crazy like mindlessly repeating or breaking, similar to if you remove the main language part in the brain, the man may go crazy. proving this teaches us how tightly languages are woven into these models-and whether we can steer them without chaos, steering in a sense that the model should prefer one over the other more.

we're chasing three hunches:

  1. removal hypothesis - if we turn off the hindi knobs, does the model go off the rails?
  2. middle-layer hypothesis - are the middle layers the real hindi control panel?
  3. hierarchy hypothesis - do different layers carry different kinds of hindi info (script, grammar, meaning)?

the cast: models & datasets

we split everything into train / validation / test, keep random seeds fixed, and save the indices so anyone can rerun the exact same experiments.


our toolbox: gated topk sparse autoencoders

to discover the hidden hindi features, we train sparse autoencoders (saes) on llama's internal activations. each sae takes a 4096-d vector (what llama thinks mid-sentence), expands it into a bigger dictionary (32k-64k neurons), but only allows a tiny number (like 64) to fire per token(crux of using sae).

why "gated topk"?

our training objective looks like:

=xx^22+λigi

where x is the original hidden state, x^ is the reconstruction, and gi are the gate values. we tune the "expansion factor" (8x for middle layers, up to 16x for late layers) so we have enough capacity to capture fine-grained language cues without getting messy.


pipeline in a nutshell

  1. collect & clean: normalize text, ditch super short lines, keep both devanagari and transliterated copies, align translations, tag code-mixed tokens.
  2. train saes: focus on layers 10-20 (middle) and 28-32 (upper), train gated topk saes, watch reconstruction loss and average active features.
  3. feature discovery loop:
    • run both hindi and english through the saes
    • compute stats (hindi vs english activation gaps, gradient x activation scores)
    • pull top snippets for each candidate feature and ask a helper model (gemini flash or similar) to draft a label like "hindi polite verb endings"
    • verify by toggling the feature on/off and see if the output changes as expected
  4. steering experiments:
    • soft dial: scale the feature from 1.0 -> 0.7 -> 0.4 -> 0.1
    • activation patching: paste english activations into a hindi run to see if hindi disappears
    • residual projection: remove just that feature direction from the hidden state
    • gated toggle: slam the gate shut and observe the fallout
  5. evaluation: log metrics (perplexity, bleu, chrf, repetition rate) plus a short human check for fluency and meaning

here's a bird's-eye pseudocode sketch:

for layer in target_layers:
    sae = train_gated_topk_sae(hiddens[layer], expansion=expansion[layer])
    features = discover_features(
        sae,
        corpora={"hi": hindi_data, "en": english_data},
        ranking="grad_times_activation"
    )

    verified = []
    for feat in features:
        label = label_feature_with_gemini(sae, feat)
        outcome = causal_toggle(model, layer, feat, scales=[1.0, 0.5, 0.1])
        if outcome.passes_thresholds():
            verified.append((feat, label, outcome.metrics))

    store_verified_features(layer, verified)
    run_steering_sweeps(model, layer, verified)

evaluate_all_metrics()

making sense of the hypotheses

removal hypothesis

middle-layer hypothesis

hierarchy hypothesis


evaluation: numbers + human sense-checks


what we expect to learn