peeking inside how llama-3 understands hindi
why poke at hindi representations?
let's say you want to "remove" hindi from the model. sounds crazy, right? i mean you would want it for some reasons like maybe its performance is not great in that and still want the model running or maybe something else.
now our bet: if you truly turn off the hindi bits, the whole language ability wobbles like the model goes crazy like mindlessly repeating or breaking, similar to if you remove the main language part in the brain, the man may go crazy. proving this teaches us how tightly languages are woven into these models-and whether we can steer them without chaos, steering in a sense that the model should prefer one over the other more.
we're chasing three hunches:
- removal hypothesis - if we turn off the hindi knobs, does the model go off the rails?
- middle-layer hypothesis - are the middle layers the real hindi control panel?
- hierarchy hypothesis - do different layers carry different kinds of hindi info (script, grammar, meaning)?
the cast: models & datasets
- models:
- llama-3.1-8b-instruct (main)
- mistral-7b-instruct (to make sure results aren't llama-only)
- data (all free on hugging face, no weird gatekeeping):
ai4bharat/indiccorp
for raw hindi (and maybe bengali) textcfilt/iitb-english-hindi
for translation pairsfacebook/flores
for gentle evaluation prompts in 200 languagesai4bharat/hinglish_social_media_2020
for code-mixed fun
we split everything into train / validation / test, keep random seeds fixed, and save the indices so anyone can rerun the exact same experiments.
our toolbox: gated topk sparse autoencoders
to discover the hidden hindi features, we train sparse autoencoders (saes) on llama's internal activations. each sae takes a 4096-d vector (what llama thinks mid-sentence), expands it into a bigger dictionary (32k-64k neurons), but only allows a tiny number (like 64) to fire per token(crux of using sae).
why "gated topk"?
- topk: keep only the k strongest features per token.
- gates: each feature learns if it should fire at all. makes activations cleaner.
our training objective looks like:
where is the original hidden state, is the reconstruction, and are the gate values. we tune the "expansion factor" (8x for middle layers, up to 16x for late layers) so we have enough capacity to capture fine-grained language cues without getting messy.
pipeline in a nutshell
- collect & clean: normalize text, ditch super short lines, keep both devanagari and transliterated copies, align translations, tag code-mixed tokens.
- train saes: focus on layers 10-20 (middle) and 28-32 (upper), train gated topk saes, watch reconstruction loss and average active features.
- feature discovery loop:
- run both hindi and english through the saes
- compute stats (hindi vs english activation gaps, gradient x activation scores)
- pull top snippets for each candidate feature and ask a helper model (gemini flash or similar) to draft a label like "hindi polite verb endings"
- verify by toggling the feature on/off and see if the output changes as expected
- steering experiments:
- soft dial: scale the feature from 1.0 -> 0.7 -> 0.4 -> 0.1
- activation patching: paste english activations into a hindi run to see if hindi disappears
- residual projection: remove just that feature direction from the hidden state
- gated toggle: slam the gate shut and observe the fallout
- evaluation: log metrics (perplexity, bleu, chrf, repetition rate) plus a short human check for fluency and meaning
here's a bird's-eye pseudocode sketch:
for layer in target_layers:
sae = train_gated_topk_sae(hiddens[layer], expansion=expansion[layer])
features = discover_features(
sae,
corpora={"hi": hindi_data, "en": english_data},
ranking="grad_times_activation"
)
verified = []
for feat in features:
label = label_feature_with_gemini(sae, feat)
outcome = causal_toggle(model, layer, feat, scales=[1.0, 0.5, 0.1])
if outcome.passes_thresholds():
verified.append((feat, label, outcome.metrics))
store_verified_features(layer, verified)
run_steering_sweeps(model, layer, verified)
evaluate_all_metrics()
making sense of the hypotheses
removal hypothesis
- test: use the verified hindi features; dial them down gently and sharply.
- collect: freak-out metrics-perplexity spikes, translation score drops, repetition, human "this is nonsense" flags.
- baseline: compare with prompt-level "don't use hindi" hacks to show how blunt those are.
middle-layer hypothesis
- test: first do a quick sensitivity sweep across all 32 layers using lightweight saes or probes. then do deeper steering runs on layers 10-20 and 28-32.
- collect: chart effect size versus layer; expect middle layers to be the main levers.
hierarchy hypothesis
- test: cluster features by labels (script, morphology, semantics, task). see where each set lives. toggle them separately. even try cross-language swaps.
- collect: evidence that script lives early, grammar mid, task cues later-each with different failure modes when disabled.
evaluation: numbers + human sense-checks
- perplexity on held-out hindi/bengali-how surprised the model is after steering.
- translation metrics: bleu & chrf on iit-bombay and flores.
- dialog sanity: repetition rate, diversity, simple coherence score.
- code-mixed stress: run hinglish to see if we break code switching.
- cross-language check: repeat key tests on bengali (or another language) to show generalization.
- baselines: run the same evals for prompt suppression and linear probes.
- human review: 50 cases per condition, rate fluency and meaning with a lightweight rubric.
what we expect to learn
- trying to yank out hindi causes whole regions of the model to collapse-proof that nothing is isolated.
- middle layers turn out to be the main control knobs, fitting with "language meaning lives in the middle" stories.
- features line up in a hierarchy: script, grammar, semantics, and we can show it with causal toggles and swaps.