the messy middle – or why you can’t untie a whole language from a transformer without losing the plot
So, here's something interesting I've been working on: when it comes to multilingual AI models, the middle layers seem way more important than the later ones.
I know, it sounds a bit weird, but it makes sense. I’ve seen metrics on languages with english(llm majorly holding) and hindi (minor) like cosine distance that show a huge, clear difference between languages right near the end of the model’s process. You can practically see the model separating Hindi from English. And that’s the trap, and it took stupid me, 6hrs of a100 to figure out that shit.
The model does understand the difference between languages in those final layers. But that's like looking at a cake puffing up in the oven and saying, "Yep, that's a cake." It’s obvious, but the baking is pretty much done. Trying to change the model's output after it has already decided on a language is just as pointless as trying to unbake that cake.
The Real Learning Happens in the Messy Middle
Think about how we learn a new language. The most intense learning happens when you're moving from a beginner to being fluent. You're not just memorizing words; you're connecting concepts, grammar, and culture in a messy, entangled way. It seems like the middle layers of an LLM are doing the exact same thing. This is where the model is still figuring things out, where the "language decision" is still forming.
This is also where things get complicated. The features for different languages are heavily tangled up with each other. A language like Hindi can be written in the Roman alphabet (Hinglish), so the model has to understand the semantics beyond just the script and literally i mean literally we can write one language in any other language man. This is why a simple prompt-based approach to control a model’s language often isn’t very robust.
My Little Experiment with Surgical Tools
This got me thinking, what if we go inside the model and mess with its internal weights? The idea is to find the specific "neurons" or features that represent a language and just... turn them off.
I decided to try this out. I used some tools called Sparse Autoencoders (SAEs) from Goodfire, which are trained to find important features in a model. I experimented on a Llama 3.1 8B model sae. i am assumming scaling but i maybe wrong about it.
And it kind of worked atleast in the 8b model! I saw a significant suppression of the Devanagari script and other aspects of Hindi using their auto-steer. But here’s the catch: the model’s output became a mess. It started repeating itself or just spitting out broken, nonsensical text.
It proved my hunch: these language features are so deeply entangled that trying to surgically remove one just causes the whole system to degrade. It’s like trying to pull a single thread out of a sweater—you don’t just get the thread, you get a hole.
so, i tried experimenting and i mean i have not really reached anywhere promising in the direction i wanted, but surprisingly middle layers work well.
So, If We Can't Remove It, What Can We Do?
This leads to a tough spot. If any attempt to surgically remove a language causes the model to break, is it a dead end?
Maybe not. Maybe the goal shouldn't be a clean removal but a robust degradation. If breaking the model is unavoidable, can we find techniques that consistently degrade its ability in a specific language, so it can be "jailbreak" prone. We’d need to formally test these methods to see how resistant they are to adversarial attacks.
It’s a tricky problem. For now, it seems like surgically removing a language from a model, especially in the later layers, just won't work.
But it leaves me with a far-out question: Is a model just a mimicking machine, or will it actively try to preserve language because it sees it as a cultural thing, the way a human does? Probably a long shot, but it's fun to think about.