Transfer Learning Across Languages: Building Truly Multilingual LLMs

Apr 20, 2025

Browse all previoiusly published AI Tutorials here.

Table of Contents

📖 Introduction](#introduction)
🏗️ Encoder-Decoder vs Decoder-Only Architectures
🎯 Multilingual Training Strategies in 2024-2025
- Bitext Pretraining
- Cross-Lingual Embedding Alignment
- Language Adapters and LoRA
- Prompt Tuning & In-Context Bootstrapping
🌍 Transfer between High-Resource and Low-Resource Languages
🛠️ Production Implementations and Tools (2024+)
🔤 Multilingual Model Design: Tokenizers & Routing

📖 Introduction

Multilingual large language models (LLMs) enable a single model to serve many languages, but they face unique challenges of balancing resources and knowledge across diverse tongues. Transfer learning across languages is key to building truly multilingual LLMs, allowing high-resource languages (like English) to enrich understanding and generation in low-resource languages. This report dives deep into state-of-the-art strategies for cross-lingual transfer, covering architectural trade-offs, training methods, adapter techniques, and practical tooling. We focus strictly on recent advances, avoiding older practices, to outline how modern multilingual LLMs are trained and optimized.

🏗️ Encoder-Decoder vs Decoder-Only Architectures

Multilingual LLMs typically use either an encoder-decoder (sequence-to-sequence) Transformer or a decoder-only (autoregressive) Transformer. Both architectures can learn from multiple languages, but they differ in how they allocate model capacity and generalize across languages:

Encoder-Decoder models (e.g. mT5, mBART) have a dedicated encoder to process input text and a separate decoder to generate output. This separation is advantageous for tasks like translation or cross-lingual summarization, where input and output may be in different (here), The encoder focuses on understanding the source language context, while the decoder can concentrate on fluent generation in the target language, capturing language-specific details during tr , Because of this division, encoder-decoder models naturally handle sequence-to-sequence transfer learning (e.g. translating from a high-resource language to a low-resource one) and often excel at cross-lingual tasks that require heavy comprehension followed by generation.
Decoder-Only models (e.g. GPT-style LLMs like PaLM, BLOOM, LLaMA) use a single Transformer stack that both interprets the prompt and generates text. They are optimized for language modeling and text completion, making them very effective for open-ended generat , In multilingual settings, decoder-only LLMs can handle any language input/output in one sequence, but they rely on prompt engineering to delineate the task (since there's no explicit encoder/decoder split). These models have proven capable as general multilingual solvers (ChatGPT and GPT-4 are decoder-only), but they may require more parameters or context to match the focused translation abilities of an encoder-decoder. For example, a 500M parameter decoder-only model (XGLM) can perform multilingual generation, but a smaller 300M encoder-decoder model (mT5) can already rival it on translation quality by virtue of its arc (Machine Translation with Large Language Models: Decoder Only vs. Encoder-Decoder),

Parameter sharing and resource allocation: In an encoder-decoder setup, the encoder and decoder each hold part of the model’s parameters. All languages share both components, but the model can implicitly allocate more capacity to processing vs generation as needed. In contrast, a decoder-only model’s entire capacity is used for both understanding and generation in a single pass. This makes decoder-only models simpler and often more scalable (many of the largest LLMs are decoder-only), but it can mean that if the model is primarily trained on one language, its internal representations may be biased towards that language for both input and output. Encoder-decoder models offer a structural modularity that can aid multilingual transfer: e.g. the encoder might learn a language-agnostic semantic representation, and the decoder can be conditioned to generate in the desired language via a language token or context signal.

Cross-lingual generalization: Encoder-decoder models explicitly trained on translation tasks tend to develop strong cross-lingual mappings (since the encoder must align semantics across languages for the (here), This can improve their ability to transfer knowledge between languages for other tasks as well. Decoder-only models can also generalize across languages (especially if pre-trained on mixed multilingual data), but interestingly, recent research finds that many decoder-only LLMs internally rely on an English-centric representation even when operating in other (Do Multilingual LLMs Think In English?), For instance, a 2025 study showed that a multilingual LLM often converts non-English inputs into an English-like latent space to perform reasoning before outputting in the target , This suggests decoder-only models inherently use high-resource languages as a hidden pivot. While this can be an efficient form of transfer, it may also indicate a limitation in truly capturing other languages’ nuances. Encoder-decoder models, by explicitly forcing alignment through cross-lingual objectives, might achieve more balanced multilingual representations in some cases.

Connect with me on X (Twitter)

It’s not a question of one type being strictly better – rather, they have different trade-offs. Encoder-decoder architectures provide structured resource allocation (separate modules for input vs. output) and are well-suited for applications like machine translation or any-to-any language generation. They can achieve strong results even with fewer parameters focused on cross-ling (Are Decoder-Only Language Models Better than Encoder-Only Language Models in Understanding Word Meaning? - ACL Anthology), Decoder-only architectures offer a unified model good for interactive and generative tasks (chat, completion) and are easier to scale to very large sizes. In practice, massive multilingual models (e.g. BLOOM with 176B parameters) have been built as decoder-only, achieving broad -, while others like mT5 (13B) use encoder-decoder to excel particularly in translation and structured tasks. The choice often comes down to the intended use cases and the training data available. Many modern systems combine ideas from both: for example, using a decoder-only base but fine-tuning it on translation tasks, or employing an encoder-decoder for a translation module within a larger pipeline.

🎯 Multilingual Training Strategies in 2024-2025

Recent years have introduced several training strategies to improve multilingual transfer learning. Below we examine the most effective approaches of 2024/2025, focusing on how they leverage cross-lingual data and model adaptations:

Bitext Pretraining

Bitext pretraining uses parallel text (translations of the same content in different languages) to teach the model direct cross-lingual correspondences. The intuition is that by aligning sentences across languages, the model will learn language-agnostic representations of meaning. Early cross-lingual models like Facebook’s XLM showed that adding a Translation Language Modeling (TLM) objective – where the model sees bilingual sentence pairs and must predict masked words across them – improved cross-lingual und (here), In 2024, bitext pretraining remains a key strategy, especially for improving low-resource languages via high-resource data.

However, recent findings paint a nuanced picture. A 2024 study by Ji et al. evaluated continued pretraining of multilingual models on a machine translation objective (i.e. explicitly training on bitext after an initial multilingual pretraining). Surprisingly, they found that continued training on translation did not always improve cross-lingual representation learning for o (Can Machine Translation Bridge Multilingual Pretraining and Cross-lingual Transfer Learning? - ACL Anthology) , The explicit sentence-level alignment, while obviously useful for translation itself, sometimes made the model’s representations more separable by language, which hurt cross-lingual transfer on tasks like clas , In other words, focusing too much on perfect translation alignment may cause the model to isolate languages in its latent space (beneficial for translating accurately, but detrimental for, say, transferring knowledge for question-answering). This counterintuitive result suggests that bitext objectives must be applied with care. They help the model learn to translate, but they might reduce the natural blending of languages that aids general cross-lingual tasks.

In practice, bitext is still extremely valuable for direct transfer to low-resource languages. For example, the Swallow project (2024) took an English-centric LLaMA-2 model and continued pretraining it on a large Japanese corpus plus some Japanese-English par (Continual Pre-Training for Cross-Lingual LLM Adaptation: Enhancing Japanese Language Capabilities) , The result was a model with drastically improved Japanese capabilities. Notably, incorporating the bitext (parallel corpus) further boosted its translation ability, on top of the gains from monolingual Jap , This shows that for improving generation in the target language or translation quality, bitext is very effective. The caution is that if the goal is cross-lingual understanding or zero-shot transfer, one must monitor that the model doesn’t overfit to “translating” at the expense of generalization.

Modern approaches sometimes use bitext in a multi-stage curriculum: e.g. first do some bilingual alignment training, then switch to multilingual language modeling, etc., to get the benefits of alignment without locking in an overly segregated representation. Bitext is also heavily used in fine-tuning for specific tasks (e.g. translating an English instruction dataset into a low-resource language to fine-tune a model). This is common in 2024 for building multilingual instruction-tuned LLMs – but as we will discuss, recent research like SDRRL shows that solely relying on translated fine-tuning data can be (Enhancing Multilingual Capabilities of Large Language Models through Self-Distillation from Resource-Rich Languages - ACL Anthology),

In summary, bitext pretraining provides a strong cross-lingual signal and remains essential for translation and for seeding knowledge in languages that lack large monolingual corpora. The latest results suggest using it in combination with other objectives and being mindful of how it shapes the model’s latent space.

Cross-Lingual Embedding Alignment

Another strategy is to explicitly align the model’s representations (embeddings) across languages, either during pretraining or as a post-hoc adaptation. The goal is to ensure that words or sentences with the same meaning end up close in the model’s embedding space regardless of language. If achieved, this greatly facilitates transfer learning, as the model essentially “understands” different languages in one semantic space.

Pre-hoc alignment: A cutting-edge example from 2024 is the PREALIGN (here), Instead of hoping a multilingual model will naturally learn alignment, PREALIGN injects alignment signals at the very start of training. It collects English-to-Other language translation pairs for words and phrases, and during early pretraining, it substitutes some words with their translations and trains the model to predict them in context. By doing so, the model is forced to put, say, “guitar” and “guitarra” on similar contextual footing from the get-go. This proactive alignment significantly improved the model’s cross-lingual knowledge transfer ability at earlier train , In essence, PREALIGN establishes a common cross-lingual embedding space before the bulk of language modeling training, so the model can learn shared concepts more easily down the line. Experiments showed that such early alignment yields better cross-lingual performance than letting the model drift and trying to a ,

Post-hoc alignment: Many works have added an alignment step after a model is pre-trained. Classic methods include taking a multilingual model like mBERT and fine-tuning it on parallel sentences with a contrastive loss so that translations have similar embeddings (e.g. LaBSE in 2020). By 2024, these approaches evolved: e.g. aligning not just final embeddings but also internal representation subspaces (see the Lens method later). Post-hoc alignment can also be done via linear mapping of embedding spaces if one has a bilingual dictionary. While such methods can yield a multilingual embedding useful for tasks like retrieval or sentence similarity, they can be costly or limited if done after the fact. Researchers noted that purely post-hoc alignment often requires a long training on parallel data and the gains could -, hence the shift toward integrating alignment into the pretraining process (as PREALIGN does).

Overall, cross-lingual embedding alignment is crucial for models to truly share knowledge between languages. The most effective 2024 techniques use existing cross-lingual supervision (dictionaries, parallel corpora) to guide the model’s hidden representations to unify languages. This can be seen as a form of representation transfer learning. By aligning embedding spaces, a model trained on English facts can immediately recognize those facts expressed in, say, Arabic, because the key tokens and sentence embeddings are mapped close together. As we’ll see, some adapter-based methods and even prompt techniques also aim to induce aligned representations.

Language Adapters and LoRA

Parameter-efficient adaptation techniques have become very popular for multilingual transfer. Instead of retraining or fine-tuning the entire large model for a new language, adapters and LoRA modules allow adding or adjusting a small number of parameters to inject new linguistic knowledge or handle imbalances.

Connect with me on X (Twitter)

Language adapters: These are small bottleneck layers inserted into the model (often between Transformer layers) that can be trained for a specific language (or task) while keeping the base model mostly fixed. In multilingual setups, one can assign an adapter for each language (or group of languages). A prime example is the X-MOD architecture (Pfeiffer et al., 2022), which pre-trains a Transformer with a mix of shared parameters and language-specific adapt (here). During pretraining, each language has some dedicated capacity, and new languages can be added later by learning new adapters without touching the rest of , X-MOD demonstrated that this modular approach can “lift the curse of multilinguality” by giving low-resource languages their own parameters to avoid being overshadowed by high-res . Crucially, adding a new language in X-MOD (post-pretraining) has minimal impact on existing languages – you train new embeddings and adapters for the new language and pl , This inexpensive expansion property is very attractive: it means we don’t have to redo massive pretraining to support an under-represented language.

Adapters in 2024 are often combined with techniques like Adapter Fusion or Mixing, where the model can learn to route between language-specific and shared adapters. They are also used in fine-tuning: e.g. one can fine-tune an English model on a new language by inserting a fresh adapter and training it on that language’s text (this avoids disturbing the original weights). The MAD-X framework (2020) and its successors followed this recipe for cross-lingual transfer in tasks, and it’s still relevant: train a language adapter on unlabeled text, then a task adapter on English task data, and combine them to get cross-lingual performance.

LoRA (Low-Rank Adaptation): LoRA is a 2021 method that exploded in popularity by 2023 for fine-tuning LLMs efficiently. It injects trainable low-rank matrices into the model’s layers (usually into the attention and feedforward projections) instead of tuning the full wei (Optimizing Large Language Models: Deep Dive into LoRA and QLoRA Techniques | by Mohamed Elrefaey | Medium), For multilingual transfer, LoRA provides a lightweight way to adjust a model to new languages. One can train a LoRA adapter on a target language corpus – essentially learning a small delta for that language – and activate it during inference to improve generation in that language. This approach has been applied to, for example, improve an English GPT model on Marathi by fine-tuning with LoRA on a translated instructi (Challenges in Adapting Multilingual LLMs to Low-Resource Languages using LoRA PEFT Tuning) , The results show the model produced more fluent Marathi outputs after LoRA tuning, though care must be taken: the cited study observed that while Marathi generation improved, some reasoning performance dropped, hinting at slight trade-offs or evaluation co , Still, the advantage is clear: LoRA uses a tiny fraction of the original model’s parameters, so one can maintain separate low-rank adapters for many languages without bloating the main model.

In practice, adapter and LoRA strategies are often combined. For example, a recent (late 2024) approach called FLARE integrates source and target language representations within LoRA adapters’ bottlen (Language Fusion for Parameter-Efficient Cross-lingual Transfer | OpenReview) , Essentially, FLARE performs a latent fusion of an English representation and a target-language representation inside the adapter, via a lightweight tran , This yielded impressive gains in cross-lingual understanding: FLARE improved XLM-R and LLaMA-based models such that the performance gap between English and a target language was reduced to around 8–12% on average (whereas without FLARE the gap w , Notably, it did this without adding any new parameters – it just uses the existing LoRA capacity more cleverly to mix languages. This speaks to a trend: use the model’s internal knowledge of a high-resource language to guide the adaptation for a low-resource language, via small adapter modules.

Modern toolkits (discussed later) make it very easy to plug in adapters or apply LoRA to large models. By 2025, it’s common to maintain a library of LoRA files or adapter modules for different languages and tasks, which can be selectively loaded on top of a base model. This avoids deploying a dozen separate large models. Crucially, adapter-based approaches also help mitigate catastrophic forgetting – since the base model stays mostly intact, adding a new language via an adapter or LoRA doesn’t erase the old languages. X-MOD’s success in extending to new languages with minimal impact on existing ones is a testame (here),

Prompt Tuning & In-Context Bootstrapping

Beyond explicit parameter training, prompt-based methods leverage the model’s own knowledge and context to induce cross-lingual transfer. Two major ideas here are prompt tuning (learned prompts) and in-context learning signals.

Prompt tuning (soft prompts): This involves learning a sequence of virtual tokens (a prompt) that, when prepended to an input, condition the model to perform better on a target language or task. Instead of updating model weights, you optimize these prompt embeddings. In multilingual scenarios, one could learn a prompt that specifically steers the model to operate in a certain language. For example, a soft prompt could be trained on Chinese text so that when applied, it “activates” the model’s Chinese knowledge. This is lighter-weight than full fine-tuning and can be used to quickly adapt one foundation model to multiple languages by storing a small prompt per language. By 2024, prompt tuning is a mature technique, though it’s been used more for tasks than raw language adaptation. One challenge is that it may be less effective if the base model has very limited exposure to the target language to begin with. Nevertheless, it’s part of the toolkit – especially when combined with other methods (e.g. one might use a learned prompt in conjunction with a translated example to cue the model).

In-context multilingual bootstrapping: This refers to using cleverly constructed input prompts (possibly including examples in multiple languages) to “bootstrap” performance in a low-resource language. A simple but powerful instance is cross-lingual few-shot prompting. Instead of providing examples in the same language as the test query, one can provide similar examples in a high-resource language that the model understands well, and just one or two indicators of the target language. Recent research found that this approach can yield better results. For instance, when prompting GPT-4 to translate or answer questions in a low-resource language, giving it examples (exemplars) in another language (like English) actually guided it better than using only target-language (Machine Translation with Large Language Models: Decoder Only vs. Encoder-Decoder), The model picks up the pattern from the English examples and applies it to the low-resource language query. This counterintuitive finding – that cross-lingual exemplars can be more helpful – suggests that the model’s reasoning is stronger in the familiar language, so it generalizes the demonstrated behavior to the new language.

Another bootstrapping trick is chain-of-thought reasoning across languages. Suppose we want the model to perform a complex task in a low-resource language. We might prompt it to first think in English (where it has the strongest reasoning abilities), then output the final answer in the target language. For example: “Translate the user’s question to English, solve it step by step, then give the answer in Swahili.” The LLM essentially uses English as an intermediate and Swahili as final output. This aligns with the finding that models often “think in English” (Do Multilingual LLMs Think In English?), By explicitly structuring the prompt to do so, we bootstrap the solution quality. Such approaches have been observed to reduce errors and hallucinations for complex queries in other languages.

We also see self-translation approaches in fine-tuning: an LLM can generate synthetic training data by translating an English dataset into the target language (leveraging its own strengths), and then that data is used to fine-tune it. This is a form of bootstrapping where the model’s knowledge of one language helps create resources for another. Many multilingual instruction-tuned models in 2024 followed this recipe: start with a large pool of English instructions/answers, translate them (using either an external MT system or the model itself in iteration), and fine-tune the model on the mixture. This was done, for example, in Meta’s XGLM and BLOOMZ/mT0 models, enabling strong performance on non-English instructions by bootstrapping from English data.

It’s important to note that while these prompt and in-context methods don’t change model weights, they rely on the base model having some latent cross-lingual ability to begin with. Thus, they often go hand-in-hand with the previously discussed pretraining and adapter methods. In 2025, a practitioner might do something like: use LoRA to give the model a bit of grounding in a new language, then use prompt engineering to elicit the best results in that language for a particular task.

🌍 Transfer between High-Resource and Low-Resource Languages

A core concern in multilingual modeling is how knowledge is transferred from resource-rich languages (like English, Chinese) to resource-poor languages (like Maltese, Lao). This is often called tackling the curse of multilinguality – as more languages (especially low-resource ones) are shoved into one model of fixed size, each language tends to get less rep (here), Several strategies address this imbalance:

Data balancing: Modern pretraining pipelines do not simply feed the raw internet proportion of each language (or else English would dominate most models). Instead, techniques like **exponential smoothing of data weights , For example, mBERT and XLM-R introduced sampling strategies so that low-resource languages are upsampled and high-resource ones are downsampled, ensuring the model sees a more b , This prevents the smallest languages from being nearly invisible during training and also helps their vocabulary representation (so that common words from those languages appear enough to get dedicated sub -). By 2025, such data scheduling is standard: multilingual corpora like CC100 or mC4 are sampled with temperature-based or smoothed distributions to give every language a fighting chance.
Shared vs. language-specific components: As discussed with adapters, one can allocate some language-specific parameters to low-resource languages. This might be during pretraining (X-MOD style mo (here)s) or post-hoc (fine-tune an adapter for the new language). The key idea is to avoid a strict competition for the same model capacity. High-resource languages can share the global parameters, while a low-resource language gets a small private corner of the model to capture what’s unique to it. This significantly improves its performance without degra , For example, X-MOD reported that adding new languages via new adapters did not harm the original ones – a form of perfect transfer where low-res gain knowledge and high-res ,
Knowledge distillation from high to low: A new line of work exemplified by SDRRL (Self-Distillation from Resource-Rich Languages) explicitly uses high-resource language capability to teach the model to perform better in low-resourc (Enhancing Multilingual Capabilities of Large Language Models through Self-Distillation from Resource-Rich Languages - ACL Anthology) , The SDRRL method (ACL 2024) takes a multilingual LLM and has it generate training signals in a resource-rich language (like solving tasks in English) and then uses those as supervision for the model in the target language. Essentially, the model is made to explain or predict its English-inferred answer in the other language, thereby transferring the internal reasoning. This approach showed enhanced performance in many languages while minimally impacting the original English , It outperformed simply fine-tuning on translated data, which the authors note can plateau because it ignores what the model already “knows” , Self-distillation, by contrast, leverages the model’s strengths directly.
Linguistic relatedness and pivoting: Low-resource languages often benefit from related high-resource languages. For instance, training a model on Spanish can indirectly help it perform better on Portuguese. Some multilingual training setups group languages by family or script and share sub-networks or training phases among them. In translation, a common tactic is pivoting: to translate from a very low-resource language A to B, the model might actually translate A→English→B internally. LLMs sometimes learn to do this implicitly (as we saw with internal English representations). System designers can encourage it by including many A→English and English→B examples, even if A→B direct data is scarce.
Continual learning and vocab expansion: When adding support for a new language after initial training, careful continual training can be done (as with Swallow for Japanese). A noteworthy point from Swallow (2024) was that they expanded the model’s vocabulary to include the new language’s characters, rather than forcing them through English-centr (Continual Pre-Training for Cross-Lingual LLM Adaptation: Enhancing Japanese Language Capabilities) , They added Japanese tokens to LLaMA-2’s tokenizer and showed this reduced the subword length of Japanese text by 56%, greatly improving efficiency and , Importantly, this vocab expansion did not degrade English , In multilingual models generally, researchers have found that giving each language a sufficient share of the vocabulary (proportional to its needs) correlates with better , If a low-resource language is forced to be composed of many broken subwords (because the vocab was mostly allocated to other languages), it will perform worse. Thus, to transfer effectively, one might update the tokenizer to properly support the new language, then do a bit of additional training to integrate it. This is a form of transfer learning at the token level.

In essence, transferring between high- and low-resource languages is about maximizing knowledge sharing while minimizing interference. High-resource languages supply universal linguistic features and world knowledge. Modern multilingual LLMs harness this by sharing a lot of parameters across languages – ensuring, for example, that the concept of “President” learned from tens of thousands of English sentences can help interpret a rare sentence in Swahili. At the same time, techniques like adapters, specialized vocab, and balanced training prevent the dominant languages from completely steamrolling the unique traits of low-resource ones. The outcome, in state-of-the-art models, is that low-resource languages often see a huge boost simply by being in a multilingual model with related high-resource languages (as opposed to training a separate model from scratch with their li (Machine Translation with Large Language Models: Decoder Only vs. Encoder-Decoder) . The gap remains – many low-resource languages still lag in performance – but the gap is closing with these transfer learning innovations.

Connect with me on X (Twitter)

🛠️ Production Implementations and Tools (2024+)

Implementing multilingual transfer learning in practice has been greatly simplified by modern frameworks and library support as of 2024:

PyTorch and Hugging Face Transformers: The transformers library (v4.x in 2024) provides out-of-the-box support for multilingual models like mBERT, XLM-R, mT5, BLOOM, etc. It also makes it easy to resize token embeddings (to add new tokens for a new language) and to load adapter modules. For example, one can use model.resize_token_embeddings(new_size) to integrate an expanded vocabulary after adding new SentencePiece tokens for a language. Hugging Face’s PEFT (Parameter-Efficient Fine-Tuning) library allows applying LoRA, prefix tuning, and other adapter techniques to any model with just a few lines of code. This means a developer can take a 7B English model and do LoRA fine-tuning on a Spanish corpus to get a bilingual model without heavy infrastructure. As a Medium summary put it, enhancing underrepresented languages via LoRA or QLoRA has become a straightforward recipe for p (Deep Dive into LoRA and QLoRA Techniques | by Mohamed Elrefaey).
Distributed training for large models: Multilingual LLMs are often huge (billions of parameters), so training/fine-tuning requires multi-GPU setups. PyTorch 2.x introduced features like Fully Sharded Data Parallel (FSDP) and better support for mixed precision, which help in fine-tuning large models on multiple GPUs by sharding model weights. Similarly, DeepSpeed and Megatron-LM (adopted in NVIDIA’s NeMo) are used for training giant multilingual models from scratch, with ZeRO optimization to handle the memory. By 2024, many of these tools have been battle-tested: e.g. BLOOM (176B) was trained with Megatron-Deepspeed. For a researcher or engineer, this means one can leverage these frameworks to do continual pretraining – as in Swallow’s Japanese adaptation, which presumably used distributed training to process 100B tokens.
NVIDIA NeMo and other toolkits: NVIDIA’s NeMo toolkit specifically provides end-to-end workflows for localized LLM training. In a May 2024 technical blog, NVIDIA demonstrated how to take an English 1.3B GPT model and train it on Thai data to mak (Training Localized Multilingual LLMs with NVIDIA NeMo, Part 1 | NVIDIA Technical Blog). The process included training a new tokenizer (Thai SentencePiece), merging it with the original, adjusting the model architecture for the new vocab, and then continual pretraining – all using NeMo’s curat . This kind of ready-made pipeline lowers the barrier for non-experts to apply multilingual transfer learning. NeMo also includes data curation tools (for scraping and cleaning text in various languages) and supports training with distributed GPU backends. Essentially, it wraps up best practices (like those we’ve discussed: vocab expansion, balanced data sampling, etc.) into a reproducible procedure. Other platforms like Google’s Seq2Seq framework or Facebook’s fairseq have also been updated to handle massive multilingual corpora efficiently (e.g. fairseq was used for NLLB-200 training with specialized memory optimizations).
I write everyday for my readers on actionable AI. Subscribe and instantly get a 1300+ page Python book.
Pretrained multilingual models and checkpoints: Another practical enabler is the availability of many pretrained multilingual LLM checkpoints on model hubs. Instead of training from zero, one can download a model like SEA-LION (11 Southeast Asian languages) or SeaLLM (a regional model mentio (Multilingual/Bilingual Large Language Models (LLMs): Tailoring AI Applications for Southeast Asia), These are community or organization-built models that already incorporate multilingual knowledge. Using transfer learning, a developer might start from such a checkpoint and fine-tune it on their domain or add one more language. For instance, SEA-LION includes Lao which is low-resource, and if one needs to add a related dialect, they could fine-tune SEA-LION’s Lao adapter on some new data. The existence of projects like Sailor (AI2’s multilingual models 0.5B–7B) and regional LLMs means you don’t always have to start with an English model – you might choose a base that was trained with your language in mind.
APIs and evaluation suites: By 2025 there are also robust evaluation frameworks (like BLEU, chrF for translation, or multilingual benchmarks like XNLI, MLQA, Flores-101) integrated in libraries. This makes it easier to quantify how well transfer learning is working between languages in your application. For example, you can fine-tune a model on a task in English and just call the evaluate() on XNLI in multiple languages to see the zero-shot transfer results.
No-code and low-code solutions: An emerging convenience is GUI or no-code platforms that incorporate multilingual models. Some AutoML tools allow a user to supply data in multiple languages and under the hood they might use multilingual embeddings or translation to augment training. While not as flexible as coding with PyTorch, this indicates how transfer learning techniques are disseminating to broader AI usage.

In summary, the tooling landscape in strongly supports multilingual LLM development. Powerful open-source frameworks abstract away the complexity of distributed training and provide interfaces for the specific needs of multilingual adaptation (tokenizer handling, adapter injection). This lets researchers focus on what they want to transfer, rather than how to implement it from scratch. A concrete example: using the Hugging Face Transformers + PEFT stack, one could load a 13B BLOOM model, add a new for an unseen language, initialize an adapter, and continue training on new text – all in a notebook environment. Such reproducible workflows have accelerated progress and adoption of truly multilingual LLMs in production settings.

🔤 Multilingual Model Design: Tokenizers & Routing

Designing a model to be truly multilingual requires careful thought at a granular level: how text is tokenized into units, how the vocabulary is allocated, and how the model might route or condition on language.

Tokenizer and Vocabulary Design: Subword tokenization (BPE, WordPiece, SentencePiece) is stan (here), For multilingual models, we train a shared vocabulary across all languages. A naive approach would concatenate all text and learn the most frequent subwords globally – but this would heavily bias towards languages with more data (often English). To mitigate this, techniques such as temperature sampling during tokenizer training or simply capping per-language data are used. The goal is to ensure that each language, especially scripts with unique characters, gets representation in the vocabulary. Researchers have empirically found that the fraction of vocab tokens allocated to a language correlates with that language’s downstrea (Continual Pre-Training for Cross-Lingual LLM Adaptation: Enhancing Japanese Language Capabilities)e. Intuitively, if a language’s words are consistently broken into several pieces because the vocab doesn’t have them, the model has a harder time learning patterns in that language. Hence multilingual vocabularies often include certain rare characters or common words from even low-resource languages, sacrificing a bit of efficiency on high-resource ones for the sake of inclusivity.

In 2024, we also see the continued exploration of tokenization-free or character-level models to avoid vocab bias. The ByT5 model (Google, 2021) showed that processing text at the byte or character level, albeit slower, gave very robust multilingual results even for languages with complex scripts or orthographies. This idea carries into recent models that sometimes prefer char/byte processing for languages where word segmentation is difficult (Thai, Chinese without word boundaries). Still, most large models use subwords for efficiency, so the focus is on training the tokenizer in a balanced way. For instance, the mT5 SentencePiece model was learned from a multilingual corpus with proportional sampling so that it didn’t just become an Engl .

An interesting case of tokenizer design is when expanding to new languages. As discussed, the Swallow project expanded LLaMA’s vocab (Continual Pre-Training for Cross-Lingual LLM Adaptation: Enhancing Japanese Language Capabilities)e. There are also efforts to post-train tokenizers: e.g. Adapters for vocabulary – one 2024 arXiv work analyzed adding new tokens via adapters and found that languages using Latin script vs others benefi (Adapters for Altering LLM Vocabularies: What Languages Benefit ...) . In general, adding new tokens (for new languages or terminology) is feasible: one initializes their embeddings (maybe from similar characters or random) and continues training a bit so the model learns them. The evidence suggests this can be done without catastrophic forgetting of original vocab, as long as it’s a moder n.

Multilingual routing and language tags: Many multilingual systems explicitly use a language ID token. For example, Google’s translation models often prepend a token like <2fr> to indicate the target language is French. In large LLMs that perform many tasks, sometimes the prompt will include an instruction like “Answer in French.” But models may or may not reliably follow that. Including a special token in the prompt that was seen during training can hardwire the behavior. NLLB-200 (No Language Left Behind) by Meta, a 200-language translation model, used language codes as part of the input so that one model could translate any direction. These tags help the model route the input to the correct linguistic output space. In a sense, the model learns a conditional distribution P(output | lang=X). Even in decoder-only models, using such tokens or a short language prefix can improve generation in the desired language.

Beyond fixed tags, there is research into dynamic routing: for example, mixture-of-experts models where certain expert sublayers are specialized by language or language family. A recent trend is to assign experts to handle specific scripts or linguistic features – e.g. one expert for all Arabic-script languages. The Switch Transformer (2021) and related MoE models can in theory allocate different experts on the fly. In multilingual MoE, a low-resource language could automatically gravitate to an expert that was also used by a high-resource relative, thereby transferring knowledge. Some works (around 2022) like “Sparse Mixture of Experts for Multilingual NLP” showed gains here, but MoEs also introduce complexity (and in 2023 their popularity waned in favor of simpler adapter methods).

Language-specific signals in modeling: Aside from explicit tokens, sometimes training regimes incorporate language signals. For instance, one might train a model to predict the language of a piece of text as an auxiliary task (so it has an internal notion of language identity). Or use constrained decoding for certain scripts. These are implementation details that ensure the model doesn’t confuse languages. A phenomenon known as language interference can happen if languages are very similar – the model might start generating a mix (e.g. some words in Spanish while writing Italian). This is usually mitigated by strong signals and context. Lens (2024) as mentioned introduced a clever way to separate language-specific vs. language-agnostic subspaces in the model (Lens: Rethinking Multilingual Enhancement for Large Language Models) . By explicitly pushing representations of different languages apart in one subspace while aligning them in another, the model can keep languages distinct when outputting (avoiding mangling them together) yet share semantics. This kind of nuanced routing at representation level is at the research frontier.

In production models, a simpler but effective practice is: always specify the language in the prompt or context. For example, if deploying a multilingual assistant, include a system message: “You are a multilingual assistant. The user’s language: . You should respond in .” During fine-tuning, developers often intermix languages and ensure the model sees instructions on which language to use. If the fine-tuning data is well balanced, the model learns to handle the code-switching.

Finally, multilingual evaluation routing: An often overlooked aspect is that when evaluating or using the model for a particular language, you might activate certain settings. For instance, if using byte-level fallback for an unseen script. Or using a custom decoding dictionary to avoid rare token glitches (some language might not use certain punctuation, etc.). These are minor engineering details but they contribute to the polish of a multilingual system.

To sum up, multilingual model design involves making sure the infrastructure of text processing doesn’t favor one language too much. It means giving each language a fair representation in the vocabulary, possibly dedicated parameters or signals, and guiding the model with clear markers of which language is in play. The advances of 2024 emphasize modularity with seamless integration: we can plug in new vocabulary, plug in new adapter modules, and use special tokens to control language, all in one unified model. The result is an LLM that behaves as a universal linguistic engine – it knows dozens of languages, and through design choices we can prompt it to activate the right subset of that knowledge at the right time.

Connect with me on X (Twitter)

Rohan's Bytes

Discussion about this post