Federated Approaches for Training LLMs

Apr 20, 2025

Browse all previoiusly published AI Tutorials here.

Table of Contents

🔄 Federated Approaches for Training LLMs
🔒 Preserving User Privacy in Federated LLM Training
🏥 Federated LLMs in Healthcare and Banking
🛠 Frameworks and Toolkits (2024–2025)
🚀 Optimization and Scalability

🔄 Federated Approaches for Training LLMs

Training large language models (LLMs) in a federated manner requires adapting traditional FL algorithms to the scale of modern models. A common approach is federated fine-tuning: start from a pre-trained LLM and collaboratively fine-tune it on decentralized private da ( OpenFedLLM: Training Large Language Models on Decentralized Private Data via Federated Learning), This avoids training from scratch (which is prohibitive) and lets each client run a few local epochs of supervised fine-tuning (e.g. instruction tuning) before model ave 8, Standard federated averaging (FedAvg) is used to coordinate these updates, but vanilla FedAvg can struggle with billion-parameter models due to client resource ( Safely Learning with Private Data: A Federated Learning Framework for Large Language Model), To address this, researchers integrate parameter-efficient tuning methods like LoRA or adapter tuning so that only a small subset of weights (or low-rank adaptation matrices) are updated on each ( FATE-LLM: A Industrial Grade Federated Learning Framework for Large Language Models), By freezing most of the LLM’s layers and only training a fraction of parameters, the communication payload and memory usage per client drop drastically while retaining most of the model’s kno (Federated Learning with Layer Skipping: Efficient Training of Large Language Models for Healthcare NLP) , For example, one 2025 study proposes Layer-Skipping FL – only fine-tuning selected layers of a LLaMA-based model while leaving others frozen – which cut communication costs by ~70% with under 2% loss in accuracy compared to full-model tr 3, This kind of partial model update (a variant of split learning) allows smaller clients to participate: an alternative hybrid approach places the LLM’s heavy layers on a central server and only trains the input and output layers on clients, reducing clien 5, In one such design (FL-GLM), the client keeps the embedding and final layers local while the server handles the rest of the transformer forward/backward pass, so that each client trains far fewer para 5, More radically, recent research even explores backpropagation-free training to lighten client work: FwdLLM, for instance, has clients perform only “perturbed forward passes” instead of full gradient backprop, combined with clever server-side updates – yielding up to 1000× faster convergence and ~15× less memory usage on mobile ha (here), In summary, federated LLM training in 2024–2025 has gravitated toward fine-tuning (not from scratch) and uses algorithmic tricks (freeze most layers, low-rank updates, or forward-only training) to make distributed training of LLMs feasible on heterogeneous, resource-constrained clients.

🔒 Preserving User Privacy in Federated LLM Training

Privacy preservation is a core motivation for federated learning, especially with LLMs training on sensitive text data. A basic guarantee is that raw data never leaves the client device – only model weight updates are ex 79, However, even gradient updates can inadvertently leak information (e.g. through reconstruction a 1, To counter this, federated LLM frameworks incorporate secure aggregation and differential privacy (DP). Secure aggregation ensures the server only sees aggregated model updates, not individual client updates, often using cryptographic protocols so that individual contributions remain en 3, This means even if the server or an adversary intercepts messages, they cannot isolate any single user's update. On top of that, DP can be applied by adding noise to each client’s gradients before upload, providing formal privacy guarantees at some cost to model a 59, In a 2025 healthcare FL study, the authors report that their federated LLM maintained robust performance when combining their layer-skipping approach with DP-SGD, underlining that privacy measures can coexist with effective t 59, Another strategy is partial model training for privacy: by keeping the most data-sensitive parts of the model local, one can limit what information the server even has access to. For example, FL-GLM places the input embedding layer on each client so that the server never receives raw embedding gradients (which could potentially be inverted to revea , Additionally, FL-GLM uses encrypted communication (client-specific key encryption for model transfers) so that even peer clients or eavesdroppers cannot reverse-engineer updates in , Some production systems also leverage trusted execution environments (enclaves) and homomorphic encryption for added security, though these come with performance trade-offs. Lastly, protecting the model itself can be important when a pretrained LLM is proprietary – techniques like federated masking or secure model fusion are being explored to protect intellectual property of models while still allowing collaborative t 22, In summary, modern FL for LLMs layers multiple privacy safeguards: keeping data local, securing update transmissions, perturbing or encrypting gradients, and even redesigning the training workflow so that sensitive features never leave the 3, These measures are crucial for applications in which both the data and the model must remain confidential.

Connect with me on X (Twitter)

🏥 Federated LLMs in Healthcare and Banking

Industries with highly sensitive data have emerged as prime beneficiaries of federated LLM training. In healthcare, patient records and clinical notes contain private health information that hospitals cannot share with a central repository. Federated learning allows institutions (hospitals, clinics, research labs) to jointly train powerful medical NLP models without violating data residency or priv (Federated Learning with Layer Skipping: Efficient Training of Large Language Models for Healthcare NLP) , A recent study applied FL to fine-tune a medical language model on distributed hospital data (clinical narratives from MIMIC-III and other datasets) and achieved performance on par with a centrally trained model, while each hospital’s data stayed on , Their “layer-skipping” federated approach handled the non-IID nature of clinical text across hospitals and even retained accuracy under differential privacy, demonstrating feasibility for real-world health 01, This is particularly relevant for tasks like clinical named entity recognition or report classification, where no single hospital has enough data to train an LLM alone – but collaboratively they can reach state-of-the-art results. Similarly in finance and banking, federated learning is addressing data silo problems. Banks and financial institutions sit on troves of customer and transaction data that are regulatory-protected (due to privacy and competition concerns). Federated training enables, for example, multiple banks to jointly fine-tune a financial LLM (say, for risk analysis or financial assistant applications) without pooling their raw data. Notably, experiments with LLaMA-2 (7B) on a private financial dataset showed that a federated fine-tuned model outperformed even GPT-4 on domain-specific benchmarks – whereas any single institution training alone couldn’t be ( OpenFedLLM: Training Large Language Models on Decentralized Private Data via Federated Learning), In other words, FL gave the data owners a way to leverage each other’s data strength to build a superior model, motivating collaboration in domains where data , We’re also seeing federated LLM pilots in other sensitive domains: law (where firms securely co-train legal language models), telecommunications (learning from distributed customer interaction data), and government agencies (collaborating on intelligence models without sharing raw text). In all cases, the industry trend is clear – federated learning is becoming a practical solution to train large language models on private, siloed data that would otherwise remain untapped due to confidentiality. By keeping data localized and models shared, sectors like healthcare and banking can benefit from advanced LLMs while complying with strict privacy regulations.

🛠 Frameworks and Toolkits (2024–2025)

The FL community has developed several frameworks to simplify federated training of LLMs, each adding capabilities for recent challenges:

Flower 1.x – “The friendly federated learning framework.” Flower is a Python framework that is very flexible and framework-agnostic, making it easy to plug in PyTorch, TensorFlow, or J (Top 7 Open-Source Frameworks for Federated Learning - www.apheris.com), In 2024, Flower introduced LLM FlowerTune examples to demonstrate federated fine-tuning of LLMs on real data. The first release showed how to fine-tune a LLaMA2-7B model on an instruction dataset (Alpaca-GPT4) across simulate (LLM FlowerTune: Federated LLM Fine-tuning with Flower), Notably, it leverages Hugging Face’s PEFT library and 8-bit model compression so that the entire federated training can run on a single GPU with mode 101, Flower’s design emphasizes pluggable strategy algorithms – you can switch between FedAvg, FedProx, FedOpt (FedAdam/Yogi/etc.) easily – and it supports both research simulations and production deployments. In 2025, Flower collaborated with Nvidia to integrate Flower’s high-level API with NVIDIA FLARE’s backend, meaning Flower-driven FL apps can run on FLARE’s robust runtime for sc (Supercharging the Federated Learning Ecosystem by Integrating Flower and NVIDIA FLARE | NVIDIA Technical Blog) , This gives users the best of both: Flower’s ease of use with FLARE’s enterprise-grade communication and multi-job scheduling.
NVIDIA FLARE 2.x – Nvidia’s FLARE (Federated Learning Application Runtime Environment) is built for production-grade federated learning, initially geared towards medical AI. In its recent releases (2.4.0 through 2.6.0), NVFLARE introduced features squarely aimed at large-model training. One major addition is a streaming communication API that can transmit huge model weights in chunks, bypassing gRPC size limits (which cap out (Efficient Federated Learning in the Era of LLMs with Message Quantization and Streaming | NVIDIA Technical Blog) , FLARE can now send multi-gigabyte model updates reliably by streaming them piecewise rather than as one giant message. Version 2.6.0 added built-in message quantization: before sending, model weights are automatically quantized (using 8-bit or 4-bit precision with libraries like bitsandbytes) and then de-quantized on the rece , This cuts down payload sizes dramatically – for example, 8-bit compression shrinks update size to ~25% of full precision, and 4-bi - – with negligible impact on model co , Together, these features target the bandwidth and memory bottlenecks of federated LLM training. NVFLARE also supports concurrent training jobs and dynamic client availability, which are important for scaling to man , Its recent integration with Flower (mentioned above) shows the ecosystem moving toward interoperability, using FLARE’s robust orchestration for real-world deployments of federated LLM ,
TensorFlow Federated (TFF) – Google’s TFF is an open-source framework for simulating FL algorithms, primarily used in research and experimentation. TFF excels at defining custom federated computations and has been used to prototype new FL optimization methods. However, it operates in a simulation mode (clients as Python processes or executors) and isn’t aimed at large-scale model training on real distributed nodes. In practice, TFF is suitable for experimenting with federated optimization on smaller models and datasets; for large models or production, teams often turn to the more systems-oriented frameworks above. (Notably, Google’s own federated efforts for Gboard etc. use internal infrastructure beyond TFF.) Still, TFF’s contributions in algorithmic research (e.g. tuning FedOpt on language tasks) have influenced how we fine-tune LLMs under FL settings.
PySyft – PySyft (by OpenMined) is a specialized library focusing on secure and private machine learning. It provides APIs for federated learning, secure multi-party computation (SMPC), and differential privacy, making it a popular choice for research prototypes that require privacy beyond the standard (Top 7 Open-Source Frameworks for Federated Learning - www.apheris.com), PySyft allows data scientists to perform computations on remote data they “cannot see”, using techniques like additive secret sharing to distribute trust. While powerful, PySyft is more geared toward experimental setups (it’s been used to demonstrate concepts like training on encrypted data or combining FL with homomorphic encryption). In the context of LLMs, PySyft can be used to orchestrate federated fine-tuning with added privacy constraints, but it may not handle massive models out-of-the-box as smoothly as Flower or FLARE. Efforts are ongoing to integrate PySyft with PyTorch Lightning and other tools for easier large-model support, but as of 2024 it remains a research-centric toolkit.
FATE (Federated AI Technology Enabler) – FATE is an industrial-grade FL platform widely used in China’s finance sector. In late 2023 it introduced FATE-LLM, extending its capabilities to support large langua ( FATE-LLM: A Industrial Grade Federated Learning Framework for Large Language Models), FATE-LLM enables both homogeneous and heterogeneous FL for LLMs (the latter meaning parties could even train different models and aggregate k , It emphasizes efficient training via PEFT – supporting methods like LoRA and P-Tuning v2 natively for clients to fine-tune only small portions , Uniquely, FATE-LLM also includes mechanisms to protect model IP: for example, using secure aggregation and masking so that a pre-trained model’s proprietary weights aren’t fully exposed to others during f , Of course, standard privacy measures (encryption, differential privacy) are built-in as well for data p , FATE’s focus is enterprise use-cases: it provides a complete stack (orchestration, deployment, even a UI) to run cross-company federated training. Banks have used FATE to collaboratively train models without sharing customer data, and with FATE-LLM, this extends to large Transformer-based models. The framework is open-source (Linux Foundation AI & Data), and its 2024 releases are aligning with the needs of training/fine-tuning LLMs across organizational boundaries.
Connect with me on X (Twitter)

🚀 Optimization and Scalability

When training LLMs across distributed clients, optimization techniques and system design go hand-in-hand to achieve feasible run times and convergence. On the algorithm side, a lot of attention has been given to federated optimization algorithms beyond plain FedAvg. The FedOpt family, which introduces server-side momentum or adaptive learning rates, has proven beneficial for large models. For instance, experiments in federated instruction tuning found that an adaptive optimizer like FedAdagrad outperformed vanilla FedAvg on multiple evaluati ( OpenFedLLM: Training Large Language Models on Decentralized Private Data via Federated Learning), By using FedAdagrad/FedAdam (which adjust learning rates per weight) at the aggregator, training can converge faster on heterogeneous data. Clients with heavier data or different distributions don’t dominate as easily, and the global model can reach higher quality. Another consideration is client selection and parallelism. In cross-device FL (many clients), not all clients can participate in every round; modern schedulers sample clients in a way that balances coverage and freshness of data. In cross-silo FL (fewer, more reliable clients like companies), one can often use all clients each round, but then parallelizing the server workload becomes key. Techniques like client batching (processing subsets of client updates concurrently) and hierarchical aggregation (where intermediate aggregators combine updates before a top-level server) are used to scale to more clients or heav ( Safely Learning with Private Data: A Federated Learning Framework for Large Language Model). FL-GLM’s implementation, for example, allows the server to handle multiple client training processes in parallel (instead of one-by-one), significantly improving .

On the systems side, communication efficiency is paramount. Large models can produce updates sized in the gigabytes, so exchanging them round after round naively would grind any network (Efficient Federated Learning in the Era of LLMs with Message Quantization and Streaming | NVIDIA Technical Blog), One solution is to reduce how often full model transfers happen – e.g. do more local epochs so fewer rounds are needed – but too many local updates can hurt convergence on non-IID data. Instead, recent work focuses on compressing and streaming the updates. As mentioned, quantizing model weights to 8-bit or 4-bit before sending cuts bandwidth usage by 4×–8× with negligible impact on fina , Some frameworks also support sending deltas or sparse updates: if only a small portion of weights (like a LoRA matrix or a few layers) are being trained, the system will transmit just those instead of the entire model. This way, a 7B parameter model’s update might be only tens of MBs rather than tens of GBs. When full-model updates are unavoidable, streaming protocols help by breaking the transmission into chunks and pipelining them. NVIDIA reports that by streaming model weights in 1 GB chunks (with their ObjectStreamer API), the peak memory during transfer is vastly reduced – e.g. sending a 140 GB model might only require ~1 GB of RAM at a time instead of loading the whole model , This enables federated training of models as large as 70B parameters (which would be 130+ GB in fp16) across network-constrained nodes, as long as they can handle the model for training. Asynchronous aggregation is another scaling tactic: rather than locking the global update step to wait for all clients, the server can accept stragglers’ updates later or continuously incorporate incoming updates. This is useful if client devices have very different availability or if network latency is highly variable. Asynchronous FL (and related strategies like FedBuff or FedAsync) can keep the training process going without idle time, though careful tuning is needed to maintain consistency of the model updates.

In practice, achieving scalability for federated LLM training means using a combination of these techniques. A 2025 NVIDIA technical report demonstrated that integrating message quantization and streaming in FL can drastically improve throughput and memory usage, without harming model c , Similarly, open-source federated frameworks now include many of these optimizations out-of-the-box (for example, NVFLARE’s filtering plugins for quantization, or Flower’s support for on-the-fly compression and adaptive client sampling). The end goal is to make federated training of large models as communication-efficient and load-balanced as possible. By reducing overhead per round, we can increase the number of clients (or the size of the model) that can participate given a fixed bandwidth and compute budget. And by improving the optimization algorithm, we reduce the number of rounds needed for the model to converge. Both angles are crucial: large language models already push the limits of hardware, so federated setups must smartly minimize any additional inefficiencies. The progress in 2024 and 2025 shows that with quantization, selective updates, better optimizers, and robust engineering, federated learning can scale to meet the demands of m - – bringing privacy-preserving, distributed AI a step closer to practical reality in enterprise settings.

Connect with me on X (Twitter)

Rohan's Bytes

Discussion about this post