Browse all previoiusly published AI Tutorials here.
Table of Contents
Deploying LLMs on Resource-Constrained Devices
Model Distillation for Smaller LLMs
Low-Bit Quantization 4-bit 8-bit Mixed Precision
Edge-Cloud Offloading and Hybrid Inference
Industry Applications and Case Studies
Optimization Techniques Pruning Sparse Compute LoRA Decomposition
Hardware Acceleration on Edge Devices
Framework and Software Support
This literature review covers recent advances (2024–2025) in making large language models (LLMs) efficient for mobile and edge deployment. We focus on model distillation, quantization, edge-cloud offloading, real-world applications, optimization techniques (pruning, sparsity, low-rank adaptation), hardware acceleration, and framework support. The emphasis is on new developments and practical methods rather than re-explaining basic concepts.
Model Distillation for Smaller LLMs
Knowledge Distillation (KD) remains a key approach to obtain smaller LLMs that retain performance. Recent work has pushed beyond one-teacher distillation to leverage multiple teachers and staged training. Multi-teacher distillation improves knowledge diversity: for example, TinyLLM (WSDM 2025) distills a student from multiple large teachers, encouraging the student not only to match answers but also to learn the reasoning (chain-of-thought) behind them ( Beyond Answers: Transferring Reasoning Capabilities to Smaller LLMs Using Multi-Teacher Knowledge Distillation). By integrating rationales from diverse teacher LLMs, the TinyLLM student achieved superior reasoning performance, even outperforming the larger teachers on some reasoning tasks despite a much smaller size .
In low-data scenarios, distillation can surprisingly improve generalization. Zhao et al. (ACL 2024) found that a student model distilled from a few-shot prompted LLM can generalize better than its teacher on unseen examples (Multistage Collaborative Knowledge Distillation from a Large Language Model for Semi-Supervised Sequence Generation - ACL Anthology). Their method, Multistage Collaborative KD (MCKD), uses an LLM to generate high-quality pseudo-labels and then iteratively trains students in stages, each generating improved labels for the next . This progressive distillation strategy yielded significant gains in tasks like semantic parsing under limited supervision.
Another trend is distilling a “family” of models from one expensive training. Kundu et al. (NAACL 2024) propose training a supernet LLM that can be subsectioned into multiple smaller sub-models via weight-sharing (Efficiently Distilling LLMs for Edge Applications). Their approach, called Multistage Low-Rank Fine-tuning of Super-transformers (MLFS), uses low-rank adaptation during supernet training to efficiently produce a palette of fine-tuned LLMs of different sizes . This addresses practical needs where “a single fine-tuned model is not optimal across the spectrum of devices” and one may want a range of model sizes for different hardware . By distillation-based supernet training, they obtain high-quality encoder models down to a fraction of the original size suitable for edge deployment . (They note decoder-only models are harder to compress to the same degree.)
Overall, modern LLM distillation techniques move beyond one-size-fits-all KD. They employ task-specific distillation (focusing on a narrow task or domain for efficiency (LLM distillation demystified: a complete guide | Snorkel AI)), multi-teacher and multi-stage schemes for richer supervision, and even progressive layer-drop or supernet training to yield multiple student models. These techniques allow developers to isolate the needed capabilities of a large model and transfer them to a smaller, faster model tailored for an edge use case . The result is often a student with comparable accuracy on target tasks but with dramatically lower inference cost and latency.
Low-Bit Quantization (4-bit, 8-bit, Mixed Precision)
Quantization is critical for squeezing LLMs onto limited hardware. While 8-bit weight quantization is already common, 2024 research pushed toward 4-bit and mixed-precision schemes that minimize accuracy loss. A challenge with ultra-low bit-widths is preserving model quality despite quantization error and outlier activation values. One solution is to use higher precision only where needed. For example, ResQ (arXiv Dec 2024) proposes a mixed-precision post-training quantization method that keeps the most variational components in higher precision and quantizes the rest to 4-bit (ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals) . ResQ uses PCA to find a low-rank subspace of activations with high variance and retains those in 8-bit, while quantizing remaining dimensions to 4-bit, combined with a random rotation to suppress outliers . This strategy achieved near full-precision accuracy on LLaMA models: ResQ outperforms prior uniform and mixed schemes (e.g. SpinQuant) with up to 33% lower perplexity on WikiText and up to 3× faster inference vs FP16 .
Another advance is fine-grained quantization with hardware-awareness. Atom (MLSys 2024) introduced a 4-bit weight and activation quantization approach optimized for modern GPUs (Atom: Low-Bit Quantization for Efficient and Accurate LLM Serving). It uses a novel grouping and reordering of weight outliers and per-channel scaling for activations to minimize error, while leveraging Tensor Core support for INT4 arithmetic. The result was negligible accuracy loss (perplexity increase <0.4) on LLaMA, even at W4A4 (4-bit weights and activations) . In a serving context, Atom demonstrated up to 7.7× throughput improvement over FP16 and ~2.5× over conventional INT8 on GPT-style models . This shows that carefully designed 4-bit schemes can unlock significant speedups by fully utilizing low-bit hardware instructions.
Because activation outliers can ruin low-bit performance, many 2024 techniques explicitly address them. Earlier methods like SmoothQuant (2022) and OmniQuant (2023) were expanded upon. OmniQuant, for instance, had reduced 4-bit perplexity but still incurred noticeable loss . Newer methods combine ideas: QLoRA (ICML 2023) fine-tuned models in 4-bit weight mode, and in 2024 Meta applied QLoRA (quantization-aware LoRA fine-tuning) to produce “Gemini Nano” 1B/3B models for mobile devices (Unleashing The Power Of AI On Mobile - AI blog - Arm Community blogs - Arm Community). These models used a mix of QLoRA and an improved PTQ (SpinQuant) to maintain quality. The result: instruction-tuned 1B and 3B Llama 3.2 models that retain the original accuracy and safety, while achieving 2–4× speedup and ~56% size reduction compared to BF16 models .
Mixed-precision activation quantization is also becoming standard. Some recent frameworks keep weights at 4-bit and activations at 8-bit (to avoid non-linear outlier issues). For example, Arm and Meta integrated a 4-bit weight + 8-bit activation per-block quantization scheme into PyTorch ExecuTorch, paired with Arm’s i8mm instructions, yielding a further ~20% speed boost on Snapdragon CPUs . This kind of hybrid quantization balances accuracy and performance. In fact, ONNX Runtime 1.17 (Feb 2024) added built-in support for INT4 inference, alongside FP16 and FP32, and reported significant speedups on small LLMs versus PyTorch and even optimized C++ libraries (ONNX Runtime | Accelerating Phi-2, CodeLlama, Gemma and other Gen AI models with ONNX Runtime) . With proper kernel fusion and graph optimization, ONNX Runtime with int4 achieved up to 20× faster inference than PyTorch on certain models (like Phi-2 2.7B) at minimal accuracy cost .
In summary, latest quantization techniques allow compressing LLMs to 8-bit or even 4-bit such that they can fit in memory-constrained devices and execute faster, with only minor accuracy drops. Key innovations include per-group and per-channel quantization to handle outliers, mixed precision (using higher bits for critical parts), and even quantization-aware training (e.g. fine-tuning or LoRA on quantized weights) to recover any lost fidelity. These methods make it feasible to run models like Llama-2 7B or Code-Llama on phones at interactive speeds. For instance, Apple’s Core ML team showed an 8B Llama 3.1 model running at ~33 tokens/sec on an M1 Max (Mac) by combining a 4-bit weight compression, fused attention kernels, and a stateful KV cache (On Device Llama 3.1 with Core ML - Apple Machine Learning Research) – techniques that are applicable to mobile devices as well. Quantization thus stands as a cornerstone for on-device LLM deployment.
Edge-Cloud Offloading and Hybrid Inference
Even with compression, some LLMs remain too heavy or too slow for standalone edge execution. Hybrid inference strategies have emerged, where computation is split between the device and the cloud to optimize latency and resource usage. A straightforward form is model partitioning: running the early layers of an LLM on the device and offloading the rest to a cloud server. By offloading after processing some tokens or layers locally, one can reduce data transmitted and leverage the device’s compute for part of the work. However, deciding how to split the model and when to offload is non-trivial – it must account for network latency, throughput, and the device’s capabilities.
One advanced formulation is to jointly optimize model partitioning and quantization per layer. In 2024, researchers proposed DILEMMA (Distributed LLM Placement and Layer-wise Quantization) for multi-edge cloud systems (DILEMMA: Joint LLM Quantization and Distributed LLM Inference Over Edge Computing Systems). DILEMMA treats the problem as an optimization: each transformer layer of an LLM can be assigned to either an edge server or the cloud, and quantized to a certain bitwidth, with the goal of minimizing total inference delay under accuracy constraints . Their approach considers a two-tier network (multiple edge servers connected via device-to-device links, plus a cloud) and uses integer linear programming to decide the placement of each layer and its quantization level . The system will, for example, keep as much of the model on the edge as possible (for speed) but automatically offload the most complex or memory-intensive parts to the cloud if the edge resources are insufficient . By also determining layer precision, it compresses transmissions between nodes. Such layer-wise offloading schemes showed they can preserve model performance while significantly reducing end-to-end latency in a distributed edge setting .
Another approach is adaptive offloading based on query complexity. Yuan et al. (2024) developed a Local-Cloud Inference Offloading (LCIO) system that uses a lightweight local LLM for simple tasks and a powerful cloud LLM for difficult tasks or multi-modal inputs (HERE). They employ a resource-constrained reinforcement learning agent to dynamically decide “where to make the inference (local vs. cloud) and which multi-modal data sources to use” for each user query . The RL policy considers the context (e.g. whether the query involves vision input, or requires a long dialogue history) and the system state (device load, network cost) to maximize a reward that combines response quality, latency, and cost . For example, simple text-only queries might be answered entirely on-device by a GPT-2 class model for real-time response, whereas a complex query with images is offloaded to a larger cloud model . LCIO showed it can substantially cut latency and cloud costs while maintaining user satisfaction, by learning to route each request to the most appropriate model tier . This kind of adaptive offloading is promising for personal assistants that need to both work offline (for quick/easy tasks) and tap into cloud AI (for heavy tasks).
In practical deployment, hybrid strategies often involve compressing the intermediate data that is sent over the network. When splitting an LLM at layer kk, the device must transmit the layer-kk activations to the cloud; techniques like activation compression or quantization are used to reduce this payload. Prior systems (e.g. for CNNs) have applied quantization or even pruning to intermediate feature maps to save bandwidth. For LLMs, one can similarly send 8-bit or 4-bit activations instead of floats. Google’s research has noted using weight sharing and caching to minimize repeated data transfer in on-device pipelines (Large Language Models On-Device with MediaPipe and TensorFlow Lite - Google Developers Blog). Additionally, frameworks like MediaPipe (Google) are exploring streaming inference APIs that hide the complexity of splitting models between device and web workers .
It’s worth noting industry trends: Google’s Android 14 introduces “AICore” to run a Gemini Nano LLM on-device, automatically offloading to accelerators and cloud as needed . AICore provides a system service where certain interactions with Google’s LLM (Gemini) can be processed locally on high-end phones, with fallback to cloud if beyond local capacity . This is paired with use-case optimized LoRA adapters and safety filters on-device to customize and control outputs . The early access Gemini API allows developers to harness this hybrid edge-cloud setup for generative tasks, showing how future applications might seamlessly blend on-device AI with cloud AI.
In summary, offloading techniques make it feasible to use smaller edge models for speed and privacy while relying on cloud only when necessary. By splitting model execution cleverly and/or selecting the right model for the task, systems achieve a balance: real-time responsiveness and reduced cloud dependency, without sacrificing the capabilities of a large model when it’s truly needed. Ongoing research is making these decisions more adaptive (via learned policies) and more efficient (via quantized transmissions and optimized layer placement).
Industry Applications and Case Studies
Leading tech companies have begun deploying language models on consumer devices, demonstrating the practicality of these research advances. One prominent example is Meta’s effort to bring Llama-based models to smartphones. In late 2024, Meta announced Llama 3.2 “Nano” models (1B and 3B parameters) that are quantized and optimized for mobile devices (Unleashing The Power Of AI On Mobile - AI blog - Arm Community blogs - Arm Community). Using techniques like QLoRA fine-tuning and SpinQuant for 4-bit weights, they preserved the original model’s quality and safety. When evaluated on phones using the PyTorch ExecuTorch runtime on an Arm CPU, the quantized 1B model achieved over 350 tokens/sec in generation throughput on a Samsung Galaxy S24+ . These 1B/3B models deliver fast on-device inference (2–4× faster than before) and use roughly half the memory of the float models . This case study from Meta and Arm shows that even multi-billion-parameter LLMs can run entirely on recent smartphones at practical speeds by combining quantization and optimized runtime software.
Another high-profile deployment is Apple’s on-device LLM integrations. While Apple has not publicly shipped a full chatGPT-like model on iPhones yet, their Core ML framework and silicon are clearly geared for it. In a November 2024 technical report, Apple detailed how they got an 8B Llama 3.1 model running in real-time on an M1-class device (On Device Llama 3.1 with Core ML - Apple Machine Learning Research). By converting the model to Core ML format and applying key optimizations (4-bit weights, fused attention, and a efficient stateful KV cache), they achieved ~33 tokens/sec generation on the Mac’s Neural Engine/GPU . This implies that a similar approach could be taken on iPhone hardware (which shares the Apple Neural Engine), enabling use cases like on-device text generation for dictation or translation. Indeed, Apple already runs smaller transformer models on-device for features like autocorrect and dictation in iOS; the trend is clearly toward increasing the sophistication of on-device language models to improve user experience without cloud latency.
On the Android side, Google is deploying Gemini Nano, a scaled-down version of its flagship Gemini model, to select Pixel devices via the Android AICore feature (Large Language Models On-Device with MediaPipe and TensorFlow Lite - Google Developers Blog). This allows features like smart reply, summarization, or even multimodal assistance to run partially on-device, enhancing privacy and responsiveness. Google also released an experimental MediaPipe LLM Inference API that lets developers run open LLMs (e.g. Falcon, StableLM) in mobile apps fully offline . While currently for research, this points to future Android libraries where app developers can easily embed an LLM and rely on under-the-hood optimizations (new ops, quantization, weight caching) to make it feasible on a phone .
Beyond the tech giants, startups and open-source projects have driven LLM on edge innovation. The MLC (Machine Learning Compilation) project showcased LLMs running locally in diverse environments (mobile, web browser via WebGPU, etc.). By compiling models with TVM-based optimization, MLC was able to run a 4-bit quantized 2.7B model (Phi-3) on an Android phone at interactive speeds (MLC | MLC-LLM: Universal LLM Deployment Engine with ML Compilation). They provide apps (like MLC Chat) that let users chat with Llama2-family models on iOS/Android completely offline. This has proven the viability of community-driven LLM deployment: one can download a quantized model and interact with it on a mid-range device, thanks to efficient runtimes.
Qualcomm has also partnered to enable on-device LLMs. Their Snapdragon 8 Gen chips boast an AI engine that supports popular 7B-13B models. In March 2024 Qualcomm announced support for models like Baichuan-7B, Llama 2, and Google’s Gemini Nano on its smartphone SoCs (Qualcomm unveils Snapdragon 8s Gen 3 for enhanced AI and gaming). They demonstrated stable execution of these models within the power and memory limits of a phone. Qualcomm has even collaborated with enterprises like Personal AI to bring domain-specific LLMs to phones and PCs. In Oct 2024, Personal AI (an enterprise AI company) reported a partnership with Qualcomm to deploy custom “small language models” (SLMs) on Snapdragon-powered devices for corporate users (Personal AI Partners with Qualcomm: Bringing Small Language Models to Billions of Devices). Personal AI’s system allows every user (e.g. an employee) to have a personalized mini-LLM that runs locally on their phone or laptop, trained on their own data, ensuring privacy . Notably, “Personal AI has been selected as Qualcomm’s small language model provider of choice for running proprietary AI models on device for legal and financial enterprises.” . This indicates real-world demand for on-device LLMs in industry where data confidentiality is paramount (e.g. lawyers or bankers using a personal LLM that never sends data to cloud).
These case studies underscore that the technology has matured to the point of production deployment. We see 7B and even larger models (with compression) running on flagship smartphones and PCs, powering features from intelligent text input to voice assistants, all without a round-trip to a server. Companies leverage a combination of the techniques from research: distillation (to smaller models), quantization (to fit memory), efficient runtimes (to utilize CPUs/NPUs), and selective offloading. The result is improved user experience – lower latency, offline capability, and better privacy. As hardware improves and methods like those in this review continue to advance, we can expect the gap between cloud AI and mobile AI to further close, enabling truly pervasive intelligent applications at the edge.
Optimization Techniques: Pruning, Sparse Compute, LoRA, Decomposition
Beyond distillation and quantization, a variety of model compression and acceleration techniques are being applied to LLMs:
Pruning and Sparsity: Large models often have redundant weights. Pruning removes less important weights or neurons, yielding a sparser model that requires fewer operations. However, unstructured pruning (arbitrarily zeroing weights) can be hard to accelerate on general hardware, and can conflict with techniques like LoRA that add new weights. In late 2023, LoRAPrune introduced a framework to make pruning compatible with LoRA fine-tuning (LoRAPrune: Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning | OpenReview). It uses LoRA’s small trainable weight updates to gauge importance: rather than looking at gradients of the full model (which is memory-intensive), it uses the LoRA gradients to decide which neurons or attention heads to prune . LoRAPrune also enforces structured pruning (removing entire channels or heads) so that the resulting model is smaller and faster at inference . Impressively, at a 50% pruning rate it reduced perplexity by 8–16 points (on WikiText2/PTB) compared to prior pruning methods, while cutting memory usage in half . Similarly, others have explored one-shot magnitude pruning of LLMs combined with post-hoc recovery; techniques like Wanda (2023) pruned GPT models without any fine-tuning by zeroing weights with minimal impact on activations.
Structured sparsity patterns (like 2:4 sparsity where 50% of weights in each block are zero) are gaining traction since they can be accelerated on NVIDIA GPUs. Neural Magic released a SparseGPT approach and a pruned “Sparse Llama” model with 2:4 pattern that runs faster on GPUs while retaining accuracy (2:4 Sparse Llama: Smaller Models for Efficient GPU Inference). Experiments show that transformers can lose 20–50% of weights with minimal accuracy drop if pruned carefully. Moreover, combining pruning with quantization yields compound benefits – some recent pipelines first prune a model then quantize the remaining weights to 8-bit or lower, achieving a multi-fold reduction in model size.
Low-Rank Adaptation (LoRA) and Adapters: LoRA was originally proposed to cheaply fine-tune large models by adding small low-rank update matrices, but it’s also being leveraged for compression. A low-rank factorization of weight matrices is a form of model compression – it approximates heavy dense layers with two slimmer matrices. Some works decompose a pre-trained LLM’s weights (e.g. via SVD) and show that a surprisingly low rank (e.g. rank-64 instead of 4096) can preserve most of the model’s knowledge, especially if one allows a tiny fine-tuning to compensate. LoRA itself doesn’t speed up inference unless the low-rank matrices replace the original weights. To address this, researchers integrated LoRA into the model weights during training. LoRA integration with sparsity is exemplified by LoSA (LoRA Sparse Adaptation) in 2025 (Dynamic Low-Rank Sparse Adaptation for Large Language Models). LoSA trains a model to be sparse and incorporates the low-rank updates such that after fine-tuning, the LoRA weights can be merged into the pruned model . In each layer, LoSA decides a proper sparsity level and LoRA rank based on a representation mutual information criterion (essentially gauging layer importance) . The result was a highly sparse LLM with no extra inference cost (LoRA merged) and minimal accuracy loss. For example, on LLaMA-2-7B, LoSA achieved a 2.6× speedup (CPU) with 50% weights pruned, while reducing perplexity by 68 points and improving zero-shot accuracy by 16% over a baseline sparse model without LoRA (i.e. it recovered a lot of lost accuracy) . This demonstrates that carefully fine-tuning a pruned model with low-rank adaptation can restore performance even at high sparsity, essentially pushing the Pareto frontier of how much we can compress an LLM.
Tensor Decomposition: Beyond low-rank adapters, one can decompose the existing weight tensors of an LLM to smaller tensors. Common approaches include matrix factorization (e.g. factorizing a large dense layer W=U×VW=U×V where U,VU,V are smaller) or tensor train decomposition for multi-dimensional tensors. A 2024 method called TRAWL (Tensor Reduced and Approximated Weights for LLMs) systematically applied tensor decomposition strategies across transformer layers (TRAWL: Tensor Reduced and Approximated Weights for Large Language Models). Interestingly, TRAWL reported that in some cases decomposing weights can even improve accuracy, likely by removing noise or forcing a form of regularization . Using a layer-by-layer intervention (especially on final layers), they observed up to 16% enhancement in accuracy without additional fine-tuning . While counterintuitive, this suggests that over-parameterization in LLMs can sometimes be pruned away or factorized to yield a model that generalizes better (a known phenomenon in smaller networks as well). In general, tensor decomposition offers a trade-off: one can choose a rank that provides a desired compression ratio and accept a small increase in error, or even do a light fine-tuning after decomposition to adjust. These methods are complementary to quantization – e.g. one might first compress a model via SVD (reducing parameter count by 2×), then quantize the factors to int8, achieving an overall 8× reduction.
Efficient Attention & Other Tricks: Researchers also optimize specific components of LLMs. Attention pruning (dropping some attention heads or limiting the attention window) can cut computation. Some 2024 works prune entire heads that have negligible contribution (a concept from BERT pruning now applied to LLMs). Others use sparsity in attention matrices: techniques like Big Bird and LongT5 (earlier) used sparse attention patterns to handle long sequences efficiently, and similar ideas are being tested on general LLMs to reduce the quadratic cost. Additionally, speculative decoding is a runtime optimization where the model generates multiple tokens in parallel and then verifies them with a larger model – this speeds up generation without changing the model itself, and was noted by Qualcomm as a promising on-device technique (What's next in on-device generative AI? - Qualcomm). While speculative decoding is not model compression, it is an algorithmic way to get tokens faster from a given model and could be combined with the above methods.
In summary, a holistic approach to LLM optimization often involves stacking multiple techniques: e.g., start with a smaller distilled model, apply structured pruning to remove 30% of its neurons, apply quantization to 8-bit, and use an optimized runtime for sparse ops. Each step compounds efficiency gains. The research community in 2024 is actively exploring these combinations – as seen with LoRA+pruning (LoRAPrune, LoSA) and quantization+pruning (DILEMMA and others). The end goal is to maximize throughput/$ and throughput/Watt for LLM inference, making it feasible to deploy advanced language capabilities in cost- and power-constrained environments (like mobiles, AR glasses, or IoT devices).
Hardware Acceleration on Edge Devices
The strides in algorithms go hand-in-hand with advances in hardware for AI acceleration. Modern edge devices come with increasingly powerful and specialized AI processors that can be leveraged for LLM inference:
Mobile NPUs and DSPs: Smartphones now commonly include Neural Processing Units (NPUs) or signal processors optimized for neural network ops. Examples include the Apple Neural Engine, Qualcomm Hexagon AI engine, Google’s TPU Edge (in Pixel phones), Huawei’s Ascend NPU, etc. These processors are designed to execute matrix multiplications and convolutions with high parallelism and low power. Initially they were tuned for vision models and smaller networks, but they are becoming capable of handling transformer workloads. For instance, Qualcomm’s Hexagon on the Snapdragon 8 Gen 3 supports INT8 and INT4 dot-product instructions (e.g. Arm v9’s i8mm) specifically to accelerate quantized transformers (Unleashing The Power Of AI On Mobile - AI blog - Arm Community blogs - Arm Community). Arm reported that by integrating their new INT4 kernels (via the KleidiAI library) into PyTorch ExecuTorch, a quantized Llama 2 model ran 20% faster on Cortex-A78 cores using the INT8/INT4 instructions compared to FP16 . These low-bit accelerations are crucial: the difference between 4-bit and 16-bit compute on ARM can be an order of magnitude in speed because a single core can execute 4 or 8 INT8 operations in one cycle (SIMD), effectively multiplying throughput for quantized models .
Mobile NPUs also typically have fast on-chip memory to store model weights or activations, reducing memory bandwidth bottlenecks. Offloading parts of an LLM onto these NPUs can drastically improve performance per Watt. For example, the Pixel 7’s Tensor processor was used to run on-device speech models; similar use for language models is expected. Qualcomm has demonstrated running a 10B parameter model fully on the Hexagon accelerator in under a few seconds per response (by leveraging INT8 and weight compression). Apple’s Neural Engine, with 16 cores and 15+ TOPS throughput, can run medium-sized transformers efficiently if the model is converted to Core ML format. Indeed, Apple’s Core ML tools can target the Neural Engine with 8-bit weights; developers have reported running 7B LLaMA on iPhone at a few tokens per second using 4-bit quantization on ANE. As Apple continues to scale up the Neural Engine’s memory (it’s rumored future chips will allow larger models to be loaded), we may see high-end iPhones able to handle models with tens of billions of parameters in a split fashion (part on CPU/GPU, part on ANE).
Edge TPUs and Micro-accelerators: Outside of phones, there are tiny accelerator chips (like Google Coral EdgeTPU, NVIDIA Jetson Orin, etc.) used in IoT or embedded scenarios. These typically excel at fixed-point arithmetic. For instance, NVIDIA’s Jetson platform can use TensorRT to run transformers with mixed INT8/FP16 precision, harnessing the same Tensor Cores as datacenter GPUs. Although Jetsons are still relatively power-hungry, they show that data-center class optimizations (like FP8 transformer kernels) are filtering down to edge. Another trend is FPGA-based accelerators for LLMs at the edge; these can be programmed to implement sparse matrix multiplies or low-precision calculations efficiently and could be reconfigured as model architectures evolve.
Memory considerations are paramount: LLMs require a lot of memory for parameters and caching. Hardware solutions are addressing this by increasing on-device RAM and using faster memory interfaces. Some NPUs support model streaming from flash storage with decompression on the fly, so that even if a model is 20GB, an edge device might still run it at slower speed by not holding it all in RAM (analogous to how PagedAttention in 2023 allowed running GPT-3 off SSD). While not mainstream yet, research prototypes of compute-in-memory (analog AI chips) aim to perform matrix multiplies directly where weights are stored, massively reducing data movement energy.
TPUs and Specialized Chips in the Cloud: Though not edge devices, it’s worth noting cloud hardware because hybrid solutions use them. Google’s TPUv4/v5, NVIDIA H100, etc., now feature support for extremely low precision (down to INT4 or even binary) with clever algorithms to maintain accuracy. These improvements trickle to edge in form of IP cores and instruction set extensions. For example, Arm’s Helium and SME extensions (Scalable Matrix Extension) add matrix-multiply units to Cortex-M microcontrollers and Cortex-A CPUs respectively (Unleashing The Power Of AI On Mobile - AI blog - Arm Community blogs - Arm Community). This effectively brings a bit of “TPU” flavor onto general-purpose processors that will end up in future phones, IoT devices, and even appliances. Arm’s roadmap highlights INT8 dot products (SDOT), BF16 support, and now INT4 accelerations in its CPU ISA – all aimed at making neural inference faster per clock.
In 2024, we saw closer collaboration between software and hardware teams to realize LLM on edge. The PyTorch ExecuTorch project worked with Arm, Apple, and Qualcomm to ensure the runtime can delegate ops to the appropriate hardware (CPU, GPU, NPU) with minimal overhead . ExecuTorch can export a model and run it on “most edge devices” out-of-the-box, and is extensible to specialized hardware like NPUs for enhanced performance . This means if a phone has a particular AI coprocessor, ExecuTorch can offload parts of the model to it via a delegate backend. Likewise, TensorFlow Lite and MediaPipe leverage NNAPI on Android, which interfaces with device-specific NPUs transparently. The bottom line: hardware acceleration is being fully exploited by modern LLM frameworks – whether it’s using the GPU for fast parallel math, the NPU for low-precision inferencing, or custom instruction sets on the CPU. As a result, the performance gap between local inference and cloud inference is shrinking for suitably optimized models. We’ve reached a point where a high-end phone can run a multi-billion parameter model in a few hundred milliseconds per token – which, for certain interactive applications, is acceptable.
Framework and Software Support
Finally, the software ecosystem has evolved rapidly to support efficient LLM deployment. Major ML frameworks and libraries have introduced features to facilitate the techniques discussed:
PyTorch: The PyTorch team in 2024 introduced TorchArrow (TorchAO) for quantization and sparsity and ExecuTorch for edge deployment. ExecuTorch is a PyTorch-native runtime focused on mobile/edge, which can take a PyTorch model (exported via torch.export
) and run it with an optimized C++ backend on various devices (ExecuTorch Alpha: Taking LLMs and AI to the Edge with Our Community and Partners | PyTorch). It supports 4-bit quantization (via GPTQ) out of the box and has integrations for hardware accelerators through delegate backends . In one blog post, the PyTorch team demonstrated Llama-2 7B quantized to 4-bit running on iPhone 15 Pro and recent Samsung Galaxy phones, indicating full-stack support from model conversion to on-device execution. PyTorch also partnered with Arm to incorporate the KleidiAI low-bit kernels (as mentioned), so developers using ExecuTorch automatically get the benefit of Arm-optimized int4/int8 operations (Unleashing The Power Of AI On Mobile - AI blog - Arm Community blogs - Arm Community). Additionally, torch.compile (PyTorch 2.x) helps here by doing JIT compilation and operator fusion which can speed up LLM inference even on CPU. For example, Microsoft found that ONNX Runtime (with optimizations) and torch.compile can be combined: in some cases, PyTorch with compile is nearly as fast as ORT for FP16, and then int4 quantization can give further speedups (ONNX Runtime | Accelerating Phi-2, CodeLlama, Gemma and other Gen AI models with ONNX Runtime).
Another interesting PyTorch addition is support for FlashAttention and memory-efficient attention in the core library, which reduces the memory and time overhead of the attention operation – crucial for long context or running on limited memory. PyTorch’s quantization toolkit (formerly fbgemm/quantization) expanded to handle larger models and added features like FP8 simulation and per-channel quant for linear layers, aligned with industry needs for LLMs.
TensorFlow / TFLite: Google’s TensorFlow team, via TensorFlow Lite and MediaPipe, now supports on-device text generation pipelines. The MediaPipe LLM Inference API (experimental in 2024) provides a simple interface to load a TFLite model of an LLM and get token outputs, handling all the behind-scenes optimization (Large Language Models On-Device with MediaPipe and TensorFlow Lite - Google Developers Blog). Under the hood, they implemented various “optimizations across the on-device stack – including new ops, quantization, caching, and weight sharing” to make this feasible . For instance, TFLite added ops for transformer LayerNorm and fused attention to avoid having to interpret the model in Python. They also support NNAPI delegation, so if a phone’s chipset has an accelerator, the heavy parts of the model can run there. TFLite had already supported 8-bit quantization; recently, they introduced support for 4-bit weight quantization (still in developer preview) to further shrink model size. Google’s blog noted that weight sharing (using the same weights for multiple decoder layers) was one technique to reduce memory, though it may come with some loss .
Moreover, Google’s Gemini API for Android provides a high-level interface where developers don’t even handle the model directly – they just call the API and the system decides if it can run on device or must call the cloud. This indicates frameworks are moving toward seamless hybrid integration: the developer doesn’t have to manage multiple models or devices, the runtime does it.
ONNX and Microsoft Olive: ONNX Runtime (ORT) has become a popular solution for optimized inference of transformer models across platforms. ORT 1.17+ includes many transformer-specific graph optimizations (like attention fusion, gelu fusion) and integrates Intel’s oneDNN and NVIDIA’s TensorRT for acceleration. It also added int4 support as noted, which means you can quantize an LLM to 4-bit and run it through ORT on CPU or GPU with considerable speedup (ONNX Runtime | Accelerating Phi-2, CodeLlama, Gemma and other Gen AI models with ONNX Runtime). Microsoft’s Olive toolkit works with ORT to automate the optimization of models: you give Olive a PyTorch or TensorFlow model, and it tries conversions, quantization, and finds the best runtime configuration for your target (CPU, GPU, or mobile) . Olive has recipes for LLMs, so it can apply int8 quantization with minimal accuracy loss, or split the model for disk paging if needed. This kind of tool support lowers the barrier for engineers to deploy efficient models – you don’t have to be an expert in quantization or ONNX; Olive will produce an optimized model (e.g. a quantized ONNX) that you can drop into your app.
Other frameworks and libraries: There are many more – for example, Hugging Face’s HuggingFace Optimum integrates with both ONNX and OpenVINO to run transformers with mixed precision. NVIDIA’s TensorRT now has an open-source extension called TensorRT-LLM specifically for large language models, providing efficient implementations of transformer blocks on GPU and even supporting GPU-CPU memory streaming for models larger than GPU memory. On Apple, Core ML Tools can convert PyTorch/TensorFlow models to Core ML format, with support for neural engine optimizations and quantization (Apple added support for 8-bit weight quantization in Core ML 4). Apple’s converter can also automatically apply post-training quantization to reduce model size by half or more, which was used in their Llama 3.1 demo (On Device Llama 3.1 with Core ML - Apple Machine Learning Research) .
There are also specialized libraries: Llama.cpp (C++ library) allows running LLaMA models on CPU with 4-bit integers and has been heavily optimized with techniques like quantized multiplication and AVX512 acceleration – it’s popular for running smaller LLMs on laptops or even single-board computers. Similarly, GGML format models use quantization to enable running on-device. These community projects have influenced the mainstream frameworks to adopt similar optimizations.
In conclusion, the ecosystem of tools in 2024–2025 is rich and rapidly evolving to support efficient LLM deployment. Developers now have at their disposal:
easy-to-use distillation libraries (e.g. Hugging Face has a distilled GPT-2 and guides to distill larger models),
built-in quantization flows in PyTorch and TensorFlow (including QAT and PTQ for 8-bit and 4-bit),
runtime optimizers like ONNX Runtime, TensorRT, and XNNPack that cater to transformers,
and edge-specific frameworks like ExecuTorch and TFLite that abstract away hardware details while squeezing maximum performance via techniques like operator fusion and delegation.
The convergence of model compression techniques and framework support means that deploying an LLM on a resource-limited device is no longer an academic exercise but a practical reality. Engineers can mix-and-match distillation, quantization, pruning, etc., often using off-the-shelf tools, to obtain a model ready for the edge. Case studies from Meta, Google, Apple, and others validate that these methods work in production. As we move forward, we expect continued improvements – e.g. better auto-distillation (one-click creation of a smaller model for a given task), more adaptive runtime systems, and hardware with higher capacity – all enabling powerful LLM applications to run anywhere, anytime, even on a mobile device in your hand.
Sources:
Kundu et al., “Efficiently Distilling LLMs for Edge Applications,” NAACL 2024 (Efficiently Distilling LLMs for Edge Applications) .
Tian et al., “Beyond Answers: Transferring Reasoning to Smaller LLMs (TinyLLM),” WSDM 2025 ( Beyond Answers: Transferring Reasoning Capabilities to Smaller LLMs Using Multi-Teacher Knowledge Distillation).
Zhao et al., “Multistage Collaborative Knowledge Distillation from an LLM,” ACL 2024 (Multistage Collaborative Knowledge Distillation from a Large Language Model for Semi-Supervised Sequence Generation - ACL Anthology) .
Zhao et al., “Atom: Low-Bit Quantization for Efficient and Accurate LLM Serving,” MLSys 2024 (Atom: Low-Bit Quantization for Efficient and Accurate LLM Serving) .
Saxena et al., “ResQ: Mixed-Precision Quantization of LLMs with Low-Rank Residuals,” arXiv 2412.14363 (likely ICML 2025) (ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals) .
PyTorch Team, “ExecuTorch Alpha: LLMs to the Edge,” PyTorch Blog, Apr 2024 (ExecuTorch Alpha: Taking LLMs and AI to the Edge with Our Community and Partners | PyTorch) .
Iodice & Desai (Arm & Meta), “LLM Inference for Llama 3.2 on Mobile (ExecuTorch + KleidiAI),” Arm Community Blog, Oct 2024 (Unleashing The Power Of AI On Mobile - AI blog - Arm Community blogs - Arm Community) .
Yuan et al., “Local-Cloud Inference Offloading for LLMs (LCIO),” arXiv 2307.10169 / IEEE TMC 2024 (HERE) .
Hosseinzadeh et al., “DILEMMA: Joint LLM Quantization and Distributed Inference over Edge,” arXiv 2503.01704, Mar 2025 (DILEMMA: Joint LLM Quantization and Distributed LLM Inference Over Edge Computing Systems).
Sherwood & Lee, “LLMs On-Device with MediaPipe and TFLite,” Google Developers Blog, Mar 2024 (Large Language Models On-Device with MediaPipe and TensorFlow Lite - Google Developers Blog) .
Apple ML Team, “On-Device Llama 3.1 with Core ML,” Nov 2024 (On Device Llama 3.1 with Core ML - Apple Machine Learning Research) .
Personal AI, “Personal AI & Qualcomm: On-Device SLMs,” Oct 2024 (Personal AI Partners with Qualcomm: Bringing Small Language Models to Billions of Devices) .
Huang et al., “Dynamic Low-Rank Sparse Adaptation (LoSA) for LLMs,” arXiv 2502.14816, Feb 2025 (Dynamic Low-Rank Sparse Adaptation for Large Language Models) .
Zhang et al., “LoRAPrune: Structured Pruning meets Low-Rank Tuning,” ICLR 2024 (LoRAPrune: Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning | OpenReview) .
Luo et al., “TRAWL: Tensor Reduced and Approximated Weights for LLMs,” arXiv 2406.17261, Jun 2024 (TRAWL: Tensor Reduced and Approximated Weights for Large Language Models) .
Parinita Rahi et al., “Accelerating Phi-2, CodeLlama, Gemma with ONNX Runtime,” ONNX Blog, Feb 2024 (ONNX Runtime | Accelerating Phi-2, CodeLlama, Gemma and other Gen AI models with ONNX Runtime) .
Matt Casey, “LLM Distillation Demystified,” Snorkel AI Blog, Feb 2024 (LLM distillation demystified: a complete guide | Snorkel AI).