Browse all previously published AI Tutorials here.
Introduction
Transformer-Based LLMs GPT T5 BERT
Retrieval-Augmented LLMs
Mixture-of-Experts MoE Models
Mamba-Based Architectures State-Space Models
Cloud vs Edge Deployment Considerations
Introduction
Document digitization often produces lengthy text that must be split into manageable chunks before processing with Large Language Models (LLMs). This is necessary because most LLMs have limited context windows and high computational costs for long inputs. Recent research (2024–2025) has explored various LLM architectures to improve efficiency and scalability for tasks like document understanding. We review four major architecture types – Transformer-based models (e.g. GPT, T5, BERT), Retrieval-Augmented models, Mixture-of-Experts (MoE) models, and Mamba-based state-space models – analyzing them in terms of inference/training efficiency, adaptability, accuracy, token/context handling, speed, cost, and memory. We also distinguish which architectures are best suited for cloud versus edge deployment.
Transformer-Based LLMs (GPT, T5, BERT)
Transformer-based models leverage self-attention and have become the foundation of modern NLP. They achieve state-of-the-art accuracy across diverse tasks due to large model size and training on vast data ( A Survey on Mixture of Experts). However, standard Transformers have quadratic time/memory complexity in sequence length, making long documents expensive to process ( An Empirical Study of Mamba-based Language Models). Key characteristics:
Inference Efficiency & Speed: On short inputs, Transformers can be fast with optimized libraries, but on long sequences their O(n²) attention leads to slowdowns and high memory usage . For example, attention requires storing a key/value cache that grows with sequence length, consuming large memory for long documents . Various optimizations (FlashAttention, Longformer-style sparse attention) have been proposed to mitigate this (Falcon Mamba: The First Competitive Attention-free 7B Language Model), but the fundamental scaling remains quadratic.
Training Scalability & Cost: Scaling Transformers to billions of parameters has unlocked emergent capabilities like in-context learning . This comes at tremendous computational cost – training giant models (e.g. hundreds of billions of params) requires massive cloud compute. Fine-tuning or instruction-tuning is feasible but also resource-intensive for large models. In practice, only cloud infrastructures can train/deploy the largest GPT-style models due to cost.
Fine-Tuning Adaptability: Transformer LLMs adapt well to new tasks via fine-tuning or prompt tuning. Encoder-based models like BERT were traditionally fine-tuned per task, while decoder models (GPT) can be instruction-tuned to handle many tasks in one model. Techniques like LoRA and adapters help update large models with modest compute. This flexibility has made Transformers widely useful for document QA, summarization, etc.
Model Accuracy: When sufficiently scaled, Transformer LLMs achieve very high accuracy and fluency. They excel at understanding and generating text, reasoning through prompts, and have dominated benchmarks . For instance, GPT models and T5 variants are state-of-the-art in summarizing long documents (provided the input is chunked to fit context). Their strong in-context learning ability means they can utilize document chunks given in the prompt effectively.
Token/Context Efficiency: Standard Transformers have fixed context limits (e.g. 512 tokens for BERT, 2048–32k for GPT variants). Processing a document longer than this requires chunking into multiple passes. This is inherently token-inefficient, as the model cannot ingest the whole document at once. Extended-context Transformers (using sparse attention or recurrence) can handle longer inputs but often at some cost to accuracy or speed . In summary, vanilla Transformers are not very token-efficient for very long texts – they rely on splitting or truncating inputs.
Memory Usage: Large Transformer models consume a lot of memory. At inference, billions of parameters must be loaded (e.g. a 175B model occupies tens of GBs), and attention activations scale with sequence length . This limits deployment on memory-constrained devices. Techniques like model pruning and quantization are often applied to reduce memory for smaller deployments, but for best performance the full model is typically kept in a data center.
Cloud vs Edge: Transformer LLMs in their full size (e.g. GPT-3, PaLM) are optimized for cloud deployment due to high resource needs. On the edge, smaller Transformer variants or distilled models are used. For example, compressed models like DistilBERT or quantized 7B-parameter GPTs can run on mobile/embedded hardware with reduced precision. Quantization significantly shrinks model size and memory footprint, enabling LLM deployment on resource-constrained devices ( LSAQ: Layer-Specific Adaptive Quantization for Large Language Model Deployment). Nonetheless, even with compression, Transformers on edge handle only limited contexts (few hundred tokens) unless paired with other strategies like retrieval.
Retrieval-Augmented LLMs
Retrieval-Augmented Generation (RAG) combines an LLM with an external knowledge source. The model retrieves relevant text chunks from a document database or index and feeds them into the prompt, instead of relying solely on its internal parameters for knowledge ( Retrieval-Augmented Generation for Large Language Models: A Survey). This approach is highly relevant for document question answering and search applications. Key points:
Inference Efficiency: RAG adds a retrieval step (using e.g. vector similarity or search) before generation. This introduces some latency and system complexity (Don’t Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks), but it allows the LM component to be smaller and faster since it doesn’t need to memorize all facts. In practice, a well-optimized retrieval (with embeddings or keyword index) can fetch relevant chunks in milliseconds. The overall inference speed can still be high because the LLM is focusing on a narrow, relevant context rather than a huge input. The trade-off is the extra retrieval call, which can be offset by caching.
Training Scalability & Adaptability: A major benefit of retrieval augmentation is that the knowledge base can be updated independently of the model. The LLM can be pre-trained generically, and for a new domain you simply index the domain documents. Fine-tuning the model is optional – some systems train the retriever and generator jointly for better end-to-end performance, but it’s not strictly required. This makes RAG very scalable: you can handle growing document collections by scaling your index, without retraining a larger model. It’s also adaptable: for a new task or document set, updating the retrieval data or doing light fine-tuning on how to use retrieved evidence can suffice.
Model Accuracy: RAG improves accuracy on knowledge-intensive tasks and reduces hallucination, because the model is grounded in retrieved text . Instead of relying on potentially outdated parametric knowledge, the LLM gets up-to-date information from documents, which is especially useful for QA on a specific corpus. This leads to more factual and traceable outputs. However, errors can occur if the retrieval fetches irrelevant or incorrect passages . Recent research proposes better retrieval ranking and integration to mitigate this. Overall, for document comprehension QA, retrieval-augmented models often outperform equivalently sized vanilla models that lack access to the full document.
Token Efficiency: Retrieval augmentation is very token-efficient for large documents. Rather than feeding an entire document of say 10,000 tokens into the LLM (which might be impossible or slow), a retrieval step selects only the most relevant few hundred tokens. The LLM then only processes those, making much better use of its context window. This approach aligns naturally with document chunking: the document is split into chunks (e.g. paragraphs) which are stored, and only the needed chunks are inserted into the prompt. This way, the LLM “sees” the important parts without wasting tokens on unrelated text.
Cost-Effectiveness: By enabling the use of a smaller LLM to tackle large knowledge problems, RAG can be cost-efficient. The heavy lifting of storing knowledge is offloaded to a database (which is cheaper and easier to scale than increasing model parameters). Inference cost is lower with a 7B or 13B model plus retrieval, compared to running a 175B model that tries to encode all knowledge internally. The extra cost of maintaining a search index is usually modest. However, RAG systems are generally cloud-based (to store big corpora and handle search). For edge scenarios, retrieval can be applied on-device but the knowledge base must be limited in size.
Cloud vs Edge: Retrieval-Augmented models are widely used in cloud deployments (e.g. a cloud service that indexes enterprise documents and uses an LLM to answer queries). In the cloud, they can tap into vast repositories (internet-scale indexes) with powerful retrievers. On edge devices, RAG can be used with a local dataset – for instance, a phone could index a user’s PDFs and use a small LLM to answer questions. This allows on-device query answering without an internet connection. The advantage on edge is that a lot of knowledge can be accessed without storing it in the model’s weights. The limitation is the device storage and compute: only relatively small corpora and smaller LMs can be used. Still, for privacy-sensitive or offline applications, an edge LLM with retrieval is attractive. One must account for the added latency of retrieval, but if the knowledge base is on-device (and of manageable size), the latency and memory use remain reasonable. Overall, retrieval augmentation is a flexible strategy that bridges the gap between limited model context lengths and the need to handle very large documents. It trades a bit of complexity for significant gains in effective context size and accuracy .
Mixture-of-Experts (MoE) Models
Mixture-of-Experts architectures expand model capacity by having multiple sub-models (“experts”) and a gating network that routes each input token to one or a few experts. Notably, this allows models with trillions of parameters to be trained without a proportional increase in computation per token ( A Survey on Mixture of Experts). In other words, MoE can achieve a much larger effective model size while keeping inference cost per token relatively low. Key analysis:
Inference Efficiency & Speed: By activating only a small fraction of the model’s weights for each input, MoE models attain high capacity with minimal computation overhead . For example, DeepSeek-V2 (2024) is a 236B-parameter MoE language model that activates ~21B parameters per token (sparsely) and supports a 128K context window ( DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model). Thanks to this sparsity, it significantly boosted generation throughput (5.76× faster than a comparable dense model) and cut memory usage (KV cache reduced by 93%) while maintaining strong performance . In general, an MoE with top-1 expert routing performs inference about as fast as a single-expert model of equivalent small size, aside from some overhead for the gating mechanism. If each token uses two experts (top-2), inference cost roughly doubles, but is still far below using all experts at once. Thus, MoEs are quite efficient in inference for their overall size.
Training Scalability & Cost-Effectiveness: MoEs shine in scaling up model capacity. They enable training models with far more parameters without linear growth in FLOPs. This can yield better accuracy for the same training compute budget. In the DeepSeek-V2 example, the MoE design saved about 42% of training cost compared to a dense model of lower capacity, yet achieved stronger results . However, MoE training is complex – it requires distributed training across many devices (each hosting different experts) and techniques to ensure balanced expert utilization. The gating function must be trained carefully to avoid some experts overload while others are underused. Despite these challenges, MoEs have been used by industry (e.g. Google’s Switch Transformer, GLaM) to reach scales previously infeasible. The approach is cost-effective in large-scale cloud training settings, but would not be for small-scale training.
Fine-Tuning Adaptability: Adapting an MoE model to new tasks or data can be more involved than a standard model. One reason is that only a subset of experts may have seen certain data patterns during pretraining. Fine-tuning may require careful adjustments so that the right experts are engaged for the new domain. In some cases, experts can be fine-tuned or added independently for new skills (expert specialization can be an advantage here). There is research on pruning or swapping experts without retraining the whole model (Not All Experts are Equal: Efficient Expert Pruning and Skipping for ...). Overall, MoEs can be fine-tuned, but best practices are still evolving; sometimes a dense fine-tuning of a smaller proxy model or other adapter techniques might be used to avoid destabilizing the mixture.
Model Accuracy: When properly trained, MoE LLMs achieve accuracy on par with or better than dense models of equivalent computational cost. They effectively leverage a larger parameter space – for instance, having 1T parameters with sparse usage can outperform a 100B dense model on many tasks. The survey of MoE indicates this approach has “emerged as an effective method for substantially scaling up model capacity with minimal overhead” and has attracted significant attention . Recent open-source MoE models rank among top performers; DeepSeek-V2, with only 21B active weights at inference, still achieves “top-tier performance among open-source models” . This shows that MoEs can maintain high accuracy while using only a fraction of their parameters per input, essentially giving the benefit of a much larger model without paying the full cost each time.
Token Efficiency & Context Length: The MoE approach itself doesn’t inherently increase context length (it is usually built on Transformer layers). But because MoEs reduce per-token compute, they can afford to process longer sequences to some extent. In practice, some MoE models incorporate other innovations for long context (DeepSeek’s 128K context is aided by a “Multi-head Latent Attention” mechanism compressing the cache ). Generally, MoEs could be combined with long-context transformer techniques to handle document-length inputs more efficiently than a dense model could. The token efficiency primarily comes from not needing to split a model into smaller pieces for capacity – the model can “focus” huge capacity on the tokens that need it, potentially capturing subtle long-range dependencies without running every token through every weight.
Memory Usage: Total memory required to host an MoE is massive (since it contains many experts). But at inference, one only needs to load or activate the experts needed for the input. In distributed cloud setups, different experts might reside on different servers, and the system routes data accordingly. This means the active memory footprint per token can be much smaller than the full model size. For example, if only 10% of parameters are used for a given input, memory and compute for that inference are cut dramatically. That said, to serve an MoE model with low latency, typically all experts are kept in RAM across the cluster. The benefit is more about compute scaling than reducing total storage. For edge or single-device scenarios, an MoE’s full size is prohibitive – you cannot store a trillion parameters on a phone. Even if only some are used at a time, the device would lack the capacity for the whole set. Thus, MoE memory advantages apply only in a distributed (cloud) context.
Cloud vs Edge: MoE LLMs are squarely aimed at cloud deployment. They require distributed infrastructure, both for training and inference, to manage the expert shards. The cloud is ideal since it can pool memory and compute from many machines, allowing the MoE to flex its capacity. In that environment, MoEs are very cost-effective, delivering high accuracy per compute dollar . Conversely, MoEs are ill-suited for edge devices. The necessity of coordinating many experts and the sheer model size make it impractical. Edge devices benefit more from model compression techniques than from MoE designs. So, we can consider MoE as a solution for scaling large models in data centers, but not a candidate for on-device use. If an edge needs more knowledge than a small model can hold, retrieval augmentation (using an external store) is a far simpler solution than trying to run a massive MoE locally.
Mamba-Based Architectures (State-Space Models)
Mamba is a new architecture (derived from structured state-space models, SSMs) designed to overcome Transformer limitations in sequence length and efficiency ( An Empirical Study of Mamba-based Language Models). Mamba-based LLMs eschew the attention mechanism altogether, using SSM layers (which mix recurrence and convolution operations) to handle long-range dependencies in linear time. This is a significant departure aiming to allow very long input sequences with much lower computational cost. Key analysis:
Inference Efficiency & Speed: Mamba models have O(n) time complexity in sequence length, versus O(n²) for standard transformers . This means they scale far better as input length grows. In practical terms, a pure Mamba 8B model or a hybrid model (Mamba + some attention) can generate text much faster for long contexts. An empirical study found an 8B hybrid (mostly Mamba layers) was up to 8× faster in generation than an 8B Transformer model . Similarly, the Falcon Mamba 7B (2024) – an attention-free 7B model – is “significantly faster at inference” than comparable Transformers and uses far less memory on long sequences (Falcon Mamba: The First Competitive Attention-free 7B Language Model). This speed advantage comes from avoiding the exhaustive attention computation and large key-value caches. For tasks involving lengthy documents, Mamba’s efficiency is a game-changer, enabling near linear scaling.
Training Scalability: Until recently, SSM-based models were tested only at small scales, but 2024 results show they can be scaled up. Researchers trained Mamba variants up to 8B parameters on trillions of tokens . Training a state-space model requires different optimizations (for example, careful kernel implementations to parallelize the recurrence). Mamba introduced GPU-friendly techniques like kernel fusion and not materializing large intermediate states (Mamba (deep learning architecture) - Wikipedia), allowing it to train effectively. A hybrid approach (mixing Mamba layers with some attention and feed-forward layers) has proven especially successful, combining the strengths of both architectures . As a result, an 8B hybrid Mamba-Transformer actually outperformed an 8B pure Transformer on a suite of tasks (+2.65 average score) . This suggests training scalability is good – Mamba layers can be integrated and scaled much like transformer layers. The main caveat noted is that pure SSM models sometimes lag on tasks requiring explicit short-term memory or copying (where attention excels) , though these gaps close with either hybrid designs or second-generation SSM improvements (Mamba-2).
Fine-Tuning Adaptability: Mamba-based LLMs can be fine-tuned with the same methods used for Transformers (supervised fine-tuning, RLHF, etc.). For instance, DeepSeek-V2 (which uses a form of latent attention and MoE with long context) applied SFT and RL to maximize performance ( DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model). Falcon Mamba 7B is a pretrained base model that could be fine-tuned or used as is for various tasks . While literature on fine-tuning pure Mamba LLMs is limited (since they are very new), early evidence indicates they are as adaptable as transformers. There might be differences in how they handle instructions or few-shot prompts due to weaker in-context learning in a pure SSM model . However, the hybrid models and improved architectures (e.g. Mamba-2, which likely enhances such capabilities) show that adaptability can match transformer-based LLMs. We can expect techniques like LoRA or prompt-tuning to work similarly on state-space models.
Model Accuracy: The big question for any new architecture is whether it can match the gold-standard (transformers) in accuracy. The answer from 2024 studies is encouraging: Mamba-based LLMs achieve competitive accuracy. Falcon Mamba 7B, for example, “matches or surpasses” other 7–11B transformer models on the Open LLM benchmark , despite using no attention. Moreover, a hybrid 8B Mamba-Transformer outperformed an 8B Transformer on a broad set of tasks . Pure Mamba models did show slightly lower performance on tasks heavily reliant on in-context learning or exact copying from input (e.g. few-shot learning tasks), as attention seems beneficial there. But for many language understanding and generation tasks, state-space models equaled or exceeded transformers . A reasonable conclusion is that Mamba architectures can reach the same quality bar on most benchmarks, especially when augmented with a bit of attention or other tweaks, and they bring significant efficiency perks.
Token Efficiency (Long-Context Handling): This is where Mamba truly stands out. State-space models can maintain very long contexts without blowing up computation. Experiments have demonstrated Mamba-based models handling inputs up to 16K, 32K, even 128K tokens while maintaining performance . In document digitization scenarios, this could eliminate the need for chunking a long text into many pieces – a Mamba model might digest an entire document in one go if the context window (and hardware memory) allows. This linear scalability means token utilization is vastly improved: you don’t pay an extra quadratic cost for each additional token, just a linear cost. Thus, feeding a 50-page document is feasible where a standard transformer would choke. In practice, extremely long sequence support might require trading off some model size or using special memory compression (as seen with DeepSeek’s latent cache compression ), but the ability to efficiently extend context is a core strength of Mamba. In summary, Mamba architectures are highly token-efficient for long sequences, directly addressing the chunking pain point.
Inference Speed and Memory Usage: We’ve noted the speed advantages – up to 8× faster generation in one study and significantly faster than baseline at 7B scale . Memory-wise, Mamba avoids storing large attention matrices. The Wikipedia description highlights that Mamba’s GPU implementation “avoids materializing expanded states… improving performance and memory usage,” leading to much lower memory consumption for long inputs (Mamba (deep learning architecture) - Wikipedia). For instance, Falcon Mamba 7B requires substantially less memory than an attention-based 7B for long text generation . Lower memory and no attention cache also mean that batch processing of long sequences is more practical. Overall, the runtime memory footprint grows roughly linearly with sequence length (mostly from intermediate states in the SSM), which is a dramatic improvement over transformers. This makes Mamba-based models appealing for processing big documents on hardware with limited RAM.
Cost-Effectiveness: Mamba models can offer cost savings in scenarios dealing with long texts. In cloud settings, serving a document-heavy workload with a linear-time model can cut inference GPU-hours significantly (8× speedup means you might need 1/8th the compute for the same throughput in some cases). Training cost for Mamba vs Transformer is roughly comparable per token (both are large dense models), but if the model can be smaller or if hybrid layers yield better accuracy, there could be training cost reductions as well. Importantly, being able to operate on longer sequences natively can simplify pipelines (less need for elaborate retrieval or chunking mechanisms), which can reduce system complexity and maintenance costs.
Cloud vs Edge: Mamba-based LLMs are a promising technology for cloud deployment when long-context tasks are required (e.g. analyzing long documents or logs). A cloud server can handle a moderately large Mamba model (such as 8B) and utilize its efficiency to process inputs that a similarly sized Transformer would struggle with. Because Mamba architectures are new, current top-performing ones are in the multi-billion parameter range, which still leans toward cloud hardware (GPUs) for real-time use. That said, the efficiency gains open the door for edge deployment of LLMs on longer text inputs than previously possible. A smaller Mamba model (say 1–3B parameters) could potentially run on a laptop or mobile device and handle inputs that a normal transformer of that size could not (due to memory limits for attention). If edge devices incorporate state-space models, users could, for example, run an on-device assistant that reads an entire local document without needing to send data to the cloud. This would offer the benefits of low latency and privacy. The edge viability will also depend on model compression – a 7B Falcon Mamba might be borderline for a phone unless quantized. But given that quantization is model-agnostic (can apply to Mamba too), we may see attention-free, long-sequence models optimized for devices. In summary, Mamba-based architectures provide a path to long-document LLM processing with high efficiency, making them valuable in cloud now and potentially transformational for edge use as hardware catches up.
Cloud vs Edge Deployment Considerations
To synthesize the findings:
Cloud-Optimized Architectures: Generally, the largest and most complex models are confined to cloud environments. This includes massive Transformer models (GPT-3, PaLM, etc.) which require specialized hardware and ample memory, as well as MoE models which distribute their many experts across servers. These achieve the highest accuracy and scaling but at high cost. Cloud deployment also suits retrieval-augmented systems that need to maintain and query huge document indices (e.g. a cloud service indexing millions of documents for QA). Mamba-based models at multi-billion scale are also currently a better fit for cloud, where their long-sequence advantages can be fully utilized with proper GPU support. In the cloud, one can horizontally scale to serve many users and use techniques like sharding and caching to optimize throughput. The cloud-first approach has been the default for LLMs due to their heavy resource demands, but it comes with expenses and dependency on internet connectivity.
Edge-Optimized Architectures: Edge deployment favors efficiency and compactness. Small Transformer variants (or moderately sized ones with compression) are the primary choice today. Techniques such as 8-bit or 4-bit quantization and pruning are essential to shrink models for edge ( LSAQ: Layer-Specific Adaptive Quantization for Large Language Model Deployment). For example, quantized LLaMA 7B models have been run on smartphones, albeit with some latency. The cost-effectiveness on edge is measured in whether the model can give acceptable performance within device constraints. Retrieval augmentation is a powerful strategy here: rather than a 20B model that won’t fit on an edge device, one could use a 2B model plus a local knowledge base relevant to the user’s documents. This leverages storage (which is relatively cheap on devices) to compensate for a smaller model size. Mamba-based models could become edge-friendly because of their lower memory usage on long inputs – an edge device might handle a longer user document with a Mamba model than it could with an attention model (which might run out of memory). We’re also seeing academic proposals for dynamic model deployment on edge, where parts of a model might adapt to hardware capabilities . Overall, edge LLM research focuses on model compression, efficient inference, and privacy. Running LLMs locally can yield faster responses and offline functionality, and it keeps sensitive data on-device (A Review on Edge Large Language Models: Design, Execution, and Applications). The trade-off is that edge models usually can’t match the full accuracy of giant cloud models. Therefore, choosing an architecture for edge involves balancing performance with efficiency – often favoring simplicity and smaller scale.
In conclusion, a spectrum of LLM architectures now exists to handle document-centric applications. Transformer-based LLMs remain the workhorse, offering high accuracy but requiring strategies (chunking or long-context variants) to process big documents efficiently. Retrieval-augmented models excel at injecting external knowledge, making them ideal when document repositories are too large to fit in context. Mixture-of-Experts models push the envelope in model size and are primarily cloud solutions for maximum accuracy per compute, though not directly addressing document length. Mamba (state-space) architectures introduce an exciting alternative that natively handles long sequences with high efficiency, potentially reducing or eliminating the need for chunking. Each comes with different profiles of inference speed, training cost, and memory use. Cloud deployments will continue leveraging the largest models and advanced mixtures for ultimate accuracy, whereas edge deployments will lean on optimized, efficient models (possibly augmented by retrieval) to bring LLM capabilities to user devices within limited resource budgets . The ongoing research from 2024 and 2025 suggests that we can expect even more hybrid approaches – such as combining retrieval with long-context models or sparsity with state-space models – to better handle document digitization tasks in a range of environments.
References: The insights above are drawn from recent literature, including surveys and empirical studies on RAG ( Retrieval-Augmented Generation for Large Language Models: A Survey), MoE models ( A Survey on Mixture of Experts), and state-space (Mamba) models ( An Empirical Study of Mamba-based Language Models), as well as research on deploying LLMs efficiently on cloud vs edge ( LSAQ: Layer-Specific Adaptive Quantization for Large Language Model Deployment).