Browse all previously published AI Tutorials here.
Fine-Tuning Architectures (PEFT) for LLMs
Retrieval-Augmented Generation (RAG)
Adapters and Modular Architectures in LLMs
Parameter-Efficient Fine-Tuning (PEFT) methods enable adapting large language models by updating only small additional parameters instead of all model weights (HERE). This drastically reduces memory and compute needed for fine-tuning. Popular PEFT techniques include low-rank adaptation and prompt/adapters injection. Key approaches in recent research (2024–2025) are:
LoRA (Low-Rank Adaptation): Introduces trainable low-rank update matrices into the model’s layers (e.g. replacing a weight W with W+BA, where B and A are small matrices of rank r≪dim(W)). The original model weights stay frozen, and only these adapter matrices are learned . This bottleneck adapter structure (down-project then up-project with residual) allows the model to be fine-tuned with only a few million parameters. Once training is done, the low-rank weights are merged with the base model with no latency penalty. Variants like AdaLoRA and DyLoRA further improve LoRA by dynamically adjusting the rank per layer during training to allocate capacity where needed, enhancing efficiency on a fixed parameter budget (e.g. training on a range of ranks instead of a fixed rank) .
QLoRA (Quantized LoRA): A 2023 method that combines quantization with LoRA to minimize memory usage. QLoRA first quantizes the pretrained model to 4-bit weights, then fine-tunes using LoRA adapters on top of this compressed model ( QLoRA: Efficient Finetuning of Quantized LLMs). Gradients are backpropagated through the frozen 4-bit model into the low-rank adapters. This approach preserves full 16-bit fine-tuning quality while allowing, for example, a 65B model to be finetuned on a single 48GB GPU . QLoRA introduced techniques like a new 4-bit NormalFloat data type and double quantization to maintain accuracy with aggressive compression. It achieved near state-of-the-art results on benchmarks with a fraction of the hardware. Recent research pushes this further: LowRA (2025) demonstrated accurate LoRA fine-tuning with effective precision below 2 bits per parameter, by using fine-grained mixed-precision assignments and custom kernels (LowRA: Accurate and Efficient LoRA Fine-Tuning of LLMs under 2 Bits) . This cuts memory usage dramatically (30–50% less memory than 4-bit LoRA) with minimal performance loss .
Fine-Tuning with Hugging Face PEFT (Example): Below is a Python example using Hugging Face Transformers and the PEFT library to apply LoRA fine-tuning to an LLM. The base model is loaded in 4-bit precision (using bitsandbytes for quantization) and then wrapped with a LoRA adapter configuration:
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from peft import LoraConfig, get_peft_model, TaskType
## Load a base model in 4-bit (quantized) mode
model_name = "facebook/opt-1.3b" # example base model
base_model = AutoModelForCausalLM.from_pretrained(
model_name, load_in_4bit=True, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
## Prepare a LoRA config (low-rank adaptation)
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM, r=16, lora_alpha=32, lora_dropout=0.1,
target_modules=["q_proj", "v_proj"] # apply LoRA to attention projection matrices
)
## Wrap the base model with the LoRA adapters
model = get_peft_model(base_model, lora_config)
print("Trainable parameters:", model.print_trainable_parameters())
## ... (Prepare training data) ...
training_args = TrainingArguments(output_dir="outputs", per_device_train_batch_size=4, num_train_epochs=3)
trainer = Trainer(model=model, args=training_args, train_dataset=my_dataset, tokenizer=tokenizer)
trainer.train()
This code freezes the core model weights and inserts LoRA adapter weights into the query/value projection of each transformer layer. Only the LoRA adapter parameters (a few million vs. billions in the full model) will be updated during training, making fine-tuning memory-efficient.
Retrieval-Augmented Generation (RAG)
RAG architectures enhance LLMs by coupling them with a retrieval system to ground generation on external data. The model is augmented with a non-parametric memory (e.g. a vector database of documents or knowledge) and an embedding-based retriever ( Retrieval-Augmented Generation for Large Language Models: A Survey). This helps mitigate hallucinations and provide up-to-date or domain-specific information beyond the LLM’s fixed training data.
How RAG Works: At query time, the system converts the user query into an embedding and performs a vector similarity search over the external database (using tools like FAISS, ScaNN, etc.) to fetch relevant text passages. These retrieved passages are then fed into the LLM alongside the original query to augment the context. The LLM’s generation is conditioned on both its internal knowledge and the retrieved evidence, leading to more factual and informed responses . In essence, RAG merges the LLM’s parametric knowledge with a large external knowledge store, allowing continuous updates and reducing the model’s reliance on stale training data .
A typical RAG pipeline involves the following steps:
Embedding & Retrieval: Encode the input query into a vector and query the vector store for nearest neighbors. The vector store (e.g. a FAISS index) holds embeddings of proprietary documents or knowledge base entries. It returns the top-kk relevant documents based on cosine similarity or inner product.
Augmentation: The retrieved documents (or their relevant snippets) are then combined with the original query, for example by prepending them to the prompt or as separate input segments. Some architectures feed the documents through an encoder and give the LLM cross-attention to those encoder representations (as in the original RAG model by Lewis et al., 2020). Simpler implementations just concatenate the text.
Generation: The LLM generates an answer conditioned on the query plus the retrieved context. The generation mechanism remains the same (e.g. causal decoding), but the presence of retrieved facts helps ensure accuracy and allows referencing information not stored in the model weights.
Modern RAG systems often use a distributed vector database (like FAISS, Milvus, etc.) to store and query embeddings efficiently, enabling retrieval in a few milliseconds even for millions of documents. The retrieval component can be updated independently (e.g. adding new documents) without retraining the LLM, making RAG attractive for enterprise use with proprietary data.
Latest Advancements: Research in 2024 has introduced adaptive retrieval techniques that make the retrieval step more context-aware. For example, the model can decide whether or not to retrieve for a given query, to avoid distracting the LLM with external text it already knows. Huang et al. (2024) propose an Adaptive RAG approach that retrieves only when the query asks for knowledge the LLM lacks. They determine this by inspecting the LLM’s own token embedding space: if the internal embeddings suggest the answer is not in the model’s stored knowledge, then trigger retrieval ( Embedding-Informed Adaptive Retrieval-Augmented Generation of Large Language Models). This embedding-informed strategy lets the system skip retrieval for questions the model can answer on its own, improving efficiency and not degrading answers with unnecessary context. Similarly, Liu et al. (2024) develop an inherent confidence-based controller that monitors the LLM’s certainty during generation and triggers retrieval only when confidence is low ( CtrlA: Adaptive Retrieval-Augmented Generation via Inherent Control). These adaptive retrieval models dynamically switch between pure generation and retrieval-augmented generation, achieving a better balance of accuracy vs. speed. Other improvements include training retrievers end-to-end with the LLM (so the retriever learns to fetch what the model truly needs) and employing multi-hop retrieval for complex queries. Emerging systems like Open-RAG (2024) even integrate a mixture-of-experts mechanism to let the model reason over retrieved evidence in multiple steps ( Open-RAG: Enhanced Retrieval-Augmented Reasoning with Open-Source Large Language Models), illustrating the trend of tightly coupling retrieval with the model’s reasoning module.
RAG Implementation Example (FAISS): Below is a simple example demonstrating how to use a vector store for RAG. We use FAISS to index document embeddings and retrieve relevant text for a query, then show how it can be fed to an LLM:
import faiss
import numpy as np
## Suppose we have document embeddings and their texts
document_embeddings = np.load("doc_embeds.npy") # shape (N_docs, dim)
documents = open("documents.txt").read().splitlines() # list of N_docs texts
## Build a FAISS index for efficient similarity search
dim = document_embeddings.shape[1]
index = faiss.IndexFlatIP(dim) # using inner-product similarity
index.add(document_embeddings)
## Encode a user query into the same embedding space (using a suitable embedding model)
query = "What are the revenue projections for product X in 2025?"
query_embedding = embed_model.encode(query) # embed_model: e.g. SentenceTransformer or LLM embedding
query_embedding = query_embedding.reshape(1, -1)
## Retrieve top-5 most similar documents
D, I = index.search(query_embedding, k=5)
retrieved_texts = [documents[i] for i in I[0]]
print("Top documents:\n", retrieved_texts)
## Augment the query with retrieved context for the LLM
augmented_prompt = query + "\n" + "\n".join(retrieved_texts)
response = llm_model.generate(augmented_prompt)
print("LLM Response:", response)
In this snippet, embed_model
could be a transformer model that generates embeddings (e.g. InstructorXL
or a smaller LLM used for embedding). We add all document embeddings to a FAISS index and then find the nearest neighbors to the query. The retrieved texts are concatenated to the query before passing into llm_model
for generation. In practice, one might use a dedicated RagRetriever
and RagSequenceForGeneration
model (as available in 🤗 Transformers) which handle the retrieval and generation steps in one framework. The example above illustrates the core idea: use vector similarity search to supply an LLM with external knowledge, enabling customized Q&A or generation based on proprietary data.
Adapters and Modular Architectures in LLMs
Adapters are lightweight neural modules inserted into an LLM’s architecture to allow efficient customization without altering the core model weights. During fine-tuning, only the adapter parameters are trained (the original pretrained weights remain frozen), greatly reducing the number of updated parameters and preserving the base model for reus (HERE). After training, the adapter can be plugged in to modify the model’s behavior on a new task or domain. This modular design means multiple adapters (for different tasks or data domains) can be attached to the same base LLM as needed.
Prefix and Prompt Tuning: Another modular customization approach is prefix tuning, which does not add new layers but instead prepends learnable vectors to the model’s input at each layer. In prefix tuning, a set of trained continuous vectors are inserted as a prefix to the key and value sequences in the self-attention mechanism of every transformer layer (HERE). The model treats these as additional “virtual” tokens that guide the attention, effectively priming the model for the new task. Only these prefix vectors are trained (often through a small MLP that generates them), and after training, they are stored (on the order of a few thousand parameters) and the model uses them to influence generation. This technique can store 1000× fewer parameters than fine-tuning the whole model, enabling one LLM to support many tasks by switching out prefixes (Prefix tuning for conditional generation). Variants like prompt tuning or P-tuning operate similarly but often only add learnable tokens at the input layer instead of every layer. These methods shine especially with very large models (billions of parameters) where tuning a few tokens can effectively steer the model. Recent research has also introduced adaptive prefix tuning (APT), which learns per-layer gating to adjust the influence of the prefix at different layers , further improving efficiency and control.
Control Mechanisms & Dynamic Adapters: Dynamic adapters refer to adapter modules that are conditionally applied or whose configuration changes based on the input. Instead of a one-size-fits-all adapter, the model can select different adapter “experts” or settings on the fly. This idea is often implemented with a Mixture-of-Experts (MoE) or gating mechanism. For example, multiple LoRA or adapter modules might be trained (each specializing in a subset of data or a particular style), and a gating network chooses which adapter to apply for a given input segment. Liu et al. (2024) describe dynamic adapters as “conditionally computed lightweight adapters” that allow selective fine-tuning of the model and greatly increase adaptability (LoRA-Switch: Boosting the Efficiency of Dynamic LLM Adapters via System-Algorithm Co-design). By retaining the pretrained model’s original weights and only swapping in different adapters or combining their outputs, the model can handle a wider range of tasks or domains without a separate full model for each. This modular approach was shown to maintain the base model’s strengths while substantially boosting capacity on new tasks .
One challenge with dynamic or multiple adapters is the potential overhead of routing and combining experts at runtime. Recent work has addressed this with system-level optimizations. LoRA-Switch (2024), for instance, introduced a token-wise routing mechanism that merges the chosen low-rank adapters for each token into the model weight during inference, thereby avoiding multiple sequential passes per layer . This brought the latency overhead of dynamic MoE adapters down significantly (improving decoding speed ~2.4×) while preserving their accuracy gain . Such advances indicate that adapter-based tuning can scale not just in parameter efficiency but also in runtime efficiency, making it practical to deploy multiple adaptive experts within an LLM.
Integrating Adapters – Example: Using the 🤗 peft library, we can attach an adapter to a pretrained model with just a few lines of code. For example, to apply prefix tuning on a GPT-2 model:
from transformers import AutoModelForCausalLM
from peft import PrefixTuningConfig, get_peft_model, TaskType
base_model = AutoModelForCausalLM.from_pretrained("gpt2-medium")
## Configure prefix tuning: e.g., 20 virtual tokens as prefix in each layer
prefix_config = PrefixTuningConfig(task_type=TaskType.CAUSAL_LM, num_virtual_tokens=20)
## Wrap the model with the prefix tuning adapter
model_with_prefix = get_peft_model(base_model, prefix_config)
## Now model_with_prefix has additional prefix-tuning parameters that can be trained.
print("Added prompt param count:", model_with_prefix.peft_config.num_virtual_tokens * base_model.config.n_layer)
In this snippet, PrefixTuningConfig
defines the adapter type (for a causal language model) and the length of the prefix. get_peft_model
injects the prefix tuning vectors into each transformer layer of GPT-2. We could then fine-tune model_with_prefix
on a new task (e.g. domain-specific text generation) – only the prefix vectors (and possibly a small MLP if configured) will be updated during training. The core GPT-2 weights remain untouched. After training, the prefix adapter (which might be only a few thousand parameters) can be stored or shared, and applied to the GPT-2 model whenever we want it to perform the new task. Similarly, other adapter types (LoRA, AdaLora, etc.) can be integrated by choosing the appropriate PeftConfig
. This modular approach allows customizing large models with proprietary data in a lightweight manner, reusing the same base LLM for many purposes by simply loading different adapters as needed.
References: Recent surveys and papers provide comprehensive overviews of these techniques, for example He et al., 2024 on PE (HERE) and Gao et al., 2024 on R ( Retrieval-Augmented Generation for Large Language Models: A Survey), as well as specific works like Dettmers et al., 2023 for QLo ( QLoRA: Efficient Finetuning of Quantized LLMs), Huang et al., 2024 for adaptive R ( Embedding-Informed Adaptive Retrieval-Augmented Generation of Large Language Models), and Liu et al., 2024 for dynamic adapters (LoRA-Switch: Boosting the Efficiency of Dynamic LLM Adapters via System-Algorithm Co-design). These advancements from 2024–2025 underscore a common theme: efficiently fine-tuning and extending LLMs by isolating small, trainable components (low-rank matrices, prefixes, or adapters) while leveraging powerful pretrained models as unchanged backbones. This enables organizations to customize LLMs with proprietary data and domain knowledge at low cost, and to continually update or switch out those customizations without retraining or serving an entire new model each time. The result is a flexible, modular LLM paradigm combining the strengths of large foundation models with the agility of smaller task-specific adaptations.