Extending LLMs to Support Multimodal Inputs

Apr 24, 2025

Browse all previoiusly published AI Tutorials here.

Introduction
Architecture Changes for Multimodal LLMs
Data Processing Pipeline Integration
Training Strategies for Multimodal Alignment
Computational Challenges
Inference Optimizations
Latest Research and Industry Insights 2024 2025
Real World Applications
Conclusion

Focusing on architecture changes, data processing, training strategies, computational challenges, inference optimizations, and real-world applications.

Introduction

Large Language Models (LLMs) are rapidly evolving beyond text-only interfaces into multimodal systems that can interpret and generate images, audio, and other data modalities alongside natural language. In 2024, this trend became mainstream – “all major LLM providers include the possibility to process images and sometimes audio and videos alongside texts,” turning what was once novel into a standard feature (A Dizzying Year for Language Models: 2024 in Review). Multimodal LLMs (sometimes called M-LLMs) unlock more natural interactions, allowing users to ask questions about a picture, get descriptions of audio, or have a conversation grounded in both text and visuals. For example, one can upload a photograph and ask an LLM to describe the scene or answer questions about it, a capability popularized by models like GPT-4 Vision in late 2023 (Visual ChatGPT: Multimodal Capabilities and Use Cases for 2024). This review surveys recent literature (2024–2025) on extending LLM architectures, data pipelines, and training techniques to support images and audio inputs, discusses the computational challenges and inference optimizations for such systems, highlights the latest research breakthroughs and industry insights, and examines real-world applications enabled by multimodal LLMs. All references are from 2024–2025 to ensure an up-to-date perspective.

Architecture Changes for Multimodal LLMs

Early efforts to add modalities to LLMs have converged on two main architectural paradigms (Understanding Multimodal LLMs). (A) Unified embedding/decoder architectures integrate all modalities into a single model (often a decoder-only transformer) by mapping non-text inputs into the same token space as text. (B) Cross-modal encoder/attention architectures use dedicated modules (encoders or adapters) for each new modality and fuse them with the language model via cross-attention or other bridging layers.

Unified Transformer Approaches: In this design, a multimodal model “utilizes a single decoder… much like an unmodified LLM architecture” . Images or audio are converted into a sequence of tokens with the same embedding size as text tokens, and concatenated into the text input stream fed to the LLM . The LLM itself does not structurally change – it treats visual or acoustic tokens as just another part of the sequence. Meta’s LLaMA 3.2 (2024) follows this pattern: it introduced 11B and 90B models that “support vision tasks, with a new model architecture that integrates image encoder representations into the language model” without altering the core transformer layers (Introducing Llama 3.2 models from Meta in Amazon Bedrock: A new generation of multimodal vision and lightweight models | AWS News Blog). Another example is AnyGPT (Zhan et al., 2024), which represents images, speech, and even music in a discrete token format so that a standard LLM can consume them. AnyGPT demonstrated that it can be trained “without any alterations to the current LLM architecture or training paradigms… relying exclusively on data-level preprocessing” to unify various modalities as token sequences ( AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling). The appeal of unified architectures is their simplicity – modalities are handled as if they were additional language “tokens” (akin to how multilingual models handle different languages). This means minimal code changes: the same transformer blocks process all inputs. Researchers report that such models can seamlessly perform multimodal generation too (e.g. generating image tokens to create images, or speech tokens for audio) in an “any-to-any” fashion , something that architectures which bolt on encoders (and usually only generate text) struggle with. However, unified models require a way to convert raw modalities into tokens – a challenge moved to the data preprocessing pipeline (discussed next).

Connect with me on X (Twitter)

Cross-Modal Attention Approaches: Many multimodal LLMs augment a text-centric model by adding dedicated vision or audio modules that feed into the language model. A typical design is a dual-encoder or encoder-decoder fusion: e.g. an image passes through a visual encoder (often a convolutional net or Vision Transformer) to produce feature embeddings, and these embeddings are then “integrated… within the [LLM’s] attention layers” via learned projection and cross-attention heads (Understanding Multimodal LLMs). DeepMind’s Flamingo architecture (2022) established a template: it kept a large language model (a frozen pretrained decoder) and added Perceiver Resampler layers that ingest image features and output a small set of latent vectors, which are then injected into the LLM’s layers through cross-attention at each timestep. The open-source IDEFICS model (Hugging Face, 2023) follows this design. IDEFICS uses a vision transformer (OpenCLIP ViT) to encode images, then applies Flamingo-style cross-attention so the LLM can attend to visual context interleaved with text (Demystifying Multimodal LLMs). In essence, the language model “sees” by querying the image representation through attention instead of directly receiving image tokens. Cross-modal architectures often result in a two-stream model: one path for language, one for vision (or audio), merging at defined points. This can be thought of as early fusion vs. late fusion in the network – some models inject visual features right at the first transformer layer (early fusion), while others let the text model process some tokens before mixing in visual information at a middle layer (later fusion) . The advantage of this approach is that it can leverage specialized feature extractors (e.g. a strong vision backbone) and potentially freeze large parts of the language model, reducing the amount of training needed. The downside is complexity: additional parameters and careful design to align the modalities. There’s also a limitation that many such models only generate text outputs – they answer questions about images but cannot produce an image, since the language decoder has no mechanism to output visual tokens in this setup (unlike the unified token approach).

Vision-Language Fusion Strategies: Recent research has explored hybrid architectures that combine strengths of both paradigms. NVIDIA’s NVLM 1.0 (2024) provides a comparative study – they implemented one model with a decoder-only approach (similar to LLaVA/LLaMA) and one with a Flamingo-style cross-attention, and found each had pros and cons. Based on this, they propose a “novel architecture that enhances both training efficiency and multimodal reasoning capabilities” by melding ideas from both ( NVLM: Open Frontier-Class Multimodal LLMs). While details are beyond our scope, NVLM introduced techniques like a “1-D tile-tagging” scheme for handling high-resolution images (breaking an image into patches tagged with spatial metadata for better OCR and spatial reasoning) . This reflects a broader trend: architecture search for multimodality. Rather than simply stacking a vision encoder onto an LLM, new designs add hierarchical fusion layers or experts to handle different aspects of multimodal understanding (spatial relationships, fine-grained details, etc.). The goal is a flexible architecture that can reason as coherently as unified models, yet remain efficient to train like the two-stream models. We can expect more “frontier-class” architectures in 2025 that push this integration further (Noteworthy LLM Research Papers of 2024), potentially incorporating not just vision and audio, but other modalities like video, 3D spatial data, or sensor inputs into a single coherent model.

Data Processing Pipeline Integration

Extending LLMs to new modalities requires careful design of the data pipeline – how raw text, images, and audio are preprocessed into a form the model can consume. Each modality has distinct preprocessing steps, but they must ultimately produce compatible representations (tokens or embeddings) that can be merged for joint modeling.

Text: For traditional LLMs, text is tokenized into subword units (using BPE, WordPiece, etc.) and then mapped to vector embeddings. This remains the case in multimodal LLMs – the text pipeline doesn’t fundamentally change. A sentence like “A cat on a chair” might become tokens [A] [cat] [on] [a] [chair] and then 768-dimensional embedding vectors if the LLM uses that hidden size. Modern LLMs can handle long sequences (OpenAI’s GPT-4, Meta’s LLaMA 3, etc., support contexts of 100k+ tokens), so text inputs may include lengthy descriptions or transcripts alongside other modalities (A Dizzying Year for Language Models: 2024 in Review). One challenge is ensuring that when text is combined with images or audio, special tokens or delimiters mark modality boundaries – e.g. using a reserved [IMAGE] token to indicate that an image embedding follows.
Images: There are two common ways to preprocess images for an LLM: (1) encode into continuous embeddings via a vision model, or (2) convert into discrete tokens via a codebook. The first approach is used in models like Flamingo/IDEFICS: pass the image through a pretrained Vision Transformer (ViT) or CNN to get feature maps or patch embeddings. For example, an image can be split into a grid of patches (e.g. 16×16 pixels each), each patch is embedded into a vector (using a linear projection + positional encoding, as in ViT) (Understanding Multimodal LLMs) resulting in a sequence of image feature vectors . A linear projection may further map these image vectors into the same dimension as the text embeddings . At that point, they can either be concatenated with text tokens or fed through cross-attention. The second approach (discrete tokens) is exemplified by AnyGPT and some image generation models: use an image tokenizer (like VQ-VAE or a vector-quantized encoder) to compress the image into a sequence of integer codes. For instance, OpenAI’s DALL-E uses 32×32 = 1024 tokens to represent an image, and Google’s Parti uses a similar idea. AnyGPT similarly tokenizes images “akin to the addition of new languages”, so an image might become a sequence of, say, 256 visual tokens that the model treats like words ( AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling). This tokenization can be learned (a VQGAN trained to reconstruct images from code) and allows the LLM to generate images by outputting those tokens. The trade-off: using continuous embeddings (approach 1) retains rich visual features but doesn’t allow image output, while discrete tokens allow generation but require a good tokenizer and increase sequence length. In either case, images often need downsampling – high-res images might be too large, so pipelines resize images (e.g. to 224×224) before encoding to limit the number of patches or tokens.
Audio: Raw audio (waveforms) is effectively a 1D time-series, which can be very long. A common preprocessing step is to convert audio to a spectrogram or Mel-frequency cepstral coefficients – essentially turning it into an “image” representing frequency over time – which can then be processed by a CNN or transformer similarly to how images are processed. Alternatively, like with images, audio can be discretized into tokens. Recent work on neural audio codecs (e.g. SoundStream, Encodec) produces sequences of tokens representing audio waveforms (MLLM Tutorial). For example, Meta’s Encodec (2022) can represent 1 second of audio as 50 tokens. An LLM could ingest these tokens to handle audio input or output. An approach described as a “Unified Speech Tokenizer” was introduced by Zhang et al. (2023) to enable LLMs to generate spoken responses by outputting tokens that a vocoder can convert to speech . In practice, many systems sidestep raw audio handling by leveraging speech-to-text and text-to-speech models. For instance, OpenAI’s ChatGPT voice mode in 2024 uses Whisper (ASR) to convert user speech to text for the LLM, and a TTS system to speak back the LLM’s answer – effectively grounding audio into text rather than truly processing audio signals. However, research prototypes like SpeechGPT have shown LLMs directly taking audio feature inputs and producing text or audio outputs within one model . In summary, audio data either enters the LLM pipeline as transcribed text (simple but indirect) or as acoustic feature tokens/embeddings (direct but requiring specialized preprocessing).
Modality Integration: After individual preprocessing, the modalities must be combined for the model. In a unified architecture, this often means concatenating token sequences with special separator tokens. For example, a multimodal training sample might form: [ImageTokens] <|image|> [TextTokens] where <|image|> is a marker the model learns to interpret. In cross-attention models, the integration happens inside the model: the image encoder produces embeddings, and at certain transformer layers the text queries attend to those image embeddings. Either way, an important part of the pipeline is alignment – ensuring that the model knows which text corresponds to which image or audio. Datasets of captioned images or narrated videos are used for this. Some pipelines explicitly pair each image token with the nearby text tokens that describe it (through positional encodings or cross-modal attention masks). Others rely on training the model with many image-text examples so it implicitly learns alignment. Recent large-scale datasets like OBELICS (introduced with IDEFICS) contain “141 million interleaved image-text documents” scraped from the web to teach models to handle interleaved multimodal inputs (Demystifying Multimodal LLMs) . In practice, data preprocessing is often the most engineering-heavy aspect: images might need cropping, normalization, and augmentations; audio might need volume normalization or trimming; and all modalities must be synchronized if they appear together (e.g. an audio description corresponding to a specific image frame).
I write everyday for my readers on actionable AI. Subscribe and instantly get a 1300+ page Python book.

To illustrate two ends of the pipeline design spectrum: LENS (2023) is a framework that does not train a unified model at all – instead, it processes each modality with separate models and only combines at the final stage. Given an image, LENS “extracts rich textual information using existing vision modules (CLIP for tags, BLIP for captions)”, essentially converting the image to a handful of keywords and a caption . These textual descriptors are then fed into a frozen LLM (like GPT-4 or LLaMA) which performs the reasoning and generates an answer or description . This is an extreme case of treating image preprocessing as feature extraction for a text-only model. On the other hand, a model like Kosmos-2 (Microsoft, 2023) takes a more integrated pipeline: it can accept an image in the middle of a chat prompt, and the training data includes annotations where an <image> token in the text corresponds to actual image pixels, so the model learns to jointly attend to text and image patches (MLLM Tutorial). Both pipelines need to eventually produce a single combined sequence or context that the model can work with. The choices made at the data preprocessing stage (discrete vs continuous tokens, early conversion to text vs direct use of features) significantly influence the model’s capabilities and training complexity. The trend in 2024 is toward unification – processing all modalities in a more homogeneous way so that one model can handle them – but with pragmatic tweaks like using powerful pretrained encoders for images/audio to jump-start the process.

Training Strategies for Multimodal Alignment

Designing an architecture and preprocessing pipeline is only half the battle – the model must be trained to align and jointly understand multiple modalities. This requires carefully devised training objectives and curricula. We highlight several key training strategies: contrastive alignment, multistage training (pre-train & fine-tune), cross-modal attention tuning, and instruction tuning with multimodal data.

Connect with me on X (Twitter)

Contrastive Learning for Cross-Modal Alignment: A fundamental technique to tie together different modalities is contrastive learning, as popularized by CLIP (2021) which aligned images and text by making matching image-caption pairs have high embedding cosine similarity. Many multimodal LLM pipelines incorporate a contrastive phase to initialize encoders. For example, the visual encoder in IDEFICS is OpenCLIP – a public image-text model already trained to map images and their captions into a joint embedding space (Demystifying Multimodal LLMs). By starting with a CLIP backbone, the LLM is given a head start: the image features it sees are already somewhat aligned with text semantics (e.g. CLIP’s image embedding for a cat photo is near the embedding of the word “cat”). This greatly eases subsequent fusion. Some works extend contrastive alignment to audio-text (e.g. Meta’s ImageBind project created a joint embedding space for image, text, audio, and even sensor data by binding each modality to a common representation (MLLM Tutorial)). The goal is a shared latent space where modality invariances hold – a caption and an image or an audio clip of the same event should all map to similar vectors. During 2024, such approaches have been scaled up to enable zero-shot transfer between modalities (for instance, binding depth maps and thermal images to the same space via language as the intermediary ). In training an LLM, contrastive losses are often used in a pre-training stage: before (or in parallel to) the main generative training, the model may be asked to produce embeddings for modalities and match or distinguish pairs. This can be applied at the encoder level (image encoder vs text encoder) or within a unified model (e.g. train the model such that the hidden state after an <image> token is close to the hidden state of the image’s caption). The literature shows that contrastive pre-training significantly improves multimodal representation quality, and many state-of-the-art M-LLMs still rely on it as a preliminary step ( AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling).
Two-Stage (Pre-train then Fine-tune) Training: Given the immense cost of training giant multimodal models from scratch, a common recipe is two-stage: first pre-train on a broad dataset with self-supervised objectives, then fine-tune on downstream tasks or instructions. In the context of multimodal LLMs, pre-training often involves predicting masked tokens in text and images (multimodal masked language modeling) or next-token prediction on interleaved data, possibly combined with contrastive objectives as above. The Fine-tuning stage typically involves instruction tuning or supervised learning on high-quality QA pairs, captions, etc. A clear example is BLIP-2 (2023) – it first learned a Q-former (a small transformer) to connect a frozen CLIP image encoder with a frozen GPT-variant language model by training on image-text pairs, and then it fine-tuned the whole system on tasks like VQA (Visual Question Answering) . By keeping most of the model frozen initially, BLIP-2 drastically reduced training cost and avoided instability. Another example, from 2024, is AnyGPT’s training procedure: after building the unified tokenization pipeline, they “build a multimodal text-centric dataset for multimodal alignment pre-training” – essentially ensuring the LLM’s embedding space can represent different modalities well – and then “synthesize a large-scale any-to-any multimodal instruction dataset (108k samples of multi-turn conversations)” to fine-tune the model to actually follow prompts mixing modalities . This mirrors the approach used in open models like LLaVA, MiniGPT-4, and others: first do vision-language pretraining, then do instruction tuning. The pretraining might be on millions of image-caption pairs (to learn generic grounding), while instruction tuning might involve a smaller curated set of dialogues (often GPT-4 generated) where the assistant is taught to follow human instructions about images . This strategy has been very successful – for instance, MiniGPT-4 (2023) showed that even a single epoch of fine-tuning on 3500 high-quality multimodal instruction examples can unlock strong chat abilities in a pretrained vision-language model . The key insight is that general perception and task-specific instruction following can be learned in sequence. By 2025, we see nearly all multimodal LLMs fine-tuned with some form of instruction or conversational data, often with human or AI-generated dialogues that say things like “User: (image) What’s happening here? Assistant: …”.
I write everyday for my readers on actionable AI. Subscribe and instantly get a 1300+ page Python book.
Cross-Modal Attention Tuning and Freezing: Training very large models on multimodal data can be unstable – language models might forget linguistic skills if not handled carefully, and gradients from image tokens might disrupt language representations. One strategy to mitigate this is freezing subsets of parameters (like the majority of the LLM) and only training newly introduced multimodal parameters (like a vision encoder, or the cross-attention projections). For example, Flamingo fixed the entire 70B language model and only trained the gating of visual features and a small set of added parameters, requiring a high-quality multimodal dataset but far fewer updated weights. LoRA (Low-Rank Adaptation) techniques have also been applied: instead of full fine-tuning, insert low-rank adapters for multimodal alignment – this was done in some 2023 works to cheaply add vision to LLaMA by training a LoRA on image-text data. The Visual Instruction Tuning approach (Liu et al., 2023) found that starting from a strong pretrained LLM and fine-tuning on image-grounded instruction data produces better results than training from scratch (MLLM Tutorial). Empirically, many 2024 models maintain language model weights largely intact (ensuring they keep their vast knowledge and fluency) and focus on adjusting the new modality interfaces. This results in faster convergence and also allows reusing the same LLM backbone for different variants (text-only, text+image, etc.). Cross-modal attention modules themselves can be specialized – some research introduces gated cross-attention that can be toggled on/off, or uses modality-specific prefixes (learned prefix tokens for each modality) to condition the model without altering its core weights. All these tricks fall under making the training efficient and stable, given limited resources. In one study, Li et al. (2024) introduced CuMo, which adds Mixture-of-Expert layers to the vision encoder of a multimodal model. By doing so, they improved scaling during training (experts increase capacity) while keeping inference costs similar (since only a few experts activate per input) (NeurIPS Poster CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts) . This kind of innovation in training seeks to achieve bigger effective models without a proportional increase in computation.
Joint Multimodal and Unimodal Training: An insight from NVIDIA’s NVLM work is that including unimodal (pure text) training data alongside multimodal data can be highly beneficial ( NVLM: Open Frontier-Class Multimodal LLMs). They found that “dataset quality and task diversity are more important than scale” during multimodal pretraining . In practice, NVLM mixed a high-quality text-only corpus into the training of their multimodal model, and observed that not only did the model retain its language skills, it actually improved on text-only benchmarks compared to the original LLM . This is a notable result: early multimodal models sometimes showed regressions on pure language tasks (since they devote capacity to other modalities or overfit to smaller multimodal data). The latest training strategy is to use massive, diverse multitask training – combine text corpora, image-caption data, OCR datasets, video transcripts, audio transcripts, math problems with images, etc., so the model learns a broad spectrum of tasks. Google’s PaLM-E (2023) for robotics, for instance, was trained on web text, image-caption pairs, and robot demonstration data, all in one model, enabling it to handle instructions like “pick up the green object to the left of the red bowl” by linking vision and language understanding. In 2024, we also see a focus on multimodal chain-of-thought training: giving models rationales that involve multiple modalities (e.g. explaining an answer by referencing the image). This requires new datasets and careful loss functions (sometimes using causal language modeling on a concatenated input where modalities and rationales are all linearized).

Overall, training a multimodal LLM is a balancing act between alignment (making sure modalities connect correctly) and retention (not losing the strong abilities of the base LLM). Techniques like contrastive pretraining, staged training (first align, then instruct), parameter-efficient tuning, and curated data mixes have become the de facto approach. As one 2024 review succinctly put it, thanks to these strategies “LLMs can natively handle non-textual sources of information such as images and audio… it is both simpler and more effective to process the input documents as they are rather than converting them to purely text” (A Dizzying Year for Language Models: 2024 in Review). In other words, we train the models to adapt to the data, instead of force-fitting the data to the models.

Computational Challenges

Extending LLMs to multimodality comes with significant computational challenges, both during training and inference. These arise from increased input sizes (images/audio are high-bandwidth data), additional model components, and the need for efficient hardware utilization. We discuss a few key challenges: memory and sequence length scaling, model size and training cost, and latency/throughput considerations.

Explosion of Input Size: A single image of moderate resolution (e.g. 224×224) might be represented by hundreds of patches or tokens. A 5-second audio clip sampled at 16kHz contains 80,000 samples; even a heavily compressed token representation could be thousands of tokens. When these are added to an LLM’s context, the sequence length grows dramatically. For example, an LLM prompt that included 5 images each represented by 256 tokens would have an extra 1280 tokens – equivalent to a few pages of text – just from images. This stresses GPU memory and attention computation. Standard self-attention cost grows quadratically with sequence length, so long multimodal contexts can slow inference or require splitting into chunks. One solution is using models optimized for long contexts (some research models use efficient attention mechanisms or chunked processing to handle sequences with image+text up to 50k or more tokens). Meta’s LLaMA 3.2 models maintain a 128k token context window even for multimodal versions (Introducing Llama 3.2 models from Meta in Amazon Bedrock: A new generation of multimodal vision and lightweight models | AWS News Blog), which is huge – they rely on FlashAttention and other memory-efficient attention implementations to make this feasible on hardware like GPUs. Still, processing multiple modalities together often requires large memory GPU instances. A recent benchmark showed that many existing multimodal LLMs “deteriorate significantly” in performance when asked to handle multiple images beyond a small number, due to memory constraints or lack of optimization (looongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via a Hybrid Architecture). Researchers from CUHK addressed this by creating LongLLaVA (Wang et al., 2024), a variant of LLaVA optimized for multi-image input. They implemented a hybrid architecture with memory-efficient attention (a mix of recurrent Mamba blocks and standard Transformer blocks) and a progressive training schedule to stabilize learning with lots of images . The result was the ability to “process nearly a thousand images on a single A100 80GB GPU” (with 8-bit quantization) while maintaining good throughput . This is an extreme case, but it demonstrates the point: specialized engineering (both architectural and low-level optimization) is needed to handle large multimodal contexts without running out of memory or time.

Connect with me on X (Twitter)

Model Size and Scaling: Multimodal LLMs tend to be large – you essentially combine a language model with (potentially) a vision model and more. If a language model has 70B parameters and a vision transformer has 2B, a naive combination might be 72B. Training such a model end-to-end is enormously costly in compute and data. One challenge is that simply increasing parameters might not yield proportional gains on multimodal tasks, especially if those extra parameters are not efficiently utilized. A notable approach to scaling came in the form of Mixture-of-Experts (MoE) layers applied to multimodal models. Li et al. (2024) added sparse MoE blocks to both the vision encoder and the connecting layers of a V+L model, creating CuMo (NeurIPS Poster CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts). During training, multiple expert subnetworks are trained, but at inference only the top-k experts (by a gating function) are activated for a given input. This means the model can have a very high parameter count (since many experts exist) but inference cost remains similar to a smaller model (only k experts’ weights are used) . CuMo was shown to “outperform state-of-the-art multimodal LLMs across various VQA and visual-instruction benchmarks within each model size group, all while training exclusively on open-source datasets” . This indicates that smart sparsity is a key to scaling: instead of one monolithic dense model, use conditional computation to allocate capacity where needed. Another scaling challenge is training speed – images are heavier to load and augment compared to text. Projects like PaLI (Google’s vision-language model) and others mentioned that I/O and data pipeline can be a bottleneck when training on billions of image-text pairs. Solutions involve using faster storage, caching datasets in memory, and doing on-the-fly tokenization on GPUs. There’s also active research on multimodal scaling laws – understanding how performance scales with model size and data size. It’s not a given that doubling parameters always helps multimodal tasks; some studies in 2024 observed that after a certain point, adding more images to training yielded diminishing returns compared to adding more diverse tasks or higher quality data ( NVLM: Open Frontier-Class Multimodal LLMs). This shifts the focus from raw scaling to efficient scaling: getting the most out of each parameter through training techniques like MoE, and using data that maximally informs the model.

Hardware Utilization: Because of their size and data complexity, multimodal LLMs push the limits of hardware. During training, one must distribute models across multiple GPUs (or TPUs). Techniques like pipeline parallelism (different layers on different devices) and tensor parallelism (sharding matrix operations across devices) are standard for large LLMs; these still apply, but the presence of a vision encoder might introduce an imbalance (for example, one part of the pipeline might be doing heavy CNN computations while another does transformer work). Frameworks such as PyTorch TorchMultimodal (released in beta by Meta) provide building blocks to ease this – e.g. enabling mixed parallelism and optimized kernels for combined image-text batches (Scaling Multimodal Foundation Models in TorchMultimodal with Pytorch Distributed | PyTorch). Even inference needs optimization: serving a multimodal LLM is more complex than a text one because you may have to run an image encoder (which could be a large ViT) in addition to the LLM decoding. To reduce latency, one strategy is to precompute and cache embeddings. For instance, if an application often queries the same image, the system can store the image’s embedding so that subsequent queries only run the (faster) language model part. Another technique is quantization: representing model weights in 8-bit or 4-bit precision. Many open multimodal models (e.g. LLaVA, Vicuna-Vision variants) are released with INT8 quantization to allow users to run them on consumer GPUs with limited memory. Researchers have shown you can quantize vision and language models with minimal loss in accuracy, which is a big win for deployment (A Dizzying Year for Language Models: 2024 in Review).

A specific inference challenge is real-time processing for modalities like audio or video. An LLM generating text can do so streamingly, token by token. But if the prompt is a live audio stream (say, a user speaking to the AI assistant), the model either needs to handle streaming input or the audio must be buffered until a chunk is ready to send. Solutions include chunking audio into short windows and processing each with overlaps (similar to how real-time ASR systems work), or using a separate streaming ASR and feeding text in incrementally. Similarly for video, processing every frame with a giant model is infeasible, so systems sample keyframes or use a smaller vision model to filter information for the LLM. An example is the use of a visual encoder with frame selection before feeding an LLM for video question answering tasks (MLLM Tutorial) .

In summary, the computational load of multimodal LLMs has driven innovations in software and hardware optimization: from advanced distributed training techniques to new model sparsity methods. With each new generation of hardware (GPUs like NVIDIA H100 or Google TPU v5) and libraries (DeepSpeed, FasterTransformer, etc.), models that handle more modalities or higher resolution data become tractable. The community is increasingly focused on not just making these models large but also efficient – for example, achieving the same accuracy with a 13B multimodal model that a naive approach would need a 100B model for. This is crucial for real-world use, where inference might need to happen in real-time on edge devices. Meta’s release of 1B and 3B parameter LLaMA 3.2 models for mobile/edge use (text-only, but presumably a hint at future small multimodal models) shows an eye towards efficiency (Introducing Llama 3.2 models from Meta in Amazon Bedrock - AWS) . As one AWS expert noted during LLaMA 3.2’s launch, these models “are designed to be more efficient for AI workloads, with reduced latency and improved performance, making them suitable for a wide range of applications” – a statement that certainly applies to multimodal workloads as well.

Inference Optimizations

When deploying multimodal LLMs, various techniques can optimize inference speed and accuracy. We will discuss early fusion vs. late fusion approaches, and the use of retrieval-augmented generation (RAG) in multimodal contexts.

Early Fusion vs. Late Fusion: Early fusion refers to combining modalities at the input level or in the first layers of the model, so the model processes them jointly from the start. Late fusion means each modality is processed mostly separately (by specialized models or branches), and the outputs are combined at a later stage (e.g. combining final representations or outputs). Early fusion is what end-to-end multimodal LLMs do – e.g. GPT-4 with vision immediately conditions on image content within its first transformer layers. This can yield very coherent understanding, since the model can intertwine information from modalities at a low level. However, early fusion requires running the full large model for every query, and if you only have a hammer (the big model), every query – even a simple one – incurs the same cost. Late fusion pipelines can be more efficient for certain tasks. A prime example we discussed is the LENS framework (Demystifying Multimodal LLMs). In LENS, the heavy lifting of image analysis is offloaded to pre-existing vision models (CLIP and BLIP) which are relatively fast and can even be reused across queries. The LLM (like GPT-3.5 or 4) then only processes text – a modality it’s highly optimized for – which is a lighter task after the image has been summarized in words. This is a form of late fusion: the fusion happens in conceptual space (both modalities are converted to text before combining). As a result, LENS achieved “competitive performance comparable to Flamingo and BLIP-2, despite not being explicitly trained to handle both images and text” . This zero-shot capability is impressive and useful: if one cannot fine-tune a giant multimodal model, one can still compose existing models to get a multimodal system. The trade-off, of course, is that errors in the first stage (vision analysis) propagate. In LENS, they noticed if CLIP or BLIP gave an incorrect caption (say, “three cats in the picture” vs “five cats in the picture”), the frozen LLM cannot fix that – it can only rely on what it was given . Thus, late fusion systems might be brittle if the intermediary is imperfect.

From an optimization perspective, late fusion can be very efficient in a modular pipeline setting. You could have a dedicated GPU running the vision encoder for all incoming images, and a separate server running the LLM for text, communicating via a text interface. This modularity also makes updating components easier (you could swap in a better image captioner without retraining the LLM). We see this idea in production systems like Bing’s multimodal question answering: an image you upload is processed by a vision model to extract text (OCR) and tags, then a prompt is constructed for an LLM that includes those results. Essentially, it’s a carefully orchestrated late fusion ensuring the LLM sees all relevant info in textual form. Another late fusion variant is ensemble fusion, where separate models make predictions that are then combined (e.g. one model generates an image caption, another independently answers a question using the caption, and a simple algorithm reconciles multiple answers). However, ensembles are less common due to complexity and latency.

Retrieval-Augmented Generation (RAG): RAG is a strategy where the model isn’t required to have all knowledge in its weights; instead, it retrieves relevant external data (texts, images, etc.) from a knowledge base at inference time to help generate the answer (A Dizzying Year for Language Models: 2024 in Review) . This approach has proven powerful for text LLMs (improving factual accuracy by retrieving documents). For multimodal models, RAG opens up some new possibilities and optimizations:

Retrieving Text for Images: Suppose a user shows a photo of a historical monument and asks a question. A multimodal LLM might identify the monument in the image (say, “this is the Colosseum in Rome”), but it might not know all historical facts about it. Rather than making the multimodal model extremely large to encode all Wikipedia knowledge, a RAG system can take the identified object (“Colosseum”) as a query to a text database (like Wikipedia or a vector store of documents) and retrieve a passage about it. This passage (now text) can be fed into the LLM along with the user’s question to generate a detailed answer. This retrieval step is modality-aligned: the model connected visual recognition to a text knowledge base. Recent systems (e.g. ColBERT-QA or MMR in 2024) have done exactly this – combining a vision model with a text retriever to answer knowledge-intensive VQA. One example referenced in a 2024 review is “multimodal retrieval models like ColPali” that assist multimodal LLMs . ColPali (2024) by Google Research is a system that indexes both images and text, enabling queries that span modalities.
Retrieving Images for Text: Conversely, if a user asks an LLM about an image that it hasn’t seen, retrieval can help there too. For instance, Visual ChatGPT could retrieve similar images from a database to use as exemplars. If someone describes an object via text, the system might retrieve an image of that object to help ground the LLM’s response (especially if the LLM has a vision module that can process that retrieved image). This is less common but conceptually possible – a sort of “reference lookup.”
Unified Multimodal Embeddings for Retrieval: One approach to RAG is to embed all information into the same vector space, regardless of modality (An Easy Introduction to Multimodal Retrieval-Augmented Generation | NVIDIA Technical Blog). NVIDIA’s 2024 technical blog describes two main approaches: “(1) embed all modalities into the same vector space, (2) ground all modalities into one primary modality” for retrieval . The first approach uses a model (like CLIP or its extensions) to generate embeddings such that, say, an image of a cat and the text “cat” end up nearby in the vector space. Then one can index a multimodal datastore (images with their captions, or audio with transcripts, etc.) in one index. A query – which could be an image or text or audio – is embedded into that space and nearest neighbors are retrieved irrespective of modality. For example, a system could store a bunch of images and their descriptions as vectors; if the user provides an image query, the system finds similar images and obtains their descriptions to help the LLM answer. The second approach (grounding into one modality) is basically what LENS does (convert everything to text and use text search, or convert everything to images and use image similarity search). Most practical implementations use text as the grounding modality because we have mature text search tools. A 2024 Medium guide on Multimodal RAG notes that a common method is to “convert images to text (via captioning or OCR) then use a text vector database,” leveraging the power of language models to handle the retrieved text (LlamaIndex Newsletter 2024-10-01).

In terms of efficiency, RAG can significantly cut down the needed model size or internal knowledge. Instead of a 100B parameter model trying to recall niche facts, a 7B model with retrieval can often do better by reading from a large text corpus (A Dizzying Year for Language Models: 2024 in Review). It also allows updating information without retraining – update the knowledge base, not the model. For multimodal, this is crucial because visual data can be very detailed; it’s unrealistic to bake in all possible visual facts (like every species of plant or every product model number). But with retrieval, a multimodal assistant can, for example, see a plant leaf, match it to a similar leaf image in a botanical database, get the species name from that, and answer the user. All steps except the initial query embedding and final answer generation happen outside the core model, making the system more scalable.

One concrete use-case from 2024: Google’s Cloud AI introduced a Multimodal Embeddings API that developers can use to generate embeddings for text, images, and video in the same space (See the Similarity: Personalizing Visual Search with Multimodal Embeddings - Google Developers Blog). They demonstrated a visual search where a query image of an outfit could be used to find text descriptions of similar outfits, enabling a shopping assistant to recommend items . This kind of service essentially provides the building blocks for multimodal RAG – the heavy lifting (embedding computation) is done by a model provided via API, and the developer can build a vector index to enable fast similarity search.

Efficiency Considerations: Early fusion (end-to-end models) may require running a huge model even for simple tasks, but it often yields the most accurate and contextually nuanced results. Late fusion and RAG introduce additional steps (which themselves have a cost, e.g. running a separate captioning model or a vector search), but these steps can be optimized independently and scaled out. A typical multimodal pipeline might even combine approaches: e.g. use late fusion as a first-pass (fast) and fall back to the full multimodal LLM (slow but thorough) if the query is complex. Such cascading systems are not yet common in published research but are a logical next step for industrial applications where cost matters.

In summary, inference optimizations for multimodal LLMs revolve around structuring the problem in clever ways: either break it into smaller pieces handled by specialist models (late fusion pipelines), or augment the LLM with external knowledge (RAG) so it doesn’t have to do everything itself. These techniques can dramatically improve throughput and also help with accuracy (since retrieving a grounded fact is often more reliable than the LLM guessing). As multimodal systems move from demo to deployment, these optimizations are becoming essential for delivering results within acceptable latency and resource limits.

Latest Research and Industry Insights 2024 2025

The past two years have seen rapid advancements in multimodal LLM research, accompanied by strong industry interest. Here we summarize some of the most relevant studies from 2024 and early 2025 and highlight insights from both research papers and industry (including framework developers and AI companies).

Open-Source and “Frontier” Multimodal Models: A notable trend is the release of large multimodal models with open access. In 2024, Hugging Face’s team introduced IDEFICS 80B, the first open multimodal LLM at that scale (Demystifying Multimodal LLMs). IDEFICS reproduced DeepMind’s Flamingo approach using LLaMA and OpenCLIP, and matched the closed-source Flamingo on vision-language benchmarks (Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Langage Model) . Similarly, Alibaba released Qwen-VL (2023) and Qwen-VL-Chat, like multimodal versions of their Qwen LLM, demonstrating strong performance on image understanding tasks. These open models empower the community to experiment without needing to rely on restricted models like GPT-4V. On the frontier side, NVLM (NVIDIA, late 2024) stands out. NVLM-1.0 is described as “a family of frontier-class multimodal LLMs” that achieve state-of-the-art results, rivaling models like GPT-4 Vision ( NVLM: Open Frontier-Class Multimodal LLMs). Importantly, NVLM provides a head-to-head comparison of the two architectural paradigms: a decoder-only variant vs. a cross-attention variant . Their findings led to a hybrid architecture that capitalizes on both – presumably integrating image tokens directly at some layers and cross-attention at others (though we’ll await their full technical details). NVLM’s results also brought an intriguing insight: after multimodal training, the model’s “text-only performance” actually improved beyond its original text model . This suggests multimodal training (with the right data mix) can regularize or enhance an LLM’s linguistic capabilities, rather than diluting them. It also underscores the importance of high-quality data: NVLM’s authors note that carefully curated multimodal and text data (including a high-quality text corpus added to multimodal training) was key to their success, more so than sheer scale of data .
Connect with me on X (Twitter)
Any-to-Any Multimodal Interaction: Beyond image-to-text Q&A, researchers are pushing towards models that can flexibly handle any input and output modality. A prime example is AnyGPT (2024) ( AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling). It was trained on a synthetic dataset of multi-turn conversations that “interweave various modalities,” allowing the model to, say, receive an audio question about an image and answer in text, or take text input and produce an audio response . AnyGPT achieved this with the unified token approach, proving that “discrete representations can effectively and conveniently unify multiple modalities within a language model” . In other words, by treating images, audio, and text all as sequences of tokens, a single decoder-only LLM can conduct dialogues that involve image descriptions, audio understanding, and more. While AnyGPT is still a research prototype (with a relatively modest size), it points towards more generalist AI systems. Similarly, we have seen Video-LMs (like Video-LLaMA, 2023) that extend image capabilities to video by feeding sequences of frame features (MLLM Tutorial), and efforts like SpeechGPT which integrate TTS/STT into LLMs for spoken dialogue. The frontier is moving from “multimodal = image+text” to “multimodal = image+text+audio+video+…,” all in one model. This is evident from Microsoft’s Kosmos project: Kosmos-1 (early 2023) was vision+text, Kosmos-2 (late 2023) grounded language models into more modalities and even actions (like controlling robots) . While these were initial forays, we anticipate Kosmos-3 or others in 2024–2025 might further expand the envelope (e.g. adding audio/video). The literature is converging on the idea that a sufficiently large and well-trained model can absorb many modalities – it’s largely a matter of training data and compute.
I write everyday for my readers on actionable AI. Subscribe and instantly get a 1300+ page Python book.
Framework and Industry Support: On the engineering side, major ML frameworks are actively adding features to support multimodal models. PyTorch’s TorchMultimodal library (announced in late 2022 and improved through 2023) provides prebuilt components for vision-language fusion and recipes to scale foundation models (Scaling Multimodal Foundation Models in TorchMultimodal with Pytorch Distributed | PyTorch). While that blog was from 2022, by 2024 we’ve seen PyTorch users leveraging these tools for training custom Flamingo-like models or doing efficient data loading for image-text datasets. Google’s TensorFlow has been used in projects like PaLM-E and Gemini (rumored), but Google has also abstracted multimodal tech into cloud services. A Google Developers Blog in Dec 2024 showcased the Vertex AI Multimodal Embeddings API, which “allows you to represent text, images, and video into the same shared vector space” (See the Similarity: Personalizing Visual Search with Multimodal Embeddings - Google Developers Blog). This kind of service indicates that cloud providers anticipate widespread use of multimodal intelligence in applications – instead of every company training its own model, they can call an API to convert inputs into embeddings and then build on top of that (for search, recommendation, etc.). On the AWS side, Amazon Bedrock (a managed service for foundation models) added support for Meta’s Llama 3.2 models right after their release (Introducing Llama 3.2 models from Meta in Amazon Bedrock: A new generation of multimodal vision and lightweight models | AWS News Blog). That includes the 11B and 90B vision-enabled Llama variants. The AWS blog announcing it emphasized that “these models are designed to inspire builders with image reasoning and are more accessible for edge applications” . We see an ecosystem forming where pretrained multimodal models are available as a service, much like text LLMs are. This will accelerate industry adoption because not everyone has the means to train or even fine-tune a 10B+ multimodal model, but they can use one via API.
Industry Applications and Integrations: By 2024, companies have started integrating multimodal LLMs into real products. OpenAI’s ChatGPT with vision and voice is one high-profile example (though the model behind it, GPT-4V, is closed). Microsoft’s Bing Chat leverages GPT-4V to let users query images (like “What is interesting about this picture?”), bringing multimodal AI to millions of end-users. Google has integrated image understanding into Bard (their conversational AI), allowing Bard to parse images users upload (e.g. find funny captions for a photo). These moves by OpenAI/Google in late 2023 were noted by many commentators as the start of a “multimodal race.” Indeed, in an end-of-2024 review, Dataiku observed that “it was truly in 2024 that multimodal LLMs became mainstream”, with such features becoming common (A Dizzying Year for Language Models: 2024 in Review). Meta, not to be left behind, showcased Multimodal LLaMA models and is likely using them in-house for content understanding on Facebook/Instagram (automatically describing images, moderating content, etc.). Tesla has its Humanoid Robot and autonomous driving which rely on vision and language understanding – while not exactly an LLM application, cross-modal AI research (like pairing language prompts with visual world understanding) is relevant there too. Another domain is healthcare: in late 2024, Microsoft and Epic Systems announced they are exploring GPT-4’s multimodal capabilities on medical images like X-rays (with proper fine-tuning and safeguards) for clinical decision support.
Latest Research Directions: A few emerging research directions deserve mention. One is Multimodal Reasoning and Tool Use – combining LLMs with external tools to perform complex tasks. For instance, a multimodal model might use an OCR tool as an API when needed, or a calculator for measurements in an image. Early works like HuggingGPT (2023) demonstrated orchestrating multiple models (vision, speech, etc.) under an LLM’s control (MLLM Tutorial). By 2025, we might see a generalization of this: multimodal LLMs that know when to invoke a plugin or a code interpreter to handle part of the input (especially for high-resolution images or long videos that are impractical to feed entirely into the model). Another active area is evaluation and safety for multimodal LLMs – new benchmarks in 2024 like MMBench, xGQA, etc., test models on fine-grained multimodal understanding, and studies are probing where these models still fail (e.g. spatial reasoning, counting objects, distinguishing subtle details). The findings often guide improvements: for example, if a model struggles with counting objects in an image, researchers might incorporate a dedicated region-counting loss or an auxiliary training task to address that. On the safety side, multimodal models introduce new concerns: e.g. generating harmful content from images (like identifying a person in an image and violating privacy, or the model hallucinating sensitive information about an image). Ensuring multimodal LLMs don’t produce vision-language bias or incorrect claims about images is a fresh challenge – some 2024 papers address this by learning to say “I don’t know” when uncertain about an image, or by grounding responses in retrieved references (to not hallucinate). OpenAI’s GPT-4V, for instance, refuses to identify people in images due to policy, showing how ethical guardrails are being implemented at deployment (Understanding Multimodal LLMs) .

In essence, the latest research and industry trends point towards convergence and integration: models are becoming more capable (handling more modalities and tasks), while industry is integrating them into real workflows (from developer APIs to consumer apps). The collaboration between research and industry is tight – e.g. NVIDIA’s NVLM is both a research contribution and likely influences their product roadmap for AI platforms; Google’s research on Gemini (its next-gen multimodal model) directly ties into its competitive strategy against GPT-4. The period of 2024–2025 is likely to be remembered as when multimodal LLMs transitioned from impressive demos to ubiquitous infrastructure.

Real World Applications

Multimodal LLMs are unlocking a wide array of real-world applications, transforming how AI is used in various industries. Here we highlight some key application domains and examples of how industry and research institutions are deploying multimodal LLMs:

Interactive Chat Assistants and Customer Support: Perhaps the most visible application is in AI assistants that can “see” and “hear.” OpenAI’s ChatGPT Vision is a prime example, allowing users to upload images during a conversation and receive detailed answers or descriptions. For instance, a user can upload a photo of a broken appliance and ask “How can I fix this?” – the assistant can recognize the appliance and its parts from the image and then provide troubleshooting steps in text. Microsoft has integrated a similar capability in Bing Chat, enabling users to ask questions about images (e.g. “What do you find funny about this meme?”). These assistants use a multimodal LLM under the hood (GPT-4V in these cases) to interpret the image and connect it to the conversation (Visual ChatGPT: Multimodal Capabilities and Use Cases for 2024) . The addition of voice input/output means users can now speak to these assistants and hear the response, making the interaction hands-free and more natural. For customer support, companies are exploring multimodal bots that can handle screenshots or audio from customers – imagine contacting tech support and simply uploading a screenshot of an error message or a photo of a defective product, and the AI agent understands the issue and guides you. Early adopters in e-commerce and IT support have begun testing such systems to reduce the friction of explaining problems in text. As one 2024 industry review put it, “multimodal features… enable more natural and more reactive user interfaces,” since users can show or say things instead of typing everything (A Dizzying Year for Language Models: 2024 in Review) .
Accessibility for the Visually and Hearing Impaired: Multimodal LLMs are being used to assist people with disabilities. A powerful use case is image description for the blind. Traditional screen readers rely on alt-text, which is often missing or inadequate. Now, a multimodal model can analyze an image in real-time and generate a descriptive caption or answer questions about it. For example, Microsoft’s Seeing AI app (which predates GPT-4 but is moving towards LLM integration) can describe scenes to blind users; with an LLM, the descriptions become more detailed and conversational. A user might ask, “What’s in this photo I just took?” and the AI could respond, “It’s a street scene. I see a green traffic light, three or four people waiting to cross, and a dog on a leash. The buildings suggest you might be in a downtown area.” Moreover, the assistant can then take follow-up questions: “Is anyone I know in the photo?” (to which it should honestly respond if it recognizes someone, though privacy guardrails would prevent naming strangers). This is essentially a virtual assistant that combines image captioning and VQA for accessibility, which researchers have noted can “provide comprehensive assistance to visually impaired individuals” (Demystifying Multimodal LLMs). On the flip side, for the hearing impaired, multimodal LLMs can help by understanding sign language or transcribing audio in real time. While LLMs themselves don’t do video yet widely, a connected system could use a vision model to interpret sign language (treating it as a sequence of human poses) and then use an LLM to respond. Additionally, an LLM with speech recognition can caption live speech or meetings far more accurately and flexibly than prior systems – even explaining tone or emphasis (“the speaker sounded angry when saying…”). These assistive technologies are in pilot stages in 2024, but hold a lot of promise for 2025 deployment.
Medical and Healthcare: The medical field is seeing exploratory applications of multimodal LLMs in diagnostics and clinical decision support. One scenario is medical imaging analysis: radiologists can benefit from a second pair of “eyes” on X-rays, MRIs, or ultrasounds. A multimodal LLM like GPT-4 (which has reportedly been tested on medical board exams and some imaging tasks) could examine an MRI scan and a corresponding question, such as “Do you detect any abnormalities that could explain the patient’s dizziness?”. The model could highlight a region (if visual output is enabled) and say, “There is a small hyperintense area in the cerebellum that could represent an infarct (stroke).” In fact, researchers have prototyped systems where the LLM is given the radiology image and the draft report text and asked to improve or check the report. By answering specific questions like “Where is the tumor located in this scan?”, an M-LLM can assist in ensuring key details aren’t missed . Of course, these applications require extreme accuracy and are currently human-in-the-loop. On another front, telehealth or primary care chatbots can use multimodal inputs – a patient might send a photo of a rash or a cough audio recording. The AI could analyze the image of the rash and compare it with known examples (via its training or retrieval) and provide a possible identification (e.g. “This looks like eczema” or “I’m not sure, you should see a dermatologist”). It can listen to a cough and, if sufficiently trained on audio, indicate if it sounds like a wet cough or dry cough, which is useful information. Startups and research groups in 2024 have been testing GPT-4 on such multimodal medical tasks, with promising results in narrow cases (but also clear warnings that these models aren’t 100% reliable yet). Still, the potential to democratize medical expertise – by giving general practitioners or even patients an AI that can interpret medical images or sounds – is huge. It is being approached cautiously given the risks.
Content Creation and Media: Multimodal LLMs are also helping creators and marketers. For example, in e-commerce marketing, product teams are using models to auto-generate product descriptions from product images. As described earlier, an M-LLM can look at a product photo and produce a rich narrative: “Elegant black cocktail dress with lace detailing, perfect for evening events.” (Demystifying Multimodal LLMs) This saves time for online retailers and ensures consistency. Likewise, for social media content, an AI might generate hashtags or captions after “looking at” the photo or video a user wants to post. Another burgeoning area is video summarization: content producers can feed a video (or key frames plus transcript) into a model and get a summary or even a trailer script. Internally, studios are experimenting with GPT-4 on tasks like analyzing dailies (raw film footage) to summarize scenes or to generate dialogue continuations. We also have seen creative tools like Adobe Firefly and others combining image generation with language – e.g. you sketch a layout and the system (with an LLM backbone) produces an image and writes copy to fit. While pure image generators (Stable Diffusion, DALL-E) are not LLMs, when you integrate them with LLMs you get systems like Visual ChatGPT where you can have a dialogue that involves generating and editing images through text instructions (MLLM Tutorial). For instance, a user could say “Here is a rough layout of a flyer [image]. Please design a catchy tagline and suggest a background image.” The LLM (connected to design tools) could output a few tagline options and even fetch or generate a background image, then compose them into a draft flyer. This level of multimodal generation – mixing text generation with image manipulation – is making its way into products like Canva’s Magic Studio and Adobe’s GenAI features.
Surveillance and Security: Some applications are less publicly discussed but are ongoing in government or corporate security contexts. Multimodal models can analyze feeds from CCTV cameras combined with other sensor data and contextual information. For example, an LLM-based system could monitor security cameras in a store and generate an alert like: “Alert: I see a person in aisle 3 who has been standing and looking around nervously for 5 minutes, which might indicate suspicious behavior.” The model would be taking visual data (perhaps processed by a person-detection model) and converting it into a semantic description with reasoning. Or in cybersecurity, one might analyze both network logs (text) and audio from VOIP calls to detect fraud patterns – a stretch for current LLMs, but conceptually within reach as multimodal capabilities grow to include structured data and audio.
Connect with me on X (Twitter)
Robotics and Autonomy: Leading research labs are integrating multimodal LLMs into robots and autonomous systems. Robotics often deals with vision (to perceive the world) and language (for instructions or explanations). Models like PaLM-E (2023) showed that a single model can take in robot sensor data (images) and high-level commands in text and output a plan or action commands. For instance, a user might tell a robot, “Fetch the blue mug from the kitchen table and bring it to me,” and the robot’s AI brain (a multimodal model) will parse that, use the robot’s camera input to locate the blue mug among objects (vision), possibly use retrieval or memory to recall what the blue mug looks like, then reason about the steps needed (go to kitchen, navigate to table, identify mug, grasp it, etc.). Some steps may require interfacing with low-level controllers, but the high-level policy can be handled by an LLM that knows how to break down the task and even respond in natural language if needed (e.g. “I couldn’t find the blue mug, should I look for another item?”). This is still a cutting-edge area, but as noted by researchers, “integrating and modeling diverse information modalities, including language, visual, auditory, and sensory data” is a path toward more human-like AI reasoning (MLLM Tutorial). It’s telling that the CVPR 2024 tutorial on MLLMs is subtitled “From Multimodal LLM to Human-level AI” – the implication is that to reach human-level intelligence, AI must master all our modalities. Practical robotics in 2025 might start to showcase this, at least in controlled environments (like warehouse robots that can read labels and gauge product conditions by image).

In industry, multimodal LLM integration is often happening behind the scenes. For instance, social media companies use AI to moderate content: an AI might review an image and its caption together to decide if it’s inappropriate (text-only or vision-only systems might miss the combined context). Multimodal LLMs are well-suited for understanding a meme where the humor is in the interplay between the picture and the text on it. This helps enforce policies on hate symbols or misinformation (e.g. an image might seem innocent, but the caption turns it into something violating policy, or vice versa). Additionally, enterprises are using multimodal LLMs for document processing: Many documents (financial reports, academic papers, etc.) contain text, tables, and charts. A multimodal model can parse a PDF, extract the text and also “read” the charts, then answer questions or summarize insights. For example, IBM has been looking into AI that can take an annual report (full of tables and graphs) and answer analysts’ questions – which requires combining natural language understanding with chart/image interpretation (like reading a bar graph). This is essentially a multimodal retrieval and QA problem and is an area of active development in 2024.

To conclude the applications section, it’s clear that AI companies and research labs are rapidly deploying multimodal LLMs in both consumer-facing and enterprise products. The value proposition is compelling: humans communicate with a mix of language, visuals, and audio, so AI that can do the same will be more useful and easier to interact with. As one data science blog noted, “Now that LLMs can natively handle non-textual sources of information such as images and audio files, [it’s] simpler and more effective to process input documents as they are rather than forcing a purely textual representation” (A Dizzying Year for Language Models: 2024 in Review). We are seeing this philosophy in action across domains from social media to healthcare. Real-world adoption is still in early stages for many of these (due to concerns like accuracy and safety), but the trajectory suggests that by the end of 2025, it will be normal to interact with AI systems using the full range of human communication modalities – speaking, showing, pointing, and not just typing. Multimodal LLMs are a cornerstone to that future.

Conclusion

Extending LLMs to support images, audio, and other modalities is a fast-evolving frontier at the intersection of natural language processing and perception AI. Over 2024–2025, we have seen significant progress in architectures – from unified tokenization approaches that feed images and audio directly into transformer decoders, to sophisticated cross-modal attention mechanisms that fuse modality-specific encoders with language models. These architectural innovations are enabled by advances in the data processing pipeline, which now routinely includes image patch embedding, audio tokenization, and clever schemes to integrate different data types into a single model input.

Equally important are the training strategies that align and bind modalities together: contrastive learning has proven vital for creating joint vision-language representations, and large-scale multimodal instruction tuning (often leveraging generative AI to create training examples) has taught models to follow human multimodal instructions. At the same time, researchers have tackled the daunting computational challenges through techniques like model compression (quantization, Mixture-of-Experts for sparse activation) and optimized attention for long contexts, making it feasible to train and run multimodal models that just a year or two ago seemed impractically large. Inference-time optimizations – whether early fusion models that generate richly contextual answers, or late fusion pipelines and retrieval augmentation that improve efficiency – ensure that these models can be deployed in real-world settings under reasonable latency and resource constraints.

The latest research suggests that multimodal LLMs are not only getting more capable, but also more robust and integrated. Studies like NVLM (2024) indicate that carefully integrating modalities can enhance, not detract from, a model’s core abilities ( NVLM: Open Frontier-Class Multimodal LLMs). Meanwhile, industry leaders (OpenAI, Google, Meta, Microsoft) are rapidly infusing multimodal capabilities into products – heralding a new era where AI systems can see, hear, and talk in our everyday tools. From assisting a visually impaired user by narrating their surroundings (Demystifying Multimodal LLMs), to helping a doctor analyze a patient’s chart and MRI together, to enabling a student to query both textbooks and diagrams in one go, the applications of multimodal LLMs are broad and impactful.

Despite the progress, challenges remain. Ensuring accuracy and truthfulness across modalities is harder – an image might be interpreted incorrectly, or an audio clip could be noisy, leading the model to err. Safety concerns like visual privacy, deepfake audio, or multimodal misinformation are on researchers’ radar and require continued effort in aligning these powerful models with human values and policies. Computationally, training a truly general multimodal LLM that covers vision, hearing, and language as well as (if not better than) specialized models is an open challenge – it may require innovations like neural-symbolic hybrids or even more advanced hardware to reach human-level proficiency across the board.

Nonetheless, the direction is set: AI that can seamlessly weave together text, images, and sounds. The literature from 2024 and 2025 paints an exciting picture of transformers that can caption images, converse about videos, transcribe and understand audio, and use all of that to make better decisions or provide better answers. If the pace continues, we can expect multimodal LLMs to become as ubiquitous as text-only LLMs, fundamentally changing human-computer interaction. As one tutorial on the subject enthused, this convergence of modalities in large models is driving AI “beyond language, toward human-level understanding and reasoning” (MLLM Tutorial). In practical terms, that means AI assistants that are more helpful and intuitive, and AI-powered tools that operate on the same information streams we do. The combined advances in architecture, training, and deployment reviewed here bring us closer to AI that can truly see what we see, hear what we hear, and speak as we speak, making technology more accessible and powerful for all.

Connect with me on X (Twitter)

Rohan's Bytes

Discussion about this post

Rohan's Bytes

Extending LLMs to Support Multimodal Inputs

Table of Contents

Introduction

Architecture Changes for Multimodal LLMs

Data Processing Pipeline Integration

Training Strategies for Multimodal Alignment

Computational Challenges

Inference Optimizations

Latest Research and Industry Insights 2024 2025

Real World Applications

Conclusion

Discussion about this post