The paper addresses the challenge of enabling LLMs to process and generate multimodal content (images, video, audio) without specific training for those modalities. It introduces MILS, a method that leverages LLMs test-time reasoning capabilities to achieve this.
MILS uses an LLM as a "GENERATOR" to create text outputs, and a multimodal model (like CLIP) as a "SCORER" to evaluate the outputs, iteratively refining the output via feedback.
-----
https://arxiv.org/abs/2501.18096
📌 MILS cleverly bypasses the need for paired multimodal training data. It repurposes existing, powerful LLMs and pre-trained vision-language models. It achieves this for diverse tasks.
📌 The iterative refinement process is key. MILS is not just a one-shot generation. The feedback loop between the LLM GENERATOR and the multimodal SCORER progressively improves output quality.
📌 MILS' ability to invert modalities to text is a powerful capability. This goes beyond simple combination of embeddings. It opens possibilities for cross-modal editing and creation.
----------
Methods Explored in this Paper 🔧:
→ MILS is a training-free, iterative optimization approach.
→ It has two core components: a GENERATOR (typically an LLM) and a SCORER (a pre-trained multimodal model like CLIP or SigLIP).
→ The GENERATOR produces candidate text outputs (e.g., captions, prompts). The SCORER evaluates these outputs against a test sample (image, video, audio) providing a similarity score.
→ The scores from SCORER are given as text feedback to the GENERATOR. Then GENERATOR uses this to generate improved outputs in the next iteration. This process repeats until convergence.
-----
Key Insights 💡:
→ MILS shows emergent zero-shot capabilities across different tasks (captioning, generation, editing) and modalities (image, video, audio). LLMs do not require explicit multimodal training.
→ MILS can improve text-to-image generation. It uses LLM as "prompt rewriter".
→ Because MILS operates without gradient updates, It can reverse multimodal inputs into text form, enabling unusual tasks like "cross-modal arithmetic".
-----
Results 📊:
→ Image Captioning on MSCOCO: Achieves 8.0 BLEU, 33.3 CIDEr, 15.0 METEOR, and 9.6 SPICE.
→ Video Captioning on MSR-VTT: Achieves 2.3 CIDEr and 14.4 METEOR, beats prior work on METEOR.
→ Audio Captioning on Clotho: Achieves 2.7 BLEU, 23.1 ROUGE, 12.4 METEOR, and 7.6 SPICE.
Share this post