0:00
/
0:00
Transcript

"Bio-xLSTM: Generative modeling, representation and in-context learning of biological and chemical sequences"

Generated this podcast on this Paper with Google's Illuminate, which is Google's platform to create podcast from arXiv papers

A faster way to analyze DNA, protein and chemical sequences using enhanced LSTM architecture

Bio-xLSTM, proposed in this paper, processes biological sequences linearly instead of quadratically, enabling longer sequence analysis

It enables constant-memory processing and seamless in-context learning, making it a practical tool for real-world biological applications.

https://arxiv.org/abs/2411.04165

🤔 Original Problem:

Current biological sequence modeling relies heavily on Transformer architectures which have quadratic runtime complexity, making them impractical for long genomic sequences and in-context learning tasks in protein/chemical domains.

-----

🛠️ Solution in this Paper:

→ The paper introduces Bio-xLSTM, a suite of recurrent neural architectures specifically optimized for biological sequences.

→ Bio-xLSTM uses two core components: sLSTM and mLSTM layers integrated into residual blocks with layer normalization and causal convolutions.

→ It achieves linear scaling with sequence length through efficient parallel, recurrent, or chunkwise processing modes.

→ For DNA, it implements reverse-complement equivariance through parameter sharing and handles sequences up to 32k tokens.

→ For proteins, it enables homology-aware generation through in-context learning with unaligned sequences.

→ For chemical sequences, it combines efficient representation learning with domain-specific generation capabilities.

-----

💡 Key Insights:

→ Linear scaling with sequence length makes it ideal for long genomic sequences

→ Supports both parallel and recurrent modes of operation

→ Can handle varying sequence lengths effectively while maintaining expressivity

→ Enables in-context learning without parameter updates

-----

📊 Results:

→ DNA-xLSTM outperforms Transformer and Mamba baselines in both causal and masked language modeling tasks

→ Prot-xLSTM shows superior performance in homology-conditioned sequence generation

→ Chem-xLSTM demonstrates effective domain-conditioned molecule generation without fine-tuning

Discussion about this video