"Bio-xLSTM: Generative modeling, representation and in-context learning of biological and chemical sequences"

Playback speed

Share post at current time

0:00

Transcript

"Bio-xLSTM: Generative modeling, representation and in-context learning of biological and chemical sequences"

Generated this podcast on this Paper with Google's Illuminate, which is Google's platform to create podcast from arXiv papers

Rohan Paul

Dec 22, 2024

A faster way to analyze DNA, protein and chemical sequences using enhanced LSTM architecture

Bio-xLSTM, proposed in this paper, processes biological sequences linearly instead of quadratically, enabling longer sequence analysis

It enables constant-memory processing and seamless in-context learning, making it a practical tool for real-world biological applications.

https://arxiv.org/abs/2411.04165

🤔 Original Problem:

Current biological sequence modeling relies heavily on Transformer architectures which have quadratic runtime complexity, making them impractical for long genomic sequences and in-context learning tasks in protein/chemical domains.

-----

🛠️ Solution in this Paper:

→ The paper introduces Bio-xLSTM, a suite of recurrent neural architectures specifically optimized for biological sequences.

→ Bio-xLSTM uses two core components: sLSTM and mLSTM layers integrated into residual blocks with layer normalization and causal convolutions.

→ It achieves linear scaling with sequence length through efficient parallel, recurrent, or chunkwise processing modes.

→ For DNA, it implements reverse-complement equivariance through parameter sharing and handles sequences up to 32k tokens.

→ For proteins, it enables homology-aware generation through in-context learning with unaligned sequences.

→ For chemical sequences, it combines efficient representation learning with domain-specific generation capabilities.

-----

💡 Key Insights:

→ Linear scaling with sequence length makes it ideal for long genomic sequences

→ Supports both parallel and recurrent modes of operation

→ Can handle varying sequence lengths effectively while maintaining expressivity

→ Enables in-context learning without parameter updates

-----

📊 Results:

→ DNA-xLSTM outperforms Transformer and Mamba baselines in both causal and masked language modeling tasks

→ Prot-xLSTM shows superior performance in homology-conditioned sequence generation

→ Chem-xLSTM demonstrates effective domain-conditioned molecule generation without fine-tuning

Rohan's Bytes

"Bio-xLSTM: Generative modeling, representation and in-context learning of biological and chemical sequences"

Discussion about this video