A faster way to analyze DNA, protein and chemical sequences using enhanced LSTM architecture
Bio-xLSTM, proposed in this paper, processes biological sequences linearly instead of quadratically, enabling longer sequence analysis
It enables constant-memory processing and seamless in-context learning, making it a practical tool for real-world biological applications.
https://arxiv.org/abs/2411.04165
🤔 Original Problem:
Current biological sequence modeling relies heavily on Transformer architectures which have quadratic runtime complexity, making them impractical for long genomic sequences and in-context learning tasks in protein/chemical domains.
-----
🛠️ Solution in this Paper:
→ The paper introduces Bio-xLSTM, a suite of recurrent neural architectures specifically optimized for biological sequences.
→ Bio-xLSTM uses two core components: sLSTM and mLSTM layers integrated into residual blocks with layer normalization and causal convolutions.
→ It achieves linear scaling with sequence length through efficient parallel, recurrent, or chunkwise processing modes.
→ For DNA, it implements reverse-complement equivariance through parameter sharing and handles sequences up to 32k tokens.
→ For proteins, it enables homology-aware generation through in-context learning with unaligned sequences.
→ For chemical sequences, it combines efficient representation learning with domain-specific generation capabilities.
-----
💡 Key Insights:
→ Linear scaling with sequence length makes it ideal for long genomic sequences
→ Supports both parallel and recurrent modes of operation
→ Can handle varying sequence lengths effectively while maintaining expressivity
→ Enables in-context learning without parameter updates
-----
📊 Results:
→ DNA-xLSTM outperforms Transformer and Mamba baselines in both causal and masked language modeling tasks
→ Prot-xLSTM shows superior performance in homology-conditioned sequence generation
→ Chem-xLSTM demonstrates effective domain-conditioned molecule generation without fine-tuning
Share this post