Were RNNs All We Needed?

Playback speed

Share post at current time

0:00

Transcript

Were RNNs All We Needed?

The podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Jan 05, 2025

A super interesting Paper getting new values from good old RNN with a huge Computational Efficiency win 🥇

Finds that by removing their hidden state dependencies from their input, forget, and update gates, LSTMs and GRUs no longer need to backpropagate through time (BPTT) and can be efficiently trained in parallel.

This change makes LSTMs and GRUs competitive with Transformers and Mamba for long sequence tasks.

• Training speedup: 175x (minGRU), 235x (minLSTM) for 512-length sequences

• Comparable performance to Mamba in selective copying, RL, and language modeling

• Uses 56% less memory than Mamba during training

📚 https://arxiv.org/abs/2410.01201

Original Problem 🔍:

Transformers' quadratic complexity limits scalability for long sequences. And then the traditional RNNs like LSTMs and GRUs are slow due to sequential processing and backpropagation through time.

-----

Solution in this Paper 🛠️:

• Introduces minLSTM and minGRU:

- Remove hidden state dependencies from gates

- Eliminate output range constraints (no tanh)

- Ensure time-independent output scale

• Trainable via parallel scan algorithm

• Significantly reduced parameters:

- minGRU: O(2d_h*d_x) vs O(3d_h*(d_x+d_h)) for GRU

- minLSTM: O(3d_h*d_x) vs O(4d_h*(d_x+d_h)) for LSTM

-----

Key Insights from this Paper 💡: