Has Sepp Hochreiter done it again? After months of announcements, a group around the inventor of the LSTM finally published a paper presenting ๐ฑ๐๐๐๐ to the world.
Until the appearance of the Transformer in 2017, ๐๐๐๐ had been the go-to technology for a wide variety of sequence-related tasks, including text generation. Three limitations
- the inability to revise storage decisions,
- limited storage capacity, and
- the need for sequential rather than parallel processing,
relegated LSTMs to second place behind Transformers.
The group proposes two types of new LSTM memory cells, baptized ๐ฌ๐๐๐๐ and ๐ฆ๐๐๐๐ (see the graph from the original paper below). Both sLSTM and mLSTM are placed in a residual block (adding skip connections, similar to Transformers). These blocks can be stacked in various combinations, and thus constitute the complete ๐ฑ๐๐๐๐ architecture.
- The input and forget gates in the ๐ฌ๐๐๐๐ cell get exponential gates, equipping it with the ability to ๐ซ๐๐ฏ๐ข๐ฌ๐ ๐ฌ๐ญ๐จ๐ซ๐๐ ๐ ๐๐๐๐ข๐ฌ๐ข๐จ๐ง๐ฌ. Identical to normal LSTMs, sLSTM can have multiple cells (one for every sequence step), allowing memory mixing. sLSTM, however, can also have multiple heads, again a Transformer idea injected into LSTMs. Memory mixing across heads is not possible.
- Instead of a scalar, the ๐ฆ๐๐๐๐ memory cell is a matrix for ๐๐ง๐ก๐๐ง๐๐๐ ๐ฌ๐ญ๐จ๐ซ๐๐ ๐ ๐๐๐ฉ๐๐๐ข๐ญ๐ฒ. For retrieval, mLSTM adapts the key, value, and query vectors concept from Transformers. Consequently, there is no memory mixing, but multiple heads are possible here as well.
- The memory mixing in sLSTM requires ๐ฌ๐๐ช๐ฎ๐๐ง๐ญ๐ข๐๐ฅ ๐๐๐ฅ๐๐ฎ๐ฅ๐๐ญ๐ข๐จ๐ง๐ฌ, ruling out parallelization. Sepp Hochreiterโs team does propose a fast CUDA kernel, but the speed handicap remains.
In the experimental section, xLSTM is pitched against other methods, most notably Transformers. Overall, ๐ฑ๐๐๐๐ ๐๐จ๐ฆ๐ฉ๐๐ซ๐๐ฌ ๐๐๐ฏ๐จ๐ซ๐๐๐ฅ๐ฒ in many tasks, including Large Language Model. Ablation studies show that both components, sLSTM and mLSTM, contribute to the improvement over regular LSTM. An important observation is also that xLSTM scales well, so their performance is not limited to smaller datasets. Speed tests are not shown.
It will be interesting to observe to what extent xLSTM will gain traction, and what, in business speak, its key success factors will be.