- ...

Hi! Blog author. This was an attempt a couple years ago to understand and write about this paper in a detailed way. Here is a video going through this topic as well: https://youtu.be/dKJEpOtVgXc?si=PDNO0B0qi6ARHaeb

Section 2 of the blog post is no longer very relevant. A lot of advances (DSS, S4D) simplified that part of the process. Arguably also this all should be updated for Mamba (same authors).

Thanks for your spectacular resources! I see that you began an Annotated Mamba repository -- any chance you could share when that blog page might go live?

This was an excellent write up thanks. It'll help me understand the Mamba work a lot more.

I still find it really confusing how a linear model can perform so well.

A lot of intimidating math that will make all self-attention tutorials seem like a walk in the park in comparison. Luckily subsequent state space models building on S4 (DSS, S4D and newer ones like Mamba) simplified the primitives and the math used.

The math is not designed to intimidate but rather approach the "how to build sequence model" in a principled way from state space models, which draws from an arguably longer literature than neural networks.

Some of concepts are better explained here than anywhere else, and make it straightforward to make sense of Mamba, which is increasingly popular.

Well, but this stuff is also much more principled and much better understood (by construction) than why/how a transformer works. The price of actual understanding, and being able to make precise statements, is that the statements will be precise and detailed (ie likely involve math).

Can someone point me to DSS and S4D papers?

What I need so learn to start to understand those articles? Is there some good courses on the topic? For beginners?

All machine learning is convolution.

I did not mean it in a negative way, this is a great resource. But the math will be intimidating regardless for most devs who don't have a solid math/signal processing background. It's way beyond the simple linear algebra plus chain rule from calculus that are required to understand basic neural networks training.

DSS: Diagonal State Spaces are as Effective as Structured State Spaces (https://arxiv.org/abs/2203.14343)

S4D: On the Parameterization and Initialization of Diagonal State Space Models (https://arxiv.org/abs/2206.11893)

Thank you!

## 12 Comments: