Very small discrepency in kernel construction
Read the code and it all looks good. There's a very tiny difference in the parameterization of \Lambda: you use
Lambda = -self.Lambda_real.exp() + 1j * self.Lambda_imag.exp()
but DSS/S4D don't put an exp on the imaginary part
Lambda = -self.Lambda_real_log.exp() + 1j * self.Lambda_imag
The exp is to ensure the real part is negative which can be important for unbounded settings (e.g. generation). I doubt the difference between these parameterizations will make any difference though.
I can also share my own experiments with this model. I ran two datasets (sequential CIFAR and Speech Commands) and found quite different behavior than what the paper found for LM.
- I found that the gating actually hurts performance, controlling for parameters/speed (on GPU). From talking to the authors my understanding is that the gating is mainly a computational trick - it's used to move more parameters into FFN and less into the SSM since the FFT is so slow on TPU
- The random initialization works fine on CIFAR (just 1-2% worse) - but it's really bad on SC. I think it might vary a lot depending on the characteristics of the data, so I'd believe that it works on LM but I'd be shocked if it works on PathX. Curious to see how your experiments turn out.