• State Space Models (SSMs) are a class of mathematical models used in various fields for describing systems that evolve over time. These models are characterized by their ability to represent dynamic systems through state variables and equations that capture the relationships between these variables.
  • This primer offers an overview of State Space Models and their application in deep learning.


  • Some of the common themes in the search for new architectures that do not have the drawbacks that the Transformer architectures suffers from (quadratic time and space complexity, large parameter count, etc.) are based on designing a mathematical framework/system for mapping the input sequence to an output sequence, such that:
    1. Allows for processing sequences in parallel during training.
    2. Being able to express the output as a recurrence equation during inference time. Constant state size further boosts inference time speed and memory requirements thanks to the fact that we no longer need a linearly growing KV cache.
    3. Framing the input sequence to output sequence mapping through mathematical models such as State Space Models allows for 1 and 2.
    4. Leveraging Fast Fourier Transformations to perform convolutional operations. Convolutional operations in the frequency domain can be implemented as pointwise multiplications. Hyena Hierarchy and StripedHyena are two examples that leverage this observation.

State Space Models: Overview

  1. Definition: A State Space Model typically consists of two sets of equations:
    • State Equations: These describe how the state of the system evolves over time.
    • Observation Equations: These link the state of the system to the measurements or observations that are made.
  2. Components:
    • State Variables: Represent the system’s internal state at a given time.
    • Inputs/Controls: External inputs that affect the state.
    • Outputs/Observations: What is measured or observed from the system.
  3. Usage: SSMs are widely used in control theory, econometrics, signal processing, and other areas where it’s crucial to model dynamic behavior over time.

SSMs in Deep Learning

  1. Combination with Neural Networks:
    • SSMs can be combined with neural networks to create powerful hybrid models. The neural network component can learn complex, nonlinear relationships in the data, which are then modeled dynamically through the state space framework.
    • This is particularly useful in scenarios where you have time-series data or need to model sequential dependencies.
  2. Time Series Analysis and Forecasting:
    • In deep learning, SSMs are often applied to time series analysis and forecasting. They can effectively capture temporal dynamics and dependencies, which are crucial in predicting future values based on past and present data.
    • Recurrent neural networks (RNNs) and Long Short-Term Memory (LSTM) networks are examples of deep learning models that can be viewed as a form of state space model.
  3. Reinforcement Learning:
    • In reinforcement learning, SSMs can be used to model the environment in which an agent operates. The state space represents the state of the environment, and the agent’s actions influence the transition between states.
    • This is particularly relevant in scenarios where the environment is partially observable or the dynamics are complex.
  4. Data Imputation and Anomaly Detection:
    • SSMs in deep learning can be applied to tasks like data imputation (filling in missing data) and anomaly detection in time-series data. They are capable of understanding normal patterns and detecting deviations.
  5. Customization and Flexibility:
    • Deep learning allows for the customization of the standard state space model structure, enabling the handling of more complex and high-dimensional data, which is common in modern applications.

Theory of SSMs


  • The integration of State Space Models with deep learning represents a powerful approach to modeling dynamic systems, especially in scenarios involving time-series data or environments with temporal dependencies. The flexibility and adaptability of these models make them suitable for a wide range of applications, from forecasting and anomaly detection to complex reinforcement learning environments.

Further Reading

  • Related resources to understand RMKV, another architecture that converts self-attention to a linear operation:
    • Intro to RMKV presents an overview of the RWKV language model, an RNN that combines the benefits of transformers, offering efficient training, reduced memory use during inference, and excellent scaling up to 14 billion parameters, while being an open-source project open for community contribution.
    • Annotated RMKV offers 100 lines of code to implement a basic version of RMKV.


If you found our work useful, please cite it as:

  title   = {Loss Functions},
  author  = {Chadha, Aman and Jain, Vinija},
  journal = {Distilled AI},
  year    = {2020},
  note    = {\url{}}