Module cache

Expand description

§Mamba-2 Inference Caches

This module defines the state that must be preserved between calls during autoregressive (token-by-token) generation. During training or prefill the full sequence is available at once and the chunked SSD algorithm is used (see Mamba2::forward). During decoding the model processes one token per step and the SSM operates in its pure recurrent form (see Mamba2::step):

  hₜ = Āₜ hₜ₋₁ + B̄ₜ xₜ        (state update)
  yₜ = Cₜᵀ hₜ + D xₜ            (output)

Two pieces of state are required per layer:

Convolution cache — the last conv_kernel inputs to the depthwise Conv1d, kept so that every decoding step can apply the causal filter without re-processing previous tokens.
SSM hidden state — the matrix hₜ ∈ ℝ^{per_head_dim×state_rank} (per head), which compresses the entire past context into a fixed-size representation regardless of how many tokens have been generated. This is the key memory-efficiency advantage of SSMs over attention: the KV-cache of a Transformer grows as O(sequence·state_rank) with sequence length, whereas the SSM state is always O(per_head_dim·state_rank).

Structs§

Mamba2Cache: The mutable state carried between decoding steps for a single Mamba-2 layer.
Mamba2CacheConfig: Configuration / factory for a single Mamba2Cache.
Mamba2Caches: A collection of per-layer caches for a complete Mamba-2 network.
Mamba2CachesConfig: Configuration / factory for Mamba2Caches.

Module cache

Module cache Copy item path

§Mamba-2 Inference Caches

Structs§

Module cache