Struct Mamba2

Source

pub struct Mamba2<B: Backend> {
    pub in_proj: Linear<B>,
    pub conv1d: Conv1d<B>,
    pub dt_bias_h: Param<Tensor<B, 1>>,
    pub dt_limit: (f64, f64),
    pub a_log_h: Param<Tensor<B, 1>>,
    pub d_h: Param<Tensor<B, 1>>,
    pub norm: RmsNormGated<B>,
    pub out_proj: Linear<B>,
    pub init_state_hpr: Option<Param<Tensor<B, 3>>>,
    pub state_rank: usize,
    pub ngroups: usize,
}

Expand description

The Mamba-2 SSM block.

Implements the full SSD layer as described in §5 of the paper. Supports two execution modes:

Self::forward — chunkwise SSD for training / prefill (exploits tensor cores; linear in sequence length T)
Self::step — pure recurrent form for token-by-token decoding (O(H·P·N) per step; no KV-cache)

§Architecture (one forward pass through the block)

  u  [B, T, D]
  ├─ in_proj ──────────────────────────────────┐
  │                                            │
  │  z [B,T,I]   xbc [B,T,V]   dt_raw [B,T,H] │
  │                │                           │
  │            causal Conv1d                   │
  │                │ SiLU                      │
  │           split into                       │
  │       x [B,T,H,P]  B [B,T,G,N]  C [B,T,G,N]
  │                                            │
  │     Δ = softplus(dt_raw + dt_bias)         │
  │     Ā = exp(Δ · A)   [scalar per head]     │
  │     B̄ = Δ · B                              │
  │                                            │
  │  ┌──── chunked_selective_scan ─────────┐   │
  │  │  (Steps 1–4, see below)             │   │
  │  └────────────────────────────────────-┘   │
  │     y [B,T,H,P]                            │
  │     + D skip                               │
  │     RmsNormGated(·, z)                     │
  └─ out_proj ─────────────────────────────────┘
  output  [B, T, D]

Fields§

§in_proj: Linear

Input projection: maps d_model → d_inner + conv_dim + nheads.

The output is split into three parts:

z [B, T, d_inner] — multiplicative gate for the output norm
xbc [B, T, conv_dim] — input to the causal convolution, which is then split into (x, B, C) after activation
dt_raw [B, T, nheads] — raw (pre-softplus) discretisation step Δ

§conv1d: Conv1d

Causal depthwise Conv1d applied to the xbc projection.

Input/output channels: conv_dim
Kernel size: conv_kernel (typically 4)
Groups: conv_dim (fully depthwise — each channel is independent)
Padding: none (left-padding is applied manually so the convolution is strictly causal)

The convolution provides a local conv_kernel-token context window before the SSM, which helps the model capture short-range dependencies that the SSM’s recurrent form handles less efficiently.

§dt_bias_h: Param<Tensor<B, 1>>

Per-head bias for the discretisation step size Δ.

Shape: [nheads]

At inference time, Δₜ = softplus(dt_raw_t + dt_bias). Initialised such that the corresponding initial Δ values are log-uniformly distributed in [dt_min, dt_max].

§dt_limit: (f64, f64)

Hard clamp applied to Δ after softplus: Δ ∈ [dt_limit.0, dt_limit.1].

Prevents degenerate discretisations (e.g. Δ → 0 causes Ā → 1, meaning the state never decays; Δ → ∞ causes Ā → 0, meaning the state is immediately wiped each step).

§a_log_h: Param<Tensor<B, 1>>

Per-head log-magnitude of the continuous-time decay parameter A.

Shape: [nheads]

The actual (negative) decay rate is A = -exp(a_log). The discrete decay is Āₜ = exp(Δₜ · A) = exp(-Δₜ · exp(a_log)) ∈ (0, 1).

Storing the log of the magnitude and negating ensures A < 0 (decaying system) unconditionally and avoids any sign-constraint during gradient descent.

§d_h: Param<Tensor<B, 1>>

Per-head skip (D) coefficient.

Shape: [nheads]

Adds a direct path from the (post-convolution, pre-SSM) input to the output: yₜ += D · xₜ. Initialised to ones.

§norm: RmsNormGated

Gated RMSNorm applied to the SSM output, conditioned on the gate z.

Input channel dimension: d_inner.

This combines the multiplicative gate (from z) and a normalisation step into a single fused operation, matching the architecture in §5.2 of the paper.

§out_proj: Linear

Output projection: maps d_inner → d_model.

§init_state_hpr: Option<Param<Tensor<B, 3>>>

Optional learnable initial hidden state h₀.

Shape: [nheads, per_head_dim, state_rank] (i.e. [H, P, N])

When None, the initial state is zero (the standard default). When Some, the stored tensor is used as the initial condition for every forward call (not per-batch; it is broadcast over the batch dimension).

§state_rank: usize

State rank N — the number of latent dimensions in the SSM hidden state h ∈ ℝ^{N×P} per head. Corresponds to the paper’s N.

§ngroups: usize

Number of B/C groups G for grouped SSM heads (analogous to grouped-query attention). G divides nheads; all nheads/G heads within a group share the same B and C projections while having independent X, A, and Z projections.

Struct Mamba2 Copy item path

§Architecture (one forward pass through the block)

Fields§

Implementations§

impl<B: Backend> Mamba2<B>

pub fn step( &self, input_bm: Tensor<B, 2>, cache: Option<Mamba2Cache<B>>, ) -> (Tensor<B, 2>, Mamba2Cache<B>)

§Shapes

impl<B: Backend> Mamba2<B>

pub fn d_inner(&self) -> usize

pub fn nheads(&self) -> usize

pub fn per_head_dim(&self) -> usize

pub fn conv_dim(&self) -> usize

impl<B: Backend + Mamba2BackendExt> Mamba2<B>

pub fn forward( &self, input_bsm: Tensor<B, 3>, cache: Option<Mamba2Cache<B>>, ssd_path: Mamba2SsdPath, ) -> (Tensor<B, 3>, Mamba2Cache<B>)

§Full dataflow

§Sequence padding

§Shapes

impl<B: Backend> Mamba2<B>

pub fn ssd_minimal(input: Mamba2SsdInput<B>) -> (Tensor<B, 5>, Tensor<B, 4>)

§The four steps

§Step 1 — Intra-chunk outputs (Y_diag)

§Step 2 — Chunk state (state_bnhpr)

§Step 3 — Inter-chunk state scan (state passing)

§Step 4 — State-to-output (Y_off)

§Final output (with D skip-connection)

impl<B: Backend> Mamba2<B>

pub fn ssd_serial(input: Mamba2SsdInput<B>) -> (Tensor<B, 5>, Tensor<B, 4>)

impl<B: Backend + Mamba2BackendExt> Mamba2<B>

pub fn ssd_serial_recalculated( input: Mamba2SsdInput<B>, ) -> (Tensor<B, 5>, Tensor<B, 4>)

Trait Implementations§

impl<B> AutodiffModule<B> for Mamba2<B>where B: AutodiffBackend + Backend, <B as AutodiffBackend>::InnerBackend: Backend,

type InnerModule = Mamba2<<B as AutodiffBackend>::InnerBackend>

fn valid(&self) -> Self::InnerModule

fn from_inner(module: Self::InnerModule) -> Self

impl<B: Backend> Clone for Mamba2<B>

fn clone(&self) -> Self

fn clone_from(&mut self, source: &Self)

impl<B: Debug + Backend> Debug for Mamba2<B>

fn fmt(&self, f: &mut Formatter<'_>) -> Result

impl<B: Backend> Display for Mamba2<B>

fn fmt(&self, f: &mut Formatter<'_>) -> Result

impl<B> HasAutodiffModule<B> for Mamba2<B::InnerBackend>where B: AutodiffBackend + Backend, <B as AutodiffBackend>::InnerBackend: Backend,

type TrainModule = Mamba2<B>

impl<B: Backend> Module<B> for Mamba2<B>

type Record = Mamba2Record<B>

fn load_record(self, record: Self::Record) -> Self

fn into_record(self) -> Self::Record

fn num_params(&self) -> usize

fn visit<Visitor: ModuleVisitor<B>>(&self, visitor: &mut Visitor)

fn map<Mapper: ModuleMapper<B>>(self, mapper: &mut Mapper) -> Self

fn collect_devices(&self, devices: Devices<B>) -> Devices<B>

fn to_device(self, device: &B::Device) -> Self

fn fork(self, device: &B::Device) -> Self

fn devices(&self) -> Vec<<B as BackendTypes>::Device>

fn no_grad(self) -> Self

fn train<AB>(self) -> Self::TrainModulewhere AB: AutodiffBackend<InnerBackend = B>, Self: HasAutodiffModule<AB>,

fn quantize_weights(self, quantizer: &mut Quantizer) -> Self

impl<B: Backend> ModuleDisplay for Mamba2<B>

fn format(&self, passed_settings: DisplaySettings) -> String

fn custom_settings(&self) -> Option<DisplaySettings>

fn custom_content(&self, _content: Content) -> Option<Content>

impl<B: Backend> ModuleDisplayDefault for Mamba2<B>

fn content(&self, content: Content) -> Option<Content>

fn num_params(&self) -> usize

Auto Trait Implementations§

impl<B> !Freeze for Mamba2<B>

impl<B> !RefUnwindSafe for Mamba2<B>

impl<B> Send for Mamba2<B>

impl<B> Sync for Mamba2<B>

impl<B> Unpin for Mamba2<B>where <B as BackendTypes>::Device: Unpin, <B as BackendTypes>::FloatTensorPrimitive: Unpin, <B as BackendTypes>::QuantizedTensorPrimitive: Unpin,

impl<B> UnsafeUnpin for Mamba2<B>where <B as BackendTypes>::Device: UnsafeUnpin, <B as BackendTypes>::FloatTensorPrimitive: UnsafeUnpin, <B as BackendTypes>::QuantizedTensorPrimitive: UnsafeUnpin,

impl<B> !UnwindSafe for Mamba2<B>

Blanket Implementations§

impl<T> Any for Twhere T: 'static + ?Sized,

fn type_id(&self) -> TypeId

impl<T> Borrow<T> for Twhere T: ?Sized,

fn borrow(&self) -> &T

impl<T> BorrowMut<T> for Twhere T: ?Sized,

fn borrow_mut(&mut self) -> &mut T

impl<T> CloneToUninit for Twhere T: Clone,

Struct Mamba2

impl<B> AutodiffModule<B> for Mamba2<B>
where B: AutodiffBackend + Backend, <B as AutodiffBackend>::InnerBackend: Backend,

impl<B> HasAutodiffModule<B> for Mamba2<B::InnerBackend>
where B: AutodiffBackend + Backend, <B as AutodiffBackend>::InnerBackend: Backend,

fn train<AB>(self) -> Self::TrainModule
where AB: AutodiffBackend<InnerBackend = B>, Self: HasAutodiffModule<AB>,

impl<B> Unpin for Mamba2<B>
where <B as BackendTypes>::Device: Unpin, <B as BackendTypes>::FloatTensorPrimitive: Unpin, <B as BackendTypes>::QuantizedTensorPrimitive: Unpin,

impl<B> UnsafeUnpin for Mamba2<B>
where <B as BackendTypes>::Device: UnsafeUnpin, <B as BackendTypes>::FloatTensorPrimitive: UnsafeUnpin, <B as BackendTypes>::QuantizedTensorPrimitive: UnsafeUnpin,

impl<T> Any for T
where T: 'static + ?Sized,

impl<T> Borrow<T> for T
where T: ?Sized,

impl<T> BorrowMut<T> for T
where T: ?Sized,

impl<T> CloneToUninit for T
where T: Clone,

impl<T, U> Into<U> for T
where U: From<T>,

impl<T> ToOwned for T
where T: Clone,

impl<T> ToString for T
where T: Display + ?Sized,

impl<T, U> TryFrom<U> for T
where U: Into<T>,

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,