Skip to main content

Module minimal

Module minimal 

Source
Expand description

§The Chunkwise SSD Algorithm

During training (and prefill), a naive sequential recurrence cannot exploit GPU tensor cores. The chunkwise SSD algorithm (§4 of the paper) achieves this by splitting the sequence into chunks of length Q and decomposing the computation into four steps:

  Step 1  (intra-chunk, quadratic form)   →  Y_diag
  Step 2  (input → chunk state)           →  state_bnhpr
  Step 3  (inter-chunk state scan)        →  state_bnhpr, final_state
  Step 4  (chunk state → output)          →  Y_off

  Y = Y_diag + Y_off

Steps 1, 2, 4 are fully parallel across chunks and use batched matrix multiplications (exploiting tensor cores). Step 3 is a short sequential scan over T/Q elements rather than T.