Expand description
§The Chunkwise SSD Algorithm
During training (and prefill), a naive sequential recurrence cannot exploit GPU tensor cores. The chunkwise SSD algorithm (§4 of the paper) achieves this by splitting the sequence into chunks of length Q and decomposing the computation into four steps:
Step 1 (intra-chunk, quadratic form) → Y_diag
Step 2 (input → chunk state) → state_bnhpr
Step 3 (inter-chunk state scan) → state_bnhpr, final_state
Step 4 (chunk state → output) → Y_off
Y = Y_diag + Y_offSteps 1, 2, 4 are fully parallel across chunks and use batched matrix
multiplications (exploiting tensor cores). Step 3 is a short sequential
scan over T/Q elements rather than T.