group-telegram.com/alwebbci/3707
Last Update:
Meet Mamba3. A research paper submitted to ICLR 2026 introduced Mamba-3, which addresses several limitations in current sub-quadratic sequence models through three methodological changes grounded in classical state-space theory.
Code and detailed implementation not yet publicly available as the paper is under review.
Core Modifications
1. Trapezoidal Discretization
The paper replaces Euler's rule (first-order approximation) with a generalized trapezoidal rule (second-order approximation) for discretizing the continuous-time SSM.
This results in:
- A recurrence that incorporates both current and previous inputs with data-dependent weights
- Ability to replace the short causal convolution when combined with learnable biases on B and C projections
- Lower approximation error: O(Δt²) vs O(Δt) for Euler's method
2. Complex-Valued State Spaces
Mamba-2 simplified the transition matrix to a real scalar, which removed the model's ability to solve simple state-tracking tasks. Mamba-3 reintroduces complex SSMs:
- Enables rotational dynamics in hidden states
- Mathematically equivalent to applying data-dependent rotary embeddings to B and C projections
- Can be computed efficiently using the "RoPE trick"
- Recovers performance on parity and modular arithmetic tasks (100% vs <1% for Mamba-2).
3. MIMO Formulation
Changes state update from outer-product to matrix-multiplication based:
- Increases arithmetic intensity from ~2.5 to ~2r (where r is MIMO rank)
- Better utilizes GPU accelerators during decode
- No increase in state size, maintaining inference speed
- Optional feature that can be enabled when compute efficiency is prioritized
Experimental Results
Language Modeling (100B FineWeb-Edu tokens):
Outperforms Mamba-2, Transformer, and Gated DeltaNet baselines at all tested scales (180M, 440M, 820M, 1.5B parameters)
Example: Mamba-3-1.5B achieves 56.4% average accuracy vs 55.7% for Mamba-2
State-Tracking Tasks:
Parity: 100.0% (Mamba-2: 0.9%)
Arithmetic without brackets: 98.5% (Mamba-2: 47.8%)
Arithmetic with brackets: 87.8% (Mamba-2: 0.9%)
Inference Performance:
Faster single-step decode than Mamba-2 despite more complex SSM
MIMO variant improves Pareto frontier: better perplexity at same state size
At 440M scale with 100B tokens, MIMO achieves 12.72 vs 12.87 perplexity for SISO
BY All about AI, Web 3.0, BCI
Warning: Undefined variable $i in /var/www/group-telegram/post.php on line 260
Share with your friend now:
group-telegram.com/alwebbci/3707
