vllm.model_executor.models.mamba_cache
MambaCacheManager ¶
Bases: ConstantSizeCache
Source code in vllm/model_executor/models/mamba_cache.py
__init__ ¶
__init__(
vllm_config: VllmConfig,
num_mamba_layers: int,
conv_state_shape: tuple[int, int],
temporal_state_shape: tuple[int, int],
conv_state_dtype: dtype,
temporal_state_dtype: dtype,
)
Source code in vllm/model_executor/models/mamba_cache.py
_copy_cache ¶
current_run_tensors ¶
current_run_tensors(**kwargs) -> MambaCacheParams
Return the tensors for the current run's conv and ssm state.
Source code in vllm/model_executor/models/mamba_cache.py
get_seqlen_agnostic_capture_inputs ¶
get_seqlen_agnostic_capture_inputs(batch_size: int)
Provide the CUDA graph capture runs with a buffer in adjusted size. The buffer is used to maintain the Mamba Cache during the CUDA graph replay runs.