vllm.model_executor.layers.mamba.mamba2_metadata
Mamba2Metadata dataclass
¶
Source code in vllm/model_executor/layers/mamba/mamba2_metadata.py
chunk_offsets instance-attribute
¶
chunk_offsets: Tensor
With continuous batching layout of x
in vLLM, to enable a Triton program to handle a request in parallel, two supporting tensors are used (batch_ptr, token_chunk_offset_ptr) BLOCK_M = the # tokens to be handled by a Triton program (can be customized for different hardware)
nums_dict
tracks the data associated with a given value of BLOCK_M BLOCK_M = #tokens handled by a Triton program
cu_seqlen: total tokens per batch (used as flag to update other data at each new input) batch_ptr: tracks batch-id handled by the Triton program token_chunk_offset_ptr: tracks token group_idx handled by the Triton program (Triton implementation of causal_conv1d handles parallelism in 3-axes - feature-axis - batch-axis - sequence-axis)
token_chunk_offset_ptr class-attribute
instance-attribute
¶
__init__ ¶
__init__(
has_initial_states: Tensor,
prep_initial_states: bool,
chunk_size: int,
seq_idx: Tensor,
chunk_indices: Tensor,
chunk_offsets: Tensor,
nums_dict: Optional[dict] = None,
cu_seqlen: Optional[int] = None,
batch_ptr: Optional[tensor] = None,
token_chunk_offset_ptr: Optional[tensor] = None,
) -> None
get_platform_metadata_classes ¶
get_platform_metadata_classes() -> tuple[
type[AttentionMetadata], ...
]
Returns the appropriate metadata classes for the current platform.
Source code in vllm/model_executor/layers/mamba/mamba2_metadata.py
prepare_mamba2_metadata ¶
prepare_mamba2_metadata(
chunk_size: int,
attn_metadata: AttentionMetadata,
mamba2_metadata=None,
) -> Mamba2Metadata
Source code in vllm/model_executor/layers/mamba/mamba2_metadata.py
update_metadata ¶
update_metadata(
x: Tensor,
query_start_loc: Tensor,
mamba2_metadata: Union[
Mamba2Metadata, Mamba2AttentionMetadata
],
)
this is triggered upon handling a new input at the first layer