vllm.attention.backends.abstract
AttentionBackend ¶
Bases: ABC
Abstract class for attention backends.
Source code in vllm/attention/backends/abstract.py
copy_blocks abstractmethod
staticmethod
¶
full_cls_name classmethod
¶
get_builder_cls abstractmethod
staticmethod
¶
get_builder_cls() -> Type[AttentionMetadataBuilder]
get_impl_cls abstractmethod
staticmethod
¶
get_impl_cls() -> Type[AttentionImpl]
get_kv_cache_shape abstractmethod
staticmethod
¶
get_kv_cache_stride_order staticmethod
¶
get_metadata_cls abstractmethod
staticmethod
¶
get_metadata_cls() -> Type[AttentionMetadata]
get_state_cls abstractmethod
staticmethod
¶
get_state_cls() -> Type[AttentionState]
make_metadata classmethod
¶
make_metadata(*args, **kwargs) -> AttentionMetadata
swap_blocks abstractmethod
staticmethod
¶
AttentionImpl ¶
Source code in vllm/attention/backends/abstract.py
__init__ abstractmethod
¶
__init__(
num_heads: int,
head_size: int,
scale: float,
num_kv_heads: Optional[int] = None,
alibi_slopes: Optional[List[float]] = None,
sliding_window: Optional[int] = None,
kv_cache_dtype: str = "auto",
logits_soft_cap: Optional[float] = None,
attn_type: str = DECODER,
kv_sharing_target_layer_name: Optional[str] = None,
) -> None
Source code in vllm/attention/backends/abstract.py
forward abstractmethod
¶
forward(
layer: AttentionLayer,
query: Tensor,
key: Tensor,
value: Tensor,
kv_cache: Tensor,
attn_metadata: T,
output: Optional[Tensor] = None,
output_scale: Optional[Tensor] = None,
output_block_scale: Optional[Tensor] = None,
) -> Tensor
Source code in vllm/attention/backends/abstract.py
fused_output_quant_supported ¶
fused_output_quant_supported(quant_key: QuantKey)
Does this attention implementation support fused output quantization. This is used by the AttnFusionPass to only fuse output quantization onto implementations that support it.
:param quant_key: QuantKey object that describes the quantization op :return: is fusion supported for this type of quantization
Source code in vllm/attention/backends/abstract.py
AttentionLayer ¶
Bases: Protocol
Source code in vllm/attention/backends/abstract.py
AttentionMetadata dataclass
¶
Attention metadata for prefill and decode batched together.
Source code in vllm/attention/backends/abstract.py
decode_metadata abstractmethod
property
¶
decode_metadata: Optional[AttentionMetadata]
Return the attention metadata that's required to run decode attention.
multi_modal_placeholder_index_maps instance-attribute
¶
prefill_metadata abstractmethod
property
¶
prefill_metadata: Optional[AttentionMetadata]
Return the attention metadata that's required to run prefill attention.
__init__ ¶
__init__(
num_prefills: int,
num_prefill_tokens: int,
num_decode_tokens: int,
slot_mapping: Tensor,
multi_modal_placeholder_index_maps: Optional[
Dict[str, IndexMap]
],
enable_kv_scales_calculation: bool,
) -> None
asdict_zerocopy ¶
Similar to dataclasses.asdict, but avoids deepcopying.
Source code in vllm/attention/backends/abstract.py
AttentionMetadataBuilder ¶
Abstract class for attention metadata builders.
Source code in vllm/attention/backends/abstract.py
__init__ abstractmethod
¶
__init__(
input_builder: ModelRunnerInputBuilderBase,
) -> None
Create the builder, remember some configuration and parameters.
AttentionState ¶
Holds attention backend-specific objects reused during the lifetime of the model runner.
Source code in vllm/attention/backends/abstract.py
__init__ abstractmethod
¶
__init__(runner: ModelRunnerBase)
begin_forward abstractmethod
¶
begin_forward(model_input: ModelRunnerInputBase) -> None
get_graph_input_buffers abstractmethod
¶
get_graph_input_buffers(
attn_metadata: T,
is_encoder_decoder_model: bool = False,
) -> Dict[str, Any]
Get attention-specific input buffers for CUDA graph capture.
graph_capture_get_metadata_for_batch abstractmethod
¶
graph_capture_get_metadata_for_batch(
batch_size: int, is_encoder_decoder_model: bool = False
) -> T
Get attention metadata for CUDA graph capture of batch_size.
graph_clone abstractmethod
¶
graph_clone(batch_size: int) -> AttentionState[T]
AttentionType ¶
Attention type. Use string to be compatible with torch.compile
.
Source code in vllm/attention/backends/abstract.py
MLAAttentionImpl ¶
Bases: AttentionImpl[T]
, Generic[T]
Source code in vllm/attention/backends/abstract.py
forward abstractmethod
¶
forward(
layer: AttentionLayer,
hidden_states_or_cq: Tensor,
kv_c_normed: Tensor,
k_pe: Tensor,
kv_cache: Tensor,
attn_metadata: T,
output: Optional[Tensor] = None,
output_scale: Optional[Tensor] = None,
output_block_scale: Optional[Tensor] = None,
) -> Tensor