vllm.v1.kv_cache_interface
AttentionSpec dataclass
¶
Bases: KVCacheSpec
Source code in vllm/v1/kv_cache_interface.py
ChunkedLocalAttentionSpec dataclass
¶
Bases: AttentionSpec
Source code in vllm/v1/kv_cache_interface.py
__init__ ¶
__init__(
block_size: int,
num_kv_heads: int,
head_size: int,
dtype: dtype,
use_mla: bool,
attention_chunk_size: int,
) -> None
max_memory_usage_bytes ¶
max_memory_usage_bytes(vllm_config: VllmConfig) -> int
Source code in vllm/v1/kv_cache_interface.py
EncoderOnlyAttentionSpec dataclass
¶
FullAttentionSpec dataclass
¶
Bases: AttentionSpec
Source code in vllm/v1/kv_cache_interface.py
attention_chunk_size class-attribute
instance-attribute
¶
When hybrid allocator is disabled and the model contains both full attention layers and sliding window attention layers, sliding window attention are regarded as full attention in KV cache manager (blocks are allocated for all tokens), while computed as sliding window attention in model runner. In this case, we use FullAttentionSpec and record the sliding window size. Default to None for not using sliding window attention.
__init__ ¶
__init__(
block_size: int,
num_kv_heads: int,
head_size: int,
dtype: dtype,
use_mla: bool,
sliding_window: Optional[int] = None,
attention_chunk_size: Optional[int] = None,
) -> None
max_memory_usage_bytes ¶
max_memory_usage_bytes(vllm_config: VllmConfig) -> int
merge classmethod
¶
Merge a list of FullAttentionSpec objects into a single FullAttentionSpec object.
Source code in vllm/v1/kv_cache_interface.py
merge_window_sizes classmethod
¶
Source code in vllm/v1/kv_cache_interface.py
KVCacheConfig dataclass
¶
The KV cache configuration of a model.
Source code in vllm/v1/kv_cache_interface.py
kv_cache_tensors instance-attribute
¶
kv_cache_tensors: list[KVCacheTensor]
The kv cache groups of the model. For models with only one type of attention, there is only one group that contains all layers. For models with multiple types of attention, there will be multiple groups, see _get_kv_cache_config_uniform_page_size
for more details.
num_blocks instance-attribute
¶
num_blocks: int
How should model runner initialize the KV cache tensors for each layer
__init__ ¶
__init__(
num_blocks: int,
kv_cache_tensors: list[KVCacheTensor],
kv_cache_groups: list[KVCacheGroupSpec],
) -> None
KVCacheGroupSpec dataclass
¶
Represents a group of model layers that share the same KV cache block table. These layers are regarded as one layer in the KV cache manager.
Source code in vllm/v1/kv_cache_interface.py
KVCacheSpec dataclass
¶
A base class for specifying the KV cache format of one layer.
Source code in vllm/v1/kv_cache_interface.py
max_memory_usage_bytes ¶
max_memory_usage_bytes(vllm_config: VllmConfig) -> int
The maximum possible memory usage of this KV cache in bytes.
Returns:
Type | Description |
---|---|
int | The KV cache size in bytes |
merge classmethod
¶
Merge a list of KVCacheSpec objects into a single KVCacheSpec object.
Source code in vllm/v1/kv_cache_interface.py
KVCacheTensor dataclass
¶
A class for specifying how the workers should initialize the KV cache.
Source code in vllm/v1/kv_cache_interface.py
MambaSpec dataclass
¶
Bases: KVCacheSpec
Source code in vllm/v1/kv_cache_interface.py
SlidingWindowSpec dataclass
¶
Bases: AttentionSpec
Source code in vllm/v1/kv_cache_interface.py
__init__ ¶
__init__(
block_size: int,
num_kv_heads: int,
head_size: int,
dtype: dtype,
use_mla: bool,
sliding_window: int,
) -> None
__post_init__ ¶
max_memory_usage_bytes ¶
max_memory_usage_bytes(vllm_config: VllmConfig) -> int