vllm.config.cache
CacheDType module-attribute
¶
CacheDType = Literal[
"auto", "fp8", "fp8_e4m3", "fp8_e5m2", "fp8_inc"
]
PrefixCachingHashAlgo module-attribute
¶
PrefixCachingHashAlgo = Literal[
"builtin", "sha256", "sha256_cbor_64bit"
]
CacheConfig ¶
Configuration for the KV cache.
Source code in vllm/config/cache.py
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 |
|
block_size class-attribute
instance-attribute
¶
block_size: SkipValidation[BlockSize] = None
Size of a contiguous cache block in number of tokens. This is ignored on neuron devices and set to --max-model-len
. On CUDA devices, only block sizes up to 32 are supported. On HPU devices, block size defaults to 128.
This config has no static default. If left unspecified by the user, it will be set in Platform.check_and_update_config()
based on the current platform.
cache_dtype class-attribute
instance-attribute
¶
cache_dtype: CacheDType = 'auto'
Data type for kv cache storage. If "auto", will use model data type. CUDA 11.8+ supports fp8 (=fp8_e4m3) and fp8_e5m2. ROCm (AMD GPU) supports fp8 (=fp8_e4m3). Intel Gaudi (HPU) supports fp8 (using fp8_inc).
calculate_kv_scales class-attribute
instance-attribute
¶
calculate_kv_scales: bool = False
This enables dynamic calculation of k_scale
and v_scale
when kv_cache_dtype is fp8. If False
, the scales will be loaded from the model checkpoint if available. Otherwise, the scales will default to 1.0.
cpu_kvcache_space_bytes class-attribute
instance-attribute
¶
(CPU backend only) CPU key-value cache space.
cpu_offload_gb class-attribute
instance-attribute
¶
cpu_offload_gb: float = 0
The space in GiB to offload to CPU, per GPU. Default is 0, which means no offloading. Intuitively, this argument can be seen as a virtual way to increase the GPU memory size. For example, if you have one 24 GB GPU and set this to 10, virtually you can think of it as a 34 GB GPU. Then you can load a 13B model with BF16 weight, which requires at least 26GB GPU memory. Note that this requires fast CPU-GPU interconnect, as part of the model is loaded from CPU memory to GPU memory on the fly in each model forward pass.
enable_prefix_caching class-attribute
instance-attribute
¶
Whether to enable prefix caching. Disabled by default for V0. Enabled by default for V1.
gpu_memory_utilization class-attribute
instance-attribute
¶
gpu_memory_utilization: float = 0.9
The fraction of GPU memory to be used for the model executor, which can range from 0 to 1. For example, a value of 0.5 would imply 50% GPU memory utilization. If unspecified, will use the default value of 0.9. This is a per-instance limit, and only applies to the current vLLM instance. It does not matter if you have another vLLM instance running on the same GPU. For example, if you have two vLLM instances running on the same GPU, you can set the GPU memory utilization to 0.5 for each instance.
is_attention_free class-attribute
instance-attribute
¶
is_attention_free: bool = False
Whether the model is attention-free. This is primarily set in ModelConfig
and that value should be manually duplicated here.
kv_sharing_fast_prefill class-attribute
instance-attribute
¶
kv_sharing_fast_prefill: bool = False
This feature is work in progress and no prefill optimization takes place with this flag enabled currently.
In some KV sharing setups, e.g. YOCO (https://arxiv.org/abs/2405.05254), some layers can skip tokens corresponding to prefill. This flag enables attention metadata for eligible layers to be overriden with metadata necessary for implementing this optimization in some models (e.g. Gemma3n)
mamba_cache_dtype class-attribute
instance-attribute
¶
mamba_cache_dtype: MambaDType = 'auto'
The data type to use for the Mamba cache (both the conv as well as the ssm state). If set to 'auto', the data type will be inferred from the model config.
mamba_page_size_padded class-attribute
instance-attribute
¶
Optional override for mamba page size; used by hybrid mamba/attention models to ensure exact alignment with attention page size.
mamba_ssm_cache_dtype class-attribute
instance-attribute
¶
mamba_ssm_cache_dtype: MambaDType = 'auto'
The data type to use for the Mamba cache (ssm state only, conv state will still be controlled by mamba_cache_dtype). If set to 'auto', the data type for the ssm state will be determined by mamba_cache_dtype.
num_cpu_blocks class-attribute
instance-attribute
¶
The number of blocks to allocate for CPU memory.
num_gpu_blocks class-attribute
instance-attribute
¶
The number of blocks to allocate for GPU memory.
num_gpu_blocks_override class-attribute
instance-attribute
¶
Number of GPU blocks to use. This overrides the profiled num_gpu_blocks
if specified. Does nothing if None
. Used for testing preemption.
prefix_caching_hash_algo class-attribute
instance-attribute
¶
prefix_caching_hash_algo: PrefixCachingHashAlgo = 'builtin'
Set the hash algorithm for prefix caching:
-
"builtin" is Python's built-in hash.
-
"sha256" is collision resistant but with certain overheads. This option uses Pickle for object serialization before hashing.
-
"sha256_cbor_64bit" provides a reproducible, cross-language compatible hash. It serializes objects using canonical CBOR and hashes them with SHA-256. The resulting hash consists of the lower 64 bits of the SHA-256 digest.
sliding_window class-attribute
instance-attribute
¶
Sliding window size for the KV cache. This is primarily set in ModelConfig
and that value should be manually duplicated here.
swap_space class-attribute
instance-attribute
¶
swap_space: float = 4
Size of the CPU swap space per GPU (in GiB).
__post_init__ ¶
_verify_args ¶
_verify_args() -> Self
Source code in vllm/config/cache.py
_verify_cache_dtype ¶
Source code in vllm/config/cache.py
_verify_prefix_caching ¶
Source code in vllm/config/cache.py
compute_hash ¶
compute_hash() -> str
WARNING: Whenever a new field is added to this config, ensure that it is included in the factors list if it affects the computation graph.
Provide a hash that uniquely identifies all the configs that affect the structure of the computation graph from input ids/embeddings to the final hidden states, excluding anything before input ids/embeddings and after the final hidden states.
Source code in vllm/config/cache.py
metrics_info ¶
verify_with_parallel_config ¶
verify_with_parallel_config(
parallel_config: ParallelConfig,
) -> None