vllm.forward_context
batchsize_logging_interval module-attribute
¶
batchsize_logging_interval: float = (
VLLM_LOG_BATCHSIZE_INTERVAL
)
BatchDescriptor ¶
Bases: NamedTuple
Batch descriptor for cudagraph dispatching. We should keep the num of items as minimal as possible to properly and uniquely describe the padded batch for cudagraph.
Source code in vllm/forward_context.py
non_uniform property
¶
non_uniform: BatchDescriptor
Return a non-uniform version of current batch descriptor.
DPMetadata dataclass
¶
Source code in vllm/forward_context.py
65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 |
|
__init__ ¶
__init__(
max_tokens_across_dp_cpu: Tensor,
cu_tokens_across_dp_cpu: Tensor,
local_sizes: Optional[list[int]] = None,
) -> None
chunked_sizes ¶
Context manager to compute and temporarily set the per-rank local token sizes for a specific chunk during chunked forward execution.
This is necessary to ensure each DP (data parallel) rank processes its designated portion of tokens in lockstep with others, even when the token counts are uneven or some ranks have completed their input early.
For chunked execution, we break up the total tokens on each rank into multiple chunks (of at most max_chunk_size_per_rank
), and for a given chunk_idx
, this context manager sets self.local_sizes
to the number of tokens to process in that chunk on each rank.
It uses cumulative sizes (cu_tokens_across_dp_cpu
) to derive the number of tokens per rank, and calls _compute_chunked_local_num_tokens
to determine the chunk-wise split.
self.local_sizes
is only valid inside the context.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
max_chunk_size_per_rank | int | The max number of tokens each rank is allowed to process in this chunk. | required |
chunk_idx | int | The index of the chunk to compute sizes for. | required |
Source code in vllm/forward_context.py
get_chunk_sizes_across_dp_rank ¶
make staticmethod
¶
make(
parallel_config: ParallelConfig,
attn_metadata: Any,
num_tokens: int,
num_tokens_across_dp: Optional[Tensor] = None,
) -> DPMetadata
Source code in vllm/forward_context.py
num_tokens_across_dp staticmethod
¶
Gather the num_tokens across all DP ranks and return results in a CPU tensor of size dp_size.
Source code in vllm/forward_context.py
ForwardContext dataclass
¶
Source code in vllm/forward_context.py
attn_metadata instance-attribute
¶
attn_metadata: Union[
AttentionMetadata, dict[str, AttentionMetadata]
]
batch_descriptor class-attribute
instance-attribute
¶
batch_descriptor: Optional[BatchDescriptor] = None
cudagraph_runtime_mode class-attribute
instance-attribute
¶
cudagraph_runtime_mode: CUDAGraphMode = NONE
no_compile_layers instance-attribute
¶
Type AttentionMetadata for v0, Type Dict[str, AttentionMetadata] for v1, map from layer_name of each attention layer to its attention metadata set dynamically for each forward pass
__init__ ¶
__init__(
no_compile_layers: dict[str, Any],
attn_metadata: Union[
AttentionMetadata, dict[str, AttentionMetadata]
],
virtual_engine: int,
dp_metadata: Optional[DPMetadata] = None,
cudagraph_runtime_mode: CUDAGraphMode = NONE,
batch_descriptor: Optional[BatchDescriptor] = None,
) -> None
_compute_chunked_local_num_tokens ¶
_compute_chunked_local_num_tokens(
num_tokens_across_dp_cpu: list[int],
max_num_tokens: int,
chunk_idx: int,
) -> list[int]
Source code in vllm/forward_context.py
get_forward_context ¶
get_forward_context() -> ForwardContext
Get the current forward context.
Source code in vllm/forward_context.py
set_forward_context ¶
set_forward_context(
attn_metadata: Any,
vllm_config: VllmConfig,
virtual_engine: int = 0,
num_tokens: Optional[int] = None,
num_tokens_across_dp: Optional[Tensor] = None,
cudagraph_runtime_mode: CUDAGraphMode = NONE,
batch_descriptor: Optional[BatchDescriptor] = None,
)
A context manager that stores the current forward context, can be attention metadata, etc. Here we can inject common logic for every model forward pass.
Source code in vllm/forward_context.py
198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 |
|