vllm.v1.worker.utils
AttentionGroup dataclass
¶
Source code in vllm/v1/worker/utils.py
__init__ ¶
__init__(
backend: type[AttentionBackend],
metadata_builder: AttentionMetadataBuilder,
layer_names: list[str],
) -> None
CpuGpuBuffer ¶
Source code in vllm/v1/worker/utils.py
__init__ ¶
Source code in vllm/v1/worker/utils.py
copy_to_cpu ¶
NOTE: Because this method is non-blocking, explicit synchronization is needed to ensure the data is copied to CPU.
Source code in vllm/v1/worker/utils.py
MultiModalBudget ¶
Helper class to calculate budget information for multi-modal models.
Source code in vllm/v1/worker/utils.py
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 |
|
max_items_per_batch_by_modality instance-attribute
¶
max_items_per_prompt_by_modality instance-attribute
¶
__init__ ¶
__init__(
model_config: ModelConfig,
scheduler_config: SchedulerConfig,
mm_registry: MultiModalRegistry,
) -> None
Source code in vllm/v1/worker/utils.py
get_max_items ¶
Source code in vllm/v1/worker/utils.py
bind_kv_cache ¶
bind_kv_cache(
kv_caches: dict[str, Tensor],
forward_context: dict[str, Attention],
runner_kv_caches: list[Tensor],
) -> None
Bind the allocated KV cache to both ModelRunner and forward context so that the KV cache can be used in the forward pass.
This function
1) Fills the ModelRunner's kv cache list (runner_kv_caches
) with kv_caches. 2) Associates each attention layer in the forward_context
with its corresponding KV cache in kv_caches.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
kv_caches | dict[str, Tensor] | The allocated kv_caches with layer names as keys. | required |
forward_context | dict[str, Attention] | The global forward context containing all Attention | required |
runner_kv_caches | list[Tensor] | The kv_cache declared by ModelRunner. | required |
Source code in vllm/v1/worker/utils.py
gather_mm_placeholders ¶
Reconstructs the embeddings from the placeholder tokens.
This is the operation of [scatter_mm_placeholders][].
Source code in vllm/v1/worker/utils.py
initialize_kv_cache_for_kv_sharing ¶
initialize_kv_cache_for_kv_sharing(
shared_kv_cache_layers: dict[str, str],
kv_cache_groups: list[KVCacheGroupSpec],
kv_caches: dict[str, Tensor],
attn_groups: Optional[
list[list[AttentionGroup]]
] = None,
runner_only_attn_layers: Optional[set[str]] = None,
) -> None
Sets up KV cache sharing by reusing the allocated KV caches in kv_caches
for layers that do not allocate its own KV cache, based on the mapping in shared_kv_cache_layers
. Adds these layers to the corresponding KV cache group, which is needed to ensure that attention metadata is assigned later.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
shared_kv_cache_layers | dict[str, str] | Layer pairings for cross-layer KV sharing. If an Attention layer | required |
kv_cache_groups | list[KVCacheGroupSpec] | The KV cache groups of the model. | required |
kv_caches | dict[str, Tensor] | The allocated kv_caches with layer names as keys. Note that layers in shared_kv_cache_layers.keys() are not originally included as it only contains layers which have its own KV cache allocation. | required |
attn_groups | Optional[list[list[AttentionGroup]]] | Optional list of attention groups. Layers in the same KV cache group may be placed in different attention groups if they have different attention backends. Currently only provided by GPU model runner. | None |
Source code in vllm/v1/worker/utils.py
sanity_check_mm_encoder_outputs ¶
sanity_check_mm_encoder_outputs(
mm_embeddings: MultiModalEmbeddings,
expected_num_items: int,
) -> None
Perform sanity checks for the result of vllm.model_executor.models.SupportsMultiModal.get_multimodal_embeddings
.
Source code in vllm/v1/worker/utils.py
scatter_mm_placeholders ¶
Scatter the multimodal embeddings into a contiguous tensor that represents the placeholder tokens.
vllm.multimodal.processing.PromptUpdateDetails.is_embed
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
embeds | Tensor | The multimodal embeddings. Shape: | required |
is_embed | Optional[Tensor] | A boolean mask indicating which positions in the placeholder tokens need to be filled with multimodal embeddings. Shape: | required |