vllm.multimodal

Modules:

Name	Description
`audio`
`base`
`cache`
`hasher`
`image`
`inputs`
`parse`
`processing`
`profiling`
`registry`
`utils`
`video`

BatchedTensorInputs `module-attribute` ¶

BatchedTensorInputs: TypeAlias = Mapping[str, NestedTensors]

A dictionary containing nested tensors which have been batched via MultiModalKwargs.batch.

MULTIMODAL_REGISTRY `module-attribute` ¶

MULTIMODAL_REGISTRY = MultiModalRegistry()

The global MultiModalRegistry is used by model runners to dispatch data processing according to the target model.

Info

mm_processing

ModalityData `module-attribute` ¶

ModalityData: TypeAlias = Union[_T, list[_T]]

Either a single data item, or a list of data items.

The number of data items allowed per modality is restricted by --limit-mm-per-prompt.

MultiModalDataDict `module-attribute` ¶

MultiModalDataDict: TypeAlias = Mapping[
    str, ModalityData[Any]
]

A dictionary containing an entry for each modality type to input.

The built-in modalities are defined by MultiModalDataBuiltins.

MultiModalHashDict `module-attribute` ¶

MultiModalHashDict = Mapping[str, list[str]]

A dictionary containing hashes for items in each modality.

MultiModalPlaceholderDict `module-attribute` ¶

MultiModalPlaceholderDict: TypeAlias = Mapping[
    str, Sequence[PlaceholderRange]
]

A dictionary containing placeholder ranges for each modality.

NestedTensors `module-attribute` ¶

NestedTensors: TypeAlias = Union[
    list["NestedTensors"],
    list["torch.Tensor"],
    "torch.Tensor",
    tuple["torch.Tensor", ...],
]

Uses a list instead of a tensor if the dimensions of each element do not match.

all `module-attribute` ¶

__all__ = [
    "BatchedTensorInputs",
    "ModalityData",
    "MultiModalDataBuiltins",
    "MultiModalDataDict",
    "MultiModalHashDict",
    "MultiModalHasher",
    "MultiModalKwargs",
    "MultiModalKwargsItems",
    "MultiModalPlaceholderDict",
    "MultiModalPlaceholderMap",
    "NestedTensors",
    "MULTIMODAL_REGISTRY",
    "MultiModalRegistry",
]

MultiModalDataBuiltins ¶

Bases: TypedDict

Type annotations for modality types predefined by vLLM.

Source code in vllm/multimodal/inputs.py

@final
class MultiModalDataBuiltins(TypedDict, total=False):
    """Type annotations for modality types predefined by vLLM."""

    image: ModalityData[ImageItem]
    """The input image(s)."""

    video: ModalityData[VideoItem]
    """The input video(s)."""

    audio: ModalityData[AudioItem]
    """The input audio(s)."""

audio `instance-attribute` ¶

audio: ModalityData[AudioItem]

The input audio(s).

image `instance-attribute` ¶

image: ModalityData[ImageItem]

The input image(s).

video `instance-attribute` ¶

video: ModalityData[VideoItem]

The input video(s).

MultiModalHasher ¶

Source code in vllm/multimodal/hasher.py

class MultiModalHasher:

    @classmethod
    def serialize_item(cls, obj: object) -> Union[bytes, memoryview]:
        # Simple cases
        if isinstance(obj, str):
            return obj.encode("utf-8")
        if isinstance(obj, (bytes, memoryview)):
            return obj
        if isinstance(obj, (int, float)):
            return np.array(obj).tobytes()

        if isinstance(obj, Image.Image):
            exif = obj.getexif()
            if Image.ExifTags.Base.ImageID in exif and isinstance(
                    exif[Image.ExifTags.Base.ImageID], uuid.UUID):
                # If the image has exif ImageID tag, use that
                return exif[Image.ExifTags.Base.ImageID].bytes
            return cls.item_to_bytes(
                "image", np.asarray(convert_image_mode(obj, "RGBA")))
        if isinstance(obj, torch.Tensor):
            tensor_obj: torch.Tensor = obj.cpu()
            tensor_dtype = tensor_obj.dtype
            if tensor_dtype == torch.bfloat16:
                tensor_obj = tensor_obj.contiguous()
                tensor_obj = tensor_obj.view(
                    (tensor_obj.numel(), )).view(torch.uint8)
                return cls.item_to_bytes(
                    "tensor", {
                        "original_dtype": str(tensor_dtype),
                        "original_shape": tuple(tensor_obj.shape),
                        "data": tensor_obj.numpy()
                    })
            return cls.item_to_bytes("tensor", tensor_obj.numpy())
        if isinstance(obj, np.ndarray):
            # If the array is non-contiguous, we need to copy it first
            arr_data = obj.data if obj.flags.c_contiguous else obj.tobytes()
            return cls.item_to_bytes("ndarray", {
                "dtype": obj.dtype.str,
                "shape": obj.shape,
                "data": arr_data,
            })

        logger.warning(
            "No serialization method found for %s. "
            "Falling back to pickle.", type(obj))

        return pickle.dumps(obj)

    @classmethod
    def item_to_bytes(
        cls,
        key: str,
        obj: object,
    ) -> bytes:
        return b''.join(kb + vb for kb, vb in cls.iter_item_to_bytes(key, obj))

    @classmethod
    def iter_item_to_bytes(
        cls,
        key: str,
        obj: object,
    ) -> Iterable[tuple[bytes, Union[bytes, memoryview]]]:
        # Recursive cases
        if isinstance(obj, (list, tuple)):
            for i, elem in enumerate(obj):
                yield from cls.iter_item_to_bytes(f"{key}.{i}", elem)
        elif isinstance(obj, dict):
            for k, v in obj.items():
                yield from cls.iter_item_to_bytes(f"{key}.{k}", v)
        else:
            key_bytes = key.encode("utf-8")
            value_bytes = cls.serialize_item(obj)
            yield key_bytes, value_bytes

    @classmethod
    def hash_kwargs(cls, **kwargs: object) -> str:
        hasher = blake3()

        for k, v in kwargs.items():
            for k_bytes, v_bytes in cls.iter_item_to_bytes(k, v):
                hasher.update(k_bytes)
                hasher.update(v_bytes)

        return hasher.hexdigest()

hash_kwargs `classmethod` ¶

hash_kwargs(**kwargs: object) -> str

Source code in vllm/multimodal/hasher.py

@classmethod
def hash_kwargs(cls, **kwargs: object) -> str:
    hasher = blake3()

    for k, v in kwargs.items():
        for k_bytes, v_bytes in cls.iter_item_to_bytes(k, v):
            hasher.update(k_bytes)
            hasher.update(v_bytes)

    return hasher.hexdigest()

item_to_bytes `classmethod` ¶

item_to_bytes(key: str, obj: object) -> bytes

Source code in vllm/multimodal/hasher.py

@classmethod
def item_to_bytes(
    cls,
    key: str,
    obj: object,
) -> bytes:
    return b''.join(kb + vb for kb, vb in cls.iter_item_to_bytes(key, obj))

iter_item_to_bytes `classmethod` ¶

iter_item_to_bytes(
    key: str, obj: object
) -> Iterable[tuple[bytes, Union[bytes, memoryview]]]

Source code in vllm/multimodal/hasher.py

@classmethod
def iter_item_to_bytes(
    cls,
    key: str,
    obj: object,
) -> Iterable[tuple[bytes, Union[bytes, memoryview]]]:
    # Recursive cases
    if isinstance(obj, (list, tuple)):
        for i, elem in enumerate(obj):
            yield from cls.iter_item_to_bytes(f"{key}.{i}", elem)
    elif isinstance(obj, dict):
        for k, v in obj.items():
            yield from cls.iter_item_to_bytes(f"{key}.{k}", v)
    else:
        key_bytes = key.encode("utf-8")
        value_bytes = cls.serialize_item(obj)
        yield key_bytes, value_bytes

serialize_item `classmethod` ¶

serialize_item(obj: object) -> Union[bytes, memoryview]

Source code in vllm/multimodal/hasher.py

@classmethod
def serialize_item(cls, obj: object) -> Union[bytes, memoryview]:
    # Simple cases
    if isinstance(obj, str):
        return obj.encode("utf-8")
    if isinstance(obj, (bytes, memoryview)):
        return obj
    if isinstance(obj, (int, float)):
        return np.array(obj).tobytes()

    if isinstance(obj, Image.Image):
        exif = obj.getexif()
        if Image.ExifTags.Base.ImageID in exif and isinstance(
                exif[Image.ExifTags.Base.ImageID], uuid.UUID):
            # If the image has exif ImageID tag, use that
            return exif[Image.ExifTags.Base.ImageID].bytes
        return cls.item_to_bytes(
            "image", np.asarray(convert_image_mode(obj, "RGBA")))
    if isinstance(obj, torch.Tensor):
        tensor_obj: torch.Tensor = obj.cpu()
        tensor_dtype = tensor_obj.dtype
        if tensor_dtype == torch.bfloat16:
            tensor_obj = tensor_obj.contiguous()
            tensor_obj = tensor_obj.view(
                (tensor_obj.numel(), )).view(torch.uint8)
            return cls.item_to_bytes(
                "tensor", {
                    "original_dtype": str(tensor_dtype),
                    "original_shape": tuple(tensor_obj.shape),
                    "data": tensor_obj.numpy()
                })
        return cls.item_to_bytes("tensor", tensor_obj.numpy())
    if isinstance(obj, np.ndarray):
        # If the array is non-contiguous, we need to copy it first
        arr_data = obj.data if obj.flags.c_contiguous else obj.tobytes()
        return cls.item_to_bytes("ndarray", {
            "dtype": obj.dtype.str,
            "shape": obj.shape,
            "data": arr_data,
        })

    logger.warning(
        "No serialization method found for %s. "
        "Falling back to pickle.", type(obj))

    return pickle.dumps(obj)

MultiModalKwargs ¶

Bases: UserDict[str, NestedTensors]

A dictionary that represents the keyword arguments to torch.nn.Module.forward.

Source code in vllm/multimodal/inputs.py

class MultiModalKwargs(UserDict[str, NestedTensors]):
    """
    A dictionary that represents the keyword arguments to
    [`torch.nn.Module.forward`][].
    """

    @staticmethod
    @deprecated("`MultiModalKwargs.from_hf_inputs` is deprecated and "
                "will be removed in v0.13. "
                "Please use `MultiModalKwargsItems.from_hf_inputs` and "
                "access the tensor data using `.get_data()`.")
    def from_hf_inputs(
        hf_inputs: "BatchFeature",
        config_by_key: Mapping[str, MultiModalFieldConfig],
    ):
        return MultiModalKwargsItems.from_hf_inputs(hf_inputs, config_by_key) \
            .get_data()

    @staticmethod
    @deprecated("`MultiModalKwargs.from_items` is deprecated and "
                "will be removed in v0.13. "
                "Please use `MultiModalKwargsItems.from_seq` and "
                "access the tensor data using `.get_data()`.")
    def from_items(
        items: Sequence[MultiModalKwargsItem],
        *,
        pin_memory: bool = False,
    ):
        return MultiModalKwargsItems.from_seq(items) \
            .get_data(pin_memory=pin_memory)

    @staticmethod
    def _try_stack(nested_tensors: NestedTensors,
                   pin_memory: bool = False) -> NestedTensors:
        """
        Stack the inner dimensions that have the same shape in
        a nested list of tensors.

        Thus, a dimension represented by a list means that the inner
        dimensions are different for each element along that dimension.
        """
        if isinstance(nested_tensors, torch.Tensor):
            return nested_tensors

        # TODO: Remove these once all models have been migrated
        if isinstance(nested_tensors, np.ndarray):
            return torch.from_numpy(nested_tensors)
        if isinstance(nested_tensors, (int, float)):
            return torch.tensor(nested_tensors)

        stacked = [
            MultiModalKwargs._try_stack(t, pin_memory) for t in nested_tensors
        ]
        if not is_list_of(stacked, torch.Tensor, check="all"):
            # Only tensors (not lists) can be stacked.
            return stacked

        tensors_ = cast(list[torch.Tensor], stacked)
        if len(tensors_) == 1:
            # An optimization when `tensors_` contains only one tensor:
            # - produce exactly same result as `torch.stack(tensors_)`
            # - will achieve zero-copy if the tensor is contiguous
            return tensors_[0].unsqueeze(0).contiguous()

        if any(t.shape != tensors_[0].shape for t in tensors_):
            # The tensors have incompatible shapes and can't be stacked.
            return tensors_

        outputs = torch.empty(len(tensors_),
                              *tensors_[0].shape,
                              dtype=tensors_[0].dtype,
                              device=tensors_[0].device,
                              pin_memory=pin_memory)
        return torch.stack(tensors_, out=outputs)

    @staticmethod
    def batch(inputs_list: list["MultiModalKwargs"],
              pin_memory: bool = False) -> BatchedTensorInputs:
        """
        Batch multiple inputs together into a dictionary.

        The resulting dictionary has the same keys as the inputs.
        If the corresponding value from each input is a tensor and they all
        share the same shape, the output value is a single batched tensor;
        otherwise, the output value is a list containing the original value
        from each input.
        """
        if len(inputs_list) == 0:
            return {}

        # We need to consider the case where each item in the batch
        # contains different modalities (i.e. different keys).
        item_lists = defaultdict[str, list[NestedTensors]](list)

        for inputs in inputs_list:
            for k, v in inputs.items():
                item_lists[k].append(v)

        return {
            k: MultiModalKwargs._try_stack(item_list, pin_memory)
            for k, item_list in item_lists.items()
        }

    @staticmethod
    def as_kwargs(
        batched_inputs: BatchedTensorInputs,
        *,
        device: torch.types.Device,
    ) -> BatchedTensorInputs:
        json_inputs = cast(JSONTree[torch.Tensor], batched_inputs)

        json_mapped = json_map_leaves(
            lambda x: x.to(device=device, non_blocking=True),
            json_inputs,
        )

        return cast(BatchedTensorInputs, json_mapped)

    def __getitem__(self, key: str):
        if key not in self:
            raise KeyError(f"Keyword argument {key!r} not found. "
                           f"Available keys: {set(self.keys())}")

        return super().__getitem__(key)

    def __eq__(self, other: object) -> bool:
        if not isinstance(other, self.__class__):
            return False

        for k in self:
            if k not in other:
                return False
            if not nested_tensors_equal(self[k], other[k]):
                return False

        return True

eq ¶

__eq__(other: object) -> bool

Source code in vllm/multimodal/inputs.py

def __eq__(self, other: object) -> bool:
    if not isinstance(other, self.__class__):
        return False

    for k in self:
        if k not in other:
            return False
        if not nested_tensors_equal(self[k], other[k]):
            return False

    return True

getitem ¶

__getitem__(key: str)

Source code in vllm/multimodal/inputs.py

def __getitem__(self, key: str):
    if key not in self:
        raise KeyError(f"Keyword argument {key!r} not found. "
                       f"Available keys: {set(self.keys())}")

    return super().__getitem__(key)

_try_stack `staticmethod` ¶

_try_stack(
    nested_tensors: NestedTensors, pin_memory: bool = False
) -> NestedTensors

Stack the inner dimensions that have the same shape in a nested list of tensors.

Thus, a dimension represented by a list means that the inner dimensions are different for each element along that dimension.

Source code in vllm/multimodal/inputs.py

@staticmethod
def _try_stack(nested_tensors: NestedTensors,
               pin_memory: bool = False) -> NestedTensors:
    """
    Stack the inner dimensions that have the same shape in
    a nested list of tensors.

    Thus, a dimension represented by a list means that the inner
    dimensions are different for each element along that dimension.
    """
    if isinstance(nested_tensors, torch.Tensor):
        return nested_tensors

    # TODO: Remove these once all models have been migrated
    if isinstance(nested_tensors, np.ndarray):
        return torch.from_numpy(nested_tensors)
    if isinstance(nested_tensors, (int, float)):
        return torch.tensor(nested_tensors)

    stacked = [
        MultiModalKwargs._try_stack(t, pin_memory) for t in nested_tensors
    ]
    if not is_list_of(stacked, torch.Tensor, check="all"):
        # Only tensors (not lists) can be stacked.
        return stacked

    tensors_ = cast(list[torch.Tensor], stacked)
    if len(tensors_) == 1:
        # An optimization when `tensors_` contains only one tensor:
        # - produce exactly same result as `torch.stack(tensors_)`
        # - will achieve zero-copy if the tensor is contiguous
        return tensors_[0].unsqueeze(0).contiguous()

    if any(t.shape != tensors_[0].shape for t in tensors_):
        # The tensors have incompatible shapes and can't be stacked.
        return tensors_

    outputs = torch.empty(len(tensors_),
                          *tensors_[0].shape,
                          dtype=tensors_[0].dtype,
                          device=tensors_[0].device,
                          pin_memory=pin_memory)
    return torch.stack(tensors_, out=outputs)

as_kwargs `staticmethod` ¶

as_kwargs(
    batched_inputs: BatchedTensorInputs, *, device: Device
) -> BatchedTensorInputs

Source code in vllm/multimodal/inputs.py

@staticmethod
def as_kwargs(
    batched_inputs: BatchedTensorInputs,
    *,
    device: torch.types.Device,
) -> BatchedTensorInputs:
    json_inputs = cast(JSONTree[torch.Tensor], batched_inputs)

    json_mapped = json_map_leaves(
        lambda x: x.to(device=device, non_blocking=True),
        json_inputs,
    )

    return cast(BatchedTensorInputs, json_mapped)

batch `staticmethod` ¶

batch(
    inputs_list: list[MultiModalKwargs],
    pin_memory: bool = False,
) -> BatchedTensorInputs

Batch multiple inputs together into a dictionary.

The resulting dictionary has the same keys as the inputs. If the corresponding value from each input is a tensor and they all share the same shape, the output value is a single batched tensor; otherwise, the output value is a list containing the original value from each input.

Source code in vllm/multimodal/inputs.py

@staticmethod
def batch(inputs_list: list["MultiModalKwargs"],
          pin_memory: bool = False) -> BatchedTensorInputs:
    """
    Batch multiple inputs together into a dictionary.

    The resulting dictionary has the same keys as the inputs.
    If the corresponding value from each input is a tensor and they all
    share the same shape, the output value is a single batched tensor;
    otherwise, the output value is a list containing the original value
    from each input.
    """
    if len(inputs_list) == 0:
        return {}

    # We need to consider the case where each item in the batch
    # contains different modalities (i.e. different keys).
    item_lists = defaultdict[str, list[NestedTensors]](list)

    for inputs in inputs_list:
        for k, v in inputs.items():
            item_lists[k].append(v)

    return {
        k: MultiModalKwargs._try_stack(item_list, pin_memory)
        for k, item_list in item_lists.items()
    }

from_hf_inputs `staticmethod` ¶

from_hf_inputs(
    hf_inputs: BatchFeature,
    config_by_key: Mapping[str, MultiModalFieldConfig],
)

Source code in vllm/multimodal/inputs.py

@staticmethod
@deprecated("`MultiModalKwargs.from_hf_inputs` is deprecated and "
            "will be removed in v0.13. "
            "Please use `MultiModalKwargsItems.from_hf_inputs` and "
            "access the tensor data using `.get_data()`.")
def from_hf_inputs(
    hf_inputs: "BatchFeature",
    config_by_key: Mapping[str, MultiModalFieldConfig],
):
    return MultiModalKwargsItems.from_hf_inputs(hf_inputs, config_by_key) \
        .get_data()

from_items `staticmethod` ¶

from_items(
    items: Sequence[MultiModalKwargsItem],
    *,
    pin_memory: bool = False,
)

Source code in vllm/multimodal/inputs.py

@staticmethod
@deprecated("`MultiModalKwargs.from_items` is deprecated and "
            "will be removed in v0.13. "
            "Please use `MultiModalKwargsItems.from_seq` and "
            "access the tensor data using `.get_data()`.")
def from_items(
    items: Sequence[MultiModalKwargsItem],
    *,
    pin_memory: bool = False,
):
    return MultiModalKwargsItems.from_seq(items) \
        .get_data(pin_memory=pin_memory)

MultiModalKwargsItems ¶

Bases: UserDict[str, Sequence[MultiModalKwargsItem]]

A dictionary of MultiModalKwargsItems by modality.

Source code in vllm/multimodal/inputs.py

class MultiModalKwargsItems(UserDict[str, Sequence[MultiModalKwargsItem]]):
    """
    A dictionary of
    [`MultiModalKwargsItem`][vllm.multimodal.inputs.MultiModalKwargsItem]s
    by modality.
    """

    @staticmethod
    def from_hf_inputs(
        hf_inputs: "BatchFeature",
        config_by_key: Mapping[str, MultiModalFieldConfig],
    ):
        # NOTE: This skips fields in `hf_inputs` that are not in `config_by_key`
        # We assume that those fields are not used in vLLM
        elems_by_key = dict[str, Sequence[MultiModalFieldElem]]()
        keys_by_modality = defaultdict[str, set[str]](set)
        for key, config in config_by_key.items():
            batch = hf_inputs.get(key)
            if batch is not None:
                elems = config.build_elems(key, batch)
                if len(elems) > 0:
                    elems_by_key[key] = elems
                    keys_by_modality[config.modality].add(key)

        items = list[MultiModalKwargsItem]()
        for modality, keys in keys_by_modality.items():
            elems_in_modality = {k: elems_by_key[k] for k in keys}
            batch_sizes = {k: len(v) for k, v in elems_in_modality.items()}

            if len(set(batch_sizes.values())) > 1:
                raise ValueError(
                    f"Cannot merge different batch sizes for {modality=}! "
                    f"Found: {batch_sizes=}")

            batch_size = next(iter(batch_sizes.values()))
            for item_idx in range(batch_size):
                elems = [v[item_idx] for v in elems_in_modality.values()]
                items.append(MultiModalKwargsItem.from_elems(elems))

        return MultiModalKwargsItems.from_seq(items)

    @staticmethod
    def from_seq(items: Sequence[MultiModalKwargsItem]):
        items_by_modality = full_groupby(items, key=lambda x: x.modality)
        return MultiModalKwargsItems(items_by_modality)

    def __getitem__(self, modality: str):
        if modality not in self:
            raise KeyError(f"Modality {modality!r} not found. "
                           f"Available modalities: {set(self.keys())}")

        return super().__getitem__(modality)

    def get_data(self, *, pin_memory: bool = False) -> "MultiModalKwargs":
        elems_by_key = defaultdict[str, list[MultiModalFieldElem]](list)
        for items in self.values():
            for item in items:
                for key, elem in item.items():
                    elems_by_key[key].append(elem)

        return MultiModalKwargs({
            key:
            elems[0].field.reduce_data(elems, pin_memory=pin_memory)
            for key, elems in elems_by_key.items() if len(elems) > 0
        })

getitem ¶

__getitem__(modality: str)

Source code in vllm/multimodal/inputs.py

def __getitem__(self, modality: str):
    if modality not in self:
        raise KeyError(f"Modality {modality!r} not found. "
                       f"Available modalities: {set(self.keys())}")

    return super().__getitem__(modality)

from_hf_inputs `staticmethod` ¶

from_hf_inputs(
    hf_inputs: BatchFeature,
    config_by_key: Mapping[str, MultiModalFieldConfig],
)

Source code in vllm/multimodal/inputs.py

@staticmethod
def from_hf_inputs(
    hf_inputs: "BatchFeature",
    config_by_key: Mapping[str, MultiModalFieldConfig],
):
    # NOTE: This skips fields in `hf_inputs` that are not in `config_by_key`
    # We assume that those fields are not used in vLLM
    elems_by_key = dict[str, Sequence[MultiModalFieldElem]]()
    keys_by_modality = defaultdict[str, set[str]](set)
    for key, config in config_by_key.items():
        batch = hf_inputs.get(key)
        if batch is not None:
            elems = config.build_elems(key, batch)
            if len(elems) > 0:
                elems_by_key[key] = elems
                keys_by_modality[config.modality].add(key)

    items = list[MultiModalKwargsItem]()
    for modality, keys in keys_by_modality.items():
        elems_in_modality = {k: elems_by_key[k] for k in keys}
        batch_sizes = {k: len(v) for k, v in elems_in_modality.items()}

        if len(set(batch_sizes.values())) > 1:
            raise ValueError(
                f"Cannot merge different batch sizes for {modality=}! "
                f"Found: {batch_sizes=}")

        batch_size = next(iter(batch_sizes.values()))
        for item_idx in range(batch_size):
            elems = [v[item_idx] for v in elems_in_modality.values()]
            items.append(MultiModalKwargsItem.from_elems(elems))

    return MultiModalKwargsItems.from_seq(items)

from_seq `staticmethod` ¶

from_seq(items: Sequence[MultiModalKwargsItem])

Source code in vllm/multimodal/inputs.py

@staticmethod
def from_seq(items: Sequence[MultiModalKwargsItem]):
    items_by_modality = full_groupby(items, key=lambda x: x.modality)
    return MultiModalKwargsItems(items_by_modality)

get_data ¶

get_data(*, pin_memory: bool = False) -> MultiModalKwargs

Source code in vllm/multimodal/inputs.py

def get_data(self, *, pin_memory: bool = False) -> "MultiModalKwargs":
    elems_by_key = defaultdict[str, list[MultiModalFieldElem]](list)
    for items in self.values():
        for item in items:
            for key, elem in item.items():
                elems_by_key[key].append(elem)

    return MultiModalKwargs({
        key:
        elems[0].field.reduce_data(elems, pin_memory=pin_memory)
        for key, elems in elems_by_key.items() if len(elems) > 0
    })

MultiModalPlaceholderMap ¶

Relates multi-modal embeddings to their corresponding placeholders.

Note: This is only used in V0.

Source code in vllm/multimodal/base.py

class MultiModalPlaceholderMap:
    """
    Relates multi-modal embeddings to their corresponding placeholders.

    Note: This is only used in V0.
    """

    class IndexMap(NamedTuple):
        src: list[int]
        dest: list[int]

    src_ranges: list[range]
    """
    The indices of the multi-modal embeddings that will replace the
    corresponding placeholder embeddings pointed to by ``dest_ranges``.
    """

    src_len: int
    """
    The total number of flattened multi-modal embeddings.
    """

    dest_ranges: list[range]
    """
    The indices of the placeholder embeddings that will be replaced by the
    multimodal embeddings.
    """

    dest_len: int
    """
    The total number of embeddings in the destination tensor.
    """

    def __init__(self):
        self.src_ranges = []
        self.src_len = 0
        self.dest_ranges = []
        self.dest_len = 0

    @classmethod
    def from_seq_group(
        cls, seq_group: "SequenceGroupMetadata", positions: range
    ) -> tuple[MultiModalKwargs, dict[str, "MultiModalPlaceholderMap"]]:
        """
        Returns the multi-modal items that intersect with the portion of a
        prompt (``seq_group``) represented by ``positions``, as well as a
        ``MultiModalPlaceholderMap`` that relates the multi-modal embedding
        vectors to their corresponding placeholders.

        Examples:

        ```
        Prompt:    |AAAA BBBB What's in these images?|
        Positions: |.................................|

            images      = [A, B]
            src_ranges  = [(0, 4), (4, 8)]
            dest_ranges = [(0, 4), (5, 9)]

        Prompt:    |AAAA BBBB What's in these images?|
        Positions: |  .....                          |

            images      = [A, B]
            src_ranges  = [(2, 4), (4, 6)]
            dest_ranges = [(0, 2), (3, 5)]

        Prompt:    |AAAA BBBB What's in these images?|
        Positions: |     .........                   |

            images      = [B]
            src_ranges  = [(0, 4)]
            dest_ranges = [(0, 4)]

        Prompt:    |AAAA BBBB What's in these images?|
        Positions: |          .......................|

            images      = []
            src_ranges  = []
            dest_ranges = []
        ```
        """
        seq_mm_data = seq_group.multi_modal_data
        seq_mm_placeholders = seq_group.multi_modal_placeholders

        if not seq_mm_data or not seq_mm_placeholders:
            return MultiModalKwargs(), {}

        placeholder_maps = dict[str, MultiModalPlaceholderMap]()

        for modality, placeholders in seq_mm_placeholders.items():
            placeholder_map = MultiModalPlaceholderMap()

            if positions:
                placeholder_map.append_items_from_seq_group(
                    positions,
                    # Dummy, since we don't care about intersecting items
                    [None] * len(placeholders),
                    placeholders,
                )

            placeholder_maps[modality] = placeholder_map

        return seq_mm_data, placeholder_maps

    def append_items_from_seq_group(
        self,
        positions: range,
        multi_modal_items: list[_T],
        multi_modal_placeholders: Sequence[PlaceholderRange],
    ) -> list[_T]:
        """
        Adds the multi-modal items that intersect ```positions`` to this
        placeholder map and returns the intersecting items.
        """
        intersecting_items = []

        if len(multi_modal_items) != len(multi_modal_placeholders):
            raise ValueError(
                "Multi-modal placeholders and items must have the same length."
            )
        for placeholder_dict, mm_item in zip(multi_modal_placeholders,
                                             multi_modal_items):
            placeholder = range(
                placeholder_dict.offset,
                placeholder_dict.offset + placeholder_dict.length,
            )
            intersection = range(
                max(positions.start, placeholder.start),
                min(positions.stop, placeholder.stop),
            )

            if not intersection:
                # Skip this multi-modal item.
                continue

            token_embedding_range = range(
                intersection.start - positions.start,
                intersection.stop - positions.start,
            )

            multimodal_embedding_range = range(
                intersection.start - placeholder.start + self.src_len,
                intersection.stop - placeholder.start + self.src_len,
            )

            intersecting_items.append(mm_item)
            self.dest_ranges.append(token_embedding_range)
            self.src_ranges.append(multimodal_embedding_range)
            self.src_len += len(placeholder)

        self.dest_len += len(positions)
        return intersecting_items

    def extend(self, other: "MultiModalPlaceholderMap"):
        """
        Adds the placeholders from another ``MultiModalPlaceholderMap`` to this
        instance based on the source and destination tensors being
        concatenated.
        """

        self.src_ranges.extend(
            range(self.src_len + r.start, self.src_len + r.stop)
            for r in other.src_ranges)
        self.src_len += other.src_len
        self.dest_ranges.extend(
            range(self.dest_len + r.start, self.dest_len + r.stop)
            for r in other.dest_ranges)
        self.dest_len += other.dest_len

    def index_map(self) -> "IndexMap":
        """
        Finalizes the placeholder map into lists of indices that can be used to
        index the source and destination tensors.
        """

        src_indices = [i for r in self.src_ranges for i in r]
        dest_indices = [i for r in self.dest_ranges for i in r]

        if len(src_indices) != len(dest_indices):
            raise ValueError(
                f"The number of source ({len(src_indices)}) and destination "
                f"indices ({len(dest_indices)}) must be the same.")

        return self.IndexMap(src=src_indices, dest=dest_indices)

dest_len `instance-attribute` ¶

dest_len: int = 0

The total number of embeddings in the destination tensor.

dest_ranges `instance-attribute` ¶

dest_ranges: list[range] = []

The indices of the placeholder embeddings that will be replaced by the multimodal embeddings.

src_len `instance-attribute` ¶

src_len: int = 0

The total number of flattened multi-modal embeddings.

src_ranges `instance-attribute` ¶

src_ranges: list[range] = []

The indices of the multi-modal embeddings that will replace the corresponding placeholder embeddings pointed to by dest_ranges.

IndexMap ¶

Bases: NamedTuple

Source code in vllm/multimodal/base.py

class IndexMap(NamedTuple):
    src: list[int]
    dest: list[int]

dest `instance-attribute` ¶

dest: list[int]

src `instance-attribute` ¶

src: list[int]

init ¶

__init__()

Source code in vllm/multimodal/base.py

def __init__(self):
    self.src_ranges = []
    self.src_len = 0
    self.dest_ranges = []
    self.dest_len = 0

append_items_from_seq_group ¶

append_items_from_seq_group(
    positions: range,
    multi_modal_items: list[_T],
    multi_modal_placeholders: Sequence[PlaceholderRange],
) -> list[_T]

Adds the multi-modal items that intersect `positions to this placeholder map and returns the intersecting items.

Source code in vllm/multimodal/base.py

def append_items_from_seq_group(
    self,
    positions: range,
    multi_modal_items: list[_T],
    multi_modal_placeholders: Sequence[PlaceholderRange],
) -> list[_T]:
    """
    Adds the multi-modal items that intersect ```positions`` to this
    placeholder map and returns the intersecting items.
    """
    intersecting_items = []

    if len(multi_modal_items) != len(multi_modal_placeholders):
        raise ValueError(
            "Multi-modal placeholders and items must have the same length."
        )
    for placeholder_dict, mm_item in zip(multi_modal_placeholders,
                                         multi_modal_items):
        placeholder = range(
            placeholder_dict.offset,
            placeholder_dict.offset + placeholder_dict.length,
        )
        intersection = range(
            max(positions.start, placeholder.start),
            min(positions.stop, placeholder.stop),
        )

        if not intersection:
            # Skip this multi-modal item.
            continue

        token_embedding_range = range(
            intersection.start - positions.start,
            intersection.stop - positions.start,
        )

        multimodal_embedding_range = range(
            intersection.start - placeholder.start + self.src_len,
            intersection.stop - placeholder.start + self.src_len,
        )

        intersecting_items.append(mm_item)
        self.dest_ranges.append(token_embedding_range)
        self.src_ranges.append(multimodal_embedding_range)
        self.src_len += len(placeholder)

    self.dest_len += len(positions)
    return intersecting_items

extend ¶

extend(other: MultiModalPlaceholderMap)

Adds the placeholders from another MultiModalPlaceholderMap to this instance based on the source and destination tensors being concatenated.

Source code in vllm/multimodal/base.py

def extend(self, other: "MultiModalPlaceholderMap"):
    """
    Adds the placeholders from another ``MultiModalPlaceholderMap`` to this
    instance based on the source and destination tensors being
    concatenated.
    """

    self.src_ranges.extend(
        range(self.src_len + r.start, self.src_len + r.stop)
        for r in other.src_ranges)
    self.src_len += other.src_len
    self.dest_ranges.extend(
        range(self.dest_len + r.start, self.dest_len + r.stop)
        for r in other.dest_ranges)
    self.dest_len += other.dest_len

from_seq_group `classmethod` ¶

from_seq_group(
    seq_group: SequenceGroupMetadata, positions: range
) -> tuple[
    MultiModalKwargs, dict[str, MultiModalPlaceholderMap]
]

Returns the multi-modal items that intersect with the portion of a prompt (seq_group) represented by positions, as well as a MultiModalPlaceholderMap that relates the multi-modal embedding vectors to their corresponding placeholders.

Examples:

Prompt:    |AAAA BBBB What's in these images?|
Positions: |.................................|

    images      = [A, B]
    src_ranges  = [(0, 4), (4, 8)]
    dest_ranges = [(0, 4), (5, 9)]

Prompt:    |AAAA BBBB What's in these images?|
Positions: |  .....                          |

    images      = [A, B]
    src_ranges  = [(2, 4), (4, 6)]
    dest_ranges = [(0, 2), (3, 5)]

Prompt:    |AAAA BBBB What's in these images?|
Positions: |     .........                   |

    images      = [B]
    src_ranges  = [(0, 4)]
    dest_ranges = [(0, 4)]

Prompt:    |AAAA BBBB What's in these images?|
Positions: |          .......................|

    images      = []
    src_ranges  = []
    dest_ranges = []

Source code in vllm/multimodal/base.py

@classmethod
def from_seq_group(
    cls, seq_group: "SequenceGroupMetadata", positions: range
) -> tuple[MultiModalKwargs, dict[str, "MultiModalPlaceholderMap"]]:
    """
    Returns the multi-modal items that intersect with the portion of a
    prompt (``seq_group``) represented by ``positions``, as well as a
    ``MultiModalPlaceholderMap`` that relates the multi-modal embedding
    vectors to their corresponding placeholders.

    Examples:

    ```
    Prompt:    |AAAA BBBB What's in these images?|
    Positions: |.................................|

        images      = [A, B]
        src_ranges  = [(0, 4), (4, 8)]
        dest_ranges = [(0, 4), (5, 9)]

    Prompt:    |AAAA BBBB What's in these images?|
    Positions: |  .....                          |

        images      = [A, B]
        src_ranges  = [(2, 4), (4, 6)]
        dest_ranges = [(0, 2), (3, 5)]

    Prompt:    |AAAA BBBB What's in these images?|
    Positions: |     .........                   |

        images      = [B]
        src_ranges  = [(0, 4)]
        dest_ranges = [(0, 4)]

    Prompt:    |AAAA BBBB What's in these images?|
    Positions: |          .......................|

        images      = []
        src_ranges  = []
        dest_ranges = []
    ```
    """
    seq_mm_data = seq_group.multi_modal_data
    seq_mm_placeholders = seq_group.multi_modal_placeholders

    if not seq_mm_data or not seq_mm_placeholders:
        return MultiModalKwargs(), {}

    placeholder_maps = dict[str, MultiModalPlaceholderMap]()

    for modality, placeholders in seq_mm_placeholders.items():
        placeholder_map = MultiModalPlaceholderMap()

        if positions:
            placeholder_map.append_items_from_seq_group(
                positions,
                # Dummy, since we don't care about intersecting items
                [None] * len(placeholders),
                placeholders,
            )

        placeholder_maps[modality] = placeholder_map

    return seq_mm_data, placeholder_maps

index_map ¶

index_map() -> IndexMap

Finalizes the placeholder map into lists of indices that can be used to index the source and destination tensors.

Source code in vllm/multimodal/base.py

def index_map(self) -> "IndexMap":
    """
    Finalizes the placeholder map into lists of indices that can be used to
    index the source and destination tensors.
    """

    src_indices = [i for r in self.src_ranges for i in r]
    dest_indices = [i for r in self.dest_ranges for i in r]

    if len(src_indices) != len(dest_indices):
        raise ValueError(
            f"The number of source ({len(src_indices)}) and destination "
            f"indices ({len(dest_indices)}) must be the same.")

    return self.IndexMap(src=src_indices, dest=dest_indices)

MultiModalRegistry ¶

A registry that dispatches data processing according to the model.

Source code in vllm/multimodal/registry.py

class MultiModalRegistry:
    """
    A registry that dispatches data processing according to the model.
    """

    def __init__(self) -> None:
        self._processor_factories = ClassRegistry[nn.Module,
                                                  _ProcessorFactories]()

    def _get_processor_cache(self, model_config: "ModelConfig"):
        model_id = model_config.model
        capacity_gb = model_config.mm_processor_cache_gb
        return _get_processor_cache(model_id, capacity_gb)

    def reset_processor_cache(self, model_config: "ModelConfig") -> bool:
        """Reset the multi-modal processing cache."""
        if processor_cache := self._get_processor_cache(model_config):
            processor_cache.reset()

        return True  # Success

    def enable_mm_input_cache(self, model_config: "ModelConfig") -> bool:
        """Whether the multi-modal input cache should be enabled.
        NOTE: This is put under MultiModalRegistry on purpose to respect 
        text-only mode for multimodal models.
        """

        if not self.supports_multimodal_inputs(model_config):
            return False

        mm_config = model_config.get_multimodal_config()

        return mm_config.mm_processor_cache_gb > 0

    def supports_multimodal_inputs(self, model_config: "ModelConfig") -> bool:
        """
        Checks if the model supports multimodal inputs.
        Returns True if the model is multimodal with any non-zero supported 
        modalities, otherwise returns False, effectively running in 
        text-only mode.
        """
        if not model_config.is_multimodal_model:
            return False

        info = self._create_processing_info(model_config, tokenizer=None)
        supported_modalities = info.get_supported_mm_limits()

        mm_config = model_config.get_multimodal_config()

        # Check if all supported modalities have limit == 0
        if all(
                mm_config.get_limit_per_prompt(modality) == 0
                for modality in supported_modalities):
            logger.info_once(
                "All limits of multimodal modalities supported by the model "
                "are set to 0, running in text-only mode.")
            return False

        return True

    def get_max_tokens_per_item_by_modality(
        self,
        model_config: "ModelConfig",
    ) -> Mapping[str, int]:
        """
        Get the maximum number of tokens per data item from each modality based
        on underlying model configuration.
        """
        if not model_config.is_multimodal_model:
            return {}

        processor = self.create_processor(model_config, disable_cache=False)
        profiler = MultiModalProfiler(processor)

        seq_len = model_config.max_model_len
        mm_limits = self.get_mm_limits_per_prompt(model_config)

        return profiler.get_mm_max_contiguous_tokens(
            seq_len,
            {
                modality: 1
                for modality, limit in mm_limits.items() if limit > 0
            },
        )

    def get_max_tokens_per_item_by_nonzero_modality(
        self,
        model_config: "ModelConfig",
    ) -> Mapping[str, int]:
        """
        Get the maximum number of tokens per data item from each modality based
        on underlying model configuration, excluding modalities that user
        explicitly disabled via `limit_mm_per_prompt`.

        Note:
            This is currently directly used only in V1 for profiling the memory
            usage of a model.
        """
        mm_limits = self.get_mm_limits_per_prompt(model_config)

        return {
            key: max_tokens_per_mm_item
            for key, max_tokens_per_mm_item in
            self.get_max_tokens_per_item_by_modality(model_config).items()
            if mm_limits[key] > 0
        }

    def get_max_tokens_by_modality(
        self,
        model_config: "ModelConfig",
    ) -> Mapping[str, int]:
        """
        Get the maximum number of tokens from each modality
        for profiling the memory usage of a model.
        """
        mm_limits = self.get_mm_limits_per_prompt(model_config)

        return {
            key: mm_limits[key] * max_tokens_per_mm_item
            for key, max_tokens_per_mm_item in
            self.get_max_tokens_per_item_by_modality(model_config).items()
        }

    def get_max_multimodal_tokens(self, model_config: "ModelConfig") -> int:
        """
        Get the maximum number of multi-modal tokens
        for profiling the memory usage of a model.
        """
        return sum(self.get_max_tokens_by_modality(model_config).values())

    def get_mm_limits_per_prompt(
        self,
        model_config: "ModelConfig",
    ) -> Mapping[str, int]:
        """
        Get the maximum number of multi-modal input instances for each modality
        that are allowed per prompt for a model class.
        """
        if not model_config.is_multimodal_model:
            return {}

        processor = self.create_processor(model_config, disable_cache=False)
        profiler = MultiModalProfiler(processor)
        return profiler.get_mm_limits()

    def register_processor(
        self,
        processor: MultiModalProcessorFactory[_I],
        *,
        info: ProcessingInfoFactory[_I],
        dummy_inputs: DummyInputsBuilderFactory[_I],
    ):
        """
        Register a multi-modal processor to a model class. The processor
        is constructed lazily, hence a factory method should be passed.

        When the model receives multi-modal data, the provided function is
        invoked to transform the data into a dictionary of model inputs.
        """

        def wrapper(model_cls: N) -> N:
            if self._processor_factories.contains(model_cls, strict=True):
                logger.warning(
                    "Model class %s already has a multi-modal processor "
                    "registered to %s. It is overwritten by the new one.",
                    model_cls, self)

            self._processor_factories[model_cls] = _ProcessorFactories(
                info=info,
                dummy_inputs=dummy_inputs,
                processor=processor,
            )

            return model_cls

        return wrapper

    def _get_model_cls(self, model_config: "ModelConfig"):
        # Avoid circular import
        from vllm.model_executor.model_loader import get_model_architecture

        model_cls, _ = get_model_architecture(model_config)
        return model_cls

    def _create_processing_ctx(
        self,
        model_config: "ModelConfig",
        tokenizer: Optional[AnyTokenizer] = None,
    ) -> InputProcessingContext:
        if tokenizer is None and not model_config.skip_tokenizer_init:
            tokenizer = cached_tokenizer_from_config(model_config)
        return InputProcessingContext(model_config, tokenizer)

    def _create_processing_info(
        self,
        model_config: "ModelConfig",
        *,
        tokenizer: Optional[AnyTokenizer] = None,
    ) -> BaseProcessingInfo:
        model_cls = self._get_model_cls(model_config)
        factories = self._processor_factories[model_cls]
        ctx = self._create_processing_ctx(model_config, tokenizer)
        return factories.info(ctx)

    def create_processor(
        self,
        model_config: "ModelConfig",
        *,
        tokenizer: Optional[AnyTokenizer] = None,
        disable_cache: Optional[bool] = None,
    ) -> BaseMultiModalProcessor[BaseProcessingInfo]:
        """
        Create a multi-modal processor for a specific model and tokenizer.
        """
        if not model_config.is_multimodal_model:
            raise ValueError(f"{model_config.model} is not a multimodal model")

        if disable_cache is None:
            disable_cache = not model_config.enable_mm_processor_cache

        model_cls = self._get_model_cls(model_config)
        factories = self._processor_factories[model_cls]

        ctx = self._create_processing_ctx(model_config, tokenizer)
        cache = None if disable_cache else self._get_processor_cache(
            model_config)

        return factories.build_processor(ctx, cache=cache)

    def get_decoder_dummy_data(
        self,
        model_config: "ModelConfig",
        seq_len: int,
        mm_counts: Optional[Mapping[str, int]] = None,
    ) -> DummyDecoderData:
        """
        Create dummy data for profiling the memory usage of a model.

        The model is identified by ``model_config``.
        """
        processor = self.create_processor(model_config, disable_cache=False)
        profiler = MultiModalProfiler(processor)
        dummy_data = profiler.get_decoder_dummy_data(seq_len, mm_counts)

        # Having more tokens is over-conservative but otherwise fine
        token_ids = dummy_data.prompt_token_ids
        if len(token_ids) < seq_len:
            raise AssertionError(
                f"Expected at least {seq_len} dummy tokens for profiling, "
                f"but found {len(token_ids)} tokens instead.")

        return dummy_data

    def get_encoder_dummy_data(
        self,
        model_config: "ModelConfig",
        seq_len: int,
        mm_counts: Optional[Mapping[str, int]] = None,
    ) -> DummyEncoderData:
        """
        Create dummy data for profiling the memory usage of a model.

        The model is identified by ``model_config``.
        """
        processor = self.create_processor(model_config, disable_cache=False)
        profiler = MultiModalProfiler(processor)
        dummy_data = profiler.get_encoder_dummy_data(seq_len, mm_counts)

        # Having more tokens is over-conservative but otherwise fine
        token_ids = dummy_data.prompt_token_ids
        if len(token_ids) < seq_len:
            logger.warning_once(
                "Expected at least %d dummy encoder tokens for profiling, but found %d tokens instead.",  # noqa: E501
                seq_len,
                len(token_ids),
            )

        return dummy_data

_processor_factories `instance-attribute` ¶

_processor_factories = ClassRegistry[
    Module, _ProcessorFactories
]()

init ¶

__init__() -> None

Source code in vllm/multimodal/registry.py

def __init__(self) -> None:
    self._processor_factories = ClassRegistry[nn.Module,
                                              _ProcessorFactories]()

_create_processing_ctx ¶

_create_processing_ctx(
    model_config: ModelConfig,
    tokenizer: Optional[AnyTokenizer] = None,
) -> InputProcessingContext

Source code in vllm/multimodal/registry.py

def _create_processing_ctx(
    self,
    model_config: "ModelConfig",
    tokenizer: Optional[AnyTokenizer] = None,
) -> InputProcessingContext:
    if tokenizer is None and not model_config.skip_tokenizer_init:
        tokenizer = cached_tokenizer_from_config(model_config)
    return InputProcessingContext(model_config, tokenizer)

_create_processing_info ¶

_create_processing_info(
    model_config: ModelConfig,
    *,
    tokenizer: Optional[AnyTokenizer] = None,
) -> BaseProcessingInfo

Source code in vllm/multimodal/registry.py

def _create_processing_info(
    self,
    model_config: "ModelConfig",
    *,
    tokenizer: Optional[AnyTokenizer] = None,
) -> BaseProcessingInfo:
    model_cls = self._get_model_cls(model_config)
    factories = self._processor_factories[model_cls]
    ctx = self._create_processing_ctx(model_config, tokenizer)
    return factories.info(ctx)

_get_model_cls ¶

_get_model_cls(model_config: ModelConfig)

Source code in vllm/multimodal/registry.py

def _get_model_cls(self, model_config: "ModelConfig"):
    # Avoid circular import
    from vllm.model_executor.model_loader import get_model_architecture

    model_cls, _ = get_model_architecture(model_config)
    return model_cls

_get_processor_cache ¶

_get_processor_cache(model_config: ModelConfig)

Source code in vllm/multimodal/registry.py

def _get_processor_cache(self, model_config: "ModelConfig"):
    model_id = model_config.model
    capacity_gb = model_config.mm_processor_cache_gb
    return _get_processor_cache(model_id, capacity_gb)

create_processor ¶

create_processor(
    model_config: ModelConfig,
    *,
    tokenizer: Optional[AnyTokenizer] = None,
    disable_cache: Optional[bool] = None,
) -> BaseMultiModalProcessor[BaseProcessingInfo]

Create a multi-modal processor for a specific model and tokenizer.

Source code in vllm/multimodal/registry.py

def create_processor(
    self,
    model_config: "ModelConfig",
    *,
    tokenizer: Optional[AnyTokenizer] = None,
    disable_cache: Optional[bool] = None,
) -> BaseMultiModalProcessor[BaseProcessingInfo]:
    """
    Create a multi-modal processor for a specific model and tokenizer.
    """
    if not model_config.is_multimodal_model:
        raise ValueError(f"{model_config.model} is not a multimodal model")

    if disable_cache is None:
        disable_cache = not model_config.enable_mm_processor_cache

    model_cls = self._get_model_cls(model_config)
    factories = self._processor_factories[model_cls]

    ctx = self._create_processing_ctx(model_config, tokenizer)
    cache = None if disable_cache else self._get_processor_cache(
        model_config)

    return factories.build_processor(ctx, cache=cache)

enable_mm_input_cache ¶

enable_mm_input_cache(model_config: ModelConfig) -> bool

Whether the multi-modal input cache should be enabled. NOTE: This is put under MultiModalRegistry on purpose to respect text-only mode for multimodal models.

Source code in vllm/multimodal/registry.py

def enable_mm_input_cache(self, model_config: "ModelConfig") -> bool:
    """Whether the multi-modal input cache should be enabled.
    NOTE: This is put under MultiModalRegistry on purpose to respect 
    text-only mode for multimodal models.
    """

    if not self.supports_multimodal_inputs(model_config):
        return False

    mm_config = model_config.get_multimodal_config()

    return mm_config.mm_processor_cache_gb > 0

get_decoder_dummy_data ¶

get_decoder_dummy_data(
    model_config: ModelConfig,
    seq_len: int,
    mm_counts: Optional[Mapping[str, int]] = None,
) -> DummyDecoderData

Create dummy data for profiling the memory usage of a model.

The model is identified by model_config.

Source code in vllm/multimodal/registry.py

def get_decoder_dummy_data(
    self,
    model_config: "ModelConfig",
    seq_len: int,
    mm_counts: Optional[Mapping[str, int]] = None,
) -> DummyDecoderData:
    """
    Create dummy data for profiling the memory usage of a model.

    The model is identified by ``model_config``.
    """
    processor = self.create_processor(model_config, disable_cache=False)
    profiler = MultiModalProfiler(processor)
    dummy_data = profiler.get_decoder_dummy_data(seq_len, mm_counts)

    # Having more tokens is over-conservative but otherwise fine
    token_ids = dummy_data.prompt_token_ids
    if len(token_ids) < seq_len:
        raise AssertionError(
            f"Expected at least {seq_len} dummy tokens for profiling, "
            f"but found {len(token_ids)} tokens instead.")

    return dummy_data

get_encoder_dummy_data ¶

get_encoder_dummy_data(
    model_config: ModelConfig,
    seq_len: int,
    mm_counts: Optional[Mapping[str, int]] = None,
) -> DummyEncoderData

Create dummy data for profiling the memory usage of a model.

The model is identified by model_config.

Source code in vllm/multimodal/registry.py

def get_encoder_dummy_data(
    self,
    model_config: "ModelConfig",
    seq_len: int,
    mm_counts: Optional[Mapping[str, int]] = None,
) -> DummyEncoderData:
    """
    Create dummy data for profiling the memory usage of a model.

    The model is identified by ``model_config``.
    """
    processor = self.create_processor(model_config, disable_cache=False)
    profiler = MultiModalProfiler(processor)
    dummy_data = profiler.get_encoder_dummy_data(seq_len, mm_counts)

    # Having more tokens is over-conservative but otherwise fine
    token_ids = dummy_data.prompt_token_ids
    if len(token_ids) < seq_len:
        logger.warning_once(
            "Expected at least %d dummy encoder tokens for profiling, but found %d tokens instead.",  # noqa: E501
            seq_len,
            len(token_ids),
        )

    return dummy_data

get_max_multimodal_tokens ¶

get_max_multimodal_tokens(model_config: ModelConfig) -> int

Get the maximum number of multi-modal tokens for profiling the memory usage of a model.

Source code in vllm/multimodal/registry.py

def get_max_multimodal_tokens(self, model_config: "ModelConfig") -> int:
    """
    Get the maximum number of multi-modal tokens
    for profiling the memory usage of a model.
    """
    return sum(self.get_max_tokens_by_modality(model_config).values())

get_max_tokens_by_modality ¶

get_max_tokens_by_modality(
    model_config: ModelConfig,
) -> Mapping[str, int]

Get the maximum number of tokens from each modality for profiling the memory usage of a model.

Source code in vllm/multimodal/registry.py

def get_max_tokens_by_modality(
    self,
    model_config: "ModelConfig",
) -> Mapping[str, int]:
    """
    Get the maximum number of tokens from each modality
    for profiling the memory usage of a model.
    """
    mm_limits = self.get_mm_limits_per_prompt(model_config)

    return {
        key: mm_limits[key] * max_tokens_per_mm_item
        for key, max_tokens_per_mm_item in
        self.get_max_tokens_per_item_by_modality(model_config).items()
    }

get_max_tokens_per_item_by_modality ¶

get_max_tokens_per_item_by_modality(
    model_config: ModelConfig,
) -> Mapping[str, int]

Get the maximum number of tokens per data item from each modality based on underlying model configuration.

Source code in vllm/multimodal/registry.py

def get_max_tokens_per_item_by_modality(
    self,
    model_config: "ModelConfig",
) -> Mapping[str, int]:
    """
    Get the maximum number of tokens per data item from each modality based
    on underlying model configuration.
    """
    if not model_config.is_multimodal_model:
        return {}

    processor = self.create_processor(model_config, disable_cache=False)
    profiler = MultiModalProfiler(processor)

    seq_len = model_config.max_model_len
    mm_limits = self.get_mm_limits_per_prompt(model_config)

    return profiler.get_mm_max_contiguous_tokens(
        seq_len,
        {
            modality: 1
            for modality, limit in mm_limits.items() if limit > 0
        },
    )

get_max_tokens_per_item_by_nonzero_modality ¶

get_max_tokens_per_item_by_nonzero_modality(
    model_config: ModelConfig,
) -> Mapping[str, int]

Get the maximum number of tokens per data item from each modality based on underlying model configuration, excluding modalities that user explicitly disabled via limit_mm_per_prompt.

Note

This is currently directly used only in V1 for profiling the memory usage of a model.

Source code in vllm/multimodal/registry.py

def get_max_tokens_per_item_by_nonzero_modality(
    self,
    model_config: "ModelConfig",
) -> Mapping[str, int]:
    """
    Get the maximum number of tokens per data item from each modality based
    on underlying model configuration, excluding modalities that user
    explicitly disabled via `limit_mm_per_prompt`.

    Note:
        This is currently directly used only in V1 for profiling the memory
        usage of a model.
    """
    mm_limits = self.get_mm_limits_per_prompt(model_config)

    return {
        key: max_tokens_per_mm_item
        for key, max_tokens_per_mm_item in
        self.get_max_tokens_per_item_by_modality(model_config).items()
        if mm_limits[key] > 0
    }

get_mm_limits_per_prompt ¶

get_mm_limits_per_prompt(
    model_config: ModelConfig,
) -> Mapping[str, int]

Get the maximum number of multi-modal input instances for each modality that are allowed per prompt for a model class.

Source code in vllm/multimodal/registry.py

def get_mm_limits_per_prompt(
    self,
    model_config: "ModelConfig",
) -> Mapping[str, int]:
    """
    Get the maximum number of multi-modal input instances for each modality
    that are allowed per prompt for a model class.
    """
    if not model_config.is_multimodal_model:
        return {}

    processor = self.create_processor(model_config, disable_cache=False)
    profiler = MultiModalProfiler(processor)
    return profiler.get_mm_limits()

register_processor ¶

register_processor(
    processor: MultiModalProcessorFactory[_I],
    *,
    info: ProcessingInfoFactory[_I],
    dummy_inputs: DummyInputsBuilderFactory[_I],
)

Register a multi-modal processor to a model class. The processor is constructed lazily, hence a factory method should be passed.

When the model receives multi-modal data, the provided function is invoked to transform the data into a dictionary of model inputs.

Source code in vllm/multimodal/registry.py

def register_processor(
    self,
    processor: MultiModalProcessorFactory[_I],
    *,
    info: ProcessingInfoFactory[_I],
    dummy_inputs: DummyInputsBuilderFactory[_I],
):
    """
    Register a multi-modal processor to a model class. The processor
    is constructed lazily, hence a factory method should be passed.

    When the model receives multi-modal data, the provided function is
    invoked to transform the data into a dictionary of model inputs.
    """

    def wrapper(model_cls: N) -> N:
        if self._processor_factories.contains(model_cls, strict=True):
            logger.warning(
                "Model class %s already has a multi-modal processor "
                "registered to %s. It is overwritten by the new one.",
                model_cls, self)

        self._processor_factories[model_cls] = _ProcessorFactories(
            info=info,
            dummy_inputs=dummy_inputs,
            processor=processor,
        )

        return model_cls

    return wrapper

reset_processor_cache ¶

reset_processor_cache(model_config: ModelConfig) -> bool

Reset the multi-modal processing cache.

Source code in vllm/multimodal/registry.py

def reset_processor_cache(self, model_config: "ModelConfig") -> bool:
    """Reset the multi-modal processing cache."""
    if processor_cache := self._get_processor_cache(model_config):
        processor_cache.reset()

    return True  # Success

supports_multimodal_inputs ¶

supports_multimodal_inputs(
    model_config: ModelConfig,
) -> bool

Checks if the model supports multimodal inputs. Returns True if the model is multimodal with any non-zero supported modalities, otherwise returns False, effectively running in text-only mode.

Source code in vllm/multimodal/registry.py

def supports_multimodal_inputs(self, model_config: "ModelConfig") -> bool:
    """
    Checks if the model supports multimodal inputs.
    Returns True if the model is multimodal with any non-zero supported 
    modalities, otherwise returns False, effectively running in 
    text-only mode.
    """
    if not model_config.is_multimodal_model:
        return False

    info = self._create_processing_info(model_config, tokenizer=None)
    supported_modalities = info.get_supported_mm_limits()

    mm_config = model_config.get_multimodal_config()

    # Check if all supported modalities have limit == 0
    if all(
            mm_config.get_limit_per_prompt(modality) == 0
            for modality in supported_modalities):
        logger.info_once(
            "All limits of multimodal modalities supported by the model "
            "are set to 0, running in text-only mode.")
        return False

    return True

vllm.multimodal

BatchedTensorInputs module-attribute ¶

MULTIMODAL_REGISTRY module-attribute ¶

ModalityData module-attribute ¶

MultiModalDataDict module-attribute ¶

MultiModalHashDict module-attribute ¶

MultiModalPlaceholderDict module-attribute ¶

NestedTensors module-attribute ¶

__all__ module-attribute ¶

MultiModalDataBuiltins ¶

audio instance-attribute ¶

image instance-attribute ¶

video instance-attribute ¶

MultiModalHasher ¶

hash_kwargs classmethod ¶

item_to_bytes classmethod ¶

iter_item_to_bytes classmethod ¶

serialize_item classmethod ¶

MultiModalKwargs ¶

__eq__ ¶

__getitem__ ¶

_try_stack staticmethod ¶

as_kwargs staticmethod ¶

batch staticmethod ¶

from_hf_inputs staticmethod ¶

from_items staticmethod ¶

MultiModalKwargsItems ¶

__getitem__ ¶

from_hf_inputs staticmethod ¶

from_seq staticmethod ¶

get_data ¶

MultiModalPlaceholderMap ¶

dest_len instance-attribute ¶

dest_ranges instance-attribute ¶

src_len instance-attribute ¶

src_ranges instance-attribute ¶

IndexMap ¶

dest instance-attribute ¶

src instance-attribute ¶

__init__ ¶

append_items_from_seq_group ¶

extend ¶

from_seq_group classmethod ¶

index_map ¶

MultiModalRegistry ¶

_processor_factories instance-attribute ¶

__init__ ¶

_create_processing_ctx ¶

_create_processing_info ¶

_get_model_cls ¶

_get_processor_cache ¶

create_processor ¶

enable_mm_input_cache ¶

get_decoder_dummy_data ¶

get_encoder_dummy_data ¶

get_max_multimodal_tokens ¶

get_max_tokens_by_modality ¶

get_max_tokens_per_item_by_modality ¶

get_max_tokens_per_item_by_nonzero_modality ¶

get_mm_limits_per_prompt ¶

register_processor ¶

reset_processor_cache ¶

supports_multimodal_inputs ¶

BatchedTensorInputs `module-attribute` ¶

MULTIMODAL_REGISTRY `module-attribute` ¶

ModalityData `module-attribute` ¶

MultiModalDataDict `module-attribute` ¶

MultiModalHashDict `module-attribute` ¶

MultiModalPlaceholderDict `module-attribute` ¶

NestedTensors `module-attribute` ¶

all `module-attribute` ¶

audio `instance-attribute` ¶

image `instance-attribute` ¶

video `instance-attribute` ¶

hash_kwargs `classmethod` ¶

item_to_bytes `classmethod` ¶

iter_item_to_bytes `classmethod` ¶

serialize_item `classmethod` ¶

eq ¶

getitem ¶

_try_stack `staticmethod` ¶

as_kwargs `staticmethod` ¶

batch `staticmethod` ¶

from_hf_inputs `staticmethod` ¶

from_items `staticmethod` ¶

getitem ¶

from_hf_inputs `staticmethod` ¶

from_seq `staticmethod` ¶

dest_len `instance-attribute` ¶

dest_ranges `instance-attribute` ¶

src_len `instance-attribute` ¶

src_ranges `instance-attribute` ¶

dest `instance-attribute` ¶

src `instance-attribute` ¶

init ¶

from_seq_group `classmethod` ¶

_processor_factories `instance-attribute` ¶

init ¶