vllm.model_executor.layers.quantization.kernels.scaled_mm
Modules:
Name | Description |
---|---|
ScaledMMLinearKernel | |
aiter | |
cpu | |
cutlass | |
triton | |
xla | |
_POSSIBLE_KERNELS module-attribute
¶
_POSSIBLE_KERNELS: dict[
PlatformEnum, list[type[ScaledMMLinearKernel]]
] = {
CPU: [CPUScaledMMLinearKernel],
CUDA: [CutlassScaledMMLinearKernel],
ROCM: [
AiterScaledMMLinearKernel,
TritonScaledMMLinearKernel,
],
TPU: [XLAScaledMMLinearKernel],
}
choose_scaled_mm_linear_kernel ¶
choose_scaled_mm_linear_kernel(
config: ScaledMMLinearLayerConfig,
compute_capability: Optional[int] = None,
) -> type[ScaledMMLinearKernel]
Choose an ScaledMMLinearKernel that can implement the given config for the given compute capability. Attempts to choose the best kernel in terms of performance.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
config | ScaledMMLinearLayerConfig | Description of the linear layer to be implemented. | required |
compute_capability | Optional[int] | The compute capability of the target device, if None uses | None |
Raises:
Type | Description |
---|---|
ValueError | If no kernel can implement the given config. |
Returns:
Type | Description |
---|---|
type[ScaledMMLinearKernel] | type[ScaledMMLinearKernel]: Chosen kernel. |