vllm.model_executor.layers.quantization.utils.quant_utils
This file is used for /tests and /benchmarks
kFp8DynamicTensorSym module-attribute
¶
kFp8DynamicTensorSym = QuantKey(
FP8_DTYPE, kDynamicTensorScale, symmetric=True
)
kFp8DynamicTokenSym module-attribute
¶
kFp8DynamicTokenSym = QuantKey(
FP8_DTYPE, kDynamicTokenScale, symmetric=True
)
kFp8StaticTensorSym module-attribute
¶
kFp8StaticTensorSym = QuantKey(
FP8_DTYPE, kStaticTensorScale, symmetric=True
)
kNvfp4GroupScale module-attribute
¶
kNvfp4GroupScale = ScaleDesc(
FP8_DTYPE, False, GroupShape(1, 16)
)
kNvfp4Quant module-attribute
¶
kNvfp4Quant = QuantKey(
FP4_DTYPE,
scale=kNvfp4GroupScale,
scale2=kStaticTensorScale,
)
GroupShape ¶
Bases: _GroupShape
This class describes the quantization group shape. It includes static members for common shapes (per-tensor, per-token).
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
QuantKey dataclass
¶
Class for identifying the type of quantization. dtype: quantized data type scale: scale descriptor scale2: second-level scale descriptor symmetric: symmetric if True, asymmetric if False
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
ScaleDesc dataclass
¶
Class for describing a single quantization scaling factor. dtype: data type of the scale static: static scale if True, dynamic if False group_shape: group shape of the scale
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
__str__ ¶
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
_GroupShape ¶
_normalize_quant_group_shape ¶
_normalize_quant_group_shape(
x: Tensor, group_shape: GroupShape
)
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
awq_pack ¶
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
cutlass_fp4_supported ¶
cutlass_fp4_supported() -> bool
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
get_pack_factor ¶
gptq_pack ¶
gptq_quantize_weights ¶
gptq_quantize_weights(
w: Tensor,
quant_type: ScalarType,
group_size: int,
act_order: bool,
test_perm: Optional[Tensor] = None,
)
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
group_broadcast ¶
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
is_layer_skipped ¶
is_layer_skipped(
prefix: str,
ignored_layers: list[str],
fused_mapping: Mapping[
str, list[str]
] = MappingProxyType({}),
) -> bool
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
pack_cols ¶
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
pack_quantized_values_into_int32 ¶
pack_quantized_values_into_int32(
w_q: Tensor, wtype: ScalarType, packed_dim: int = 0
)
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
pack_rows ¶
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
permute_rows ¶
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
quantize_weights ¶
quantize_weights(
w: Tensor,
quant_type: ScalarType,
group_size: Optional[int],
zero_points: bool = False,
ref_zero_points_after_scales: bool = False,
)
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 |
|
scaled_dequantize ¶
scaled_dequantize(
x_q: Tensor,
x_s: Tensor,
group_shape: Optional[GroupShape] = None,
out_dtype: dtype = float32,
) -> tuple[Tensor, Tensor]
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
scaled_quantize ¶
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
sort_weights ¶
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
swizzle_blockscale ¶
Pad and block-interleave the FP4 block-scales so that they match the data layout expected by the CUTLASS / FlashInfer kernels.
Parameters¶
scale: torch.Tensor
Returns¶
torch.Tensor The swizzled tensor with the same logical shape as scale.
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
unpack_cols ¶
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
unpack_quantized_values_into_int32 ¶
unpack_quantized_values_into_int32(
w_q: Tensor, wtype: ScalarType, packed_dim: int = 0
)