vllm.model_executor.layers.quantization.deepspeedfp
DeepSpeedFPConfig ¶
Bases: QuantizationConfig
Config for DeepSpeed FP quantizer. It supports fp6 and fp8.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
weight_bits | int | the target quantization bits, 6 or 8. | 8 |
group_size | int | group size for quantizaiton, default to 128. | 512 |
Source code in vllm/model_executor/layers/quantization/deepspeedfp.py
__init__ ¶
Source code in vllm/model_executor/layers/quantization/deepspeedfp.py
from_config classmethod
¶
from_config(config: dict[str, Any]) -> DeepSpeedFPConfig
Source code in vllm/model_executor/layers/quantization/deepspeedfp.py
get_config_filenames staticmethod
¶
get_linear_method ¶
get_linear_method() -> DeepSpeedFPLinearMethod
get_name classmethod
¶
get_name() -> QuantizationMethods
get_quant_method ¶
get_quant_method(
layer: Module, prefix: str
) -> Optional[DeepSpeedFPLinearMethod]
DeepSpeedFPLinearMethod ¶
Bases: LinearMethodBase
Linear method for DeepSpeedFP quantizer.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
quant_config | DeepSpeedFPConfig | the DeepSpeedFP quantization config. | required |
Source code in vllm/model_executor/layers/quantization/deepspeedfp.py
__init__ ¶
__init__(quant_config: DeepSpeedFPConfig)
apply ¶
create_weights ¶
create_weights(
layer: Module,
input_size_per_partition: int,
output_partition_sizes: list[int],
input_size: int,
output_size: int,
params_dtype: dtype,
weight_loader=None,
**extra_weight_attrs,
)
Source code in vllm/model_executor/layers/quantization/deepspeedfp.py
DeepSpeedFPParameter ¶
Bases: Parameter
DeepSpeedFP quantized parameter class that implements fp8/fp6 quantization deepspeed. Weights are stored in quantized form on GPUs, and can be dequantized on-the-fly when needed by the model.
Source code in vllm/model_executor/layers/quantization/deepspeedfp.py
__new__ ¶
__new__(
orig_shape: Size,
params_dtype: dtype,
quant_config: DeepSpeedFPConfig,
)
Source code in vllm/model_executor/layers/quantization/deepspeedfp.py
ds_dequantize ¶
ds_dequantize(fp_out=None) -> Tensor
Return a tensor containing the dequantized weights of this parameter.
Source code in vllm/model_executor/layers/quantization/deepspeedfp.py
ds_quantize_ ¶
ds_quantize_(tensor: Tensor)
Source code in vllm/model_executor/layers/quantization/deepspeedfp.py
ds_selective_dequantize ¶
ds_selective_dequantize(indices, fp_out=None) -> Tensor
Return a tensor where only the weights at indices
are dequantized (to save HBM -> SRAM bandwidth).