vllm.model_executor.layers.quantization.utils.flashinfer_utils
FlashinferMoeBackend ¶
Bases: Enum
Source code in vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
apply_flashinfer_per_tensor_scale_fp8 ¶
apply_flashinfer_per_tensor_scale_fp8(
layer: Module,
hidden_states: Tensor,
router_logits: Tensor,
routing_bias: Optional[Tensor],
top_k: int,
num_expert_group: Optional[int],
topk_group: Optional[int],
global_num_experts: int,
apply_router_weight_on_input: bool,
) -> Tensor
Source code in vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
build_flashinfer_fp8_cutlass_moe_prepare_finalize ¶
build_flashinfer_fp8_cutlass_moe_prepare_finalize(
moe: Optional[FusedMoEConfig], layer: Module
) -> FusedMoEPrepareAndFinalize
Create a FlashInfer CUTLASS fused-MoE prepare finalize kernel
Source code in vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
calculate_tile_tokens_dim ¶
Source code in vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
flashinfer_cutlass_moe_fp8 ¶
flashinfer_cutlass_moe_fp8(
hidden_states: Tensor,
layer: Module,
topk_weights: Tensor,
topk_ids: Tensor,
inplace: bool = False,
activation: str = "silu",
global_num_experts: int = -1,
expert_map: Optional[Tensor] = None,
apply_router_weight_on_input: bool = False,
) -> Tensor
Source code in vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
get_flashinfer_moe_backend ¶
get_flashinfer_moe_backend() -> FlashinferMoeBackend
Source code in vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
get_moe_scaling_factors ¶
get_moe_scaling_factors(
input_scale: Tensor,
gemm1_weights_scale: Tensor,
activation_scale: Tensor,
gemm2_weights_scale: Tensor,
) -> tuple[Tensor, Tensor, Tensor]
Source code in vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
register_moe_scaling_factors ¶
register_moe_scaling_factors(layer: Module) -> None
Source code in vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
rotate_flashinfer_fp8_moe_weights ¶
Source code in vllm/model_executor/layers/quantization/utils/flashinfer_utils.py
select_cutlass_fp8_gemm_impl ¶
select_cutlass_fp8_gemm_impl(
moe: Optional[FusedMoEConfig],
layer: Module,
out_dtype: Optional[dtype] = None,
) -> FusedMoEPermuteExpertsUnpermute
Return a GEMM experts implementation for fused-MoE layers