vllm.model_executor.layers.quantization.utils.flashinfer_fp4_moe
Utility helpers for NVFP4 + FlashInfer fused-MoE path
__all__ module-attribute
¶
__all__ = [
"is_flashinfer_fp4_cutlass_moe_available",
"reorder_w1w3_to_w3w1",
"build_flashinfer_fp4_cutlass_moe_prepare_finalize",
]
build_flashinfer_fp4_cutlass_moe_prepare_finalize ¶
build_flashinfer_fp4_cutlass_moe_prepare_finalize(
moe: FusedMoEConfig, a1_gscale: Tensor
) -> FusedMoEPrepareAndFinalize
Create a FlashInfer CUTLASS fused-MoE prepare finalize kernel
Source code in vllm/model_executor/layers/quantization/utils/flashinfer_fp4_moe.py
is_flashinfer_fp4_cutlass_moe_available ¶
is_flashinfer_fp4_cutlass_moe_available() -> bool
Return True
when FlashInfer CUTLASS NV-FP4 kernels can be used.
Source code in vllm/model_executor/layers/quantization/utils/flashinfer_fp4_moe.py
reorder_w1w3_to_w3w1 ¶
Re-order the concatenated [w1, w3]
tensors to [w3, w1]
Source code in vllm/model_executor/layers/quantization/utils/flashinfer_fp4_moe.py
select_nvfp4_gemm_impl ¶
select_nvfp4_gemm_impl(
moe: FusedMoEConfig,
g1_alphas: Tensor,
g2_alphas: Tensor,
a1_gscale: Tensor,
a2_gscale: Tensor,
allow_flashinfer: bool,
) -> FusedMoEPermuteExpertsUnpermute
Return a GEMM experts implementation for NV-FP4 fused-MoE layers