vllm.model_executor.layers.fused_moe.flashinfer_cutlass_prepare_finalize
FlashInferCutlassMoEPrepareAndFinalize ¶
Bases: FusedMoEPrepareAndFinalize
Source code in vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py
__init__ ¶
Source code in vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py
finalize ¶
finalize(
output: Tensor,
fused_expert_output: Tensor,
topk_weights: Tensor,
topk_ids: Tensor,
apply_router_weight_on_input: bool,
weight_and_reduce_impl: TopKWeightAndReduce,
) -> None
Source code in vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py
max_num_tokens_per_rank ¶
prepare ¶
prepare(
a1: Tensor,
a1_scale: Optional[Tensor],
a2_scale: Optional[Tensor],
topk_weights: Tensor,
topk_ids: Tensor,
num_experts: int,
expert_map: Optional[Tensor],
apply_router_weight_on_input: bool,
quant_config: FusedMoEQuantConfig,
) -> tuple[
Tensor,
Optional[Tensor],
Optional[ExpertTokensMetadata],
Optional[Tensor],
Optional[Tensor],
]