vllm.model_executor.layers.fused_moe.topk_weight_and_reduce
TopKWeightAndReduceContiguous ¶
Bases: TopKWeightAndReduce
TopKWeightAndReduce implementation for a fused_experts output of shape (m, topk, K)
Source code in vllm/model_executor/layers/fused_moe/topk_weight_and_reduce.py
__eq__ ¶
apply ¶
apply(
output: Optional[Tensor],
fused_expert_output: Tensor,
topk_weights: Tensor,
topk_ids: Tensor,
apply_router_weight_on_input: bool,
) -> Tensor
Source code in vllm/model_executor/layers/fused_moe/topk_weight_and_reduce.py
TopKWeightAndReduceDelegate ¶
Bases: TopKWeightAndReduce
Useful in the case when some FusedMoEPermuteExpertsUnpermute implementation does not perform weight application and reduction but cannot address the needs of all the compatible PrepareAndFinalize implementations. For example, BatchedTritonExperts is compatible with both PplxPrepareAndFinalize and BatchedPrepareAndFinalize. PplxPrepareAndFinalize does the weight-application + reduction as part of the pplx combine kernel. But the BatchedPrepareAndFinalize needs an implementation. To facilitate this case, the BatchedTritonExperts could use TopKWeightAndReduceDelegate so the PrepareAndFinalize implementations could choose how to weight + reduce.
Source code in vllm/model_executor/layers/fused_moe/topk_weight_and_reduce.py
TopKWeightAndReduceNaiveBatched ¶
Bases: TopKWeightAndReduce
TopKWeightAndReduce implementation for a fused_experts output of shape (num_experts, batch_size, K)
Source code in vllm/model_executor/layers/fused_moe/topk_weight_and_reduce.py
__eq__ ¶
apply ¶
apply(
output: Optional[Tensor],
fused_expert_output: Tensor,
topk_weights: Tensor,
topk_ids: Tensor,
apply_router_weight_on_input: bool,
) -> Tensor
Source code in vllm/model_executor/layers/fused_moe/topk_weight_and_reduce.py
TopKWeightAndReduceNoOP ¶
Bases: TopKWeightAndReduce
The fused_experts outputs have already been weight applied and reduced. This implementation is a no-op.
Source code in vllm/model_executor/layers/fused_moe/topk_weight_and_reduce.py
__eq__ ¶
apply ¶
apply(
output: Optional[Tensor],
fused_expert_output: Tensor,
topk_weights: Tensor,
topk_ids: Tensor,
apply_router_weight_on_input: bool,
) -> Tensor