vllm.config.parallel
DistributedExecutorBackend module-attribute
¶
DistributedExecutorBackend = Literal[
"ray", "mp", "uni", "external_launcher"
]
EPLBConfig ¶
Configuration for Expert Parallel Load Balancing (EP).
Source code in vllm/config/parallel.py
log_balancedness class-attribute
instance-attribute
¶
log_balancedness: bool = False
Log the balancedness each step of expert parallelism. This is turned off by default since it will cause communication overhead.
num_redundant_experts class-attribute
instance-attribute
¶
num_redundant_experts: int = 0
Number of redundant experts to use for expert parallelism.
ParallelConfig ¶
Configuration for the distributed execution.
Source code in vllm/config/parallel.py
|
|
_data_parallel_master_port_list class-attribute
instance-attribute
¶
List of open port auto-queried for data parallel messaging. Set to be private as it's not intended to be configured by users.
data_parallel_backend class-attribute
instance-attribute
¶
data_parallel_backend: str = 'mp'
Backend to use for data parallel, either "mp" or "ray".
data_parallel_external_lb class-attribute
instance-attribute
¶
data_parallel_external_lb: bool = False
Whether to use "external" DP LB mode. Applies only to online serving and when data_parallel_size > 0. This is useful for a "one-pod-per-rank" wide-EP setup in Kuberentes. Set implicitly when --data-parallel-rank is provided explicitly to vllm serve.
data_parallel_hybrid_lb class-attribute
instance-attribute
¶
data_parallel_hybrid_lb: bool = False
Whether to use "hybrid" DP LB mode. Applies only to online serving and when data_parallel_size > 0. Enables running an AsyncLLM and API server on a "per-node" basis where vLLM load balances between local data parallel ranks, but an external LB balances between vLLM nodes/replicas. Set explicitly in conjunction with --data-parallel-start-rank.
data_parallel_master_ip class-attribute
instance-attribute
¶
data_parallel_master_ip: str = '127.0.0.1'
IP of the data parallel master.
data_parallel_master_port class-attribute
instance-attribute
¶
data_parallel_master_port: int = 29500
Port of the data parallel master.
data_parallel_rank class-attribute
instance-attribute
¶
data_parallel_rank: int = 0
Rank of the data parallel group.
data_parallel_rank_local class-attribute
instance-attribute
¶
Local rank of the data parallel group, set only in SPMD mode.
data_parallel_rpc_port class-attribute
instance-attribute
¶
data_parallel_rpc_port: int = 29550
Port for data parallel messaging.
data_parallel_size class-attribute
instance-attribute
¶
data_parallel_size: int = 1
Number of data parallel groups. MoE layers will be sharded according to the product of the tensor parallel size and data parallel size.
data_parallel_size_local class-attribute
instance-attribute
¶
data_parallel_size_local: int = 1
Number of local data parallel groups.
disable_custom_all_reduce class-attribute
instance-attribute
¶
disable_custom_all_reduce: bool = False
Disable the custom all-reduce kernel and fall back to NCCL.
distributed_executor_backend class-attribute
instance-attribute
¶
distributed_executor_backend: Optional[
Union[
str, DistributedExecutorBackend, type[ExecutorBase]
]
] = None
Backend to use for distributed model workers, either "ray" or "mp" (multiprocessing). If the product of pipeline_parallel_size and tensor_parallel_size is less than or equal to the number of GPUs available, "mp" will be used to keep processing on a single host. Otherwise, this will default to "ray" if Ray is installed and fail otherwise. Note that tpu only support Ray for distributed inference.
enable_eplb class-attribute
instance-attribute
¶
enable_eplb: bool = False
Enable expert parallelism load balancing for MoE layers.
enable_expert_parallel class-attribute
instance-attribute
¶
enable_expert_parallel: bool = False
Use expert parallelism instead of tensor parallelism for MoE layers.
eplb_config class-attribute
instance-attribute
¶
eplb_config: EPLBConfig = field(default_factory=EPLBConfig)
Expert parallelism configuration.
eplb_log_balancedness class-attribute
instance-attribute
¶
eplb_log_balancedness
is deprecated and has been replaced with eplb_config.log_balancedness
. This will be removed in v0.12.0. Please use eplb_config.log_balancedness
instead.
eplb_step_interval class-attribute
instance-attribute
¶
eplb_step_interval
is deprecated and has been replaced with eplb_config.step_interval
. This will be removed in v0.12.0. Please use eplb_config.step_interval
instead.
eplb_window_size class-attribute
instance-attribute
¶
eplb_window_size
is deprecated and has been replaced with eplb_config.window_size
. This will be removed in v0.12.0. Please use eplb_config.window_size
instead.
max_parallel_loading_workers class-attribute
instance-attribute
¶
Maximum number of parallel loading workers when loading model sequentially in multiple batches. To avoid RAM OOM when using tensor parallel and large models.
num_redundant_experts class-attribute
instance-attribute
¶
num_redundant_experts
is deprecated and has been replaced with eplb_config.num_redundant_experts
. This will be removed in v0.12.0. Please use eplb_config.num_redundant_experts
instead.
pipeline_parallel_size class-attribute
instance-attribute
¶
pipeline_parallel_size: int = 1
Number of pipeline parallel groups.
placement_group class-attribute
instance-attribute
¶
placement_group: Optional[PlacementGroup] = None
ray distributed model workers placement group.
ray_runtime_env class-attribute
instance-attribute
¶
ray_runtime_env: Optional[RuntimeEnv] = None
Ray runtime environment to pass to distributed workers.
ray_workers_use_nsight class-attribute
instance-attribute
¶
ray_workers_use_nsight: bool = False
Whether to profile Ray workers with nsight, see https://docs.ray.io/en/latest/ray-observability/user-guides/profiling.html#profiling-nsight-profiler.
sd_worker_cls class-attribute
instance-attribute
¶
sd_worker_cls: str = 'auto'
The full name of the worker class to use for speculative decoding. If "auto", the worker class will be determined based on the platform.
tensor_parallel_size class-attribute
instance-attribute
¶
tensor_parallel_size: int = 1
Number of tensor parallel groups.
worker_cls class-attribute
instance-attribute
¶
worker_cls: str = 'auto'
The full name of the worker class to use. If "auto", the worker class will be determined based on the platform.
worker_extension_cls class-attribute
instance-attribute
¶
worker_extension_cls: str = ''
The full name of the worker extension class to use. The worker extension class is dynamically inherited by the worker class. This is used to inject new attributes and methods to the worker class for use in collective_rpc calls.
world_size class-attribute
instance-attribute
¶
world_size is TPxPP, it affects the number of workers we create.
world_size_across_dp property
¶
world_size_across_dp: int
world_size_across_dp is TPxPPxDP, it is the size of the world including data parallelism.
__post_init__ ¶
Source code in vllm/config/parallel.py
|
|
_verify_args ¶
_verify_args() -> Self
Source code in vllm/config/parallel.py
compute_hash ¶
Provide a hash that uniquely identifies all the configs that affect the structure of the computation graph from input ids/embeddings to the final hidden states, excluding anything before input ids/embeddings and after the final hidden states.
Source code in vllm/config/parallel.py
get_next_dp_init_port ¶
get_next_dp_init_port() -> int
We might need to initialize process groups in multiple processes that is related to data parallelism, e.g. both in the worker and in the engine, which can live in different processes. To avoid port conflicts, we pop a new port from the prepared port list each time we need to initialize a new process group related to data parallelism.