vllm bench serve¶
JSON CLI Arguments¶
When passing JSON CLI arguments, the following sets of arguments are equivalent:
--json-arg '{"key1": "value1", "key2": {"key3": "value2"}}'
--json-arg.key1 value1 --json-arg.key2.key3 value2
Additionally, list elements can be passed individually using +
:
--json-arg '{"key4": ["value3", "value4", "value5"]}'
--json-arg.key4+ value3 --json-arg.key4+='value4,value5'
Options¶
--seed
¶
Default: 0
--num-prompts
¶
Number of prompts to process.
Default: 1000
--dataset-name
¶
Possible choices: sharegpt
, burstgpt
, sonnet
, random
, random-mm
, hf
, custom
, prefix_repetition
Name of the dataset to benchmark on.
Default: random
--no-stream
¶
Do not load the dataset in streaming mode.
Default: False
--dataset-path
¶
Path to the sharegpt/sonnet dataset. Or the huggingface dataset ID if using HF dataset.
Default: None
--endpoint-type
¶
Possible choices: vllm
, openai
, openai-chat
, openai-audio
, openai-embeddings
Default: openai
--label
¶
The label (prefix) of the benchmark results. If not specified, the endpoint type will be used as the label.
Default: None
--backend
¶
Possible choices: vllm
, openai
, openai-chat
, openai-audio
, openai-embeddings
Default: vllm
--base-url
¶
Server or API base url if not using http host and port.
Default: None
--host
¶
Default: 127.0.0.1
--port
¶
Default: 8000
--endpoint
¶
API endpoint.
Default: /v1/completions
--max-concurrency
¶
Maximum number of concurrent requests. This can be used to help simulate an environment where a higher level component is enforcing a maximum number of concurrent requests. While the --request-rate argument controls the rate at which requests are initiated, this argument will control how many are actually allowed to execute at a time. This means that when used in combination, the actual request rate may be lower than specified with --request-rate, if the server is not processing requests fast enough to keep up.
Default: None
--model
¶
Name of the model.
Default: None
--tokenizer
¶
Name or path of the tokenizer, if not using the default tokenizer.
Default: None
--use-beam-search
¶
Default: False
--logprobs
¶
Number of logprobs-per-token to compute & return as part of the request. If unspecified, then either (1) if beam search is disabled, no logprobs are computed & a single dummy logprob is returned for each token; or (2) if beam search is enabled 1 logprob per token is computed
Default: None
--request-rate
¶
Number of requests per second. If this is inf, then all the requests are sent at time 0. Otherwise, we use Poisson process or gamma distribution to synthesize the request arrival times.
Default: inf
--burstiness
¶
Burstiness factor of the request generation. Only take effect when request_rate is not inf. Default value is 1, which follows Poisson process. Otherwise, the request intervals follow a gamma distribution. A lower burstiness value (0 < burstiness < 1) results in more bursty requests. A higher burstiness value (burstiness > 1) results in a more uniform arrival of requests.
Default: 1.0
--trust-remote-code
¶
Trust remote code from huggingface
Default: False
--disable-tqdm
¶
Specify to disable tqdm progress bar.
Default: False
--profile
¶
Use Torch Profiler. The endpoint must be launched with VLLM_TORCH_PROFILER_DIR to enable profiler.
Default: False
--save-result
¶
Specify to save benchmark results to a json file
Default: False
--save-detailed
¶
When saving the results, whether to include per request information such as response, error, ttfs, tpots, etc.
Default: False
--append-result
¶
Append the benchmark result to the existing json file.
Default: False
--metadata
¶
Key-value pairs (e.g, --metadata version=0.3.3 tp=1) for metadata of this run to be saved in the result JSON file for record keeping purposes.
Default: None
--result-dir
¶
Specify directory to save benchmark json results.If not specified, results are saved in the current directory.
Default: None
--result-filename
¶
Specify the filename to save benchmark json results.If not specified, results will be saved in {label}-{args.request_rate}qps-{base_model_id}-{current_dt}.json format.
Default: None
--ignore-eos
¶
Set ignore_eos flag when sending the benchmark request.Warning: ignore_eos is not supported in deepspeed_mii and tgi.
Default: False
--percentile-metrics
¶
Comma-separated list of selected metrics to report percentils. This argument specifies the metrics to report percentiles. Allowed metric names are "ttft", "tpot", "itl", "e2el".
Default: ttft,tpot,itl
--metric-percentiles
¶
Comma-separated list of percentiles for selected metrics. To report 25-th, 50-th, and 75-th percentiles, use "25,50,75". Default value is "99".Use "--percentile-metrics" to select metrics.
Default: 99
--goodput
¶
Specify service level objectives for goodput as "KEY:VALUE" pairs, where the key is a metric name, and the value is in milliseconds. Multiple "KEY:VALUE" pairs can be provided, separated by spaces. Allowed request level metric names are "ttft", "tpot", "e2el". For more context on the definition of goodput, refer to DistServe paper: https://arxiv.org/pdf/2401.09670 and the blog: https://hao-ai-lab.github.io/blogs/distserve
Default: None
--request-id-prefix
¶
Specify the prefix of request id.
Default: benchmark-serving
--tokenizer-mode
¶
Possible choices: auto
, slow
, mistral
, custom
The tokenizer mode.
- "auto" will use the fast tokenizer if available.
- "slow" will always use the slow tokenizer.
- "mistral" will always use the
mistral_common
tokenizer. *"custom" will use --tokenizer to select the preregistered tokenizer.
Default: auto
--served-model-name
¶
The model name used in the API. If not specified, the model name will be the same as the --model
argument.
Default: None
--lora-modules
¶
A subset of LoRA module names passed in when launching the server. For each request, the script chooses a LoRA module at random.
Default: None
--ramp-up-strategy
¶
Possible choices: linear
, exponential
The ramp-up strategy. This would be used to ramp up the request rate from initial RPS to final RPS rate (specified by --ramp-up-start-rps and --ramp-up-end-rps.) over the duration of the benchmark.
Default: None
--ramp-up-start-rps
¶
The starting request rate for ramp-up (RPS). Needs to be specified when --ramp-up-strategy is used.
Default: None
--ramp-up-end-rps
¶
The ending request rate for ramp-up (RPS). Needs to be specified when --ramp-up-strategy is used.
Default: None
--ready-check-timeout-sec
¶
Maximum time to wait for the endpoint to become ready in seconds (default: 600 seconds / 10 minutes).
Default: 600
custom dataset options¶
--custom-output-len
¶
Number of output tokens per request, used only for custom dataset.
Default: 256
--custom-skip-chat-template
¶
Skip applying chat template to prompt, used only for custom dataset.
Default: False
sonnet dataset options¶
--sonnet-input-len
¶
Number of input tokens per request, used only for sonnet dataset.
Default: 550
--sonnet-output-len
¶
Number of output tokens per request, used only for sonnet dataset.
Default: 150
--sonnet-prefix-len
¶
Number of prefix tokens per request, used only for sonnet dataset.
Default: 200
sharegpt dataset options¶
--sharegpt-output-len
¶
Output length for each request. Overrides the output length from the ShareGPT dataset.
Default: None
random dataset options¶
--random-input-len
¶
Number of input tokens per request, used only for random sampling.
Default: 1024
--random-output-len
¶
Number of output tokens per request, used only for random sampling.
Default: 128
--random-range-ratio
¶
Range ratio for sampling input/output length, used only for random sampling. Must be in the range [0, 1) to define a symmetric sampling range[length * (1 - range_ratio), length * (1 + range_ratio)].
Default: 0.0
--random-prefix-len
¶
Number of fixed prefix tokens before the random context in a request. The total input length is the sum of random-prefix-len
and a random context length sampled from [input_len * (1 - range_ratio), input_len * (1 + range_ratio)].
Default: 0
--random-batch-size
¶
Batch size for random sampling. Only used for embeddings benchmark.
Default: 1
random multimodal dataset options extended from random dataset¶
--random-mm-base-items-per-request
¶
Base number of multimodal items per request for random-mm. Actual per-request count is sampled around this base using --random-mm-num-mm-items-range-ratio.
Default: 1
--random-mm-num-mm-items-range-ratio
¶
Range ratio r in [0, 1] for sampling items per request. We sample uniformly from the closed integer range [floor(n(1-r)), ceil(n(1+r))] where n is the base items per request. r=0 keeps it fixed; r=1 allows 0 items. The maximum is clamped to the sum of per-modality limits from --random-mm-limit-mm-per-prompt. An error is raised if the computed min exceeds the max.
Default: 0.0
--random-mm-limit-mm-per-prompt
¶
Per-modality hard caps for items attached per request, e.g. '{"image": 3, "video": 0}'. The sampled per-request item count is clamped to the sum of these limits. When a modality reaches its cap, its buckets are excluded and probabilities are renormalized.OBS.: Only image sampling is supported for now.
Default: {'image': 255, 'video': 0}
--random-mm-bucket-config
¶
The bucket config is a dictionary mapping a multimodal itemsampling configuration to a probability.Currently allows for 2 modalities: images and videos. An bucket key is a tuple of (height, width, num_frames)The value is the probability of sampling that specific item. Example: --random-mm-bucket-config {(256, 256, 1): 0.5, (720, 1280, 1): 0.4, (720, 1280, 16): 0.10} First item: images with resolution 256x256 w.p. 0.5Second item: images with resolution 720x1280 w.p. 0.4 Third item: videos with resolution 720x1280 and 16 frames w.p. 0.1OBS.: If the probabilities do not sum to 1, they are normalized.OBS bis.: Only image sampling is supported for now.
Default: {(256, 256, 1): 0.5, (720, 1280, 1): 0.5, (720, 1280, 16): 0.0}
hf dataset options¶
--hf-subset
¶
Subset of the HF dataset.
Default: None
--hf-split
¶
Split of the HF dataset.
Default: None
--hf-output-len
¶
Output length for each request. Overrides the output lengths from the sampled HF dataset.
Default: None
prefix repetition dataset options¶
--prefix-repetition-prefix-len
¶
Number of prefix tokens per request, used only for prefix repetition dataset.
Default: 256
--prefix-repetition-suffix-len
¶
Number of suffix tokens per request, used only for prefix repetition dataset. Total input length is prefix_len + suffix_len.
Default: 256
--prefix-repetition-num-prefixes
¶
Number of prefixes to generate, used only for prefix repetition dataset. Prompts per prefix is num_requests // num_prefixes.
Default: 10
--prefix-repetition-output-len
¶
Number of output tokens per request, used only for prefix repetition dataset.
Default: 128
sampling parameters¶
--top-p
¶
Top-p sampling parameter. Only has effect on openai-compatible backends.
Default: None
--top-k
¶
Top-k sampling parameter. Only has effect on openai-compatible backends.
Default: None
--min-p
¶
Min-p sampling parameter. Only has effect on openai-compatible backends.
Default: None
--temperature
¶
Temperature sampling parameter. Only has effect on openai-compatible backends. If not specified, default to greedy decoding (i.e. temperature==0.0).
Default: None