Skip to content

vllm bench serve

JSON CLI Arguments

When passing JSON CLI arguments, the following sets of arguments are equivalent:

  • --json-arg '{"key1": "value1", "key2": {"key3": "value2"}}'
  • --json-arg.key1 value1 --json-arg.key2.key3 value2

Additionally, list elements can be passed individually using +:

  • --json-arg '{"key4": ["value3", "value4", "value5"]}'
  • --json-arg.key4+ value3 --json-arg.key4+='value4,value5'

Options

--seed

Default: 0

--num-prompts

Number of prompts to process.

Default: 1000

--dataset-name

Possible choices: sharegpt, burstgpt, sonnet, random, random-mm, hf, custom, prefix_repetition

Name of the dataset to benchmark on.

Default: random

--no-stream

Do not load the dataset in streaming mode.

Default: False

--dataset-path

Path to the sharegpt/sonnet dataset. Or the huggingface dataset ID if using HF dataset.

Default: None

--endpoint-type

Possible choices: vllm, openai, openai-chat, openai-audio, openai-embeddings

Default: openai

--label

The label (prefix) of the benchmark results. If not specified, the endpoint type will be used as the label.

Default: None

--backend

Possible choices: vllm, openai, openai-chat, openai-audio, openai-embeddings

Default: vllm

--base-url

Server or API base url if not using http host and port.

Default: None

--host

Default: 127.0.0.1

--port

Default: 8000

--endpoint

API endpoint.

Default: /v1/completions

--max-concurrency

Maximum number of concurrent requests. This can be used to help simulate an environment where a higher level component is enforcing a maximum number of concurrent requests. While the --request-rate argument controls the rate at which requests are initiated, this argument will control how many are actually allowed to execute at a time. This means that when used in combination, the actual request rate may be lower than specified with --request-rate, if the server is not processing requests fast enough to keep up.

Default: None

--model

Name of the model.

Default: None

--tokenizer

Name or path of the tokenizer, if not using the default tokenizer.

Default: None

Default: False

--logprobs

Number of logprobs-per-token to compute & return as part of the request. If unspecified, then either (1) if beam search is disabled, no logprobs are computed & a single dummy logprob is returned for each token; or (2) if beam search is enabled 1 logprob per token is computed

Default: None

--request-rate

Number of requests per second. If this is inf, then all the requests are sent at time 0. Otherwise, we use Poisson process or gamma distribution to synthesize the request arrival times.

Default: inf

--burstiness

Burstiness factor of the request generation. Only take effect when request_rate is not inf. Default value is 1, which follows Poisson process. Otherwise, the request intervals follow a gamma distribution. A lower burstiness value (0 < burstiness < 1) results in more bursty requests. A higher burstiness value (burstiness > 1) results in a more uniform arrival of requests.

Default: 1.0

--trust-remote-code

Trust remote code from huggingface

Default: False

--disable-tqdm

Specify to disable tqdm progress bar.

Default: False

--profile

Use Torch Profiler. The endpoint must be launched with VLLM_TORCH_PROFILER_DIR to enable profiler.

Default: False

--save-result

Specify to save benchmark results to a json file

Default: False

--save-detailed

When saving the results, whether to include per request information such as response, error, ttfs, tpots, etc.

Default: False

--append-result

Append the benchmark result to the existing json file.

Default: False

--metadata

Key-value pairs (e.g, --metadata version=0.3.3 tp=1) for metadata of this run to be saved in the result JSON file for record keeping purposes.

Default: None

--result-dir

Specify directory to save benchmark json results.If not specified, results are saved in the current directory.

Default: None

--result-filename

Specify the filename to save benchmark json results.If not specified, results will be saved in {label}-{args.request_rate}qps-{base_model_id}-{current_dt}.json format.

Default: None

--ignore-eos

Set ignore_eos flag when sending the benchmark request.Warning: ignore_eos is not supported in deepspeed_mii and tgi.

Default: False

--percentile-metrics

Comma-separated list of selected metrics to report percentils. This argument specifies the metrics to report percentiles. Allowed metric names are "ttft", "tpot", "itl", "e2el".

Default: ttft,tpot,itl

--metric-percentiles

Comma-separated list of percentiles for selected metrics. To report 25-th, 50-th, and 75-th percentiles, use "25,50,75". Default value is "99".Use "--percentile-metrics" to select metrics.

Default: 99

--goodput

Specify service level objectives for goodput as "KEY:VALUE" pairs, where the key is a metric name, and the value is in milliseconds. Multiple "KEY:VALUE" pairs can be provided, separated by spaces. Allowed request level metric names are "ttft", "tpot", "e2el". For more context on the definition of goodput, refer to DistServe paper: https://arxiv.org/pdf/2401.09670 and the blog: https://hao-ai-lab.github.io/blogs/distserve

Default: None

--request-id-prefix

Specify the prefix of request id.

Default: benchmark-serving

--tokenizer-mode

Possible choices: auto, slow, mistral, custom

The tokenizer mode.

  • "auto" will use the fast tokenizer if available.
  • "slow" will always use the slow tokenizer.
  • "mistral" will always use the mistral_common tokenizer. *"custom" will use --tokenizer to select the preregistered tokenizer.

Default: auto

--served-model-name

The model name used in the API. If not specified, the model name will be the same as the --model argument.

Default: None

--lora-modules

A subset of LoRA module names passed in when launching the server. For each request, the script chooses a LoRA module at random.

Default: None

--ramp-up-strategy

Possible choices: linear, exponential

The ramp-up strategy. This would be used to ramp up the request rate from initial RPS to final RPS rate (specified by --ramp-up-start-rps and --ramp-up-end-rps.) over the duration of the benchmark.

Default: None

--ramp-up-start-rps

The starting request rate for ramp-up (RPS). Needs to be specified when --ramp-up-strategy is used.

Default: None

--ramp-up-end-rps

The ending request rate for ramp-up (RPS). Needs to be specified when --ramp-up-strategy is used.

Default: None

--ready-check-timeout-sec

Maximum time to wait for the endpoint to become ready in seconds (default: 600 seconds / 10 minutes).

Default: 600

custom dataset options

--custom-output-len

Number of output tokens per request, used only for custom dataset.

Default: 256

--custom-skip-chat-template

Skip applying chat template to prompt, used only for custom dataset.

Default: False

sonnet dataset options

--sonnet-input-len

Number of input tokens per request, used only for sonnet dataset.

Default: 550

--sonnet-output-len

Number of output tokens per request, used only for sonnet dataset.

Default: 150

--sonnet-prefix-len

Number of prefix tokens per request, used only for sonnet dataset.

Default: 200

sharegpt dataset options

--sharegpt-output-len

Output length for each request. Overrides the output length from the ShareGPT dataset.

Default: None

random dataset options

--random-input-len

Number of input tokens per request, used only for random sampling.

Default: 1024

--random-output-len

Number of output tokens per request, used only for random sampling.

Default: 128

--random-range-ratio

Range ratio for sampling input/output length, used only for random sampling. Must be in the range [0, 1) to define a symmetric sampling range[length * (1 - range_ratio), length * (1 + range_ratio)].

Default: 0.0

--random-prefix-len

Number of fixed prefix tokens before the random context in a request. The total input length is the sum of random-prefix-len and a random context length sampled from [input_len * (1 - range_ratio), input_len * (1 + range_ratio)].

Default: 0

--random-batch-size

Batch size for random sampling. Only used for embeddings benchmark.

Default: 1

random multimodal dataset options extended from random dataset

--random-mm-base-items-per-request

Base number of multimodal items per request for random-mm. Actual per-request count is sampled around this base using --random-mm-num-mm-items-range-ratio.

Default: 1

--random-mm-num-mm-items-range-ratio

Range ratio r in [0, 1] for sampling items per request. We sample uniformly from the closed integer range [floor(n(1-r)), ceil(n(1+r))] where n is the base items per request. r=0 keeps it fixed; r=1 allows 0 items. The maximum is clamped to the sum of per-modality limits from --random-mm-limit-mm-per-prompt. An error is raised if the computed min exceeds the max.

Default: 0.0

--random-mm-limit-mm-per-prompt

Per-modality hard caps for items attached per request, e.g. '{"image": 3, "video": 0}'. The sampled per-request item count is clamped to the sum of these limits. When a modality reaches its cap, its buckets are excluded and probabilities are renormalized.OBS.: Only image sampling is supported for now.

Default: {'image': 255, 'video': 0}

--random-mm-bucket-config

The bucket config is a dictionary mapping a multimodal itemsampling configuration to a probability.Currently allows for 2 modalities: images and videos. An bucket key is a tuple of (height, width, num_frames)The value is the probability of sampling that specific item. Example: --random-mm-bucket-config {(256, 256, 1): 0.5, (720, 1280, 1): 0.4, (720, 1280, 16): 0.10} First item: images with resolution 256x256 w.p. 0.5Second item: images with resolution 720x1280 w.p. 0.4 Third item: videos with resolution 720x1280 and 16 frames w.p. 0.1OBS.: If the probabilities do not sum to 1, they are normalized.OBS bis.: Only image sampling is supported for now.

Default: {(256, 256, 1): 0.5, (720, 1280, 1): 0.5, (720, 1280, 16): 0.0}

hf dataset options

--hf-subset

Subset of the HF dataset.

Default: None

--hf-split

Split of the HF dataset.

Default: None

--hf-output-len

Output length for each request. Overrides the output lengths from the sampled HF dataset.

Default: None

prefix repetition dataset options

--prefix-repetition-prefix-len

Number of prefix tokens per request, used only for prefix repetition dataset.

Default: 256

--prefix-repetition-suffix-len

Number of suffix tokens per request, used only for prefix repetition dataset. Total input length is prefix_len + suffix_len.

Default: 256

--prefix-repetition-num-prefixes

Number of prefixes to generate, used only for prefix repetition dataset. Prompts per prefix is num_requests // num_prefixes.

Default: 10

--prefix-repetition-output-len

Number of output tokens per request, used only for prefix repetition dataset.

Default: 128

sampling parameters

--top-p

Top-p sampling parameter. Only has effect on openai-compatible backends.

Default: None

--top-k

Top-k sampling parameter. Only has effect on openai-compatible backends.

Default: None

--min-p

Min-p sampling parameter. Only has effect on openai-compatible backends.

Default: None

--temperature

Temperature sampling parameter. Only has effect on openai-compatible backends. If not specified, default to greedy decoding (i.e. temperature==0.0).

Default: None