Supported Model Servers¶
Any model server that conform to the model server protocol are supported by the inference extension.
Compatible Model Server Versions¶
Model Server | Version | Commit | Notes |
---|---|---|---|
vLLM V0 | v0.6.4 and above | commit 0ad216f | |
vLLM V1 | v0.8.0 and above | commit bc32bc7 | |
Triton(TensorRT-LLM) | 25.03 and above | commit 15cb989. | LoRA affinity feature is not available as the required LoRA metrics haven't been implemented in Triton yet. Feature request |
SGLang | v0.4.0 and above | commit 1929c06 | Set --enable-metrics on the model server. LoRA affinity feature is not available as the required LoRA metrics haven't been implemented in SGLang yet. |
vLLM¶
vLLM is configured as the default in the endpoint picker extension. No further configuration is required.
Triton with TensorRT-LLM Backend¶
Triton specific metric names need to be specified when starting the EPP.
Use --set inferencePool.modelServerType=triton-tensorrt-llm
to install the inferencepool
via helm. See the inferencepool
helm guide for more details.
Add the following to the flags
in the helm chart as flags to EPP
- name=total-queued-requests-metric
value="nv_trt_llm_request_metrics{request_type=waiting}"
- name=kv-cache-usage-percentage-metric
value="nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type=fraction}"
- name=lora-info-metric
value="" # Set an empty metric to disable LoRA metric scraping as they are not supported by Triton yet.
SGLang¶
Add the following flags
while deploying using helm charts in the EPP deployment
- name=total-queued-requests-metric
value="sglang:num_queue_reqs"
- name=kv-cache-usage-percentage-metric
value="sglang:token_usage"
- name=lora-info-metric
value="" # Set an empty metric to disable LoRA metric scraping as they are not supported by SGLang yet.