Supported Model Servers¶
Any model server that conform to the model server protocol are supported by the inference extension.
Compatible Model Server Versions¶
| Model Server | Version | Commit | Notes |
|---|---|---|---|
| vLLM V0 | v0.6.4 and above | commit 0ad216f | |
| vLLM V1 | v0.8.0 and above | commit bc32bc7 | |
| Triton(TensorRT-LLM) | 25.03 and above | commit 15cb989. | LoRA affinity feature is not available as the required LoRA metrics haven't been implemented in Triton yet. Feature request |
| SGLang | v0.4.0 and above | commit 1929c06 | Set --enable-metrics on the model server. LoRA affinity feature is not available as the required LoRA metrics haven't been implemented in SGLang yet. |
vLLM¶
vLLM is configured as the default in the endpoint picker extension. No further configuration is required.
Triton with TensorRT-LLM Backend¶
Triton specific metric names need to be specified when starting the EPP.
Use --set inferencePool.modelServerType=triton-tensorrt-llm to install the inferencepool via helm. See the inferencepool helm guide for more details.
Add the following to the flags in the helm chart as flags to EPP
- name=total-queued-requests-metric
value="nv_trt_llm_request_metrics{request_type=waiting}"
- name=kv-cache-usage-percentage-metric
value="nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type=fraction}"
- name=lora-info-metric
value="" # Set an empty metric to disable LoRA metric scraping as they are not supported by Triton yet.
SGLang¶
Add the following flags while deploying using helm charts in the EPP deployment
- name=total-queued-requests-metric
value="sglang:num_queue_reqs"
- name=kv-cache-usage-percentage-metric
value="sglang:token_usage"
- name=lora-info-metric
value="" # Set an empty metric to disable LoRA metric scraping as they are not supported by SGLang yet.
Multi-Engine Support¶
The Inference Extension supports collecting metrics from multiple inference engines simultaneously within the same InferencePool. This is useful for A/B testing or mixed-engine deployments.
By default, EPP includes pre-configured metric mappings for vLLM (default) and SGLang. You only need to label your Pods with the engine type.
1. Label your Pods¶
Label each deployment with the engine type:
# vLLM Deployment
metadata:
labels:
inference.networking.k8s.io/engine-type: vllm
# SGLang Deployment
metadata:
labels:
inference.networking.k8s.io/engine-type: sglang
Pods without the engine label will use the default engine configuration (vLLM).
2. Change Default Engine (Optional)¶
To use SGLang as the default engine instead of vLLM, simply set the defaultEngine parameter:
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: EndpointPickerConfig
featureGates:
- dataLayer
plugins:
- name: core-metrics-extractor
type: core-metrics-extractor
parameters:
defaultEngine: "sglang" # Pods without engine label will use SGLang metrics
3. Custom Engine Configuration (Optional)¶
If you need to customize the metric mappings or add support for other engines (e.g., Triton), provide engine-specific configurations in your EndpointPickerConfig. Note that built-in vLLM and SGLang configs are automatically included, so you only need to define them if you want to override the defaults:
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: EndpointPickerConfig
featureGates:
- dataLayer
plugins:
- name: core-metrics-extractor
type: core-metrics-extractor
parameters:
engineLabelKey: "inference.networking.k8s.io/engine-type" # Pod label key (optional, this is the default)
defaultEngine: "vllm" # Which engine to use for Pods without engine label
engineConfigs:
# vllm and sglang are optional - only define them to override defaults
- name: vllm
queuedRequestsSpec: "vllm:num_requests_waiting"
runningRequestsSpec: "vllm:num_requests_running"
kvUsageSpec: "vllm:kv_cache_usage_perc"
loraSpec: "vllm:lora_requests_info"
cacheInfoSpec: "vllm:cache_config_info"
- name: sglang
queuedRequestsSpec: "sglang:num_queue_reqs"
runningRequestsSpec: "sglang:num_running_reqs"
kvUsageSpec: "sglang:token_usage"
- name: triton
queuedRequestsSpec: "nv_trt_llm_request_metrics{request_type=waiting}"
kvUsageSpec: "nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type=fraction}"
Key points:
- Use engineLabelKey to customize the Pod label key for engine identification (defaults to inference.networking.k8s.io/engine-type)
- Use defaultEngine to specify which engine is used for Pods without an engine label (defaults to "vllm")
- Built-in vLLM and SGLang configs are automatically included, even when adding custom engines
Active Port Declaration via Pod Annotations¶
The EPP supports specifying which ports on a pod should be considered as active for inference traffic using pod annotations. This allows for fine-grained control over which ports the EPP will use when routing inference requests to model server pods.
To specify active ports on a pod, add the annotation inference.networking.k8s.io/active-ports with a comma-separated list of port numbers.
Relationship to InferencePool CR TargetPorts¶
The ports specified in the inference.networking.k8s.io/active-ports annotation are subject to the InferencePool CR's TargetPorts configuration. Only ports that are within the TargetPorts range defined in the InferencePool CR are considered valid and will be used for inference traffic. This annotation is optional - if it is not present, all ports in the InferencePool's TargetPorts range will be considered active for that pod.
Example¶
apiVersion: v1
kind: Pod
metadata:
name: model-server-pod
annotations:
inference.networking.k8s.io/active-ports: "8000,8002"
spec:
containers:
- name: model-server
image: your-model-server:latest
ports:
- containerPort: 8000
name: http
- containerPort: 8001
name: metrics
- containerPort: 8002
name: health
In this example, assuming ports 8000-8002 are defined in the InferencePool's TargetPorts, the EPP will only consider ports 8000 and 8002 as active ports for inference traffic on this pod (since 8001 is not specified in the annotation). Any other ports exposed by the pod that are not within the TargetPorts range will not be used for inference requests.
This feature is particularly useful when your model server pods expose multiple ports in the TargetPorts range and you want to explicitly control which ones are used for inference traffic routing.