EPP Resource Tuning¶

This page documents the default resource requests and limits for the Endpoint Picker (EPP) deployment and guidance on when to adjust them.

Default Resources¶

The Helm chart sets the following defaults for the EPP container:

Setting	Value	Rationale
CPU request	4 cores	Sufficient for built-in scheduling plugins under sustained load
Memory request	8 Gi	Baseline working set for EPP with default plugins
CPU limit	unset	Allows bursting to all available node CPUs during scheduling spikes
Memory limit	16 Gi	Prevents the node from entering memory pressure in case of a memory leak

The terminationGracePeriodSeconds is set to 130 seconds to match the grace period used in the vLLM example deployment, ensuring in-flight requests can drain before the pod is killed.

Validation¶

These defaults were validated using the Inference Perf tool with the following setup:

Model: Qwen/Qwen3-32B served by vLLM
Load: Poisson arrival, staged ramp from 1 to 5000 QPS in steps of 500 (100s per stage), 88 workers, max concurrency 100, max TCP connections 2500
Data: Shared-prefix workload — 150 groups, 5 prompts/group, system prompt 60 tokens, question 12 tokens, output 10 tokens
API: Streaming completions

Result: p90 scheduler latency remained within 100 ms across all stages.

QPS over time

Scheduler latency over time

To reproduce, see the Benchmark guide for instructions on deploying Inference Perf with custom configurations.

When to Increase Resources¶

You should consider increasing the resource requests if:

You add computationally intensive plugins (e.g., latency prediction scorers, custom scorers with ML inference).
You deploy sidecars alongside the EPP container that share the pod's resource budget.
You observe CPU throttling or increased scheduling latency under your production traffic patterns.