Autoscaling and Concurrency
Scale to/from zero with different types of scaling factor.
Autoscaling lets you handle highly variable traffic while minimizing spend on idle compute resources.
Configuration
Autoscaling settings are configurable for each deployment of a model. Autoscaling configs are global settings so each version of your model use the same autoscaling config. You can update via magicpoint.yaml
or console.
You can set following configs for each type of scaling:
- Autoscaling Window
window
- Scale Down Delay
scale_down_period
Autoscaling Window: Tracks the scaling metric and wait X seconds to upscale your deployment.
Scale Down Delay: If your replica has no request X seconds to downscale your deployment.
Scale by queue depth
The basic scaling option is by queue depth. Every application has their own queue and you can configure scaling by queue length. And also you can set the threshold in this option.
Scale by latency
Response time is the most important topic for the end users. We integrated scale by response time to overcome latency issue. It’s a bit complicated solution because we have three option for response time:
- Average Response Time
- p95 Response Time
- p99 Response Time
You can use latency scaler with following yaml: