Batch Jobs
What are batch jobs?
Batch jobs are an autoscaling Containers feature for long-running, one-off work.
Each job gets a dedicated replica. That replica is destroyed as soon as the job finishes.
Why use batch jobs instead of continuous deployments?
With long inference duration (typically > 3 minutes), downscaling is tricky:
A high
Scale-down delayprevents killing in-flight requests. It also leaves replicas idle and wastes money.A low
Scale-down delaycan terminate a replica mid-request.
Batch jobs avoid this. They tie replica lifetime to the job lifecycle.
Your app must be able to exit the process to signal completion. Use exit code 0 for success. Use a non-zero code for failure.
Key differences vs continuous deployments
Batch jobs are always async. See Async Inference.
Each job has a
deadline. When it’s reached, the replica is killed even if still running.A job is considered “done” only when your process exits.
Usage and example
This example uses:
Source: verda-cloud/batch-jobs-example
Exposed port:
8000Health check path:
/health
When creating the deployment, the batch-job specific settings are:
Max concurrent jobs: maximum replicas. Scales to
0when the queue is empty.Deadline: maximum time a replica can stay up for a job.

Best practices
Use batch jobs for workloads that usually run longer than ~3 minutes.
Exit the process when the job is done (success or failure).
If you return an HTTP response, exit after the response is sent.
Log heavily. Use
DEBUGduring development. UseINFO/WARNINGin production.
Troubleshooting
Replica keeps running after the job is done
Make sure you actually exit the process.
Make sure you exit with the right status code.
Unhandled exceptions may return an HTTP error but keep the process alive.
Replica was killed before the job finished
Set
Deadlinehigher than your expected job duration.
No response is returned
Make sure the process doesn’t exit before sending the response.
In FastAPI, exit from a
BackgroundTaskstask after returning.In Node.js, exit via
setImmediate()after writing the response.
Replica isn’t accepting jobs
Make sure you implement a
GET /healthendpoint.
Last updated
Was this helpful?