K8s-native CronJobs are quite convenient to run regularly scheduled tasks. But K8s CronJob and Job specs does not provide a straight-forward way (at least not that I could find) to specify an execution timeout. So when execution hangs, whatever the reason, container continues running. Best case scenario, util next execution, if concurrencyPolicy: Replace
is used.
If your task's code has it's own timeout capability - life is good. When it does not, here's what you can do.
When running a task you'd rather not delay until next try in case of hangup and/or job history needs to be retained via concurrencyPolicy: Forbid
, livenessProbe
could be used to compare time elapsed since start of the task and timeout value. When that probe fails, container is restarted thanks to restartPolicy: OnFailure
.
If job history does not need to be retained, one could use concurrencyPolicy: Replace
. However, that will make successfulJobsHistoryLimit
and failedJobsHistoryLimit
meaningless, as jobs will be replaced each time CronJob
schedule kicks off another one.
Perhaps Downward API can be used to get container start time, but I haven't found the right reference for that yet.
I like to be able to see what went wrong in failed job runs. Counterintuitively, using restartPolicy: Never
will keep failed pods around, and available to examine.
CronJob with timeout via livenessProbe example