K8s-native CronJobs are quite convenient to run regularly scheduled tasks. But K8s CronJob and Job specs does not provide a straight-forward way (at least not that I could find) to specify an execution timeout. So when execution hangs, whatever the reason, container continues running. Best case scenario, util next execution, if
concurrencyPolicy: Replace is used.
If your task's code has it's own timeout capability - life is good. When it does not, here's what you can do.
When running a task you'd rather not delay until next try in case of hangup and/or job history needs to be retained via
livenessProbe could be used to compare time elapsed since start of the task and timeout value. When that probe fails, container is restarted thanks to
If job history does not need to be retained, one could use
concurrencyPolicy: Replace. However, that will make
failedJobsHistoryLimit meaningless, as jobs will be replaced each time
CronJob schedule kicks off another one.
Perhaps Downward API can be used to get container start time, but I haven't found the right reference for that yet.
I like to be able to see what went wrong in failed job runs. Counterintuitively, using
restartPolicy: Never will keep failed pods around, and available to examine.
CronJob with timeout via livenessProbe example