Split timed out batched background jobs and retry
Context:
Currently, when a BatchedJob
fails, we retry the job (until the max number of retries is achieved). During these retries, we don't change any parameters of the job.
Goal:
To make our retry system smarter, when the max number of retries is achieved. We split the job into two small jobs. These two jobs should have a new batch_size
and max_value.
Then, we run the jobs. If, for some reason, one of these jobs achieves the max number of retries again, we should split and rerun the job.
Note: We already have a mechanism to split the job and retry it, but it requires human interaction (only available via the admin panel). We want to automate this process.
Edited by Yannis Roussos