Re-think operations for scalability
Context
Currently, every time a client submits a job, we create a new operation for them and associate it with the relevant job. We map many operations to a single job if they are all trying to execute the same ActionDigest ("deduplication").
Part of having many operations was because we currently "cancel" jobs once we don't have any more operations pointing to it (for the in-memory scheduler, also helping free up memory we considered "unused"). (Doing that also makes clients that try to do async job submission/request results later not work)
Moreover, having the jobs and operations split out this way has previously caused race conditions e.g. in terms of figuring out if the job was done.
Task
Re-think how we create and use operations to make this scale well with the database backend especially.
Possible Approach
- Have only one operation per job (in the same table?)
- Have a "refcount" of clients that asked for the execution of this job, adding one every time they ask for an execution, and subtracting one every time a client cancels
- if refcount reaches 0, cancel it.
- Only cancel the jobs if a client explicitly asks for a cancellation (instead of assuming that whenever they disconnect).