WIP: Show all operations in get_all_jobs, regardless of status.
Description
The operations proto makes no mention of only showing none-completed operations in a ListOperations
response.
Since the operation has a done
field, get_all_jobs should return all jobs not just the active ones.
From my search, this method is only used during a ListOperations
request.
TODO
-
Remove filter from sql implementation.
Merge request reports
Activity
As far as I can tell, this (theoretically) isn't a functional change since we delete jobs when they're complete.
But the line you've edited gets operations by way of jobs. So if the job gets deleted in the in-memory implementation, its operation is unreferenced.
A more general question: what is the motivation behind this change besides reinterpreting the API spec? The operations API doesn't mention all running operations, but it doesn't mention that it should return completed operations either. This list could be massive. As a client, if I send a
ListOperations
request against anOperations
service, I would primarily be interested in what's going on, and everything that's completed is just noise.The
ListOperations
call has a filter parameter that we could use for this purpose, but it's open-ended, and we currently ignore it. If this change is made, then I would expect to be able to make use of this parameter as well so that I don't have to trudge through completed operations. Designing the interface for it is going to take some thought and will be a more substantial change.Edited by Rohit KothurUnless you keep track of the operation id's of job's you're interested in, there's no way to know the status of a completed job.
Buildfarm
also returns all operations on this request: https://github.com/bazelbuild/bazel-buildfarm/blob/master/src/main/java/build/buildfarm/instance/AbstractServerInstance.java#L1341The size of the response is handled by the unimplemented
page_token
parameter.While I am a fan of having
ListOperations
return allOperations
by default, I don't think we can realistically do that right now. We currently don't remove old operations, so the list of operations will just keep growing and growing, which will make this query slower and slower to the point of being too expensive. While it looks like thepage_size
+page_token
parameters would help, do we respect those? (using https://cloud.google.com/service-infrastructure/docs/service-management/reference/rpc/google.longrunning#google.longrunning.ListOperationsRequest as a reference fyi)In addition, with tons of operations, think on the order of tens of thousands, the results don't become useful without us implementing the
filter
parameter @rkothur mentioned above.So I think those two things are prerequisites this MR, but I do think it's a place we want to get to.
I'm not in favor of this change in general, but I'd be open to discussing it after we fully implement the parameters.
Additionally--and this should not necessarily fully inform design decisions made here--BuildFarm does not return completed operations. The method you linked calls
createOperationsIterator
, which has two implementations: in-memory, which only looks at outstanding operations, and sharded, which appears to always throw an exception at the linked line because the only backplane implementation doesn't support getOperations.which appears to always throw an exception at the linked line because the only backplane implementation doesn't support getOperations.
The commented out line below the linked code, seems to suggest that completed operations were returned in certain implementations.
Regardless, the scenario still exists, that you will lose operation statuses if they were completed without having known the operation id beforehand, or were completed before a call to ListOperations happened while the job was staged/executing.
I'm fine closing this MR in favor of reporting our findings in an issue. Does that sound reasonable?