Skip to content

Only notify once when a job is completed, refactor update_bot_session

Marios Hadjimichael requested to merge marios/fix-job-completed-notify-race into master

Description

When buildgrid is running in permissive botsession mode, or, multiple instances of buildgrid are running with the same database backend, there are some cases in which clients receive an ActionResult with status completed, but without the actual results. This MR addresses the fact that we had 3 separate "notify" calls when a job was completed, that being one of the races, and now only notify after the result was stored in the action cache.

It also refactors update_bot_session and some helper methods a bit to make it a bit easier to follow what's going on when leases are checked.

However, further work is needed for a few more cases described below, when using something like a load balancer in front of multiple buildgrid instances using the same datastore.

Further work

  • Looks like loading operations from the datastore will not populate the result field, which means that currently the machine that receives the result needs to be the one asked for the operation for the clients to receive the result, otherwise the clients will not receive the result (won't be loaded from db).
  • Looks like we never delete leases from the sql data-store, causing all old leases to remain stored indefinitely: https://gitlab.com/BuildGrid/buildgrid/-/blob/master/buildgrid/server/job.py#L531 (that works, and can be useful for troubleshooting, but we may want to reconsider removing the leases in the sql datastore as well)
Edited by Marios Hadjimichael

Merge request reports