Skip to content

SQL Scheduler: Load job results that were stored by an external BuildGrid; Improve operation handling

Marios Hadjimichael requested to merge marios/load-operation-result into master

Description

This MR addresses a few issues that resulted in the job result to not be loaded from the datastore in some cases. The changes proposed here make BuildGrid work better for clusters of many BuildGrids using the same sql-datastore (bots/clients can connect to any buildgrid and their requests should be correctly fulfilled).

Checklist

Before raising this MR, consider whether the following are required, and complete if so:

  • Unit tests – unit tests were already there; this MR updates them with the new assumptions and adds a test to make sure that job results are loaded from the db when updated outside of current buildgrid
  • Metrics - no metrics updates needed
  • Documentation update(s) - not applicable

Changes proposed in this merge request:

  • Job.py no longer holds individual operation protos (and their states) (which were hard and redundant to keep synced between in-memory and db state) - it now holds a list of operation names (and cancelled operation names), and when asked for an operation (with get_operation()), it will create the proto on demand, using the Job's response/metadata attached to needed name.
    • Scheduler interface: create_operation accepts a name for an operation to create (instead of Operation proto) (since we don't have the operation proto, and only the name was needed)
  • SQL scheduler: only store mapping from job-name -> execute_response_proto in the response_cache when there is a result (e.g. don't populate it with empty ExecuteResponse on job creation); this was causing an issue with buildgrid not reloading an execute response for a job it assumed it had cached the response for. – this happens if a bot starts work that it was assigned from buildgrid A, then talks to buildgrid B when done with the job ; Notifying buildgrid A would reload the job (and COMPLETED state), but bypassed reloading the ExecuteResponse proto since it assumed it was already cached.
  • SQL models.py: Updated so that when getting an operation from the db and calling to_protobuf also includes the ExecuteResponse.

Validation

You can test this using the stress-testing dockerfiles. Here's the script I have been using for reference:

STRESS_TESTING_DISABLE_PLATFORM_PROPERTIES=yes \
STRESS_TESTING_BGD_CONFIG=bgd-pges-permissive-botsession.yml \
STRESS_TESTING_CLIENT_REQUESTS=100 \
 docker-compose -f compose-files/pges-recc-bbrht.yml up \
  --scale buildgrid=1 --scale bots=1 \
  --scale clients=1 --scale database=1 -d &

sleep 60 # allow some time for action to start happening with buildgrid_1

# Now add another buildgrid
STRESS_TESTING_DISABLE_PLATFORM_PROPERTIES=yes \
STRESS_TESTING_BGD_CONFIG=bgd-pges-permissive-botsession.yml \
STRESS_TESTING_CLIENT_REQUESTS=100 \
 docker-compose -f compose-files/pges-recc-bbrht.yml up \
  --scale buildgrid=2 --scale bots=1 \
  --scale clients=1 --scale database=1 -d &

# OPTIONAL: (testing what happens when a buidlgrid goes away in addition to the above)
# Wait for the new buildgrid to be up and able to do work
# and then stop the first buildgrid
sleep 20
docker stop compose-files_buildgrid_1

Follow up items

  1. Currently we only persist leases to the database only once the bots ack' them, which means that bots have to talk to the buildgrid that initially assigned them the lease at least once after they receive it, otherwise other buildgrids may not know about the lease. It may make sense to just store the lease right away to avoid this problem. Refactor lease handling. This is related to: #329
Edited by Marios Hadjimichael

Merge request reports