SQL Scheduler: Load job results that were stored by an external BuildGrid; Improve operation handling
Description
This MR addresses a few issues that resulted in the job result to not be loaded from the datastore in some cases. The changes proposed here make BuildGrid work better for clusters of many BuildGrids using the same sql-datastore (bots/clients can connect to any buildgrid and their requests should be correctly fulfilled).
- For bots, enabling the permissive-botsession mode is required for the above to work seamlessly with bots: https://buildgrid.build/user/configuration.html#persisting-internal-state
Checklist
Before raising this MR, consider whether the following are required, and complete if so:
-
Unit tests – unit tests were already there; this MR updates them with the new assumptions and adds a test to make sure that job results are loaded from the db when updated outside of current buildgrid -
Metrics - no metrics updates needed -
Documentation update(s) - not applicable
Changes proposed in this merge request:
-
Job.py
no longer holds individual operation protos (and their states) (which were hard and redundant to keep synced between in-memory and db state) - it now holds a list of operation names (and cancelled operation names), and when asked for an operation (withget_operation()
), it will create the proto on demand, using the Job's response/metadata attached to needed name.- Scheduler interface:
create_operation
accepts a name for an operation to create (instead of Operation proto) (since we don't have the operation proto, and only the name was needed)
- Scheduler interface:
- SQL scheduler: only store mapping from
job-name
->execute_response_proto
in theresponse_cache
when there is a result (e.g. don't populate it with empty ExecuteResponse on job creation); this was causing an issue with buildgrid not reloading an execute response for a job it assumed it had cached the response for. – this happens if a bot starts work that it was assigned from buildgrid A, then talks to buildgrid B when done with the job ; Notifying buildgrid A would reload the job (and COMPLETED state), but bypassed reloading the ExecuteResponse proto since it assumed it was already cached. - SQL
models.py
: Updated so that when getting an operation from the db and callingto_protobuf
also includes the ExecuteResponse.
Validation
You can test this using the stress-testing dockerfiles. Here's the script I have been using for reference:
STRESS_TESTING_DISABLE_PLATFORM_PROPERTIES=yes \
STRESS_TESTING_BGD_CONFIG=bgd-pges-permissive-botsession.yml \
STRESS_TESTING_CLIENT_REQUESTS=100 \
docker-compose -f compose-files/pges-recc-bbrht.yml up \
--scale buildgrid=1 --scale bots=1 \
--scale clients=1 --scale database=1 -d &
sleep 60 # allow some time for action to start happening with buildgrid_1
# Now add another buildgrid
STRESS_TESTING_DISABLE_PLATFORM_PROPERTIES=yes \
STRESS_TESTING_BGD_CONFIG=bgd-pges-permissive-botsession.yml \
STRESS_TESTING_CLIENT_REQUESTS=100 \
docker-compose -f compose-files/pges-recc-bbrht.yml up \
--scale buildgrid=2 --scale bots=1 \
--scale clients=1 --scale database=1 -d &
# OPTIONAL: (testing what happens when a buidlgrid goes away in addition to the above)
# Wait for the new buildgrid to be up and able to do work
# and then stop the first buildgrid
sleep 20
docker stop compose-files_buildgrid_1
Follow up items
- Currently we only persist leases to the database only once the bots ack' them, which means that bots have to talk to the buildgrid that initially assigned them the lease at least once after they receive it, otherwise other buildgrids may not know about the lease. It may make sense to just store the lease right away to avoid this problem. Refactor lease handling. This is related to: #329