Add support for "watching" job state
Description
Currently, there are two sides of BuildGrid which communicate together via the shared Scheduler
. These sides are the ExecutionService
and the BotsService
. A peer sends an Execute request to the ExecutionService
, which creates a Job
to hand out to a bot on demand. It also creates a message queue related to that job for sending progress updates back to the peer. Meanwhile, bots request jobs from the BotsService
. This hands out the job that the peer requested, and when the bot reports a state change, sends a message to the relevant peers via the message queues tracked by the scheduler.
This works nicely for a single-scheduler BuildGrid, but limits the ability to horizontally scale due to the requirement that the bot is connected to the same instance as every peer for a given job, otherwise this communication breaks down.
This MR removes all of that communication infrastructure, to break the dependency on sharing a Scheduler
between bots and peers for a given job. This will allow us to horizontally scale the core ExecutionService
and BotsService
components without adversely affecting communication.
It works by replacing the message queue created by a peer with a thread which waits for changes to the job in the shared data store (ie. the database used by all the instances). If using a PostgreSQL database, this waiting is implemented using LISTEN and NOTIFY to react as soon as changes are made. Otherwise, it relies on polling the data store on a regular basis.
The Scheduler
keeps track of this thread, and also creates a threading.Event
which is set and cleared when the thread detects a change in the job. Once the thread is created, the peer repeatedly waits for this event until the job is completed or cancelled, or the connection is closed. If a new peer makes a WaitExecution
request for the same job, it simply waits for the same event, rather than creating a new watcher thread.
Once all connections are closed the watcher thread stops watching for changes.
Changes proposed in this merge request:
- Create threads to watch job state for changes
- Remove the existing message queue based communication system
- Wait for events emitted from the watcher threads to handle communicating back to peers
- Modify the test suite to account for these changes
Todo
-
Update tests that are now broken -
Use PostgreSQL listen/notify to wait for changes rather than polling -
Add some tests for parallel instantiation of theExecutionService
(this might end up in a separate MR)