BuildGrid 2.0

Problem Statement

The current design of BuildGrid has too much need for inspecting state stored in a central database to scale as well as we want/require. This leads to contention for connections, and increases the time between sending an Execute request and the Action requested actually being handed to a worker for execution (assuming the queue time is reducible by simply adding more workers).

All of the state we need to access and keep correct also makes the individual services more complex than is strictly necessary. Sharing the state between services also means that the contention issue is worsened, and makes it a bit harder to ensure that the services are properly independent (for purposes of deploying and scaling each one individually for example).

Using a database to represent the internal queue of work also means that assigning work is more complex than it needs to be, since we need to handle race conditions with work being selected and claimed by workers, rather than just popping things out of a queue. This also has implications with long-polling bots, since they need to regularly check for new work rather than just waiting to receive some work as they would in a queue-based solution.

Design

For lots of detail, see the mailing list thread about this work. In brief, the idea is to remove all the shared state that currently lives in the SQL database in favour of communicating changes via RabbitMQ exchanges. This means we can remove and simplify lots of things, in particular:

The Job class
The Scheduler class
The DataStoreInterface class (and its implementations).

Actions will be sent to a RabbitMQ exchange that routes them into a queue based on their platform requirements. Bots services will consume Actions from those queues based on the capabilities of workers that are currently connected.

Updates on Action execution will be broadcast from the Bots services using another exchange in the form of an Operation message. Other services will consume from that exchange to do anything they need to with Operations, such as persisting their state, communicating the update to clients, or caching the result.

Cancellation of running work will be handled by the Operations services broadcasting Action digests to cancel execution for on another exchange, which will be consumed by Bots services to check against Actions mentioned in UpdateBotSession requests they receive.

Notable service specific points...

Execution service

Send Action messages to an exchange to queue them for execution
Consume Operation messages and use them to
- populate a local cache of operation states to rapidly respond to WaitExecution requests.
- stream updates on execution back to clients.
Send Operation messages to inform other services of
- an Action being requested and a cache lookup being performed
- a new Action being enqueued

Bots service

Consume Action messages from relevant queues and hand them off to workers
Keep track of what workers are connected with what capabilities
Create and broadcast Operation messages to an exchange
Consume cancellation messages and track which digests need cancelling if they're seen
Broadcast worker connection messages for a separate service to track worker health
Allow workers to send an UpdateBotSession request without first sending a CreateBotSession request to the same Bots service

Operations service

Consume Operation messages and use them to populate a persistent store (probably just a key/value store in Redis)
- Handle "reference counting" Operations for a given Action, sending cancellation messages if the last Operation for an Action is cancelled.
(stretch) Support for filtering/paging in ListOperations

ActionCache service

Consume Operation messages and use them to
- populate the cache if they have a cacheable result set.
- populate a cache of Action digests that it expects to see results for in the future.
Support also checking the cache of expected Action digests via a custom gRPC request.
Consume cancellation messages to maintain the cache of expected digests

New stuff

A service to consume worker connection messages and use them to
- track the health of workers connected to the grid.
- requeue Actions that are being handled by workers that have become unhealthy.

Open Questions

A final definition for Operation naming
- <action_digest>/<random_uuid>?
How do we want to handle gathering grid-level metrics, rather than metrics for an individual instantiation of a service (where we could have multiple processes for the same instance name sharing load)?
Are named buckets sufficient for handling inexact matching (see https://github.com/bazelbuild/remote-apis/issues/117 )?
Should we allow specifying platform properties to not match on?
- For example, if we know our workers can handle all values of chrootDigest, we don't need to match on it (and therefore can reduce the number of queues we need)
Broadcasting ActionResults on a separate queue was discussed, before we decided to keep deduplication by extending the ActionCache. We should consider whether this is still something we want to implement.

Acceptance Criteria

The following services are completely independently instantiable and have no shared state:
- Execution service
- Operations service
- Bots service
- ActionCache service
Actions are queued for execution in platform property specific queues consumed by one or more Bots services
The aforementioned services communicate updates primarily using RabbitMQ exchanges
The ActionCache maintains its cache independently, rather than receiving updates from other services
The ActionCache can report on results it expects to have a cache entry for in the future