This page describes a new distributed architecture for a remote-execution service. This is work-in-progress!
Data Model
Buildgrid's execution service manipulates jobs submitted by peers and processed by bots:
- Peers submit
Action
s and get streamedOperation
s back, ultimately carrying anActionResult
. - Bots receive
Lease
s and produceActionResult
s out of them. - The
Job
class ties the REAPI and RWAPI messages together.
classDiagram
Job *-- Action: 1:1
Job *-- Operation: 1:n
Job *-- Lease: 1:(0,1)
Job o-- ActionResult: 1:(0,1)
Job: name [str]
Job: priority [int]
Job: n_tries [int]
Job: done [bool]
Operation: name [str]
Lease: name [str]
Peer <--> Operation
Bot <--> Lease
Peer: id [str]
Bot: id [str]
Out of any external timeline consideration:
- An
Action
maps to one and only oneJob
. - The
Job
holds oneOperation
for each peer interested in itsAction
result. - The
Job
holds aLease
if at least one of itsOperation
has reach theEXECUTING
stage. - The
Job
may hold anActionResult
either if:- The
ActionCache
hits for itsAction
. - Its
Lease
as reachCOMPLETED
and hasn't been cancelled.
- The
Operations life-cycle:
An Operation
passes through different stages:
graph LR;
UNKNOWN -- 1 --> CACHE_CHECK;
CACHE_CHECK -- 3 --> QUEUED;
UNKNOWN -- 2 --> QUEUED;
CACHE_CHECK -- 4 --> COMPLETED;
QUEUED -- 5 --> COMPLETED;
QUEUED -- 6 --> EXECUTING;
EXECUTING -- 7 --> QUEUED;
EXECUTING -- 8 --> COMPLETED;
Stages transition details:
- If
ExecuteRequest.skip_cache_lookup
isFalse
- If
ExecuteRequest.skip_cache_lookup
isTrue
- On
ActionCache
miss - On
ActionCache
hit or cancellation - On cancellation
- On scheduling
- On bot failure
- On success or build failure
Leases life-cycle:
A Lease
passes through different states:
graph LR;
UNSPECIFIED -- 1 --> PENDING;
PENDING -- 2 --> CANCELLED;
PENDING -- 3 --> ACTIVE;
ACTIVE -- 4 --> PENDING;
ACTIVE -- 5 --> CANCELLED;
ACTIVE -- 6 --> COMPLETED;
State transition details:
- On initial emission
- On cancellation (server-side)
- On acceptation (bot-side)
- On bot failure
- On cancellation (server-side)
- On success or build failure
High-level architecture
In order for an execution service to scale regarding the number of
peers and bots it can handle simultaneously, the proposed
architecture exposes the REAPI and RWAPI through respectively two sets
of n
and m
gRCP front-end nodes that can be dynamically spin-up
and tear-down. The global state is stored in a central database:
graph LR;
DB((DB));
Peer-1 -.-> RE-node-1;
Peer-2 -.-> RE-node-1;
Peer-3 -.-> RE-node-3;
subgraph Service;
RE-node-1 --- DB;
RE-node-2 --- DB;
RE-node-3 --- DB;
RE-node-n -.- DB;
DB --- RW-node-2;
DB --- RW-node-1;
DB -.- RW-node-m;
end;
RW-node-2 -.- Bot-1;
RW-node-2 -.- Bot-2;
RW-node-1 -.- Bot-3;
RW-node-1 -.- Bot-4;
Considering:
- Peer: any REAPI client (BuildStream, Bazel, RECC...)
- Bot: any RWAPI bot (buildbox-worker, bgd-bot...)
- RE-node: remote-execution (REAPI) front-end node
- RW-node: remote-worker (RWAPI) front-end node
- DB: central state database
RE front-end nodes
graph LR;
DB((DB));
CC{Controller};
Peer -.- Capabilities;
Peer -.- ActionCache;
Peer -.- Execution;
Peer -.- Operations;
subgraph RE-node;
Capabilities --- CC;
ActionCache --- CC;
Execution --- CC;
Operations --- CC;
CC --- DB-driver;
CC --- CAS-client;
end;
DB-driver -.- DB;
CAS-client -.- CAS;
RW front-end nodes
graph LR;
DB((DB));
CC{Controller};
DB -.- DB-driver;
subgraph RW-node;
CC --- Capabilities;
CC --- Bots;
DB-driver --- CC;
end;
Capabilities -.- Bot;
Bots -.- Bot;