Runway (Experimentation Spaces) Architecture

The goal of this issue is to lay out a high-level architecture. This will divide the Experimentation Spaces product into distinct responsibilities, so that we can work on them independently and perhaps even switch out components if needed.

Responsibilities

The responsibilities we have identified thus far are:

Provisioning
Deployment
Reconciliation
Runtime
Observability

By designing interfaces between them, we can de-risk lock-in to a particular implementation. This needs to be balanced against least-common-denominator/worst-of-all-worlds.

Provisioning https://gitlab.com/gitlab-com/gl-infra/platform/experimentation-spaces/-/issues/9

The provisioning process is responsible for taking a request "create an experimentation space for me", and stamping out the minimum required infrastructure for that space. It also covers decommissioning when a space is no longer needed.

Deployment https://gitlab.com/gitlab-com/gl-infra/platform/experimentation-spaces/-/issues/10

The deployment process is responsible for taking an artifact (e.g. a docker image) from a customer and bringing that into a runtime. This includes rollout strategies, rollbacks, canarying, multi-environment promotion, as well as diagnostic tools for failed deploys. Some of these capabilities may also be delegated to the runtime. There should also be a standard way for connecting an existing code base to a deployment.

Reconciliation https://gitlab.com/gitlab-com/gl-infra/platform/experimentation-spaces/-/issues/11

The Reconciler is the heart of the system. It is responsible for creating a desired view of the world (based on service definition and current version), finding the differences from the actual state, and then applying that diff. It will also require some form of storage.

Runtime https://gitlab.com/gitlab-com/gl-infra/platform/experimentation-spaces/-/issues/2

The runtime is responsible for actually scheduling and running the customer's workloads. Deployment targets a runtime. Runtime will provide autoscaled compute resources with a degree of tenant isolation. It will also optionally expose an endpoint at which the workload can be reached. This endpoint will have a DNS name and be TLS encrypted.

Observability https://gitlab.com/gitlab-com/gl-infra/platform/experimentation-spaces/-/issues/12

The observability stack serves two purposes. First, it allows customers to operate their applications. Second, it connects to existing monitoring, alerting, and capacity planning processes owned by Infrastructure.

Diagram

runway-arch.key

Note: Service Developers (formerly "Customers") are GitLab team members who are developing/deploying the "experiment" and code within.

Next steps

Get team sign-off on architecture
Design interfaces
Begin design for each component

Thanks to @andrewn, @ggillies, and @cfeick for early input.

Edited May 16, 2023 by Igor