Initial setup for AMI pipeline (!2) · Merge requests · GitLab.org / Ops Sub-Department / shared-runners / images / AWS / GRIT images

Davis Bickford requested to merge initial-setup-v2 into main Apr 10, 2024

This MR is for adding the initial setup for a pipeline that will generate an AMI to be used to build ephemeral machines in linux environments in GRIT. The basic design is as follows:

Design proposal (discussion thread here)

Based on discussion in my initial draft MR with @josephburnett I drew up a basic design for our AMI pipeline:

Source-of-truth for AMIs: Maintain a Terraform configuration file (e.g. manifest.json) in the project that serves as the single source of truth for AMIs. This file could list all AMIs to be distributed and their target regions, structured like:

{
  "aws-linux-ephemeral": {
    "us-east-1": {
      "architecture": "x86_64",
      "deprecation_time": "",
      "id": "ami-048b8191da9c5e202",
      "name": "ubuntu",
      "public": true
    },
    "us-east-2": {
      "architecture": "x86_64",
      "deprecation_time": "",
      "id": "ami-00be98396005a8a31",
      "name": "ubuntu",
      "public": true
    },
    "us-west-1": {
      "architecture": "x86_64",
      "deprecation_time": "",
      "id": "ami-0253f2f2179c1b843",
      "name": "ubuntu",
      "public": true
    }
  }
}

Two-phase pipeline (not to be confused with stages. each phase could have one-or-many pipeline stages):
- Build Phase: Triggered by a merge request, this phase would build and test (leveraging GRIT) a new candidate AMI. We do this in our sandbox AWS account.
- Publish Phase: Manual stages of the pipeline replicate and publish the images, then generate a follow-up merge request in GRIT with the latest manifest (e.g. manifests.json) with the newly created AMI IDs. This phase is responsible for copying the AMI to the distribution account (`runner-grit-images`), and applying the Terraform plan to replicate and publish the AMIs publicly.
AMI Release Cycle:
- Promotion: promote the "latest" image in the publish pipeline when tagging a release. A "stable" promotion occurs, at most, once per milestone, ensuring a controlled/predictable release of thoroughly tested AMIs. Ideally, if we find a way to automate this publishing MR, and simply approve it in order to publish and promote simultaneously.
AMI naming: Introduce a naming convention for candidate images and public images to distinguish them from one another in AWS.
Disaster Recovery and Auditing: The Terraform configuration file can act as a deployment mechanism and also enable disaster recovery by allowing us to roll back to a previous AMI with a git revert. It also provides an audit trail for changes and the current state of AMI distribution.
Drift detection: a scheduled job will run daily to ensure that the AMIs in the manifest of tagged versions of GRIT are all still present and accessible. It will also list AMIs that are present in the project but not needed by any GRIT versions, and are therefore, able to be deleted.

How This MR's pipeline is Structured:

CI/CD Pipeline Scripts and Configurations

Common Variables and Settings (00_common.gitlab-ci.yml): Defines Docker versions, AWS region, image use cases, and sets up a Docker-in-Docker environment for creating the CI image that is used to run the remaining jobs in the pipeline.
Pipeline Rules (00_rules.gitlab-ci.yml): Specifies conditions under which different pipeline jobs should run, including merge request pipelines, default branch updates, stable releases, and cleanup tasks.
Pipeline Stages:
- Prepare (10_prepare.gitlab-ci.yml, 11_build_ci_image.gitlab-ci.yml): Configures the environment and builds a CI image with necessary dependencies (packer, terraform, etc).
- End-to-End Testing (20_e2e.gitlab-ci.yml): Defines jobs for setting up, executing, and tearing down end-to-end tests with newly created AMIs to ensure they work as expected.
- AMI Building and Replication (30_linux_images.gitlab-ci.yml): Defines jobs for building AWS AMIs (only for Linux for now, but designed to be expanded with other use cases), replicating them across regions, publishing them to be publicly available, and updating a manifest in a GitLab project (GRIT) with the new AMI IDs.

Makefiles

These files include most of the core operations of the jobs: building the CI image (Makefile.ci-image.mk), running end-to-end tests (Makefile.e2e.mk), building AMIs (Makefile.grit-image.mk), and other utilities to support the CI/CD pipeline.

How the Pipeline Works:

Prepare: Initializes the environment and builds a Docker image with necessary tools and dependencies for the CI pipeline. This will only run when changes are made to the docker file or dependencies specific to the job.
Build: using Hashicorp Packer, each use case maps to a config file and vars file to generate a new AWS AMI.
End-to-End Testing: Manual job. Deploys the built artifacts in a staging environment to run automated tests. Currently only supports running one use case, we will need to design a follow up for running each use case sequentially so we don't spin up multiple infrastructures simultaneously.
Replicate: Manual job. Copies the AWS AMIs to multiple regions that are needed for general availability and with dedicated runners.
Publish: Manual job. Makes the replicated AMIs publicly available.
Document: Generates a final manifest and MR within GRIT.

TODO:

Scheduled job: A recurring job that will detect drift and ensure all AMIs that are needed by tagged GRIT versions are available in the AWS project.
Get GRIT test functional
Add all regions to replicate to (list here)
Switch jobs that are still on the staging environment (for testing) over to production

Edited Apr 10, 2024 by Davis Bickford

Initial setup for AMI pipeline

Design proposal (discussion thread here)

How This MR's pipeline is Structured:

CI/CD Pipeline Scripts and Configurations

Makefiles

Merge request reports