Create AMI pipeline
The AMI for ephemeral machines will need to be updated periodically, so we should create a pipeline to build and publish the image.
- build AMI with necessary packages (git, etc)
- replicate the AMI across all regions we care about
- mark all AMIs public
- cleanup of old AMIs that become discontinued / unsupported past a certain point
All AMIs made public from our official release account will need to remain for a specified time, so we would need a well-defined deprecation policy. Something like, minimum 1 year or last 3 versions, whichever is longer, etc
here)
Design proposal (discussion threadBased on discussion in my initial draft MR with @josephburnett I drew up a basic design for our AMI pipeline:
-
Source-of-truth for AMIs: Maintain a Terraform configuration file (e.g.
manifest.json
) in the project that serves as the single source of truth for AMIs. This file could list all AMIs to be distributed and their target regions, structured like:{ "aws-linux-ephemeral": { "us-east-1": { "architecture": "x86_64", "deprecation_time": "", "id": "ami-048b8191da9c5e202", "name": "ubuntu", "public": true }, "us-east-2": { "architecture": "x86_64", "deprecation_time": "", "id": "ami-00be98396005a8a31", "name": "ubuntu", "public": true }, "us-west-1": { "architecture": "x86_64", "deprecation_time": "", "id": "ami-0253f2f2179c1b843", "name": "ubuntu", "public": true } } }
-
Two-phase pipeline (not to be confused with stages. each phase could have one-or-many pipeline stages):
- Build Phase: Triggered by a merge request, this phase would build and test (leveraging GRIT) a new candidate AMI. We do this in our sandbox AWS account.
-
Publish Phase: Manual stages of the pipeline replicate and publish the images, then generate a follow-up merge request in GRIT with the latest manifest (e.g.
manifests.json
) with the newly created AMI IDs. This phase is responsible for copying the AMI to the distribution account (`runner-grit-images`), and applying the Terraform plan to replicate and publish the AMIs publicly.
-
AMI Release Cycle:
- Promotion: promote the "latest" image in the publish pipeline when tagging a release. A "stable" promotion occurs, at most, once per milestone, ensuring a controlled/predictable release of thoroughly tested AMIs. Ideally, if we find a way to automate this publishing MR, and simply approve it in order to publish and promote simultaneously.
-
AMI naming: Introduce a naming convention for candidate images and public images to distinguish them from one another in AWS.
-
Disaster Recovery and Auditing: The Terraform configuration file can act as a deployment mechanism and also enable disaster recovery by allowing us to roll back to a previous AMI with a git revert. It also provides an audit trail for changes and the current state of AMI distribution.
-
Drift detection: a scheduled job will run daily to ensure that the AMIs in the manifest of tagged versions of GRIT are all still present and accessible. It will also list AMIs that are present in the project but not needed by any GRIT versions, and are therefore, able to be deleted.
Update as of 2024/2/22:
The draft MR is moving along, however we are discussing blockers regarding how to handle publicly publishing AMIs for GRIT and ensuring we follow best practices. It appears our runner group's AWS account has public publishing disabled by default. Part of our discussion is whether we should publishing in a new or different AWS account or enable the ability in our current account.
Update as of 2024/3/6:
We have a new AWS account, runner-grit-images, to be used for GRIT image distribution. @dbickford, @josephburnett, and @tmaczukin are admins. We are now finalizing and fielding feedback on the design, discussion thread here.
Update as of 2024/5/7:
Health status: Needs attention
What's left to be done: final testing of the pipeline and final configurations can move forward now that blocking issues have been cleared:
- Include Go components to ci image reduce pipeline execution time
- GRIT e2e main is broken
- GRIT docker autoscaler example fails to create route table with Error: RouteAlreadyExists
What's blocking: Potential blocker could be if publishing publicly does not behave as we expect.
Update as of 2024/5/9:
Health status: At Risk
What's left to be done: Followup/final touches MRs are in progress/review:
- gitlab-org/ci-cd/shared-runners/images/aws/grit-images!3 (merged)
- gitlab-org/ci-cd/shared-runners/images/aws/grit-images!4 (merged)
What's blocking: There is a docker daemon issue in the pipeline, currently troubleshooting. Also, potential blocker could be if publishing publicly does not behave as we expect.
Update as of 2024/5/22:
Health status: On track (moved to %17.1 )
What's left to be done: Followup/final touches MRs are awaiting reviews:
- gitlab-org/ci-cd/shared-runners/images/aws/grit-images!3 (merged)
- gitlab-org/ci-cd/shared-runners/images/aws/grit-images!4 (merged)
What's blocking: The previous docker daemon issue in the pipeline has been fixed. Potential blocker could still be if publishing publicly does not behave as we expect, we won't know until the above MRs are approved and we attempt to run the full pipeline.