RFC: Terraform Merge Request automation with Atlantis

This RFC discusses the implementation of Atlantis as a new workflow in the Terraform projects managing the GitLab.com infrastructure, to address several incidents related to Terraform changes and avoid repeating them in the future.

It will focus primarily on the Config Management project which will benefit the most from it and is a priority for FY24-Q3, but it can later be implemented in Infrastructure Management, GitLab Services and others projects using Terraform.

Summary

Atlantis is a bot that, when you create an MR against config-mgmt, will add the plan in comments (instead of a CI job). A user can interact with this bot via comments, using it to apply the plan while the MR is still open. Once everything is applied, the bot will merge the MR. If this apply fails, it will fail while the MR is open. This represents a significant change to our Terraform workflow, resulting in more transparency and safety when making production changes.

Current workflow

Day to day code changes

The current workflow for new code changes is fairly simple:

a contributor opens a MR with Terraform code changes
a CI pipeline runs a terraform plan job for each environment affected by the MR, and updates the MR check status with the number of resources created, modified or deleted for each environment
this pipeline also runs several linting, policy and security checks
a reviewer verifies the MR changes and the output of each Terraform plan from the CI job output(s)
after reviewer approval, the contributor or another team member with maintainer access merges the MR
a CI pipeline run terraform plan + terraform apply jobs for every affected environment
the contributor and/or maintainer verifies that all the Terraform apply jobs

State drift detection

A daily scheduled CI pipeline runs a terraform plan job for each environment and notifies the #infrastructure-lounge Slack channel if there are any state changes detected in a given environment, with links to the pipeline to review and apply those changes.

State refresh

To speed up CI pipeline and avoid applying unverified state drift changes, the terraform plan jobs run with state refresh disabled, and the state is refreshed only once a day as part of the state drift check CI pipeline.

What problems are we trying to solve?

The current workflow presents several problems that lead to multiple destructive incidents due to unwanted changes being applied without proper verification:

the contributor and reviewers might not always check the Terraform plan in the CI job output, especially with smaller changes that appear harmless, checking the job output for each environment is cumbersome, even more when all environments are affected
due to the limitations of the file change detection in CI, changes to Terraform modules under the modules/ directory trigger a Terraform plan job for every environments, making reviews more difficult, see previous point
the Terraform plan verified in the MR might not always be the same one applied from the master branch (unverified and unapplied state drift, incompletely applied older changes, etc.), especially when the MR plan output is several days old
the Terraform plan is applied automatically without any verification, and the engineer might not always check that the apply is successful
- in a previous iteration of the CI pipeline, the apply jobs were triggered manually after verification of the plan, however this resulted in the apply jobs often not being run at all because of the engineer forgetting about it after merging the MR, or conversely resulting in all plan being applied without any verification because again it is sometimes too cumbersome
CI pipelines from MRs merged shortly after one another can sometime cancel each others (save for the most recent one), resulting in unapplied changes in some environments and confusing plan outputs
when a change affects multiple environments that depend on each others, there is no control on the execution order of the Terraform apply jobs, leading to failed apply jobs that need to be retried and are sometimes unnoticed by the engineer
the plan and apply jobs in the latest CI pipeline can be rerun with very little visibility, making investigation more difficult
while imports and refactoring can now be done declaratively, removing resources from the state (terraform state rm) is still a manual operation that needs to be executed locally, without any review and trace, and which can interfere with ongoing reviews and applies
the only traces of state drifts and their resolution are notifications in a Slack channel (lost in the channel noise and disappearing after 90 days) and the CI pipeline history, making them poorly visible and difficult to investigate after the fact
the policy checks (using Checkov and OPA) in MRs are poorly visible (in CI jobs output again) and fairly unused to this day in part because of it
the CI pipeline configuration has become quite complicated, it is generated from Jsonnet files to help manage it but making any significant changes to it has become increasingly difficult; adding an environment to the pipeline also requires regenerating the configuration and some knowledge of Jsonnet

A new workflow with Atlantis

Atlantis is a Terraform ~~Pull~~ Merge Request Automation tool.

Atlantis is an application for automating Terraform via pull requests. It is deployed as a standalone application into your infrastructure. No third-party has access to your credentials.

Atlantis listens for GitHub, GitLab or Bitbucket webhooks about Terraform pull requests. It then runs terraform plan and comments with the output back on the pull request.

When you want to apply, comment atlantis apply on the pull request and Atlantis will run terraform apply and comment back with the output.

-- https://www.runatlantis.io/guide/#overview-%E2%80%93-what-is-atlantis

Atlantis provides full visibility of the entire life of a MR in the MR, from the draft Terraform plan to the Terraform apply results, for better reviews, safer applies and easier investigations in case of trouble.

Atlantis doesn't have a UI (well it kind of does but only for viewing and managing locks) so it is entirely controlled from MR comments.

Working day to day with Atlantis

The new workflow is quite different but still simple:

a contributor opens a MR with Terraform code changes
the CI pipeline runs various static linting and security checks
Atlantis runs a Terraform plan automatically and displays the results in a comment, showing a summary of the changes per environment and the full plan output under a collapse section
- when a module under the modules/ directory is updated, Atlantis autodetects which environments use this module and will only run a Terraform plan for those
the contributor can comment atlantis plan to generate a new Terraform plan when they need so, the older plan results are hidden automatically
- a Terraform plan for a single environment can be obtained with atlantis plan -p my-env
if OPA and/or Checkov policies exist, Atlantis will also post their results in the comments
a reviewer verifies the MR changes and the output of each Terraform plan output from the comments
when the MR is mergeable (approved, CI passed, policies passed, no merge conflicts), the contributor can comment atlantis apply to apply the changes from the plan previously generated
- a Terraform apply for a single environment can be run with atlantis apply -p my-env
if a Terraform apply fails, the contributor can iterate in the same MR until it succeeds, including reverting partially applied changes
a resource can be removed from the Terraform state by commenting atlantis state_rm my.resource
when the plans for all environments have been applied, Atlantis merges the MR automatically

As a result:

the Terraform plan and the apply results are plainly visible in the comments
the plan that is applied is exactly the one that was reviewed, there are no surprise additional changes
the contributor and reviewer(s) are aware that their changes are not fully applied until the MR is merged, and are always notified of the outcome via the comments
an unprivileged compromised/malicious user has no way to bypass the workflow (for example by modifying the CI configuration) to forcefully apply changes without the approval from a maintainer
the main branch now reflects the configuration currently applied instead of what will be applied (or not) and so is always valid, it doesn't contain any unapplied changes at that point in time (until a resource is modified manually which would result in a state drift)

Architecture

The Atlantis Helm chart is deployed in the ops-gitlab-gke GKE cluster via gitlab-helmfiles. It consists simply of a Statefulset, Service and Ingress. There is one Atlantis deployment per GitLab instance and project (or group of project), all sharing the same ingress with different dedicated FQDNs.

HA

Atlantis runs as a single pod by default, but can be made highly available by deploying Redis alongside it (to use instead of BoltDB for project locks) and leveraging Filestore for shared storage (to store branch checkouts and Terraform plan files).

Vault secrets and security

In order to be able to access the Vault secrets for config-mgmt, Atlantis can authenticate to Vault via the Kubernetes authentication method (see the role and policies configuration here). Additionally, it has direct access to the Terraform state GCS buckets via Workload Identity (see configuration here). Because of the permissions thus given to Atlantis, each Terraform project (or group of projects with identical privileges) using Alantis will have its own dedicated Atlantis deployment.

Project configuration

The plan, apply and policy checks workflows are configured and customized on the service side in gitlab-helmfiles, this allows us to configure additional steps to runs for each command (eg. Vault authentication) and to set the conditions for allowing a user to run a Terraform plan or apply (documentation).

The environments (or projects in Atlantis terms) and Atlantis' behaviour can be configured in the file atlantis.yaml in the repository (documentation).

Caveats

Atlantis is still a relatively young project (but gaining momentum!) so some features are still missing or imcomplete compared to its competitors.

Environment locking

Atlantis has a locking mechanism (separate from the Terraform state lock) to avoid changes being applied simultaneously from multiple MRs. However this lock is applied from the moment atlantis plan is run until the MR is merged or is unlocked with atlantis unlock, so this would be problematic with the way we work in our projects (multiple concurrent MRs being worked on, some drafted over multiple days, Renovate opening multiple MRs, ...), especially with an automatic plan upon opening a new MR (though it could be disabled).

For this reason, project (environment) locking will be disabled for the time being. We will implement an alternative mechanism to prevent applying changes from simultaneously from multiple MRs (:bulb:Atlantis can only apply is a MR is mergeable, so we only need to make the other MRs not mergeable when an apply is initiated: approval, commit status, ...?)

🚧 There is some work in progress to add non-locking Terraform plans for draft MRs, scheduled for v0.27, see also this issue.

State drift detection

Atlantis is not able to do state drift detection at this time (see this issue).

One possible idea to implement it ourselves would be:

trigger a Terraform plan for each environment via the API
if changes are detected, open a merge request bumping a timestamp in the affected environment(s) and notify on Slack
review, fix, approve and apply as needed

This method would provide a trace in new MRs for every state drift and their resolution, which is an improvement on the current workflow for visibility.

State refresh

Atlantis doesn't provide a way to do daily Terraform state refreshes. However this could be done as part of the daily state drift detection described above (the same way it is done in the current workflow).

Alternatives

env0: paid product, can be self-hosted
Spacelift: paid product, can be self-hosted only in AWS
Terraform Cloud: paid product, no self-hosting

Demo

Demo from the APAC Reliability Discussions meeting can be found in the Infrastructure Demos folder, with detailed notes in the Reliability Discussion agenda document

Edited Feb 13, 2024 by Pierre Guinoiseau

RFC: Terraform Merge Request automation with Atlantis

Summary

Current workflow

Day to day code changes

State drift detection

State refresh

What problems are we trying to solve?

A new workflow with Atlantis

Working day to day with Atlantis

Atlantis plan

Atlantis apply (with an error in one environment)

Auto-merge

Architecture

HA

Vault secrets and security

Project configuration

Caveats

Environment locking

State drift detection

State refresh

Alternatives

Demo