New Terraform state drift workflow with Atlantis
Current workflow
- a daily scheduled pipeline refreshes the Terraform state and then runs a Terraform plan for each environment
- if the plan contains new changes (state drift) it sends a notification in the
#infrastructure-lounge
Slack channel linking back to the pipeline - a SRE verifies the changes in the job output
- if the changes are OK to apply, the engineer triggers a new Terraform plan and apply in that same pipeline to apply those changes
- if a fix is needed, the engineer opens a new merge request, which is then reviewed, merged and applied
Problems
- There is very little visibility over the state changes: the Slack notifications are lost in the noise of the channel (and sending to another channel would easily be ignored too), and the pipeline is also quickly buried under the other ones, so we quickly lose any trace of it
- The state drift changes are not reviewed the same way regular changes are (in merge requests)
- Once we move to Atlantis for regular changes, the CI pipeline for state drift detection needs to run in Atlantis too so that project/environment locking keeps working as usual, and to only have to maintain permissions for a single tool
Proposal
The state drift detection pipeline is reworked so that:
- refresh the state as usual (no way to do it in Atlantis at this time, see https://github.com/runatlantis/atlantis/issues/2849)
- trigger a Terraform plan via the Atlantis API (https://www.runatlantis.io/docs/api-endpoints.html#post-api-plan)
- if changes are detected, open a merge request bumping a timestamp in a
.tf
file for the affected environment (Atlantisautoplan
on MR creation will show the diff), and notify about it on Slack to have it reviewed, amended if needed, approved and applied
Future improvements
Feature request for drift detection: https://github.com/runatlantis/atlantis/issues/3245
Edited by Pierre Guinoiseau