The DRI for the incident review is the issue assignee.
If applicable, ensure that the exec summary is completed at the top of the associated incident issue, the timeline tab is updated and relevant graphs are included.
If there are any corrective actions or infradev issues, ensure they are added as related issues to the original incident.
Fill out relevant sections below or link to the meeting review notes that cover these topics
Customer Impact
Who was impacted by this incident? (i.e. external customers, internal customers)
All users of gitlab.com, including external and internal customers.
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
GitLab.com was unavailable on 2023-07-07 from 16:25 UTC to 18:42 UTC. During this time the web and API interfaces were not available (503). Customers were able to perform git actions via the command line.
For customers that did not have DNS records cached, Container Registry was unavailable on 2023-07-07 from 16:25 UTC to 19:36 UTC.
A small number of git pushes on 2023-07-07 from 15:55 UTC to 16:17 UTC are not available on GitLab.com until the changes are pushed again from a local copy.
We have restored data to known recovery points, and a small subset of customer projects requires a refresh using their local copy.
The impacted project owners have been notified and were advised to re-push their changes.
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
All customers.
What were the root causes?
An outdated production configuration was applied to our production environment, which caused several GitLab.com production services to be removed and replaced.
The root cause was an out-of-sync infrastructure configuration plan (Terraform) executed against our production environment.
This infrastructure configuration plan was prepared 3-weeks prior in preparation for our production database upgrade.
Environmental drifts accumulated during those 3 weeks causing the planned configuration and production environment to become out-of-sync.
Executing the out-of-sync plan caused an unintended removal of production services which resulted in the outage.
We typically execute configuration plans shortly after they are prepared. However, the execution of this 3-week-old configuration plan exposed a gap in our process.
Incident Response Analysis
How was the incident detected?
16:10 UTC - A job begins to apply that is applying an old TF configuration
16:20 UTC - Based on slack reports and personal observation of 5xx errors on GitLab.com, EOC attempted to upgrade incident to an S1.
It was five minutes from when the job began applying destructive changes until the EOC was notified of an initial problem. And ten minutes until it was clear that this was an S1 site outage incident.
How could detection time be improved?
NA.
How was the root cause diagnosed?
18:25 UTC - Checked Cloudflare status page
18:25 UTC - DBRE brings the calls attention to a running Terraform apply job
18:28 UTC - First look at the MR attached to the pipeline seems harmless
18:30 UTC - Examining the running Terraform apply job revealed several resources being destroyed. Referencing the plan for the pipeline showed 617 resources to be destroyed.
18:31 UTC - The job was stopped to try and prevent further destruction.
18:34 UTC - A local plan against the production environment was run by the EOC to see what cloud resources were missing.
18:37 UTC - At this point, it appeared likely that the applied plan had caused the outage due to destroyed resources.
How could time to diagnosis be improved?
We had to fall back to Google Docs to manage this incident since GitLab.com was unavailable, however, a GitLab issue was eventually created once GitLab.com was back online #15997 (closed). Having both the doc and issue caused back and forth copying and pasting of data which was inefficient while trying to manage the incident. One way to improve this is to use our ops instance for incident tracking rather than Google docs. Though, this would lead to additional problems and is currently discarded. Alternative solution are being discussed and a follow-up issue has been created to continue exploring them: Improve Incident Management process when gitlab... (gitlab-com/www-gitlab-com#34382 - moved)
How did we reach the point where we knew how to mitigate the impact?
Spent some time trying to "fix" terraform, or get a better handle on how a restore might work without having to slowly apply each difference one at a time.
While that was happening, work was put into trying to assess what specific disks and other systems were missing.
An attempt to remove the dependency on the redis cache cluster was put in progress to see if that would get the web fleet back to operating status.
The call was broken into two zoom chats. The main incident room was used to focus on restoring services without using Terraform. The other was focused on restoring resources via Terraform.
Getting the resources restored and the Terraform configuration into a clean state along with ensuring affected customers were notified was what we considered mitigated.
How could time to mitigation be improved?
Identifying a list of affected customers was delayed due to tooling (rails console and Teleport) being offline. This tooling was dependent on first getting our Terraform configuration into a clean state.
Drafting messaging for affected customers required quite a lot of cross-functional efforts and approvals (i.e. engineering, product, and support to assess impact and work with corporate communications to draft message, marketing ops to send the message, legal / customer success sign off). This can be improved by having pre-approved messaging templates for these types of outages allowing us to move quicker. Follow-up issue: https://gitlab.com/gitlab-com/www-gitlab-com/-/issues/34387+
Post Incident Analysis
Did we have other events in the past with the same root cause?
A lot of people came together to help, even though it was a Friday, and Saturday!
Our severity1 response processes were put to test, with an immediate good feedback from everyone involved, for e.g. the breakdown of zoom calls, threads, and documents to tackle different parts of the incident was excellent.
I imagine the desire to complete this earlier comes from the need to communicate in more detail in various forums. I think I have enough information to do those communications without the review completing. Can you point me in the direction of those requests and I'll cover them off ahead of the review and follow up with any additional info from the review.
Rehabmarked the checklist item If applicable, ensure that the exec summary is completed at the top of the associated incident issue, the timeline tab is updated and relevant graphs are included. as completed
marked the checklist item If applicable, ensure that the exec summary is completed at the top of the associated incident issue, the timeline tab is updated and relevant graphs are included. as completed
Given that we had to revert to google docs, should we have a backup project in the ops instance for incident tracking? Asking as a question not a prescription.
Affected customers could not see the tracking ticket provided in status update emails, which wasn't a good experience. In that circumstance, it would be useful to have a backup of some kind. It might make sense for that backup to be a tech stack distinct from GitLab, so that you know it cannot possibly be affected by the same issue.
1 to this suggestion, tracking things in a Google doc was a good step but introduced overhead work after the dust has settled, and didn't provide good visibility to our users while the incident was Active.
I think once we determine which issue tracker to use in the ops instance, we can automate an incident creation fallback to that tracker via woodhouse if gitlab-com is down.
The mirror page is just a single html file hosted in GCP as a load-balanced static html site ops.gitlab.net-hosted page.
The idea here is that it would get updated (ideally via webhook) every time a comment or issue description change was made to an severity1 incident issue being managed in the ops instance.
All this would avoid the necessity to open ops.gitlab.net to the public, and obviate the need to ensure it (ops.gitlab.net) could scale to meet demands of thousands of page views during a high visibility (gitlab.com-outage) incident.
Thanks @nnelson! I was considering to add to the on call team members onbording process to get a local clone of the runbooks but having them on ops is great
We had to fall back to Google Docs to manage this incident since GitLab.com was unavailable, however, an GitLab issue was eventually created once GitLab.com was back online #15997 (closed). Having both the doc and issue caused back and forth copying and pasting of data which was inefficient while trying to manage the incident. One way to improve this is to use our ops instance for incident tracking rather than Google docs #15999 (comment 1462412414)
@rehab, @jeromezng, @meks - a couple of notes that were also discussed in the meetings:
We had heald off of using ops and a public issue/project because we did not want to have heavy traffic on the instance we were using to recover.
We had also wanted to dogfood a feature the monitor stage was working on - gitlab-com/www-gitlab-com#7012. I'm not sure that ever came to completion and is something we could look at.
Thank you for the update on the root cause. I recalled we used to ask 5 whys to understand the underlying cause in depth. Can we also perform this analysis in a blameless way? @andrewn may have historical context.
Part of a change request, an old pipeline was triggered, applying an obsolete Terraform plan to the Production environment.
Particularly I am interested in how we triggered an obsolete plan and what can we do to prevent this in the future. The list that @rehab created is a good start. I am looking at both the TF mechanics and also the human usability part.
Plan naming conventions to make it clear (prefix/dates)
I don't think plan names are ever seen by a human eye, so that might not be a valuable change.
And archiving old plans is not really relevant. Terraform has it's own methods to prevent stale plans from being applied. But, perhaps the intention is to archive old pipelines so they can be viewed but not re-run? If GitLab has a way to expire old pipelines from being re-run, that could possibly be a feature that might help. But it seems redundant in this specific case since we've added a check that will not let applies proceed if they are applying an older commit than origin/HEAD as a Terraform configuration.
While we currently have a workflow for severity1complete outages (can't find a link to it at the moment), which worked well during this incident, it doesn't help in historically differentiating between an event such as this one, and an outage in only one of the services in Production.
We can fork the workflow for complete outages to it's own incident severity. Clarifying this in our docs would also help the IMOC and other engineers during their incident response.
During the first 30-45 minutes of the incident, where every minute counts, I remember there was few questions raised about the impact, which was clear to most of us as a complete outage, but wasn't really clear from: 1. the title of the incident. and 2. the severity assigned to the incident. at the time.
We can fork the workflow for complete outages to it's own incident severity. Clarifying this in our docs would also help the IMOC and other engineers during their incident response.
Could you provide more detail on your thoughts here? My initial take is that we wouldn't do anything differently but don't want to simply dismiss it.
During the first 30-45 minutes of the incident, where every minute counts, I remember there was few questions raised about the impact, which was clear to most of us as a complete outage, but wasn't really clear from: 1. the title of the incident. and 2. the severity assigned to the incident. at the time.
severity1 was certainly capturing folks attention and I agree that folks were very aware of the customer impact. I'm not sure how having severity::0 would have helped with this. Can you please expand?
Could you provide more detail on your thoughts here? My initial take is that we wouldn't do anything differently but don't want to simply dismiss it.
I'd say we mostly wouldn't do anything differently for this incident! However, the workflow for handling this type of incident (outage preventing incident creation/view) was different than the workflow for handling other severity1 incidents with no outage, and as far as I can tell, I can't find a documentation for such a workflow anywhere.
At first I thought we had an outage workflow, perhaps this was a proposal in the past that never came to life? However, I found this, which hints to outages but isn't very clear.
What I'm trying to propose here is a separate workflow for when there's a complete outage, and although it's rare, it would still be nice to formalize our response, document it, and iterate with our learnings from events like this.
During the first 30-45 minutes of the incident, where every minute counts, I remember there was few questions raised about the impact, which was clear to most of us as a complete outage, but wasn't really clear from: 1. the title of the incident. and 2. the severity assigned to the incident. at the time.
That should have been my responsibility. Apologies for having missed that.
I'd say we mostly wouldn't do anything differently for this incident! However, the workflow for handling this type of incident (outage preventing incident creation/view) was different than the workflow for handling other severity1 incidents with no outage, and as far as I can tell, I can't find a documentation for such a workflow anywhere.
I agree and particularly as a new IMOC I've found myself very inefficient at the beginning. The IMOC process available to me was based on issues which were unavailable. The IMOC runbook was unreachable too. Maybe a more experienced IMOC or someone more familiar with operational issues would have reacted faster but we definitely should have this written down somewhere accessible. For instance, starting a google doc as we've realized the issue tracker became inaccessible (this was asked by the EOC as I was still looking at how to do things).
Additionally, should we consider outage as a circumstance where we want a seasoned IMOC to take over? As said, time is critical in such events and we might want to be as efficient as possible. A lot of team members helped during this incident. Some asked me directly if I needed help or even pro-actively took over some tasks. Though, we might still have lost some crucial minutes that an experienced IMOC would have saved.
@gonzoyumo I was out on PTO during the incident, so I don't know if about.gitlab.com was also impacted. But here is the handbook page for Incident Management Processes, that is not dependent on issues.
At any time, if an IM feels they need more support because they feel underprepared or unable to be effective in a given incident, regardless of labels or incident severity, they should feel free to escalate to Infrastructure leadership. There is a Pagerduty escalation policy in place for exactly this purpose, and it's always ok to ask for someone from Infra leadership to take over as IM in this case, while the original IM should remain engaged in a shadow capacity to learn. I'll make a note to highlight that more for new Incident Managers in the training.
Thanks @amoter. What I was referring to about "issues" in my comment is the fact that the steps described in the IMOC responsiblities are mainly based on the incident issue being available, which was not the case here due to the outage. This, to me, has caused additional burden as I had to adapt further a process I was not familiar with. It might not look like much and I probably lacked some judgement at that time, the "deer in the headlights" effect did not help. Though, having a check list to follow when there is such an outage would maybe help IMOC to face the situation more efficiently.
@alanrichards the escalation was indeed triggered but not by me :/
@gonzoyumo FYI I'll bring up this thread in today's Incident Review, as there was a lot of different points raised around incidents that are also an outage, my hope is that we'll come up with a set of actions from the discussion to improve our workflows and be better prepared for instances like this in the future.
During today's incident review sync meeting it was suggested that rather than trying to create a distinct workflow, we could simply improve the current one.
Pointing specific situations or needs that can arise and how to best adress them. For instance:
what to do when the incident issue can't be created or is inaccessible?
when the scope of incident is large, emphasize the opportunity to escalate, call additional IMOCs/EOCs, and split the tasks. As necesary, split into separate zoom rooms. This is mentionned in the multiple incidents section but could be made more generic and benefit from improvements like automation to create the room and record the meeting.
@jarv do you know who would be best positioned to help on automating this via woodhouse like you suggested?
explicit a process for incident response and communications when customers are impacted (this is different than the Incidents requiring direct customer interaction guideline). For instance, pre-approved messaging could help speeding up the response time. @jmalleo you mentionned on slack (internal link) that you'll be looking into this, could you please share an issue to collaborate on this topic?
Awesome suggestion @jarv I feel like this is important enough for us to contribute the feature upstream ourselves should we create an issue in our issue tracker to contribute the feature?
I feel like this is important enough for us to contribute the feature upstream ourselves should we create an issue in our issue tracker to contribute the feature?
@cmcfarland I think that could work for us, as long as there is local permission to run it. Could you open an issue to discuss and put it in the teamFoundations queue?
Igorchanged title from Incident Review for Site-wide Outage for GitLab.com #15997 (closed) to Incident Review for Site-wide Outage for GitLab.com - Stale Terraform Pipeline #15997 (closed)
changed title from Incident Review for Site-wide Outage for GitLab.com #15997 (closed) to Incident Review for Site-wide Outage for GitLab.com - Stale Terraform Pipeline #15997 (closed)
Incident communication via issue on a separate instance
I've heard from different sources (Hacker News, wider community members, friends, customers, social media, etc.) that hosting the incident issue on GitLab.com SaaS is not great, and limits transparency and communications, when GitLab.com is down.
Suggestions included
Hosting a dedicated public GitLab instance for incident handling
A different system for incident updates that is linked from status.gitlab.com (imho contradicts dogfooding the product)
In my mind, we had two incidents that occurred. The first is outage and the second was the deletion on storage. Should we consider breaking apart configuration so that these pieces are not handled in the same change?
We rolled back one single commit, @clefelhocz1 point is where if should break up our TF commits to touch on fewer components. If we rolled back a single commit with this much impact it seems the commit that was rolled back was a large change initially.
@meks actually we rolled back many commits, around 3 weeks worth of commits, or 24 days to be accurate. The impact was large because it contained many commits.
To clarify how the impact would've been different, let's take the following two scenarios.
A scenario for a better incident
The 3 Gitaly nodes that were recreated, were initially added to Terraform in this commit on Jun 17, 2023, this was a normal routine task for infra, to scale up the fleet whenever resources are close to reach capacity.
The commit that was triggered with an old plan was first pushed on Jun 13, 2023, re-triggered on Jul 07, 2023, the day of the incident.
If the 3 new Gitaly nodes were added on Jun 12, 2023 instead of Jun 17, 2023, we wouldn't have lost the nodes during this incident! The only reason they were reverted (destroyed) is because the commit adding them falls in the window of Jun 13 - Jul 07. This was not the only commit that was reverted, but it was the most impactful (along with the Redis nodes).
A scenario for a worse incident
Let's take another example. If we had to recreate our main database nodes in the past 3 weeks for any legitimate reason. Example, upgrading the main PG14 fleet.
Because the recreation of these nodes would have fallen in the window of the reverted commits Jun 13 - Jul 07, we would have lost our main database fleet! This would've been a catastrophic event, that would've taken many more hours to recover from.
The bottom line is, the vulnerability; if we can call it that; we had in our Terraform workflow for allowing old commits to change the state of the current Production to a point in time in the past, could have caused a worse incident, or a less impactful one! But we ended up somewhere in the middle.
Around 2 weeks ago we experienced an event where a handful of terraform resources were removed.
Upon further investigation, we discovered that this happened due to two merges happening around the same time, resulting in competing pipelines. These pipelines compete for the same resource group lock as well as the same terraform state lock.
However: There is no enforced sequencing in order of merges. It resulted in the following sequence of events:
08:42 - Pipeline P2 starts job J2 which applies terraform for commit C2.
08:53 - Pipeline P1 starts job J1 which applies terraform for commit C1.
Because C1 is an earlier commit than C2, applying C1 results in a unintended rollback of C2.
Only once the next merge happens, will we roll-forward and re-apply C2.
In this case we only rolled back a single commit. But it highlighted that there is a real risk associated with out-of-order pipelines.
Terraform itself protects against applying a stale plan, but in this case we are creating a fresh plan + apply after each merge. The problem is not a stale plan, but rather performing a plan + apply a non-current commit.
At this point there was some speculation that a restart of an old pipeline could widen the unintended rollback window. Some googling surfaced this upstream terraform issue talking about the same hazard, specifically in the context of GitLab CI. We also briefly discussed Terraform Cloud and Atlantis as platforms which have stronger ordering guarantees.
It was however not deemed urgent enough to address immediately, as our workflows generally do not require restarting old pipelines. After all, we've been applying terraform via CI for many years without issue.
@igorwwwwwwwwwwwwwwwwwwww do you think setting process mode to oldest_first might have helped that specific situation? Assuming you are auto-applying, I think that this would prevent J2 running before J1 is completed.
@tmeijn I wasn't aware of this feature. Yes, I think it would help in the competing merges case. Thanks!
I suspect there is still a potential window for the pipelines to be created out of order (assuming they are created via Sidekiq), but the window for that is much narrower and effectively requires simultaneous merges.
@igorwwwwwwwwwwwwwwwwwwww Alright, hope it helps! Don't judge my Bash skills but use this little script to set resource_groups to oldest_first by default, since I like them to work like that anyway:
If not, did this scenario have a part to play in this incident.
Yes! This is the same scenario that happened during this incident, but at a different scale.
In Igor's example, the two pipelines were one after the other, one was applied on Jun 26, 2023 at 09:42 AM, the other was applied minutes later at 09:53 AM, in which we reverted only one commit, or in other words, 10 minutes worth of commits.
In this incident, however, the triggered pipeline was 3 weeks old, so the revert was 3 weeks worth of commits instead.
Another good comment made by @chill104 via internal channels:
Why isn't oldest_first on by default?
I think this is something we should evaluate. Whenever resource groups are in use, it is likely a deployment related job, and this implies that we may want to enforce sequencing.
@igorwwwwwwwwwwwwwwwwwwww I think the initial topic of this discussion has been addressed while completing https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24080 but your last comment about oldest_first seems to remain. Could you please expand a bit on that proposal? Is there an issue to further explore it? We can also bring that to the next sync session if needed.
It was however not deemed urgent enough to address immediately, as our workflows generally do not require restarting old pipelines.
This is generally true. Our terraform workflow consists of opening MRs, merging those MRs, and having the merge pipeline running a terraform plan + apply. Usually, there is no reason to retry a pipeline.
There are a few cases that are not handled by this workflow however:
Failed terraform apply.
Drift detection and reconciliation.
Re-creating resources.
We'll take them one at a time.
Failed terraform apply
Most of the time, terraform is able to give ample feedback during the plan phase and proactively warn users. However, there may be cases where the plan succeeds but the apply fails.
This could happen for a number of reasons, including:
Conflict, for example if a resource already exists.
Intermittent issue with the pipeline.
Intermittent issue with the cloud provider.
In case of a conflict, the cloud provider errors out. What we do to resolve it will be different on a case-by-case basis:
We may want to manually import the resource and retry.
Or manually delete it and let terraform re-create it.
Or open up a follow-up MR that resolves the conflict.
The fact that we're required to jump through hoops manually and restart pipelines here is a sign that our existing tooling does not support this case very well.
Idea: Making it easier to trigger a new pipeline for an environment could go along way to making failed terraform applies safer.
Drift detection and reconciliation
While we are very rarely running terraform locally, or making changes outside of terraform, it does happen sometimes.
Especially during an incident, we may need to act quickly to mitigate. This may involve making a direct change and backporting it to terraform.
This can however result in drift. If things are not backported, or a change is made by accident, it results in terraform no longer matching the state of the world.
The first thing we want to do is detect this drift, and to do so as quickly as possible to avoid it from accumulating and resulting in a state where we can no longer safely apply terraform.
In order to do that, we have a daily drift detection pipeline that runs as a scheduled CI job once per day. It runs a plan for every environment and if we get a dirty plan, we know that we have drift.
If drift is detected, a message is posted to slack. It includes a link to the reconciliation pipeline, which includes manual jobs to plan and apply. Running those plan + apply jobs effectively reverts any manual changes and restores the state to what is specified in terraform.
Doing so may not always be safe. So some judgement is needed when deciding whether to squash the delta or whether to backport the changes into terraform by opening a new MR.
One previously unknown problem with this approach however is the following case:
Some drift is created via manual change in GCP.
Daily drift detector runs, alerts via Slack. This pipeline runs on commit C1.
An unrelated terraform change MR1 is merged, this creates commit C2. Merge pipeline runs on C2, applies the change from MR1, but also squashes the drift as a side-effect.
Unaware that the drift has been dealt with, someone decides to run a plan on the drift detector pipeline. This plan no longer shows the original drift. But because this whole pipeline runs on commit C1 it now shows a revert of the changes from MR1.
Person is unaware of MR1 (and that it has been merged), incorrectly believes that these changes are drift, and runs the apply job.
We have now accidentally rolled back MR1. It will remain in this state until the next merge occurs.
Idea: Changing this workflow to trigger a new pipeline for reconciliation could go along way to making the drift reconciler safer.
Re-creating resources
Now we get to the main event, the actual trigger of this incident.
Deleting resources in terraform
Terraform creates resources and tries very hard not to delete or re-create them, unless it absolutely needs to. Whenever possible, it will try to make changes in-place. For example, if we add some new metadata to a GCE instance, it will not destroy that instance.
However, there are sometimes cases where we do want to destroy and re-create them from scratch.
Our terraform pipelines do not have a way of using these arguments. Thus we are forced to fall back to either making changes manually or running terraform locally. This is already a bit of a safety hazard, so in order to contain this risk, we split it into two steps:
We perform a terraform delete locally to remove the resources in question.
We then re-start the most recent pipeline for that environment to re-create those now deleted resources.
This eliminates the need for terraform apply running locally.
Preparing for maintenance
The change that was being worked on aimed to re-create the patroni-ci-v14 fleet. The original MR had been prepared ahead of time, and merged on 2023-06-14, roughly 3 weeks ago.
Fast-forward to 2023-07-07. We are now executing the production change. We've successfully run terraform delete. Everything is going according to plan. We're ready to re-create those instances.
In order to know which pipeline to restart, we'd need to manually scan through the pipelines page and figure out which one contains the environment we're seeking to re-apply. We already have the successfully restarted pipeline from 3 weeks ago.
We restart that pipeline. The rest is history.
The underlying problem here is that our tooling did not support the use case of re-creating resources. This required us to fall back to local terraform invocations. Our tooling also did not support creating a new pipeline for a fresh plan after the local changes had been made. This required us to fall back to restarting an existing pipeline.
Idea: Making it easier to trigger a new pipeline for an environment could go along way to making re-creation of deleted resources safer. The ability to trigger a targeted terraform apply -replace would be even better.
Conclusion
The general rule that terraform pipelines do not need to be restarted has some very notable exceptions, and these exceptions create significant risk.
We can introduce safeguards such as those being explored in https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24080. However, this does not actually solve the unsupported cases. If left unaddressed, it may in fact result in more of these operations bypassing CI and being executed locally, which creates a whole host of new risks.
Thus it is crucial that we understand where the gaps in the automation are, and fix them so that the golden path is not only the safest, but also the easiest most user friendly option.
Should we capture the idea to leverage a new pipeline instead of restarting an older one and using terraform apply -replace as corrective actions?
Also, I'm wondering if the tasks you've mentioned that require to manually restart an old pipeline are documented somewhere? For instance, when performing maintenance is this simply the obvious way of doing things or do team members refer to a documented process?
Is there an opportunity to do a specific training on our Terraform workflows?
Regarding the Terraform workflows, I think this is entirely a tooling/process gap, not a training gap. Our workflow lacks some important concurrency safety, and that is what we need to fix.
Our Terraform workflow has a long-standing gap that allows for two sets of changes to race with each other. The "Drift detection and reconciliation" section above walks through an example of this.
I figured a sketch might help visualize this, so here goes. Corrections welcome, as always!
Terraform natively guarantees that a plan was generated from the latest state file. (The plan and state file both have serial numbers that terraform compares for consistency.) That eliminates a broad class of risks.
However, Terraform has no way to natively check that a freshly generated plan was derived from the appropriate git commit. The class of risk that we aim to mitigate is where we (via either automation or manual action) tell Terraform to plan and apply an out-of-date git commit.
The following diagram shows 2 terraform branches being merged serially:
When branch add-resource-foo is merged, its CI pipeline will run a terraform plan and terraform apply for its merge-commit: commit 3.
Later, when the second branch add-resource-bar is merged, its CI pipeline runs for commit 5. So far, all is well.
But if we rerun the CI pipeline for the first branch, it will still use commit 3 as its reference point. Generating a fresh terraform plan from that old commit will cause terraform to undo all subsequent changes, such as those merged by the second branch add-resource-bar.
Doing this will prevent accidentally reverting a more recently merged set of changes.
Add support to easily generate a fresh CI pipeline for the given environments associated with a merge request.
Rerunning an old pipeline from an already merged MR is dangerous because it reuses a potentially stale git commit as the basis for generating its terraform plan.
As a safety feature, we can avoid that risk by spawning a fresh new CI pipeline for the same environments as the merge request -- but using the latest commit on the master branch (rather than the merged MR's own merge commit).
Thus it is crucial that we understand where the gaps in the automation are, and fix them so that the golden path is not only the safest, but also the easiest most user friendly option.
For example, improving existing policies (OPA) could help us bridge some gaps, we currently execute verification during MRs but not on a main pipeline/apply.
Adding policy checks during the apply step could provide extra safety around out-of-order execution.
We currently group resources in Terraform per environment.
Gitaly nodes are defined in Terraform as part of the Production resources, sharing the same environment as the CI Patroni nodes, which were the target of the change that triggered the incident.
The Terrafrom pipeline runs 1 plan job and 1 apply job per environment, so whenever a change happens to any of the resources in the Production environment, the apply job will attempt to fix all the resources, in GCP for e.g., to match the definitions in Terrafrom.
The 3 deleted Gitaly nodes were the only ones deleted, because they were the most recently added nodes within the past three weeks. If we have recreated half of the fleet within the last 3 weeks for any reasonable reason, this incident would have deleted half of the fleet.
Why are Gitaly nodes deletable
This is a good question, I might not be the best person to answer this, but I can think of a scenario where we'd need to upgrade the VMs to a different image or resources/CPU/memory specs, which would then require the recreation (destroy/create) of the VMs.
In theory, we should be able to delete nodes without losing data, that's why Jarv raised an upstream issue #15999 (comment 1463172939) to make this as safe as possible.
Does this answer your question? If not, please go ahead and ask more questions.
I am aware we postponed the PG upgrade earlier due to a business decision, did this also play into the 3 week worth of commits?
This was indeed a factor. However, one could argue that we got lucky that it was "only" 3 weeks worth of changes. The underlying issue could have been triggered for a pipeline much older, in which case the impact would have been much larger.
Does this mean we usually create, merge, run terraform plans within a short time window? Hence why we have not run into this before?
We generally do. And in fact we also did in this case. The MR whose pipeline was restarted was merged and applied the same day that it was created.
It's worth noting that time delta between an MR being created and it being merged + applied did not play a role in this incident. There are some risks involved with that scenario, and we're exploring merge trains as a solution.
Add support to easily generate a fresh CI pipeline for the given environments associated with a merge request.
Or in addition with an option to include -replace or -destroy for this will also remove the currently necessary manual destroy. This will make the process even more save and efficient.
In order to know which pipeline to restart, we'd need to manually scan through the pipelines page and figure out which one contains the environment we're seeking to re-apply. We already have the successfully restarted pipeline from 3 weeks ago.
@chill104 made a good point via internal channels that I wanted to share here:
Why is the instinct to look through pipeline pages? IMHO the environments page was supposed to be built for this, and the "latest" deploy job associated with production would have been C1… Could have change the trajectory of next steps if this was spotted.
That page certainly makes it easier to find the latest commit. It's good to raise awareness that this exists.
Why is the instinct to look through pipeline pages? IMHO the environments page was supposed to be built for this, and the "latest" deploy job associated with production would have been C1… Could have change the trajectory of next steps if this was spotted.
That page certainly makes it easier to find the latest commit. It's good to raise awareness that this exists.
@igorwwwwwwwwwwwwwwwwwwww based on your last comment here, should we do quick change in the process (and an annoucement) to incentivize team members to use a different worfklow when looking for a specific pipeline in this deploynent context?
Ah thanks @igorwwwwwwwwwwwwwwwwwwww! That change to the process makes it no longer relying on having to look for a pipeline to re-run so this is no longer relevant.
@f_santos@jarv@igorwwwwwwwwwwwwwwwwwwww@rehab I think one important education bit that we are missing here (unless I missed it already) is how we actually managed to recover the system. The reason I want to focus a bit on recovery and see if there is something we can improve or at least educate the rest of the infra team is that mistakes are always going to happen so preventing this specific type of issue is important but recovery is equally as important.
From what I can gather so far is:
We had a separate call to recover services created via Terraform
@sxuereb great point to bring up as I think there are some learnings here but will only respond to (1)-(3) since I was mostly present for that part of the recovery.
How did we recover the deleted Gitaly nodes via snapshots?
Was it a matter of checking out the master branch and hitting apply?
There was a bit of confusion initially, as we had a link to the CI job (https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/jobs/10538991) which showed a recent runtime and it wasn't immediate obvious that this was run on an old commit. Due to that confusion there was a lot of discussions around the state file being corrupted in some way or a change made outside of CI. We even went as far as trying to restore an older version of the terraform state file. It wasn't until we realized that the commit was 3 weeks old, that it was clearer what happened.
Most of the initial confusion stemmed from that when the plan was run again against master there were also a lot of deletions. We were very worried at that time that we didn't fully understand what happened and that running the pipeline against master would make things worse. @cmcfarland can correct me if I am off here.
What services did we have to restore first?
The redis-cluster-cache was deleted, which was causing problems. There was one team focusing on that and another team focusing mostly on Gitaly restore and bringing all resources that were deleted.
Can we improve any of this?
Did we face any difficulties?
Did we have a hard time understanding which services we needed to recover first?
Yes, in hindsight we didn't have at our fingertips:
whether a change was made outside of CI
the ability restore the state file easily in object storage (though it ended up not being required)
a concise list of resources that were deleted from teh TF output
Also in hindsight we should have probably focused more on the recovery of service, and less on understanding what happened. I think due to the dataloss we were extra cautious.
In hindsight we were slow to restore resources that were deleted because we were doing selective applies, to filter out deletions (not being confident on how we exactly got into that state).
Was this done manually on an SRE's laptop or somewhere else?
@jarv would that make sense to have someone less familiar with the runbook to have a look at it and verify they would have followed the same path to recover the service?
Also in hindsight we should have probably focused more on the recovery of service, and less on understanding what happened. I think due to the dataloss we were extra cautious.
In retrospect, this looks like something the IMOC (me) should have advocated for. @jarv do you have the same understanding or do you think this is something EOCs should also try to keep in mind? I agree the potential data-loss is a justification for requiring more confidence in understanding the problem before taking actions but there is maybe an opportunity to highlight that priority in our process?
@sxuereb have you managed to get more details on the questions you've raised? If yes, is there any follow-up that still needs to happen or be deferred to a follow-up issue? If not, could you please bring the remaining open questions to the next sync meeting (internal link)? Thanks!
would that make sense to have someone less familiar with the runbook to have a look at it and verify they would have followed the same path to recover the service?
Yes this is something we plan to do in Q3 for DR, the OKR will be around scheduling regular gamedays for DR scenarios that would cover a similar type of restore.
In retrospect, this looks like something the IMOC (me) should have advocated for. @jarv do you have the same understanding or do you think this is something EOCs should also try to keep in mind?
I think this was a group thing that we should have all had done better.
have you managed to get more details on the questions you've raised? If yes, is there any follow-up that still needs to happen or be deferred to a follow-up issue? If not, could you please bring the remaining open questions to the next sync meeting (internal link)? Thanks!
In addition to @sxuereb's question above about how we recovered the terraform state as well as the data, I also wanted to dive into how we recovered the availability of the site.
The primary availability impact during the incident was 503 errors being served for gitlab.com.
Webservice pods are crash looping
During the incident, a majority of webservice pods for the api and web services were crash looping.
During pod boot, we check database connections for both postgres and redis. If the database is not available, we fail to boot the pod. The intention is that if we were to roll out a config change that changes the database connection config, if that config were invalid, we would then avoid propagating the deployment further.
However, this ended up biting us here, because the terraform pipeline deleted some of our redis nodes. Specifically the redis-cluster-cache fleet.
What is redis-cluster-cache?
The Scalability team is working on rolling out a new Redis deployment called redis-cluster-cache. It aims to migrate a subset of the redis-cache workload to Redis Cluster. See this epic for context: &878 (closed).
As part of this migration, we provision redis-cluster-cache alongside the existing redis-cache. We then perform dual-writes to both the old and the new deployments. We are able to switch reads from one to the other, and once we're confident, we can stop dual writes, and if the entire workload has been migrated, decommission the old deployment.
At the time of the outage, we were still in this dual write phase. It is possible to disable dual writes by toggling a feature flag.
Optional except not quite
In theory the new Redis should be optional. It should be possible to fall back at this stage. In practice that isn't the case.
All Redis connections that have been specified will be checked during the boot phase. Even without this check, we proactively connect to Redis Cluster in order to fetch slot mappings.
As a result, unavailability of the new redis deployment will impact any newly booted pod.
Removing the Redis config
Luckily the code is able to fall back to redis-cache if redis-cluster-cache is not configured. As such, removing the connection for redis-cluster-cache should be sufficient to allow pods to boot again.
We ran into challenges here though. We tried to remove that config by reverting the MR that introduced it: Revert MR. Unfortunately the resulting pipeline failed because it was trying to pull from registry.gitlab.com, which was unavailable.
This required us to break glass and attempt to run that helm command locally. Unfortunately the tooling is not streamlined for local use, and we were not able to get the correct incantation working quickly.
We decided to break glass again and drop down to modifying the ConfigMap / Secret objects in Kubernetes directly. That was also challenging though, because of the double YAML encoding. We were manually editing a YAML config that is a string within another YAML config.
This was the process used:
➜ ~ k -n gitlab get configmap gitlab-webservice -o yaml > gitlab-webservice.yaml➜ ~ vim gitlab-webservice.yaml➜ ~ colordiff -u <(k -n gitlab get configmap gitlab-webservice -o yaml) gitlab-webservice.yaml➜ ~ k -n gitlab apply -f gitlab-webservice.yaml➜ ~ k -n gitlab rollout restart deployments/gitlab-webservice-web
But with many eyes to review the changes, we eventually were able to get a diff that looked safe. We then applied this to gitlab-cny via kubectl. The rolling restart showed pods starting to recover. Because only the canary pods were now taking traffic, HPA scaled out a lot, and canary was serving the whole site.
At this point we started to see GitLab.com recovering.
Since things were looking good, we repeated the same process for the main stage zonal clusters as well as the regional cluster, and this allowed the load to re-balance across the main stage.
Why was git still available?
One side-note (came up during the incident review) is that the git service was still available during this time. When looking at the webservice pods, most of the api and web ones were crash looping, but for some reason, this was not the case with the git pods.
Registry is broken, ... is it DNS?
So the site was back, but we started to get reports of registry being broken. This was intermittent, only some people were experiencing it.
We determined relatively quickly that this was in fact a DNS problem. The terraform plan had deleted some DNS records, including the one for cdn.registry.gitlab-static.net.
Some folks still had a cached version of the record. But asking Cloudflare's resolver gave us an NXDOMAIN.
We were about to manually create the record, but decided to hold off, as the parallel effort of restoring terraform was almost ready to start restoring resources. We brought that effort back into the main zoom call and performed a targeted apply to bring back the DNS records.
At this point we started seeing successful DNS resolution:
As NXDOMAIN may be cached for a short time, it would take a few more minutes to fully recover, but we did see registry recovering now.
Conclusion
In many ways we got lucky here.
The Redis nodes we lost were still in a state where we could easily lose them without having to go through data recovery. The Redis Cache state is also ephemeral and (as far as we know) can be safely dropped entirely if needed.
The recovery highlighted several gaps in our resiliency to GitLab.com being unavailable. We run pipelines on the separate ops instance for reasons of resiliency, but if those pipelines then pull from GitLab.com, it won't work.
We've seen similar cases with node bootstrapping. We may want to investigate mechanisms for detecting these implicit dependencies.
What went right?
It's worth noting that many design decisions were also really useful in helping us recover.
The Redis migration strategy made losing the new Redis cluster something we can tolerate.
The layered automation of Pipelines => Helm => Kubernetes enabled us to drop down to the lower layer when the higher-level one was not working as expected.
The separation of work streams for data recovery, terraform recovery, and site availability recovery allowed us to tackle different aspects of the incident concurrently and maintain focus.
The outage was pretty bad, but it could have been a lot worse.
At the time of the outage, we were still in this dual write phase. It is possible to disable dual writes by toggling a feature flag.
Optional except not quite
Do we need to update our application code to take into consideration the dual write feature and ignore the secondary Redis connection if it's failing?
The recovery highlighted several gaps in our resiliency to GitLab.com being unavailable. We run pipelines on the separate ops instance for reasons of resiliency, but if those pipelines then pull from GitLab.com, it won't work.
At the time of the outage, we were still in this dual write phase. It is possible to disable dual writes by toggling a feature flag. Optional except not quite
Do we need to update our application code to take into consideration the dual write feature and ignore the secondary Redis connection if it's failing? This was discussed in the sync review (internal link).
A follow-up issue has been created to improve knowledge sharing: scalability#2432
@igorwwwwwwwwwwwwwwwwwwww do we have an issue to improve the reliability during rollout phases, like disabling dual write if a target is unavailable instead of preventing the VM to boot (hopefully I got that right )
do we have an issue to improve the reliability during rollout phases, like disabling dual write if a target is unavailable instead of preventing the VM to boot (hopefully I got that right )
We discussed this some more in the last EMEA incident review, and I'm a bit skeptical. I'll re-iterate my thoughts on the matter here:
There is only a certain configuration in which such a fallback is safe, and it's hard to guarantee that this will always be safe.
Also, once a fallback has occurred, the datasets are no longer consistent. So unless we have strong alerting or validation mechanisms, we won't know that it happened.
All our redis deployments are highly available so that loss of a single node can be tolerated. Trying to optimize for the specific scenario of losing all nodes during a migration may not be the best use of our time.
is there any corrective action we can take to prevent this to happen in the future?
(re: DNS)
Good question. I think applying prevent_destroy to DNS records could help a bit.
do we have an issue to improve the reliability during rollout phases, like disabling dual write if a target is unavailable instead of preventing the VM to boot (hopefully I got that right )
We discussed this some more in the last EMEA incident review, and I'm a bit skeptical. I'll re-iterate my thoughts on the matter here:
@jeromezng my understanding of Igor's comment is that they're not confident about an automated fallback for data-consistency concerns, if we configure our systems to automatically fallback, we won't get alerted when this happens, and we won't get the chance to intervene.
Based on this, unless @sxuereb have a different opinion I think we can resolve this thread.
@jeromezng@rehab I agree to resolve this thread it seems a micro-optimization, which will end up being more confusing because some Redis connections are optional and some aren't so it will be harder to debug issues and understand the dependencies.