Incident Review for Site-wide Outage for GitLab.com - Stale Terraform Pipeline #15997

marked this issue as related to #15997 (closed)

We will be performing the incident review once #15997 (closed) has concluded.

@alanrichards can you assign someone to lead this review please. Given the impact can we have this done earlier than our regular review schedule?

@gonzoyumo will be the DRI for this as they were the Incident Manager during some of the key events but @jeromezng and @tkuah will also play active roles too. see #15997 (comment 1462398062).

@dawsmith can you please ensure this all flows/gets completed as expected given the criticality of the incident? I'll also keep an eye on it.

@meks completing this ahead of the process in https://about.gitlab.com/handbook/engineering/infrastructure/incident-review/ (2023-07-11) will be challenging to gather the required facts to be successful in a full review. I suggest we let this complete on the normal schedule.

I imagine the desire to complete this earlier comes from the need to communicate in more detail in various forums. I think I have enough information to do those communications without the review completing. Can you point me in the direction of those requests and I'll cover them off ahead of the review and follow up with any additional info from the review.

@alanrichards sounds good thank you.

@alanrichards - yes for following up on this review work.

Adding from slack ask in parallel:

@cmcfarland - as EOC - can you start the initial timeline/detection in incident-response-analysis

@jarv and @igorwwwwwwwwwwwwwwwwwwww - would be okay for input on incident-analysis and root-causes ?

@gonzoyumo, @jeromezng and @tkuah - as the IM during this, I think input from your points of view on incident-response-analysis would be useful.

mentioned in issue #15997 (closed)

marked the checklist item If applicable, ensure that the exec summary is completed at the top of the associated incident issue, the timeline tab is updated and relevant graphs are included. as completed

Some of the corrective action or nice-to-have-improvements:

In addition to #15999 (comment 1462395847) I've also created a corrective action as a stopgap until we have comprehensive solutions implemented.

Corrective action: Review standard operating procedures used by the DBRE team to re-create resources (destroy and add infrastructure) to prevent any human errors for the future Change Requests

All these corrective actions have been linked to the incident issue, I'm marking this thread as

changed the description

assigned to @gonzoyumo, @jeromezng, and @tkuah

Given that we had to revert to google docs, should we have a backup project in the ops instance for incident tracking? Asking as a question not a prescription.

Affected customers could not see the tracking ticket provided in status update emails, which wasn't a good experience. In that circumstance, it would be useful to have a backup of some kind. It might make sense for that backup to be a tech stack distinct from GitLab, so that you know it cannot possibly be affected by the same issue.

1 to this suggestion, tracking things in a Google doc was a good step but introduced overhead work after the dust has settled, and didn't provide good visibility to our users while the incident was Active.

Currently the Production project mirror in ops isn't public, I can't think of a reason why, @jarv @dawsmith do you happen to know why?

I think once we determine which issue tracker to use in the ops instance, we can automate an incident creation fallback to that tracker via woodhouse if gitlab-com is down.

Similarly, some usefull information like the incident runbook was unavailable: https://gitlab.com/gitlab-com/runbooks/-/blob/master/incidents/general_incidents.md

Currently the Production project mirror in ops isn't public, I can't think of a reason why

@rehab - Related old issue (created by you! ): https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15255#note_946455466

Project: https://ops.gitlab.net/gitlab-com/reliability/ops-incidents-failover

Example ops issue: https://ops.gitlab.net/gitlab-com/reliability/ops-incidents-failover/-/issues/1

Mirror: [redacted - see internal note]

The mirror page is just a single html file hosted in GCP as a load-balanced static html site ops.gitlab.net-hosted page.

The idea here is that it would get updated (ideally via webhook) every time a comment or issue description change was made to an severity1 incident issue being managed in the ops instance.

All this would avoid the necessity to open ops.gitlab.net to the public, and obviate the need to ensure it (ops.gitlab.net) could scale to meet demands of thousands of page views during a high visibility (gitlab.com-outage) incident.

some usefull information like the incident runbook was unavailable

@gonzoyumo - I'm pretty sure this mirror is kept up-to-date: https://ops.gitlab.net/gitlab-com/runbooks/-/blob/master/incidents/general_incidents.md

Thanks @nnelson! I was considering to add to the on call team members onbording process to get a local clone of the runbooks but having them on ops is great

Thanks I added this under:

How could time to diagnosis be improved?

We had to fall back to Google Docs to manage this incident since GitLab.com was unavailable, however, an GitLab issue was eventually created once GitLab.com was back online #15997 (closed). Having both the doc and issue caused back and forth copying and pasting of data which was inefficient while trying to manage the incident. One way to improve this is to use our ops instance for incident tracking rather than Google docs #15999 (comment 1462412414)

@rehab, @jeromezng, @meks - a couple of notes that were also discussed in the meetings:

We had heald off of using ops and a public issue/project because we did not want to have heavy traffic on the instance we were using to recover.
We had also wanted to dogfood a feature the monitor stage was working on - gitlab-com/www-gitlab-com#7012. I'm not sure that ever came to completion and is something we could look at.

Thanks for the suggestions. I have created a follow-up issue where we can explore improvements of the process to address this need: Improve Incident Management process when gitlab... (gitlab-com/www-gitlab-com#34382 - moved)

Please feel free to double check and contribute if you see something missing.

changed the description

Thank you for the update on the root cause. I recalled we used to ask 5 whys to understand the underlying cause in depth. Can we also perform this analysis in a blameless way? @andrewn may have historical context.

Part of a change request, an old pipeline was triggered, applying an obsolete Terraform plan to the Production environment.

Particularly I am interested in how we triggered an obsolete plan and what can we do to prevent this in the future. The list that @rehab created is a good start. I am looking at both the TF mechanics and also the human usability part.

Plan naming conventions to make it clear (prefix/dates)
Archiving old plans
Archiving old pipelines

@gonzoyumo @jeromezng @tkuah can you please lead this in the review?

cc @alanrichards @dawsmith

Here's an attempt at the 5 whys, although I can change the Problem statement and get different answers.

Problem: "Production is down"

Why? The web pods are not starting.
Why? They depend on one of the Redis cluster nodes which were deleted.
Why? A Terraform pipeline plan/apply was triggered from an old commit.
Why? Part of a Change issue.
Why? Rebuilding the new DB cluster in preparation for the PG14 upgrade.

I am interested in how we triggered an obsolete plan and what can we do to prevent this in the future

@meks https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24080 is a corrective action that tries to directly deal with this in the system for applying these changes.

The corrective action in https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24080 seems to cover the use case of Archiving old pipelines.

@cmcfarland @igorwwwwwwwwwwwwwwwwwwww should we consider these additional suggestions:

Plan naming conventions to make it clear (prefix/dates)
Archiving old plans

@gonzoyumo

I don't think plan names are ever seen by a human eye, so that might not be a valuable change.
And archiving old plans is not really relevant. Terraform has it's own methods to prevent stale plans from being applied. But, perhaps the intention is to archive old pipelines so they can be viewed but not re-run? If GitLab has a way to expire old pipelines from being re-run, that could possibly be a feature that might help. But it seems redundant in this specific case since we've added a check that will not let applies proceed if they are applying an older commit than origin/HEAD as a Terraform configuration.

Thanks @cmcfarland!

These are exactly the details I was missing to make a judgement call on these remaining suggestions

I'm marking this thread as then!

I wonder if we should consider redefining our incident severities, with an introduction to severity 0 or a severity 5 definition.

This incident was an outage in multiple services, which was different in impact than most severity1 incidents, for example, the most recent 2023-05-12: Container registry pulls failing wi... (#14260 - closed) incident.

Current definition of a severity1:

GitLab.com is unavailable or severely degraded for the typical GitLab user

Any data loss directly impacting customers

The guaranteed self-managed release date is put in jeopardy

It is a high impact security incident

It is an internally facing incident with full loss of metrics observability (Prometheus down)

Incident Managers should be paged for all severity1 incidents

Reference: https://about.gitlab.com/handbook/engineering/infrastructure/incident-management/#incident-severity

While we currently have a workflow for severity1 complete outages (can't find a link to it at the moment), which worked well during this incident, it doesn't help in historically differentiating between an event such as this one, and an outage in only one of the services in Production.

We can fork the workflow for complete outages to it's own incident severity. Clarifying this in our docs would also help the IMOC and other engineers during their incident response.

During the first 30-45 minutes of the incident, where every minute counts, I remember there was few questions raised about the impact, which was clear to most of us as a complete outage, but wasn't really clear from: 1. the title of the incident. and 2. the severity assigned to the incident. at the time.

@rehab thank you for the insight here.

We can fork the workflow for complete outages to it's own incident severity. Clarifying this in our docs would also help the IMOC and other engineers during their incident response.

Could you provide more detail on your thoughts here? My initial take is that we wouldn't do anything differently but don't want to simply dismiss it.

During the first 30-45 minutes of the incident, where every minute counts, I remember there was few questions raised about the impact, which was clear to most of us as a complete outage, but wasn't really clear from: 1. the title of the incident. and 2. the severity assigned to the incident. at the time.

severity1 was certainly capturing folks attention and I agree that folks were very aware of the customer impact. I'm not sure how having severity::0 would have helped with this. Can you please expand?

I agree that the title of the incident could have been clearer and the summary could have reflected the customer impact better. Should we write some guidance on https://about.gitlab.com/handbook/engineering/infrastructure/incident-management/ to help with that part?

Could you provide more detail on your thoughts here? My initial take is that we wouldn't do anything differently but don't want to simply dismiss it.

I'd say we mostly wouldn't do anything differently for this incident! However, the workflow for handling this type of incident (outage preventing incident creation/view) was different than the workflow for handling other severity1 incidents with no outage, and as far as I can tell, I can't find a documentation for such a workflow anywhere.

At first I thought we had an outage workflow, perhaps this was a proposal in the past that never came to life? However, I found this, which hints to outages but isn't very clear.

What I'm trying to propose here is a separate workflow for when there's a complete outage, and although it's rare, it would still be nice to formalize our response, document it, and iterate with our learnings from events like this.

During the first 30-45 minutes of the incident, where every minute counts, I remember there was few questions raised about the impact, which was clear to most of us as a complete outage, but wasn't really clear from: 1. the title of the incident. and 2. the severity assigned to the incident. at the time.

That should have been my responsibility. Apologies for having missed that.

I'd say we mostly wouldn't do anything differently for this incident! However, the workflow for handling this type of incident (outage preventing incident creation/view) was different than the workflow for handling other severity1 incidents with no outage, and as far as I can tell, I can't find a documentation for such a workflow anywhere.

I agree and particularly as a new IMOC I've found myself very inefficient at the beginning. The IMOC process available to me was based on issues which were unavailable. The IMOC runbook was unreachable too. Maybe a more experienced IMOC or someone more familiar with operational issues would have reacted faster but we definitely should have this written down somewhere accessible. For instance, starting a google doc as we've realized the issue tracker became inaccessible (this was asked by the EOC as I was still looking at how to do things).

Additionally, should we consider outage as a circumstance where we want a seasoned IMOC to take over? As said, time is critical in such events and we might want to be as efficient as possible. A lot of team members helped during this incident. Some asked me directly if I needed help or even pro-actively took over some tasks. Though, we might still have lost some crucial minutes that an experienced IMOC would have saved.

@gonzoyumo I was out on PTO during the incident, so I don't know if about.gitlab.com was also impacted. But here is the handbook page for Incident Management Processes, that is not dependent on issues.

At any time, if an IM feels they need more support because they feel underprepared or unable to be effective in a given incident, regardless of labels or incident severity, they should feel free to escalate to Infrastructure leadership. There is a Pagerduty escalation policy in place for exactly this purpose, and it's always ok to ask for someone from Infra leadership to take over as IM in this case, while the original IM should remain engaged in a shadow capacity to learn. I'll make a note to highlight that more for new Incident Managers in the training.

@amoter FYI that escalation was triggered and I joined the zoom.

Thanks @amoter. What I was referring to about "issues" in my comment is the fact that the steps described in the IMOC responsiblities are mainly based on the incident issue being available, which was not the case here due to the outage. This, to me, has caused additional burden as I had to adapt further a process I was not familiar with. It might not look like much and I probably lacked some judgement at that time, the "deer in the headlights" effect did not help. Though, having a check list to follow when there is such an outage would maybe help IMOC to face the situation more efficiently.

@alanrichards the escalation was indeed triggered but not by me :/

@gonzoyumo Makes sense, thanks for clarifying.

@gonzoyumo FYI I'll bring up this thread in today's Incident Review, as there was a lot of different points raised around incidents that are also an outage, my hope is that we'll come up with a set of actions from the discussion to improve our workflows and be better prepared for instances like this in the future.

During today's incident review sync meeting it was suggested that rather than trying to create a distinct workflow, we could simply improve the current one.

Pointing specific situations or needs that can arise and how to best adress them. For instance:

what to do when the incident issue can't be created or is inaccessible?
1. @rehab are you interested in kicking off an MR on the Incident Management handbook page about this?
when the scope of incident is large, emphasize the opportunity to escalate, call additional IMOCs/EOCs, and split the tasks. As necesary, split into separate zoom rooms. This is mentionned in the multiple incidents section but could be made more generic and benefit from improvements like automation to create the room and record the meeting.
1. @jarv already started an MR related to this, to emphasize the opportunity to ask for more help from Infrastructure team: gitlab-com/www-gitlab-com!126910 (merged)
2. @jarv do you know who would be best positioned to help on automating this via woodhouse like you suggested?
explicit a process for incident response and communications when customers are impacted (this is different than the Incidents requiring direct customer interaction guideline). For instance, pre-approved messaging could help speeding up the response time. @jmalleo you mentionned on slack (internal link) that you'll be looking into this, could you please share an issue to collaborate on this topic?

Item 1 moved to Improve Incident Management process when gitlab... (gitlab-com/www-gitlab-com#34382 - moved)
Item 2 moved to Improve Incident Management process for large s... (gitlab-com/www-gitlab-com#34388 - moved)
Item 3 moved to https://gitlab.com/gitlab-com/www-gitlab-com/-/issues/34387+

Marking this discussion as as I've moved suggested actions to dedicated issues to follow-up on these improvements.

Terraform as a nice final_snapshot option for EBS volumes that would have been very helpful here for our Gitaly VMs. Opened https://github.com/hashicorp/terraform-provider-google/issues/15103 to propose something similar for google_compute_disk

Praise: @jarv this is an awesome suggestion, thank you!

Awesome suggestion @jarv I feel like this is important enough for us to contribute the feature upstream ourselves should we create an issue in our issue tracker to contribute the feature?

I feel like this is important enough for us to contribute the feature upstream ourselves should we create an issue in our issue tracker to contribute the feature?

@sxuereb

Yeah I think this could be worth doing ourselves, although it isn't a bug we might also want to consider escalating through our TAM

https://github.com/hashicorp/terraform-provider-google/wiki/Customer-Contact#raising-gcp-internal-issues-with-the-provider-development-team

Brought this up with our TAMs here https://gitlab.slack.com/archives/C01KPV0V3SM/p1689075130250659 (internal)

@jarv Maybe we could create something using the local-exec provisioner to perform a snapshot before the delete? Or before the disk delete?

@cmcfarland I think that could work for us, as long as there is local permission to run it. Could you open an issue to discuss and put it in the teamFoundations queue?

Done

changed title from Incident Review for Site-wide Outage for GitLab.com #15997 (closed) to Incident Review for Site-wide Outage for GitLab.com - Stale Terraform Pipeline #15997 (closed)

FYI @fzimmer

Suggestion for corrective action:

Incident communication via issue on a separate instance

I've heard from different sources (Hacker News, wider community members, friends, customers, social media, etc.) that hosting the incident issue on GitLab.com SaaS is not great, and limits transparency and communications, when GitLab.com is down.

Suggestions included

Hosting a dedicated public GitLab instance for incident handling
A different system for incident updates that is linked from status.gitlab.com (imho contradicts dogfooding the product)

@dnsmichi great suggestion! Mek and Aaron also suggested the same thing in #15999 (comment 1462412414).

@dnsmichi your suggestions have been added to Improve Incident Management process when gitlab... (gitlab-com/www-gitlab-com#34382 - moved) to continue exploring options to address this problem.

In my mind, we had two incidents that occurred. The first is outage and the second was the deletion on storage. Should we consider breaking apart configuration so that these pieces are not handled in the same change?

@clefelhocz1 The trigger was not a planned change but rather an accidental rollback of ~3 weeks worth of terraform changes.

Actual planned changes are always much narrower in scope. They only touch multiple systems if it is needed for the change.

Thanks @igorwwwwwwwwwwwwwwwwwwww

We rolled back one single commit, @clefelhocz1 point is where if should break up our TF commits to touch on fewer components. If we rolled back a single commit with this much impact it seems the commit that was rolled back was a large change initially.

@alanrichards any thoughts?

@meks actually we rolled back many commits, around 3 weeks worth of commits, or 24 days to be accurate. The impact was large because it contained many commits.

To clarify how the impact would've been different, let's take the following two scenarios.

A scenario for a better incident

The 3 Gitaly nodes that were recreated, were initially added to Terraform in this commit on Jun 17, 2023, this was a normal routine task for infra, to scale up the fleet whenever resources are close to reach capacity.

The commit that was triggered with an old plan was first pushed on Jun 13, 2023, re-triggered on Jul 07, 2023, the day of the incident.

If the 3 new Gitaly nodes were added on Jun 12, 2023 instead of Jun 17, 2023, we wouldn't have lost the nodes during this incident! The only reason they were reverted (destroyed) is because the commit adding them falls in the window of Jun 13 - Jul 07. This was not the only commit that was reverted, but it was the most impactful (along with the Redis nodes).

A scenario for a worse incident

Let's take another example. If we had to recreate our main database nodes in the past 3 weeks for any legitimate reason. Example, upgrading the main PG14 fleet.

Because the recreation of these nodes would have fallen in the window of the reverted commits Jun 13 - Jul 07, we would have lost our main database fleet! This would've been a catastrophic event, that would've taken many more hours to recover from.

The bottom line is, the vulnerability; if we can call it that; we had in our Terraform workflow for allowing old commits to change the state of the current Production to a point in time in the past, could have caused a worse incident, or a less impactful one! But we ended up somewhere in the middle.

There is a proposal to split configuration in https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24093+

Marking this

Out-of order pipeline execution

Around 2 weeks ago we experienced an event where a handful of terraform resources were removed.

Upon further investigation, we discovered that this happened due to two merges happening around the same time, resulting in competing pipelines. These pipelines compete for the same resource group lock as well as the same terraform state lock.

However: There is no enforced sequencing in order of merges. It resulted in the following sequence of events:

08:33 - config-mgmt!6134 is merged, creates commit C1, and triggers pipeline P1.
08:38 - config-mgmt!6133 is merged, creates commit C2, and triggers pipeline P2.
08:42 - Pipeline P2 starts job J2 which applies terraform for commit C2.
08:53 - Pipeline P1 starts job J1 which applies terraform for commit C1.
Because C1 is an earlier commit than C2, applying C1 results in a unintended rollback of C2.
Only once the next merge happens, will we roll-forward and re-apply C2.

In this case we only rolled back a single commit. But it highlighted that there is a real risk associated with out-of-order pipelines.

Terraform itself protects against applying a stale plan, but in this case we are creating a fresh plan + apply after each merge. The problem is not a stale plan, but rather performing a plan + apply a non-current commit.

At this point there was some speculation that a restart of an old pipeline could widen the unintended rollback window. Some googling surfaced this upstream terraform issue talking about the same hazard, specifically in the context of GitLab CI. We also briefly discussed Terraform Cloud and Atlantis as platforms which have stronger ordering guarantees.

It was however not deemed urgent enough to address immediately, as our workflows generally do not require restarting old pipelines. After all, we've been applying terraform via CI for many years without issue.

We are now exploring various mitigations here: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24080.

cc @pguinoiseau @rehab @f_santos @craig

@igorwwwwwwwwwwwwwwwwwwww do you think setting process mode to oldest_first might have helped that specific situation? Assuming you are auto-applying, I think that this would prevent J2 running before J1 is completed.

@tmeijn I wasn't aware of this feature. Yes, I think it would help in the competing merges case. Thanks!

I suspect there is still a potential window for the pipelines to be created out of order (assuming they are created via Sidekiq), but the window for that is much narrower and effectively requires simultaneous merges.

@igorwwwwwwwwwwwwwwwwwwww Alright, hope it helps! Don't judge my Bash skills but use this little script to set resource_groups to oldest_first by default, since I like them to work like that anyway:

#!/usr/bin/env bash

curl --header "PRIVATE-TOKEN: ${GL_TOKEN}" "https://gitlab.com/api/v4/groups/<GROUP_SLUG>/projects?simple=true&include_subgroups=true&per_page=100" | jq -e '.[].id' > project_ids.txt

while read project_id; do
  curl --header "PRIVATE-TOKEN: ${GL_TOKEN}" "https://gitlab.com/api/v4/projects/${project_id}/resource_groups" | jq -re '.[].key' > resource_groups.txt
  sed -i "s|/|%2F|" resource_groups.txt
  sed -i "s| |%20|" resource_groups.txt

  if [ -s resource_groups.txt ]; then
    while read resource_group; do
      echo "Getting resource group '${resource_group}'"
      curl --header "PRIVATE-TOKEN: ${GL_TOKEN}" "https://gitlab.com/api/v4/projects/${project_id}/resource_groups/${resource_group}" | jq

      curl --request PUT --data "process_mode=oldest_first" \
        --header "PRIVATE-TOKEN: ${GL_TOKEN}" "https://gitlab.com/api/v4/projects/${project_id}/resource_groups/${resource_group}" | jq
    done < resource_groups.txt
  fi
done < project_ids.txt

@igorwwwwwwwwwwwwwwwwwwww thank you for this.

Is the example in #15999 (comment 1463741474)+ directly involved in this incident?
If not, did this scenario have a part to play in this incident.

@meks

Is the example in #15999 (comment 1463741474)+ directly involved in this incident?

No.

If not, did this scenario have a part to play in this incident.

Yes! This is the same scenario that happened during this incident, but at a different scale.

In Igor's example, the two pipelines were one after the other, one was applied on Jun 26, 2023 at 09:42 AM, the other was applied minutes later at 09:53 AM, in which we reverted only one commit, or in other words, 10 minutes worth of commits.

In this incident, however, the triggered pipeline was 3 weeks old, so the revert was 3 weeks worth of commits instead.

Hope that answers your question.

This is helpful thank you @rehab.

Indeed. We could consider the scenario described here as a minimal reproduce case of the larger incident.

That's why I felt it was useful to outline as part of the incident review.

Another good comment made by @chill104 via internal channels:

Why isn't oldest_first on by default?

I think this is something we should evaluate. Whenever resource groups are in use, it is likely a deployment related job, and this implies that we may want to enforce sequencing.

I think it would be a safer default.

@igorwwwwwwwwwwwwwwwwwwww I think the initial topic of this discussion has been addressed while completing https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24080 but your last comment about oldest_first seems to remain. Could you please expand a bit on that proposal? Is there an issue to further explore it? We can also bring that to the next sync session if needed.

@gonzoyumo I've opened a separate issue for this proposal: gitlab-org/gitlab#419428.

Awesome @igorwwwwwwwwwwwwwwwwwwww, thank you!

Marking this thread as

changed the description

locked this issue

Understanding the incident trigger

In the thread above, there is this statement:

It was however not deemed urgent enough to address immediately, as our workflows generally do not require restarting old pipelines.

This is generally true. Our terraform workflow consists of opening MRs, merging those MRs, and having the merge pipeline running a terraform plan + apply. Usually, there is no reason to retry a pipeline.

There are a few cases that are not handled by this workflow however:

Failed terraform apply.
Drift detection and reconciliation.
Re-creating resources.

We'll take them one at a time.

Failed terraform apply

Most of the time, terraform is able to give ample feedback during the plan phase and proactively warn users. However, there may be cases where the plan succeeds but the apply fails.

This could happen for a number of reasons, including:

Conflict, for example if a resource already exists.
Intermittent issue with the pipeline.
Intermittent issue with the cloud provider.

In case of a conflict, the cloud provider errors out. What we do to resolve it will be different on a case-by-case basis:

We may want to manually import the resource and retry.
Or manually delete it and let terraform re-create it.
Or open up a follow-up MR that resolves the conflict.

The fact that we're required to jump through hoops manually and restart pipelines here is a sign that our existing tooling does not support this case very well.

Idea: Making it easier to trigger a new pipeline for an environment could go along way to making failed terraform applies safer.

Drift detection and reconciliation

While we are very rarely running terraform locally, or making changes outside of terraform, it does happen sometimes.

Especially during an incident, we may need to act quickly to mitigate. This may involve making a direct change and backporting it to terraform.

This can however result in drift. If things are not backported, or a change is made by accident, it results in terraform no longer matching the state of the world.

The first thing we want to do is detect this drift, and to do so as quickly as possible to avoid it from accumulating and resulting in a state where we can no longer safely apply terraform.

In order to do that, we have a daily drift detection pipeline that runs as a scheduled CI job once per day. It runs a plan for every environment and if we get a dirty plan, we know that we have drift.

If drift is detected, a message is posted to slack. It includes a link to the reconciliation pipeline, which includes manual jobs to plan and apply. Running those plan + apply jobs effectively reverts any manual changes and restores the state to what is specified in terraform.

Doing so may not always be safe. So some judgement is needed when deciding whether to squash the delta or whether to backport the changes into terraform by opening a new MR.

One previously unknown problem with this approach however is the following case:

Some drift is created via manual change in GCP.
Daily drift detector runs, alerts via Slack. This pipeline runs on commit C1.
An unrelated terraform change MR1 is merged, this creates commit C2. Merge pipeline runs on C2, applies the change from MR1, but also squashes the drift as a side-effect.
Unaware that the drift has been dealt with, someone decides to run a plan on the drift detector pipeline. This plan no longer shows the original drift. But because this whole pipeline runs on commit C1 it now shows a revert of the changes from MR1.
Person is unaware of MR1 (and that it has been merged), incorrectly believes that these changes are drift, and runs the apply job.
We have now accidentally rolled back MR1. It will remain in this state until the next merge occurs.

Idea: Changing this workflow to trigger a new pipeline for reconciliation could go along way to making the drift reconciler safer.

Re-creating resources

Now we get to the main event, the actual trigger of this incident.

Deleting resources in terraform

Terraform creates resources and tries very hard not to delete or re-create them, unless it absolutely needs to. Whenever possible, it will try to make changes in-place. For example, if we add some new metadata to a GCE instance, it will not destroy that instance.

However, there are sometimes cases where we do want to destroy and re-create them from scratch.

Terraform has a way of doing this: terraform apply -replace=<resource>. For removal without creation there is also terraform destroy.

Our terraform pipelines do not have a way of using these arguments. Thus we are forced to fall back to either making changes manually or running terraform locally. This is already a bit of a safety hazard, so in order to contain this risk, we split it into two steps:

We perform a terraform delete locally to remove the resources in question.
We then re-start the most recent pipeline for that environment to re-create those now deleted resources.

This eliminates the need for terraform apply running locally.

Preparing for maintenance

The change that was being worked on aimed to re-create the patroni-ci-v14 fleet. The original MR had been prepared ahead of time, and merged on 2023-06-14, roughly 3 weeks ago.

The MR's merge pipeline had already been successfully restarted on that same day.

Performing maintanance

Fast-forward to 2023-07-07. We are now executing the production change. We've successfully run terraform delete. Everything is going according to plan. We're ready to re-create those instances.

In order to know which pipeline to restart, we'd need to manually scan through the pipelines page and figure out which one contains the environment we're seeking to re-apply. We already have the successfully restarted pipeline from 3 weeks ago.

We restart that pipeline. The rest is history.

The underlying problem here is that our tooling did not support the use case of re-creating resources. This required us to fall back to local terraform invocations. Our tooling also did not support creating a new pipeline for a fresh plan after the local changes had been made. This required us to fall back to restarting an existing pipeline.

Idea: Making it easier to trigger a new pipeline for an environment could go along way to making re-creation of deleted resources safer. The ability to trigger a targeted terraform apply -replace would be even better.

Conclusion

The general rule that terraform pipelines do not need to be restarted has some very notable exceptions, and these exceptions create significant risk.

We can introduce safeguards such as those being explored in https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24080. However, this does not actually solve the unsupported cases. If left unaddressed, it may in fact result in more of these operations bypassing CI and being executed locally, which creates a whole host of new risks.

Thus it is crucial that we understand where the gaps in the automation are, and fix them so that the golden path is not only the safest, but also the easiest most user friendly option.

Thanks @igorwwwwwwwwwwwwwwwwwwww for the details provided.

Should we capture the idea to leverage a new pipeline instead of restarting an older one and using terraform apply -replace as corrective actions?

Also, I'm wondering if the tasks you've mentioned that require to manually restart an old pipeline are documented somewhere? For instance, when performing maintenance is this simply the obvious way of doing things or do team members refer to a documented process?

Is there an opportunity to do a specific training on our Terraform workflows?

@gonzoyumo great questions!

Regarding the Terraform workflows, I think this is entirely a tooling/process gap, not a training gap. Our workflow lacks some important concurrency safety, and that is what we need to fix.

Our Terraform workflow has a long-standing gap that allows for two sets of changes to race with each other. The "Drift detection and reconciliation" section above walks through an example of this.

I figured a sketch might help visualize this, so here goes. Corrections welcome, as always!

Terraform natively guarantees that a plan was generated from the latest state file. (The plan and state file both have serial numbers that terraform compares for consistency.) That eliminates a broad class of risks.

However, Terraform has no way to natively check that a freshly generated plan was derived from the appropriate git commit. The class of risk that we aim to mitigate is where we (via either automation or manual action) tell Terraform to plan and apply an out-of-date git commit.

The following diagram shows 2 terraform branches being merged serially:

When branch add-resource-foo is merged, its CI pipeline will run a terraform plan and terraform apply for its merge-commit: commit 3.
Later, when the second branch add-resource-bar is merged, its CI pipeline runs for commit 5. So far, all is well.
But if we rerun the CI pipeline for the first branch, it will still use commit 3 as its reference point. Generating a fresh terraform plan from that old commit will cause terraform to undo all subsequent changes, such as those merged by the second branch add-resource-bar.

gitGraph
    commit id: "0"
    commit id: "1"
    branch add-resource-foo
    checkout add-resource-foo
    commit id: "2"
    checkout main
    branch add-resource-bar
    checkout main
    merge add-resource-foo id: "3"
    checkout add-resource-bar
    commit id: "4"
    checkout main
    merge add-resource-bar id: "5"

To avoid this pathology, we want to:

Prevent unintentionally telling Terraform to undo other recent changes. https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24080 will make our CI pipeline fail if its plan was not generated from the latest commit on the master branch.
- Doing this will prevent accidentally reverting a more recently merged set of changes.
Add support to easily generate a fresh CI pipeline for the given environments associated with a merge request.
- Rerunning an old pipeline from an already merged MR is dangerous because it reuses a potentially stale git commit as the basis for generating its terraform plan.
- As a safety feature, we can avoid that risk by spawning a fresh new CI pipeline for the same environments as the merge request -- but using the latest commit on the master branch (rather than the merged MR's own merge commit).

Thus it is crucial that we understand where the gaps in the automation are, and fix them so that the golden path is not only the safest, but also the easiest most user friendly option.

We will be exploring different workflows as part of https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/17329+

For example, improving existing policies (OPA) could help us bridge some gaps, we currently execute verification during MRs but not on a main pipeline/apply. Adding policy checks during the apply step could provide extra safety around out-of-order execution.

Thanks @rehab, @igorwwwwwwwwwwwwwwwwwwww and @msmiley

In my mind, the 3 deleted Gitaly nodes is almost a distinct-but related incident. Could we also analysis the root cause for that ?

E.g. What caused those nodes to be deleted ? Why are Gitaly nodes deletable?

@tkuah thanks for your question.

What caused those nodes to be deleted?

We currently group resources in Terraform per environment.

Gitaly nodes are defined in Terraform as part of the Production resources, sharing the same environment as the CI Patroni nodes, which were the target of the change that triggered the incident.

The Terrafrom pipeline runs 1 plan job and 1 apply job per environment, so whenever a change happens to any of the resources in the Production environment, the apply job will attempt to fix all the resources, in GCP for e.g., to match the definitions in Terrafrom.

The 3 deleted Gitaly nodes were the only ones deleted, because they were the most recently added nodes within the past three weeks. If we have recreated half of the fleet within the last 3 weeks for any reasonable reason, this incident would have deleted half of the fleet.

Why are Gitaly nodes deletable

This is a good question, I might not be the best person to answer this, but I can think of a scenario where we'd need to upgrade the VMs to a different image or resources/CPU/memory specs, which would then require the recreation (destroy/create) of the VMs.

In theory, we should be able to delete nodes without losing data, that's why Jarv raised an upstream issue #15999 (comment 1463172939) to make this as safe as possible.

Does this answer your question? If not, please go ahead and ask more questions.

Thanks team, I appreciate the level of detail and transparency in this write up.

Since our workflows generally do not require restarting old pipelines and usually there is no reason to retry a pipeline.

Why is this workflow being used in this upgrade process?
I am aware we postponed the PG upgrade earlier due to a business decision, did this also play into the 3 week worth of commits?
Does this mean we usually create, merge, run terraform plans within a short time window? Hence why we have not run into this before?

Why is this workflow being used in this upgrade process?

Because we don't have a better / safer workflow for re-creating resources. It's always been done in one of the ways described here.

We also had to re-create part of the fleet because we discovered data corruption on the new Postgres 14 replicas: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/15925.

I am aware we postponed the PG upgrade earlier due to a business decision, did this also play into the 3 week worth of commits?

This was indeed a factor. However, one could argue that we got lucky that it was "only" 3 weeks worth of changes. The underlying issue could have been triggered for a pipeline much older, in which case the impact would have been much larger.

Does this mean we usually create, merge, run terraform plans within a short time window? Hence why we have not run into this before?

We generally do. And in fact we also did in this case. The MR whose pipeline was restarted was merged and applied the same day that it was created.

It's worth noting that time delta between an MR being created and it being merged + applied did not play a role in this incident. There are some risks involved with that scenario, and we're exploring merge trains as a solution.

Hi @meks,

Since our workflows generally do not require restarting old pipelines and usually there is no reason to retry a pipeline.

Why is this workflow being used in this upgrade process?

In the last weeks we needed to recreate production resources multiple times, for two main reasons.

To recreate faulty systems (example: corrupted standbys)
To recreate new clusters for PG14 major version upgrade, for they are in an undefined state after testing and optimizing upgrade process and tooling

Because we have to run Terraform without a new commit, CR's generally state tasks like this:

Re-run last full pipeline to recreate hosts https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/pipelines

The change technician has to choose between two imperfect solutions.

Run tf apply locally, like I normally do #15982 (comment 1458592944)
- Overhead, requires local setup, in many cases tf needs to be upgraded first
- Introduces potential for human error
- No track record, the CT has to manually document what they have done
Use our existing pipelines
- Potential for human error, because of imperfect interface for this process

Solution

From my point of view the go-to solution is to have a convenient automation to safely execute the needed task of re-creating selectable resources.

Either as suggested by @msmiley.

Add support to easily generate a fresh CI pipeline for the given environments associated with a merge request.

Or in addition with an option to include -replace or -destroy for this will also remove the currently necessary manual destroy. This will make the process even more save and efficient.

In order to know which pipeline to restart, we'd need to manually scan through the pipelines page and figure out which one contains the environment we're seeking to re-apply. We already have the successfully restarted pipeline from 3 weeks ago.

@chill104 made a good point via internal channels that I wanted to share here:

Why is the instinct to look through pipeline pages? IMHO the environments page was supposed to be built for this, and the "latest" deploy job associated with production would have been C1… Could have change the trajectory of next steps if this was spotted.

That page certainly makes it easier to find the latest commit. It's good to raise awareness that this exists.

In order to (try to) conclude this discussion, it seems that the main problem of reusing an old pipeline instead of creating a new one is being addressed by corrective actions like https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24080+ but also some follow-up improvements like https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24117+. The configuration split is also looked at in https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24093+.

Why is the instinct to look through pipeline pages? IMHO the environments page was supposed to be built for this, and the "latest" deploy job associated with production would have been C1… Could have change the trajectory of next steps if this was spotted.

That page certainly makes it easier to find the latest commit. It's good to raise awareness that this exists.

@igorwwwwwwwwwwwwwwwwwwww based on your last comment here, should we do quick change in the process (and an annoucement) to incentivize team members to use a different worfklow when looking for a specific pipeline in this deploynent context?

@gonzoyumo At least for the use case of rebuilding DB clusters, this process change has already been made: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24090#note_1469263779.

Ah thanks @igorwwwwwwwwwwwwwwwwwwww! That change to the process makes it no longer relying on having to look for a pipeline to re-run so this is no longer relevant.

Marking this thread as

Some data about the git commits missing from the 3 affected gitaly nodes is listed in https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24086#data-lost

changed the description

I've opened up a proposal for splitting the large production Terraform environments in https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24093 , in case this has already been discussed we may want to consider it for another corrective action

@f_santos @jarv @igorwwwwwwwwwwwwwwwwwwww @rehab I think one important education bit that we are missing here (unless I missed it already) is how we actually managed to recover the system. The reason I want to focus a bit on recovery and see if there is something we can improve or at least educate the rest of the infra team is that mistakes are always going to happen so preventing this specific type of issue is important but recovery is equally as important.

From what I can gather so far is:

We had a separate call to recover services created via Terraform
We faced some problems with Corrective Action: The CI pipeline for building... (delivery#19469 - closed)
Our time to recovery was from 16:10UTC till 19:13UTC (do I have these right?)

Some things that might be useful to document:

How did we recover the deleted Gitaly nodes via snapshots?
What services did we have to restore first? Was it a matter of checking out the master branch and hitting apply?
- Can we improve any of this?
- Did we face any difficulties?
- Did we have a hard time understanding which services we needed to recover first?
Was this done manually on an SRE's laptop or somewhere else?
Did we need to recover anything on our helm workloads or was that ok?
Looking at the timeline in https://docs.google.com/document/d/1YAGCcazWs84bMhojFlMWg33cz1B2P8uuxJ6qQDRyLbk/edit (internal link) I see things like Adjust configuration for cluster C can we explain why this needed to be done?
Was there any unexpected change that happened that we had to fix?
Did we have sufficient automation or did something not self-recover after we fixed tf state?

@sxuereb great point to bring up as I think there are some learnings here but will only respond to (1)-(3) since I was mostly present for that part of the recovery.

How did we recover the deleted Gitaly nodes via snapshots?

We used the DR runbook for recovering from Gitaly snapshots https://gitlab.com/gitlab-com/runbooks/-/tree/master/docs/disaster-recovery#gitaly-recovery-using-disk-snapshots. I was on the call which made this go a bit more smoother than it would have otherwise, as I'm not sure anyone else has run though this before.

Was it a matter of checking out the master branch and hitting apply?

There was a bit of confusion initially, as we had a link to the CI job (https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/jobs/10538991) which showed a recent runtime and it wasn't immediate obvious that this was run on an old commit. Due to that confusion there was a lot of discussions around the state file being corrupted in some way or a change made outside of CI. We even went as far as trying to restore an older version of the terraform state file. It wasn't until we realized that the commit was 3 weeks old, that it was clearer what happened.

Most of the initial confusion stemmed from that when the plan was run again against master there were also a lot of deletions. We were very worried at that time that we didn't fully understand what happened and that running the pipeline against master would make things worse. @cmcfarland can correct me if I am off here.

What services did we have to restore first?

The redis-cluster-cache was deleted, which was causing problems. There was one team focusing on that and another team focusing mostly on Gitaly restore and bringing all resources that were deleted.

Can we improve any of this?

Did we face any difficulties?

Did we have a hard time understanding which services we needed to recover first?

Yes, in hindsight we didn't have at our fingertips:

whether a change was made outside of CI
the ability restore the state file easily in object storage (though it ended up not being required)
a concise list of resources that were deleted from teh TF output

Also in hindsight we should have probably focused more on the recovery of service, and less on understanding what happened. I think due to the dataloss we were extra cautious.

In hindsight we were slow to restore resources that were deleted because we were doing selective applies, to filter out deletions (not being confident on how we exactly got into that state).

Was this done manually on an SRE's laptop or somewhere else?

Yes, this was all done on @cmcfarland 's laptop.

We used the DR runbook for recovering from Gitaly snapshots https://gitlab.com/gitlab-com/runbooks/-/tree/master/docs/disaster-recovery#gitaly-recovery-using-disk-snapshots. I was on the call which made this go a bit more smoother than it would have otherwise, as I'm not sure anyone else has run though this before.

@jarv would that make sense to have someone less familiar with the runbook to have a look at it and verify they would have followed the same path to recover the service?

Also in hindsight we should have probably focused more on the recovery of service, and less on understanding what happened. I think due to the dataloss we were extra cautious.

In retrospect, this looks like something the IMOC (me) should have advocated for. @jarv do you have the same understanding or do you think this is something EOCs should also try to keep in mind? I agree the potential data-loss is a justification for requiring more confidence in understanding the problem before taking actions but there is maybe an opportunity to highlight that priority in our process?

@sxuereb have you managed to get more details on the questions you've raised? If yes, is there any follow-up that still needs to happen or be deferred to a follow-up issue? If not, could you please bring the remaining open questions to the next sync meeting (internal link)? Thanks!

@gonzoyumo

would that make sense to have someone less familiar with the runbook to have a look at it and verify they would have followed the same path to recover the service?

Yes this is something we plan to do in Q3 for DR, the OKR will be around scheduling regular gamedays for DR scenarios that would cover a similar type of restore.

In retrospect, this looks like something the IMOC (me) should have advocated for. @jarv do you have the same understanding or do you think this is something EOCs should also try to keep in mind?

I think this was a group thing that we should have all had done better.

have you managed to get more details on the questions you've raised? If yes, is there any follow-up that still needs to happen or be deferred to a follow-up issue? If not, could you please bring the remaining open questions to the next sync meeting (internal link)? Thanks!

@gonzoyumo All my questions where answered in this thread or in #15999 (comment 1465823774)

The only unresolved thing is #15999 (comment 1466533473)

I'm going to mark this thread with as we aim to wrap up this issue.

We can track the remaining item in this thread #15999 (comment 1466533473)

changed the description

How we recovered availability

In addition to @sxuereb's question above about how we recovered the terraform state as well as the data, I also wanted to dive into how we recovered the availability of the site.

The primary availability impact during the incident was 503 errors being served for gitlab.com.

Webservice pods are crash looping

During the incident, a majority of webservice pods for the api and web services were crash looping.

During pod boot, we check database connections for both postgres and redis. If the database is not available, we fail to boot the pod. The intention is that if we were to roll out a config change that changes the database connection config, if that config were invalid, we would then avoid propagating the deployment further.

However, this ended up biting us here, because the terraform pipeline deleted some of our redis nodes. Specifically the redis-cluster-cache fleet.

What is `redis-cluster-cache`?

The Scalability team is working on rolling out a new Redis deployment called redis-cluster-cache. It aims to migrate a subset of the redis-cache workload to Redis Cluster. See this epic for context: &878 (closed).

As part of this migration, we provision redis-cluster-cache alongside the existing redis-cache. We then perform dual-writes to both the old and the new deployments. We are able to switch reads from one to the other, and once we're confident, we can stop dual writes, and if the entire workload has been migrated, decommission the old deployment.

At the time of the outage, we were still in this dual write phase. It is possible to disable dual writes by toggling a feature flag.

Optional except not quite

In theory the new Redis should be optional. It should be possible to fall back at this stage. In practice that isn't the case.

All Redis connections that have been specified will be checked during the boot phase. Even without this check, we proactively connect to Redis Cluster in order to fetch slot mappings.

As a result, unavailability of the new redis deployment will impact any newly booted pod.

Removing the Redis config

Luckily the code is able to fall back to redis-cache if redis-cluster-cache is not configured. As such, removing the connection for redis-cluster-cache should be sufficient to allow pods to boot again.

We ran into challenges here though. We tried to remove that config by reverting the MR that introduced it: Revert MR. Unfortunately the resulting pipeline failed because it was trying to pull from registry.gitlab.com, which was unavailable.

This required us to break glass and attempt to run that helm command locally. Unfortunately the tooling is not streamlined for local use, and we were not able to get the correct incantation working quickly.

We decided to break glass again and drop down to modifying the ConfigMap / Secret objects in Kubernetes directly. That was also challenging though, because of the double YAML encoding. We were manually editing a YAML config that is a string within another YAML config.

This was the process used:

➜  ~ k -n gitlab get configmap gitlab-webservice -o yaml > gitlab-webservice.yaml
➜  ~ vim gitlab-webservice.yaml
➜  ~ colordiff -u <(k -n gitlab get configmap gitlab-webservice -o yaml) gitlab-webservice.yaml
➜  ~ k -n gitlab apply -f gitlab-webservice.yaml
➜  ~ k -n gitlab rollout restart deployments/gitlab-webservice-web

But with many eyes to review the changes, we eventually were able to get a diff that looked safe. We then applied this to gitlab-cny via kubectl. The rolling restart showed pods starting to recover. Because only the canary pods were now taking traffic, HPA scaled out a lot, and canary was serving the whole site.

At this point we started to see GitLab.com recovering.

Since things were looking good, we repeated the same process for the main stage zonal clusters as well as the regional cluster, and this allowed the load to re-balance across the main stage.

Why was `git` still available?

One side-note (came up during the incident review) is that the git service was still available during this time. When looking at the webservice pods, most of the api and web ones were crash looping, but for some reason, this was not the case with the git pods.

Registry is broken, ... is it DNS?

So the site was back, but we started to get reports of registry being broken. This was intermittent, only some people were experiencing it.

We determined relatively quickly that this was in fact a DNS problem. The terraform plan had deleted some DNS records, including the one for cdn.registry.gitlab-static.net.

Some folks still had a cached version of the record. But asking Cloudflare's resolver gave us an NXDOMAIN.

We were about to manually create the record, but decided to hold off, as the parallel effort of restoring terraform was almost ready to start restoring resources. We brought that effort back into the main zoom call and performed a targeted apply to bring back the DNS records.

At this point we started seeing successful DNS resolution:

➜  ~ dig +short cdn.registry.gitlab-static.net @1.1.1.1
34.149.22.116

As NXDOMAIN may be cached for a short time, it would take a few more minutes to fully recover, but we did see registry recovering now.

Conclusion

In many ways we got lucky here.

The Redis nodes we lost were still in a state where we could easily lose them without having to go through data recovery. The Redis Cache state is also ephemeral and (as far as we know) can be safely dropped entirely if needed.

The recovery highlighted several gaps in our resiliency to GitLab.com being unavailable. We run pipelines on the separate ops instance for reasons of resiliency, but if those pipelines then pull from GitLab.com, it won't work.

We've seen similar cases with node bootstrapping. We may want to investigate mechanisms for detecting these implicit dependencies.

What went right?

It's worth noting that many design decisions were also really useful in helping us recover.

The Redis migration strategy made losing the new Redis cluster something we can tolerate.
The layered automation of Pipelines => Helm => Kubernetes enabled us to drop down to the lower layer when the higher-level one was not working as expected.
The separation of work streams for data recovery, terraform recovery, and site availability recovery allowed us to tackle different aspects of the incident concurrently and maintain focus.

The outage was pretty bad, but it could have been a lot worse.

@igorwwwwwwwwwwwwwwwwwwww amazing analysis!

At the time of the outage, we were still in this dual write phase. It is possible to disable dual writes by toggling a feature flag. Optional except not quite

Do we need to update our application code to take into consideration the dual write feature and ignore the secondary Redis connection if it's failing?

The recovery highlighted several gaps in our resiliency to GitLab.com being unavailable. We run pipelines on the separate ops instance for reasons of resiliency, but if those pipelines then pull from GitLab.com, it won't work.

I think we have this covered via Corrective Action: The CI pipeline for building... (delivery#19469 - closed)

Another one from @chill104:

Why did any change remove the DNS record in the first place? Were we moving to a different DNS?

That's a really good question. I assumed we'd moved something, but that turns out not to have been the case.

As a matter of fact, records for gitlab-static.net had been created manually at some point in the past, we discovered this fact recently, and thus imported them into terraform.

This is why terraform decided to delete them, because they had been unmanaged.

At the time of the outage, we were still in this dual write phase. It is possible to disable dual writes by toggling a feature flag. Optional except not quite

Do we need to update our application code to take into consideration the dual write feature and ignore the secondary Redis connection if it's failing? This was discussed in the sync review (internal link).

A follow-up issue has been created to improve knowledge sharing: scalability#2432

@igorwwwwwwwwwwwwwwwwwwww do we have an issue to improve the reliability during rollout phases, like disabling dual write if a target is unavailable instead of preventing the VM to boot (hopefully I got that right )

As a matter of fact, records for gitlab-static.net had been created manually at some point in the past, we discovered this fact recently, and thus imported them into terraform.

@igorwwwwwwwwwwwwwwwwwwww is there any corrective action we can take to prevent this to happen in the future?

@gonzoyumo

do we have an issue to improve the reliability during rollout phases, like disabling dual write if a target is unavailable instead of preventing the VM to boot (hopefully I got that right )

We discussed this some more in the last EMEA incident review, and I'm a bit skeptical. I'll re-iterate my thoughts on the matter here:

There is only a certain configuration in which such a fallback is safe, and it's hard to guarantee that this will always be safe.
Also, once a fallback has occurred, the datasets are no longer consistent. So unless we have strong alerting or validation mechanisms, we won't know that it happened.
All our redis deployments are highly available so that loss of a single node can be tolerated. Trying to optimize for the specific scenario of losing all nodes during a migration may not be the best use of our time.

is there any corrective action we can take to prevent this to happen in the future?

(re: DNS)

Good question. I think applying prevent_destroy to DNS records could help a bit.

Good question. I think applying prevent_destroy to DNS records could help a bit.

@igorwwwwwwwwwwwwwwwwwwww FYI it looks like this was added as a corrective action https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24079#note_1469197692 but work will instead be done on Corrective Action: Improve safety around Terraform destroy operations (disks)

do we have an issue to improve the reliability during rollout phases, like disabling dual write if a target is unavailable instead of preventing the VM to boot (hopefully I got that right )

We discussed this some more in the last EMEA incident review, and I'm a bit skeptical. I'll re-iterate my thoughts on the matter here:

Here are the notes the incident review along with @igorwwwwwwwwwwwwwwwwwwww's summary #15999 (comment 1481806163). @sxuereb is this still something you're still interested in investigating?

@jeromezng my understanding of Igor's comment is that they're not confident about an automated fallback for data-consistency concerns, if we configure our systems to automatically fallback, we won't get alerted when this happens, and we won't get the chance to intervene.

Based on this, unless @sxuereb have a different opinion I think we can resolve this thread.

@jeromezng @rehab I agree to resolve this thread it seems a micro-optimization, which will end up being more confusing because some Redis connections are optional and some aren't so it will be harder to debug issues and understand the dependencies.

Thanks @rehab and @sxuereb, I've marked as

Incident Review for Site-wide Outage for GitLab.com - Stale Terraform Pipeline #15997

Incident Review

Customer Impact

What were the root causes?

Incident Response Analysis

Post Incident Analysis

What went well?

Guidelines

Designs

Child items ...

Activity

A scenario for a better incident

A scenario for a worse incident

Out-of order pipeline execution

Understanding the incident trigger

Failed terraform apply

Drift detection and reconciliation

Re-creating resources

Deleting resources in terraform

Preparing for maintenance

Performing maintanance

Conclusion

What caused those nodes to be deleted?

Why are Gitaly nodes deletable

Solution

How we recovered availability

Webservice pods are crash looping

What is `redis-cluster-cache`?

Optional except not quite

Removing the Redis config

Why was `git` still available?

Registry is broken, ... is it DNS?

Conclusion

What went right?

Incident Review for Site-wide Outage for GitLab.com - Stale Terraform Pipeline #15997

Incident Review

Customer Impact

What were the root causes?

Incident Response Analysis

Post Incident Analysis

What went well?

Guidelines

Relates to

Activity

A scenario for a better incident

A scenario for a worse incident

Out-of order pipeline execution

Understanding the incident trigger

Failed terraform apply

Drift detection and reconciliation

Re-creating resources

Deleting resources in terraform

Preparing for maintenance

Performing maintanance

Conclusion

What caused those nodes to be deleted?

Why are Gitaly nodes deletable

Solution

How we recovered availability

Webservice pods are crash looping

What is redis-cluster-cache?

Optional except not quite

Removing the Redis config

Why was git still available?

Registry is broken, ... is it DNS?

Conclusion

What went right?

What is `redis-cluster-cache`?

Why was `git` still available?