2021-01-21: license prod down

added IncidentActive Source::IMAIncidentDeclare incident severity1 labels

assigned to @AnthonySandoval and unassigned @dawsmith

It appears that the database destroy operation failed.

{
  "textPayload": "2021-01-21 21:14:01.529 UTC [1024939]: [1-1] db=cloudsqladmin,user=cloudsqlagent ERROR:  database \"default\" is being accessed by other users",
  "insertId": "REDACTED",
  "resource": {
    "type": "cloudsql_database",
    "labels": {
      "region": "us-east1",
      "project_id": "REDACTED",
      "database_id": "REDACTED"
    }
  },
  "timestamp": "2021-01-21T21:14:01.529667Z",
  "severity": "ERROR",
  "logName": REDACTED",
  "receiveTimestamp": "2021-01-21T21:14:02.687450113Z"
}

added (perceived) data loss label

corrective action The stop_review job should not be available in CI for any of the production or staging branches.

This is done, for all but license. So, this is probably a small MR.

MR: https://ops.gitlab.net/gitlab-com/services-base/-/merge_requests/157

@AnthonySandoval @ahanselka please note that customers.gitlab.com is also down.

changed the description

Hi @ops-gitlab-net,

This incident issue does not have any service attribution. Please add one or more of the appropriate service label that are prefixed with Service:.

Please also add a group:: scoped label to help trace to a correct engineering group.

Thanks for your help!

You are welcome to help improve this comment.

added auto updated label

added ServiceLicense label

Related MRs: https://ops.gitlab.net/gitlab-com/services-base/-/merge_requests/153

https://ops.gitlab.net/gitlab-com/services-base/-/jobs/2824663

mentioned in issue on-call-handovers#1334 (closed)

This is mitigated, we're continuing with terraform surgery, but the service itself is up.

added IncidentMitigated label and removed IncidentActive label

added IncidentResolved label and removed IncidentMitigated label

I am continuing to do targeted apply's on the terraform state. I've found a few resources that were corrupted or pointed to non-existent objects. In most cases, it's just a matter of deleting the bad resource from the state file and re-creating it.

If the terraform destroy job had been allowed to complete, we wouldn't have to do any of this, but in that case, we'd be doing a database restore now instead. This is better.

The database wasn't touched. What I've found so far that needed to be re-created or fixed was:

The autodevops Kubernetes integration with the license project
The Kubernetes node pool (this is what caused the site to go down). But NOT the kubernetes cluster master (this is what caused everything to come back so quickly without a new deploy).
A bunch of service accounts for different things.
A few IAM roles were set to deleted it was easiest to just manually click undelete in the interface. Terraform didn't mind.
The new data export buckets
The database replica nodes
A bunch of firewall rules

I made a fresh data backup in the license-prd-data bucket before starting to rebuild the replica nodes. I've seen CloudSQL do strange things and lose data when messing with stuff like that in the past. So far no problems though.

The only thing left to fix are the CloudSQL replica nodes. We can’t rebuild CloudSQL nodes with the same name as a deleted one until the previously deleted one is completely removed from Google’s systems. Unfortunately, we don’t want to use a different name to create the replica nodes because it’s based on the primary node name. So we can either delete the primary node and re-create them all (then restore the data) - or set the replica_nodes count to 0 for now, and wait until the deleted nodes expire.

I'm leaning towards just setting the replica_nodes = 0 for a few days and taking some extra backups. The data set is small and backups just take a minute or two. If we do it this way, it can be transparent and not cause any interruption. We'll be running without redundancy for a few days, but we've never actually used these redundant database nodes.

This is all back to normal. All of the pipelines are succeeding in services-base. I've merged in the documentation updates and the corrective action update to add the license environments to the protected environments lists in the CI pipelines.

All of the pipelines are showing green for master, license-prd and license-stg and all of the recent MR's to those tracking branches.

We still need to update the Google Terraform provider, and Terraform versions so this doesn't take so long if we have to do this kind of thing again. Those things are scheduled for next week already, so I'm going to close out this issue.

@devin Did we set the replica_nodes back to their original settings?

And, is there an issue already to update the Terraform provider?

mentioned in issue on-call-handovers#1335 (closed)

closed

Reopening for a reviewed via our incident management workflow (severity1 issues all go through the Incident Review workflow).

/cc @AnthonySandoval @ahanselka @devin

reopened

added 1 deleted label and removed IncidentResolved label

changed the description

Assigning to @cmcfarland for the RCA

assigned to @cmcfarland and unassigned @ahanselka

changed the description

It appears that ingress logging was affected by this incident, so I am looking for alternative methods to determine impact.

changed the description

added RootCauseSoftware-Change label

unassigned @AnthonySandoval

@dawsmith Ok, I think this is ready for review. I'm not certain who the additional stakeholders are except for the Core-Infra team and the license team.

Thanks @cmcfarland - @jameslopez / @amandarueda - is there anybody you would like to have at the sync meeting to talk about this next Tuesday?

Also adding @lyle and @jcolyer - would you like to have anyone from support to review this and be in the meeting next week?

Happy to join myself if needed

assigned to @dawsmith

added IncidentReview-Scheduled label and removed 1 deleted label

added review-requested label

changed the description

Two corrective actions from review have been added and created.

The license-prd and license-stg branches were not prevented from including the clean-up job in their pipelines. The site reliability engineer making changes to the license-prd branch did not fully understand this underlying CI system and thought (incorrectly) that the clean up job was safe to run.

This is an opportunity to dig deeper IMO. @jarv touched on this in the incident review session.

If our conclusion is that an engineer performed a change they shouldn't have, we could ask:

Why did they think it was safe?
Why wasn't it safe?
Can we make it safe?
Can we add additional guardrails to the process?

+1 on improving the runbooks, though I feel there may be more to uncover here. WDYT @cmcfarland?

added IncidentResolved label and removed IncidentReview-Scheduled label

Given how old this issue and the current backlog of issues that are pending review I am going to close this as complete with the current corrective actions .

If there is anything more to do here, please feel free to re-open.

closed

added IncidentReview-Completed label and removed IncidentResolved label

added Incident-CommsStatus-Page label

mentioned in issue reliability-reports#108 (closed)

2021-01-21: license prod down

Summary

Timeline

Corrective Actions

Incident Review

Summary

Metrics

Customer Impact

What were the root causes?

Incident Response Analysis

Post Incident Analysis

Lessons Learned

Guidelines

Resources

Incident Review Stakeholders

Designs

Child items ...

Activity