2021-05-25 Customers.gitlab.com down (filled filesystem)

Current Status

The filesystem on the customers.gitlab.com VM filled up, causing the application to cease functioning. Manual actions were undertaken to cleanup some unnecessary disk usage.

Summary for CMOC notice / Exec summary:

Customer Impact: https://customers.gitlab.com
Customer Impact Duration: 06:26 - 06:34 (12 minutes)
Current state: See Incident::<state> label
Root cause: RootCauseSPoF with a full disk on the node serving the VM

Timeline

View recent production deployment and configuration events (internal only)

All times UTC.

2021-05-25

06:25:11 - First signs of errors: log writing failed. No space left on device @ io_write - /home/gitlab-customers/customers-gitlab-com/log/production.log
06:26:40 - First blackbox check fails
06:27:29 - @vitallium logs in via SSH
06:28 - Blackbox alert fires: blackbox probe availability https://customers.gitlab.com is less than 70.00% for the last 5 minutes.
06:29:45 - @vitallium begins manual actions to fix the problem
06:34:39 - Recovery, with logs: Redis is online, 498.595706715 sec downtime
06:34:43 - First blackbox check succeeds
06:37 - @cmiskell declares incident in Slack.
06:38 - Alert clears

Corrective Actions

Corrective actions should be put here as soon as an incident is mitigated, ensure that all corrective actions mentioned in the notes below are included.

https://gitlab.com/gitlab-org/customers-gitlab-com/-/issues/3173
&390 (closed) - Moving this to GCP, which amongst a cleanup/rebuild of the VM would also imply more detailed monitoring.

Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.

Click to expand or collapse the Incident Review section.

Incident Review

Summary

Service(s) affected: ~"Service::Customers"
Team attribution: ~"group::fulfillment" StagePurchase
Time to detection: 3 minutes
Minutes downtime or degradation: 9 minutes

Metrics

Source

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
1. External customers wanting to self-service on customers.gitlab.com
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. Complete inability to access the application
How many customers were affected?
1. Unknown; no logs were able to be kept because the filesystem was full.
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
1. 9 minutes of complete outage; from the day before, around 200 requests were received in that period, ~60 of which were seat_links API calls (automated, retryable, not user-affecting), and some of which were also healthchecks and other automated traffic leaving around 40 requests of 'user at a computer' traffic.

What were the root causes?

Disk filled up; timing suggests logrotate copying/compressing old files was the tipping point, and once that had failed it would have left the disk full requiring manual effort to resovle
This was because we run hot and have a number of old/unnecessary files on disk, including obsolete logs, kernel packages, and obsolete release files
This wasn't picked up before it became a problem because we have no detailed monitoring, just blackbox "is the site up" checks.
Contributions to the disk saturation included:
- Unnecessary apt packages (kernels et al)
- Old/obsolete log files
- Deployments leaving old/obsolete front-end files on disk.

Incident Response Analysis

How was the incident detected?
1. Blackbox monitoring for up-ness of the site
How could detection time be improved?
1. More detailed monitoring of basic system metrics (filesystem utilization)
How was the root cause diagnosed?
1. Basic system tools (du, probably)
How could time to diagnosis be improved?
1. More detailed monitoring
How did we reach the point where we knew how to mitigate the impact?
1. Not sure; @vitallium can you expand on how you figured out the cause?
How could time to mitigation be improved?
1. Of this particular cause: not clear; it was pretty quickly handled, and in general a full disk requires finding something to delete or expanding the disk.
What went well?
1. @vitallium was incredibly onto it and had found the problem before I even logged in. This was most delightful for an SRE, and I'm very thankful.

Post Incident Analysis

Did we have other events in the past with the same root cause?
1. #2342 (closed)
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. Yes, specifically &390 (closed)
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. Yes, and no. One of the contributing factors was front-end files from old deployments, but this was not the sole cause and it wasn't the act of deployment that triggered the incident.

Lessons Learned

We really need the better monitoring to get ahead of these things (alert at 90% and either expand disk or remove old cruft). This is best achieved by migrating to GCP and getting all our baseline goodness.

Guidelines

Blameless RCA Guideline

Resources

If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)

Edited May 26, 2021 by Brent Newton