2021-04-11: Pages NFS server is getting low on free inodes

Current Status

The inode usage is back below the alerting threshold, after having manually deleted 94 million residual temp files.

Empirically confirmed the expectation that the legacy NFS volume is not being actively used by the Pages nodes. This supports the idea that we can also stop writing to the NFS volume on the sidekiq nodes and fully deprecate that storage system (just for our GitLab.com environment, not as a product feature removal).

The cause of leaking residual temp files may be related to sidekiq jobs being killed before clean-up. A promising lead suggests that unzipping large zip archives is slow and memory intensive, possibly leading to timeouts or OOM. Unzip performance is being examined in https://gitlab.com/gitlab-org/gitlab/-/issues/327581.

Summary for CMOC notice / Exec summary:

Customer Impact: ServicePages
Customer Impact Duration: No direct customer impact. This incident was proactive mitigation of an impending failure mode.
Current state: IncidentMitigated
Root cause: RootCauseSaturation of inodes due to temp files/dirs not being reliably deleted. Cause of the missing clean-up is being investigated.

Background context

The NFS server pages-01-stor-gprd was historically the canonical backing storage for GitLab Pages files. That Pages data was migrated to object storage several months ago. To support the transition from NFS to object storage, Pages has been writing the files to both storage locations. Now that object storage is the canonical storage target for Pages data, the NFS volume should no longer be needed.

Most of the inode consumption for the legacy NFS server's filesystem is driven by Pages-related CI jobs writing temp files but not reliably removing them under some circumstances. These residual temp files and temp directories are stored on the NFS server under the following path:

/var/opt/gitlab/gitlab-rails/shared/pages/@pages.tmp

Those residual temp files have been accumulating continuously for a few years, so this is not a new problem. However, the rate of accumulation abruptly increased a few months ago starting in January. This increased rate of leaking residual temp files might possibly correlate with the start of the object storage migration, but we have not carefully examined that yet.

Timeline

View recent production deployment and configuration events (internal only)

All times UTC.

2021-04-11

16:54 - @msmiley declares incident in Slack.

Corrective Actions

Corrective actions should be put here as soon as an incident is mitigated, ensure that all corrective actions mentioned in the notes below are included.

Deprecate pages NFS gitlab-com&1078 (closed)

Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.

Click to expand or collapse the Incident Review section.

Incident Review

Summary

Service(s) affected:
Team attribution:
Time to detection:
Minutes downtime or degradation:

Metrics

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
1. ...
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. ...
How many customers were affected?
1. ...
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
1. ...

What were the root causes?

Incident Response Analysis

How was the incident detected?
1. ...
How could detection time be improved?
1. ...
How was the root cause diagnosed?
1. ...
How could time to diagnosis be improved?
1. ...
How did we reach the point where we knew how to mitigate the impact?
1. ...
How could time to mitigation be improved?
1. ...
What went well?
1. ...

Post Incident Analysis

Did we have other events in the past with the same root cause?
1. ...
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. ...
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. ...

Lessons Learned

Guidelines

Blameless RCA Guideline

Resources

If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)

Edited Apr 30, 2021 by John Jarvis