2020-07-07: GitLab Pages increased error rates

Summary

A recent version of omnibus introduced a change in SSL certificate handling for Pages that didn't work out quite well, it resulted in Pages being unable to communicate with GitLab.com API eventually resulting in 502 errors for Pages users.

Timeline

All times UTC.

2020-07-07

14:21 - Error ratios start rising for web-pages service
14:30 - jarv declares incident in Slack using /incident declare command.
15:04 - Root cause is diagnosed
15:10 - Chef-client is stopped on nodes, rollbacked omnibus and run a gitlab-ctl reconfigure
15:39 - web-pages service error ratio return to normal

Click to expand or collapse the Incident Review section.

Incident Review

Summary

Service(s) affected: ServicePages
Team attribution: Infra
Minutes downtime or degradation: 72 minutes

Metrics

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
- External customers
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- First they would get a certificate error from their browser, then a 502 error if they chose to proceed
How many customers were affected?
- 224 (number of unique IPs who got a response error)
If a precise customer impact number is unknown, what is the estimated potential impact?
- N/A

Incident Response Analysis

How was the event detected?
- PagerDuty alerts and a couple of reports from GitLab team members
How could detection time be improved?
- It was detected fairly quick. It was also suggested to bump the error ratio SLA for pages to four nines (gitlab-com/runbooks!2490 (merged)), this is supposed to improved detection time from 10 minutes to 3 minutes.
How did we reach the point where we knew how to mitigate the impact?
- Collective efforts from Infra team and Distribution team members, they identified a potential root cause and reached an agreement on how to mitigate the problem.
How could time to mitigation be improved?
- Mitigation was fairly quick once we decided on the root cause. It took gitlab-pages around 5 minutes to be able to serve successful responses again but that has been a known behavior (slow start) for gitlab-pages.

Post Incident Analysis

How was the root cause diagnosed?
- It was noticed that gitlab-pages was restarted (in addition to SSL errors present in the logs) which prompted (among other actions) a look into omnibus recent reconfigure output. There were some changes related to SSL certificates, so engineers from the Distribution team were asked to clarify the change and whether it's related to the incident. It was, and so it was decided to roll back to the previous version of omnibus.
How could time to diagnosis be improved?
- The SSL error in gitlab-pages wasn't very clear whether it's a result of attempting to hit GL.com API or whether it's a result of fetching a certificate from the API. It could use some re-wording.
- There was a little confusion about who owns the development of gitlab-pages and whom to page to help out, but this was sorted out quickly. An outline of who owns what would be helpful.
- There was uncertainty about what change(s) that caused the problem (was it Rails, gitlab-pages, omnibus, ...). Usually it's pretty quick to know what Rails changes are included in a certain deployment (usually it's included with the Slack message about deployment) but not so much for other components (e.g. omnibus), which requires a manual digging. A list of all component changes for a certain deployment would be very beneficial and won't be hard to implement/automated.
Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?
- N/A
Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)?
- A deployment of v13.2.202007070840+2ee23c9f2a0.6fd870136b6

5 Whys

Users were unable to load Pages sites, why?
- A new omnibus release was deployed that changed the handling of SSL certificates for pages which didn't work as expected.
Why was this issue caught in staging?
- No alerts were triggered for Pages in staging nor it was caught by QA.
Why no alerts were triggered for staging?
- At least one alert would've been triggered it GitLabPagesStagingFeIpPossibleChange, but it wasn't. The reason is ~~still TBD~~ we had a global silence on alerts having env=gstg label~~, which is now removed~~.
Why it wasn't caught by QA?
- Apparently Pages isn't covered by QA test suite.

Lessons Learned

QA should include tests for the integrity of the pages service.
Perhaps we should have a canary node for web-pages.
An accessible compiled list of all commits included in a certain deploy for all component inside and including omnibus would be beneficial for quickly ruling out potential suspects.

Corrective Actions

Guidelines

Blameless RCA Guideline

Edited Jul 13, 2020 by Ahmad Sherif