2020-07-07: GitLab Pages increased error rates
Summary
A recent version of omnibus introduced a change in SSL certificate handling for Pages that didn't work out quite well, it resulted in Pages being unable to communicate with GitLab.com API eventually resulting in 502 errors for Pages users.
Timeline
All times UTC.
2020-07-07
- 14:21 - Error ratios start rising for web-pages service
- 14:30 - jarv declares incident in Slack using
/incident declare
command. - 15:04 - Root cause is diagnosed
- 15:10 - Chef-client is stopped on nodes, rollbacked omnibus and run a gitlab-ctl reconfigure
- 15:39 - web-pages service error ratio return to normal
Click to expand or collapse the Incident Review section.
Incident Review
Summary
A recent version of omnibus introduced a change in SSL certificate handling for Pages that didn't work out quite well, it resulted in Pages being unable to communicate with GitLab.com API eventually resulting in 502 errors for Pages users.
- Service(s) affected: ServicePages
- Team attribution: Infra
- Minutes downtime or degradation: 72 minutes
Metrics
Customer Impact
- Who was impacted by this incident? (i.e. external customers, internal customers)
- External customers
- What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- First they would get a certificate error from their browser, then a 502 error if they chose to proceed
- How many customers were affected?
- 224 (number of unique IPs who got a response error)
- If a precise customer impact number is unknown, what is the estimated potential impact?
- N/A
Incident Response Analysis
- How was the event detected?
- PagerDuty alerts and a couple of reports from GitLab team members
- How could detection time be improved?
- It was detected fairly quick. It was also suggested to bump the error ratio SLA for pages to four nines (gitlab-com/runbooks!2490 (merged)), this is supposed to improved detection time from 10 minutes to 3 minutes.
- How did we reach the point where we knew how to mitigate the impact?
- Collective efforts from Infra team and Distribution team members, they identified a potential root cause and reached an agreement on how to mitigate the problem.
- How could time to mitigation be improved?
- Mitigation was fairly quick once we decided on the root cause. It took gitlab-pages around 5 minutes to be able to serve successful responses again but that has been a known behavior (slow start) for gitlab-pages.
Post Incident Analysis
- How was the root cause diagnosed?
- It was noticed that
gitlab-pages
was restarted (in addition to SSL errors present in the logs) which prompted (among other actions) a look into omnibus recentreconfigure
output. There were some changes related to SSL certificates, so engineers from the Distribution team were asked to clarify the change and whether it's related to the incident. It was, and so it was decided to roll back to the previous version of omnibus.
- It was noticed that
- How could time to diagnosis be improved?
- The SSL error in gitlab-pages wasn't very clear whether it's a result of attempting to hit GL.com API or whether it's a result of fetching a certificate from the API. It could use some re-wording.
- There was a little confusion about who owns the development of gitlab-pages and whom to page to help out, but this was sorted out quickly. An outline of who owns what would be helpful.
- There was uncertainty about what change(s) that caused the problem (was it Rails, gitlab-pages, omnibus, ...). Usually it's pretty quick to know what Rails changes are included in a certain deployment (usually it's included with the Slack message about deployment) but not so much for other components (e.g. omnibus), which requires a manual digging. A list of all component changes for a certain deployment would be very beneficial and won't be hard to implement/automated.
- Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?
- N/A
- Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)?
- A deployment of v13.2.202007070840+2ee23c9f2a0.6fd870136b6
5 Whys
- Users were unable to load Pages sites, why?
- A new omnibus release was deployed that changed the handling of SSL certificates for pages which didn't work as expected.
- Why was this issue caught in staging?
- No alerts were triggered for Pages in staging nor it was caught by QA.
- Why no alerts were triggered for staging?
- At least one alert would've been triggered it
GitLabPagesStagingFeIpPossibleChange
, but it wasn't. The reason isstill TBDwe had a global silence on alerts havingenv=gstg
label~~, which is now removed~~.
- At least one alert would've been triggered it
- Why it wasn't caught by QA?
- Apparently Pages isn't covered by QA test suite.
Lessons Learned
- QA should include tests for the integrity of the pages service.
- Perhaps we should have a canary node for web-pages.
- An accessible compiled list of all commits included in a certain deploy for all component inside and including omnibus would be beneficial for quickly ruling out potential suspects.
Corrective Actions
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10771
- gitlab-com/runbooks!2490 (merged)
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10792
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10794
Guidelines
Edited by Ahmad Sherif