2021-07-24: Pages Degraded due to High Traffic
Current Status
GitLab Pages sites were degraded due to a significant increase of requests to one site. All sites are functioning normally now.
Summary for CMOC notice / Exec summary:
- Customer Impact: ServicePages
- Customer Impact Duration: 21:38 UTC - 21:53 UTC (15 minutes)
- Current state: IncidentResolved
- Root cause: RootCauseSaturation
Timeline
View recent production deployment and configuration events / gcp events (internal only)
All times UTC.
2021-07-24
-
21:42
- @cmcfarland declares incident in Slack.
Corrective Actions
Corrective actions should be put here as soon as an incident is mitigated, ensure that all corrective actions mentioned in the notes below are included.
Those 2 issues would help, they aren't corrective actions, but rather full new features:
- https://gitlab.com/gitlab-org/gitlab/-/issues/224504 - Consider adding usage limits for GitLab Pages
- gitlab-org/gitlab#287700 - Enable Pages to deploy to a CDN
- &273 (closed) - GitLab Pages traffic on Kubernetes
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9810 - Move Pages behind Cloudflare
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13849 - Short term fleet growth to handle more traffic
Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.
Click to expand or collapse the Incident Review section.
Incident Review
Summary
- Service(s) affected : GitLab Pages
- Team attribution : grouprelease
- Minutes downtime or degradation : 21:38 UTC - 21:53 UTC (15 minutes)
Impact & Metrics
Start with the following:
Question | Answer |
---|---|
What was the impact | Degraded service to websites hosted on GitLab Pages |
Who was impacted | 213 unique hostnames received a 500 error |
How did this impact customers | Some users received 500 errors while browsing some websites |
How many attempts made to access | Single attempt to load 36 assets from the same website |
How many customers affected | ~250 unique IP addresses |
How many customers tried to access | ~128,000 unique IP addresses |
Hostnames:
Remote IP addresses:
Detection & Response
Start with the following:
Question | Answer |
---|---|
When was the incident detected? | 2021-07-24 21:38 UTC |
How was the incident detected? | AlertManager on Slack |
Did alarming work as expected? | Yes? |
How long did it take from the start of the incident to its detection? | ~4 minutes |
How long did it take from detection to remediation? | self healed after the burst in traffic |
What steps were taken to remediate? | None |
Were there any issues with the response? | N/A |
Timeline
2021-07-24
- 21:38 UTC - high traffic for a single website hosted on Pages
- 21:40 UTC - alert reported on Slack
- 21:42 UTC - incident reported
- 21:53 UTC - traffic stabilized and error rate dropped.
Root Cause Analysis
GitLab Pages reported 500 errors for some websites during a 15 minute period.
What went well
Start with the following:
- Monitoring detected the problem and SRE quickly declared the incident.
- The problem quickly resolved itself
What can be improved
Start with the following:
- Using the root cause analysis, explain what can be improved to prevent this from happening again. - Put Pages behind a CDN / Add some sort of rate limiting.
- Is there anything that could have been done to improve the detection or time to detection? - N/A
- Is there anything that could have been done to improve the response or time to response? - N/A
- Is there an existing issue that would have either prevented this incident or reduced the impact? - Yes:
- https://gitlab.com/gitlab-org/gitlab/-/issues/224504 - Consider adding usage limits for GitLab Pages
- gitlab-org/gitlab#287700 - Enable Pages to deploy to a CDN
- Did we have any indication or beforehand knowledge that this incident might take place? - Yes, see related issues in gitlab-org/gitlab#287700
Guidelines
Resources
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)