2019-10-23: Short outage of gitlab.com
Summary
We had a short outage of gitlab.com which recovered from self. We are still investigating the root cause.
More information will be added as we investigate the issue.
RCA issue for further analysis: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8247
Configuration overview
- primary redis : redis-01-db-gprd.c.gitlab-production.internal
- primary postgres: patroni-02-db-gprd.c.gitlab-production.internal
Timeline
All times UTC.
2019-10-23
- 11:28 - pingdom alerts starting (https://gitlab.pagerduty.com/incidents/P0P8KMK?utm_source=slack&utm_campaign=channel)
- 11:32 - high Rails error rate alert (https://gitlab.pagerduty.com/incidents/PJP6CKB?utm_source=slack&utm_campaign=channel)
- 11:35 - alerts resolve
- 11:55 - site is down again
- 12:03 - site is up again
- 12:06 - we turn off
save
on redis-01 to reduce memory pressure leading to OOM - 12:25 - blocking endpoint abused by spammers https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/2027/diffs
- 13:04 -
/chatops run feature delete ci_enable_live_trace
to reduce memory usage - 13:23 - bgsave successfully finished without OOM
- 13:43 - running https://ops.gitlab.net/gitlab-com/runbooks/blob/master/howto/clear_anonymous_sessions.md to clear up sessions
Edited by Henri Philipps