Sign in or sign up before continuing. Don't have an account yet? Register now to get started.
Register now

2019-10-23: Short outage of gitlab.com

Summary

We had a short outage of gitlab.com which recovered from self. We are still investigating the root cause.

More information will be added as we investigate the issue.

RCA issue for further analysis: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8247

Configuration overview

  • primary redis : redis-01-db-gprd.c.gitlab-production.internal
  • primary postgres: patroni-02-db-gprd.c.gitlab-production.internal

Timeline

All times UTC.

2019-10-23

  • 11:28 - pingdom alerts starting (https://gitlab.pagerduty.com/incidents/P0P8KMK?utm_source=slack&utm_campaign=channel)
  • 11:32 - high Rails error rate alert (https://gitlab.pagerduty.com/incidents/PJP6CKB?utm_source=slack&utm_campaign=channel)
  • 11:35 - alerts resolve
  • 11:55 - site is down again
  • 12:03 - site is up again
  • 12:06 - we turn off save on redis-01 to reduce memory pressure leading to OOM
  • 12:25 - blocking endpoint abused by spammers https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/2027/diffs
  • 13:04 - /chatops run feature delete ci_enable_live_trace to reduce memory usage
  • 13:23 - bgsave successfully finished without OOM
  • 13:43 - running https://ops.gitlab.net/gitlab-com/runbooks/blob/master/howto/clear_anonymous_sessions.md to clear up sessions
Edited Oct 23, 2019 by Henri Philipps
Assignee Loading
Time tracking Loading