2020-10-24: Increased backend errors - GitLab.com Down
Summary
Current State: GitLab.com is up.
Database load spiked and this has dramatically increased error rates and degraded times on responses. Automatic re-indexing was enabled over the weekend. A re-index of the routes table changed the statistics on the table. This resulted in much less efficient queries there. A manual re-analyze of the table fixed the issue. The automatic re-indexing has been disabled.
Outage from 2020-10-24 09:16 to 10:25.
Timeline
All times UTC.
2020-10-24
- 09:12 - starting to reindex
index_on_routes_lower_path
- 09:14 - swapped old and new index by name, deleting old index. This is causing table statistics to be off and makes queries slow.
- 09:15 - starting to get alerts and pages all over the place
- 09:19 - hphilipps declares incident in Slack using
/incident declare
command, paging IMOC and CMOC. - 09:32 - Escalated to IMOC again.
- 09:32 - graphs https://dashboards.gitlab.net/d/patroni-main/patroni-overview?orgId=1
- 09:46 - looking at https://log.gprd.gitlab.net/goto/50cd8288ecda46c79ddf7b605795cbeb Looks like a hug of death, not an attack. A user is deleting a large number of tags
- 09:51 - started by trying to block particular URL deleting tags, but now blocking entire identified network which is sending troublesome requests. (Turns out as not related later.)
- 10:05 - DB load is low, but pg_bouncer_sync_replica_pool saturation is 100%
- Looking into pg_bouncer
- https://dashboards.gitlab.net/d/pgbouncer-main/pgbouncer-overview?orgId=1
- jwtController - main stage has a lot of timeouts - looking at blocking
- Will affect registry
- 10:19 - the block to jwt_auth appears to have made a difference.
- Things in pg_bouncer/patroni appear to be recovering.
- 10:21 - removing initial project and IP block as they did not appear to make a difference. Leaving block on jwt_auth
- 10:54 - suspecting automatic reindexing to have caused the issue, disabling the feature flag:
/chatops run feature set database_reindexing false
- 11:08 - found that a rerun of statistics on the route table changed the query plan
- 11:12 - removed the temporary blocks of request paths that were related to the issues.
- 11:20 - all traffic is operating normally
Incident Review
Summary
An automatic DB re-indexing cronjob, which was enabled this weekend, lead to missing table statistics for an important index which made all queries related to that index very inefficient, slowing down the whole database. This caused an APDEX drop to around 50% for most services between 09:16 to 10:25 UTC (69 minutes) and for the container registry service until 11:20 (124 minutes).
- Service(s) affected: All services depending on the DB
- Team attribution: @gitlab-org/database-team
- Minutes downtime or degradation: 69 minutes (124 minutes for container registry)
Metrics
For Container Registry:
Customer Impact
- Who was impacted by this incident?
- All users
- What was the customer experience during the incident?
- Very slow response or 500 errors
- How many customers were affected?
- If a precise customer impact number is unknown, what is the estimated potential impact?
- API requests dropped to 50% (Grafana), Web requests dropped by 75% (Grafana), CI job requests dropped by 50% and job durations got very high.
Incident Response Analysis
- How was the event detected?
- EOC got paged for increased backend error rates and Pingdom check failing.
- How could detection time be improved?
- Detection worked well.
- How did we reach the point where we knew how to mitigate the impact?
- After checking CloudFlare traffic and if we had a DB failover we started spending time looking for abusive requests and identifying responsible controllers. The pgbouncer dashboard controller panel and analysis of the dominating slow queries by OnGres helped us to identify and block the
/jwt/auth
endpoint as the main producer of DB load as a first mitigation. As @abrandl had informed the EOC about enabling the re-indexing cronjob the prior day (and it was mentioned in the oncall handover), we started to suspect a possible correlation and started looking into that. Checking the re-indexing cron job logs, @ahegyi and @jacobvosmaer-gitlab found that the indexindex_on_routes_lower_path
just had been re-indexed when the incident started and OnGres confirmed that the query plan for the slow query stopped using the index because of missing statistics which made those queries very inefficient. OnGres finally fixed that by runningANALYZE
on the index and we disabled the re-index feature flag to prevent further events.
- After checking CloudFlare traffic and if we had a DB failover we started spending time looking for abusive requests and identifying responsible controllers. The pgbouncer dashboard controller panel and analysis of the dominating slow queries by OnGres helped us to identify and block the
- How could time to mitigation be improved?
- Better documentation on how to page OnGres.
- Improvements and training on how to interpret slow queries and connect them to controllers and endpoints.
- Alert if only a few queries are dominating the DB request times
Post Incident Analysis
- How was the root cause diagnosed?
- See above.
- How could time to diagnosis be improved?
- See above.
- Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?
- Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)?
- Yes, #2849 (closed)
5 Whys
- GitLab.com was down, why?
- DB requests have been very slow.
- Why were DB requests so slow?
- Because a functional index had missing statistics leading to a bad query plan.
- Why was the index missing statistics?
- Because an automatic re-indexing job was re-creating the index and swapping the new with the old index, but for functional indices this is leading to missing statistics without running
ANALYZE
immediately.
- Because an automatic re-indexing job was re-creating the index and swapping the new with the old index, but for functional indices this is leading to missing statistics without running
- Why didn't we run
ANALYZE
during the automated re-indexing?- Because for normal indices (the majority) this isn't needed - only for functional indices (we have very few of them) and this isn't documented in the postgres documentation and so was overlooked by all reviewers.
- Why didn't we catch this in tests?
- The reindexing job in staging was running without issues for a few days already, but as indices are selected randomly, there is a high chance that it didn't hit a functional index so far and going through all indices would have taken many weeks. We all had the assumption that if it works for some indices, that it also will work for the rest.
Lessons Learned
- Technical: Functional ("expression") indexes require explicitly building statistics, which is a relevant learning for both reindexing gitlab-org/gitlab#272997 (closed) and regular database migrations gitlab-org/gitlab#272992.
Corrective Actions
-
Update change management to increase the criticality of SQL scripts (via cron or manually) from C4 to C2. - https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11723 -
Make it easier to find documentation on how to page OnGres: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11770 -
Improvements and training on how to interpret slow queries and connect them to controllers and endpoints. https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11787 -
Have DB re-index cron job add annotations to grafana when it has finished a re-index on a table. gitlab-org/gitlab#273198 (closed) -
Alert if small set of queries dominates postgres. https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11794 -
Include statements in elastic postgres logs for better debugging. https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11795 -
production-like test-env for DB performance regressions. https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11797 -
For discussion: When enabling a new higher risk(DB or new feature), set up a shadow rotation https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11802 -
Remove extra and temporary index in staging to better align envs https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11868