GItLab Week of 6/3
Context
On 2018-06-03, GitLab.com started experiencing increased load. This issue tracks the actions and events for the week.
Summary Post Mortem Document: https://docs.google.com/document/d/1Zd-SNXEcpVtxT7t0N7Ge9k11kZ3UIbib26um38S4JkQ/edit
Timeline
On date: 2018-06-03
- 2018-06-03 ~15:00 UTC - something happened - posts started trending regarding Github / Microsoft
- 2018-06-03 18:00-24:00 UTC - Scaled up API nodes from 8 to 16 cores (to DS14) and 112GB RAM. This brought a noticable decrease on api node transaction timings from 3 seconds to around 1 second or less.
- 2018-06-03 2400 UTC - Sidekiq processing of pages_verification and project_cache was growing into 100K+ jobs. Changed tuning of sidekiq workers to make processing of those queues happen at a higher priority.
- 2018-06-04 02:00 - 06:00 UTC- we re-purposed 5 api servers we had been prepping to be sidekiq workers to handle github imports
- 2018-06-04 AM UTC We are noticing pgBouncer limits that we seem to be bumping against. Many other job queues in sidekiq are backing up. We are investigating the causes.
- 2018-06-04 11:45 UTC test increasing the size of the sidekiq DB pool size from 150 to 170 to see if this speeds up sidekiq a bit. While running chef to make these changes gitlab.com was unavailable for a short period of time.
- 2018-06-04 12:10 UTC - 11:45 changes reverted after not showing a positive impact
- 2018-06-04 12:46 UTC - learnings - changes from 11:45 were reverted per host with time in between - running a reconfigure on pgbouncer.db.prd.gitlab.com seemed to cause downtime even though pgbouncer itself was not running
- 2018-06-04 12:53 UTC - changing tuning back for best effort workers from 4 threads back to 2
- 2018-06-04 14:30 UTC - Further tuning of how many github importer workers are being used. Decreasing the number of sidekiq worker threads from 4 to 2 on the new nodes appears to be easing the postgres connection contention.
- 2018-06-04 14:30 UTC - Note that prometheus host is coming and going - may need to increase memory on that host
- 2018-06-04 15:30 UTC? - Changing FetchRemote limit to 5 because of a specific set of repo clones
- 2018-06-04 18:30 UTC - Hotpatch to put a 10GB repo size limit on all Github imports
- 2018-06-04 19:15 UTC - 10GB limit patch has made a big difference on load
- 2018-06-04 19:10 UTC - disabling all sidekiq workers to test - gitlab.com was much faster - something in sidekiq workers is contributing to the issues - investigating
- 2018-06-04 19:17 UTC - All queues re-enabled and performance is not good again. continuing investigation.
- 2018-06-04 20:40 UTC - We tested moving sidekiq workers to connect to pgbouncer-02, leaving other web traffic on pg-bouncer-1. It became apparent quickly that this was helping. Tweeted the update/current findings. Further investigation of pgbouncer resources will be done, but letting the system stabilize for the current time.
Initial Actions Taken
Overall actions from the evening of 6/3 through early 6/5 UTC
- about.gitlab.com behind fastly CDN - https://gitlab.com/gitlab-com/infrastructure/issues/4309
- The API workers have been running hot for a while. We added more cores by resizing the machines to significantly better ones, and we saw the API response timings drop dramatically.
- Yesterday we discovered pgbouncer was the bottleneck. The machine has only 1 core and its load hovered around 1.0. We immediately saw response timings drop dramatically once we moved all Sidekiq nodes to connect to a second pgbouncer node (pgbouncer2).
- Other Sidekiq jobs were blocked because the GitHub Importer was hogging all the DB connections in idle in transaction while generating diffs in the save hook. A patch was made that would free up the connection by deferring the diff generation later.
- NFS nodes were loaded by the repeated fetches of large repositories (e.g. with large pack files). We killed off a lot of fetches for these repositories. We still need to figure out how to solve this. Our gitaly team added a FetchRemote rate limit to mitigate this problem.
Incident Analysis
- How was the incident detected?
- Is there anything that could have been done to improve the time to detection?
- How was the root cause discovered?
- Was this incident triggered by a change?
- Was there an existing issue that would have either prevented this incident or reduced the impact?
Root Cause Analysis
Follow the the 5 whys in a blameless manner as the core of the post mortem.
For this it is necessary to start with the production incident, and question why this incident happen, once there is an explanation of why this happened keep iterating asking why until we reach 5 whys.
It's not a hard rule that it has to be 5 times, but it helps to keep questioning to get deeper in finding the actual root cause. Additionally, from one why there may come more than one answer, consider following the different branches.
A root cause can never be a person, the way of writing has to refer to the system and the context rather than the specific actors.
For Ex:
At 00:00 UTC something happened that led to downtime
- Why did X caused downtime?
...
What went well
- Identify the things that worked well
What can be improved
- Using the root cause analysis, explain what things can be improved.
Corrective actions
- - Issue labeled as infrastructure~2132984