GItLab Week of 6/3

Context

On 2018-06-03, GitLab.com started experiencing increased load. This issue tracks the actions and events for the week.

Summary Post Mortem Document: https://docs.google.com/document/d/1Zd-SNXEcpVtxT7t0N7Ge9k11kZ3UIbib26um38S4JkQ/edit

Timeline

On date: 2018-06-03

2018-06-03 ~15:00 UTC - something happened - posts started trending regarding Github / Microsoft
2018-06-03 18:00-24:00 UTC - Scaled up API nodes from 8 to 16 cores (to DS14) and 112GB RAM. This brought a noticable decrease on api node transaction timings from 3 seconds to around 1 second or less.
2018-06-03 2400 UTC - Sidekiq processing of pages_verification and project_cache was growing into 100K+ jobs. Changed tuning of sidekiq workers to make processing of those queues happen at a higher priority.
2018-06-04 02:00 - 06:00 UTC- we re-purposed 5 api servers we had been prepping to be sidekiq workers to handle github imports
2018-06-04 AM UTC We are noticing pgBouncer limits that we seem to be bumping against. Many other job queues in sidekiq are backing up. We are investigating the causes.
2018-06-04 11:45 UTC test increasing the size of the sidekiq DB pool size from 150 to 170 to see if this speeds up sidekiq a bit. While running chef to make these changes gitlab.com was unavailable for a short period of time.
2018-06-04 12:10 UTC - 11:45 changes reverted after not showing a positive impact
2018-06-04 12:46 UTC - learnings - changes from 11:45 were reverted per host with time in between - running a reconfigure on pgbouncer.db.prd.gitlab.com seemed to cause downtime even though pgbouncer itself was not running
2018-06-04 12:53 UTC - changing tuning back for best effort workers from 4 threads back to 2
2018-06-04 14:30 UTC - Further tuning of how many github importer workers are being used. Decreasing the number of sidekiq worker threads from 4 to 2 on the new nodes appears to be easing the postgres connection contention.
2018-06-04 14:30 UTC - Note that prometheus host is coming and going - may need to increase memory on that host
2018-06-04 15:30 UTC? - Changing FetchRemote limit to 5 because of a specific set of repo clones
2018-06-04 18:30 UTC - Hotpatch to put a 10GB repo size limit on all Github imports
2018-06-04 19:15 UTC - 10GB limit patch has made a big difference on load
2018-06-04 19:10 UTC - disabling all sidekiq workers to test - gitlab.com was much faster - something in sidekiq workers is contributing to the issues - investigating
2018-06-04 19:17 UTC - All queues re-enabled and performance is not good again. continuing investigation.
2018-06-04 20:40 UTC - We tested moving sidekiq workers to connect to pgbouncer-02, leaving other web traffic on pg-bouncer-1. It became apparent quickly that this was helping. Tweeted the update/current findings. Further investigation of pgbouncer resources will be done, but letting the system stabilize for the current time.

Initial Actions Taken

Overall actions from the evening of 6/3 through early 6/5 UTC

about.gitlab.com behind fastly CDN - https://gitlab.com/gitlab-com/infrastructure/issues/4309
The API workers have been running hot for a while. We added more cores by resizing the machines to significantly better ones, and we saw the API response timings drop dramatically.
Yesterday we discovered pgbouncer was the bottleneck. The machine has only 1 core and its load hovered around 1.0. We immediately saw response timings drop dramatically once we moved all Sidekiq nodes to connect to a second pgbouncer node (pgbouncer2).
Other Sidekiq jobs were blocked because the GitHub Importer was hogging all the DB connections in idle in transaction while generating diffs in the save hook. A patch was made that would free up the connection by deferring the diff generation later.
NFS nodes were loaded by the repeated fetches of large repositories (e.g. with large pack files). We killed off a lot of fetches for these repositories. We still need to figure out how to solve this. Our gitaly team added a FetchRemote rate limit to mitigate this problem.

Incident Analysis

How was the incident detected?
Is there anything that could have been done to improve the time to detection?
How was the root cause discovered?
Was this incident triggered by a change?
Was there an existing issue that would have either prevented this incident or reduced the impact?

Root Cause Analysis

Follow the the 5 whys in a blameless manner as the core of the post mortem.

For this it is necessary to start with the production incident, and question why this incident happen, once there is an explanation of why this happened keep iterating asking why until we reach 5 whys.

It's not a hard rule that it has to be 5 times, but it helps to keep questioning to get deeper in finding the actual root cause. Additionally, from one why there may come more than one answer, consider following the different branches.

A root cause can never be a person, the way of writing has to refer to the system and the context rather than the specific actors.

For Ex:

At 00:00 UTC something happened that led to downtime