Longer than usual response times from rails

Please note: if the incident relates to sensitive data, or is security related consider labeling this issue with security and mark it confidential.

Summary

A number of things happened starting from 8:20 UTC which generated a higher than usual load on sidekiq and gitaly fleets. This lead to increase in queues and timings which resulted in rails responding slower than usual

Service(s) affected : rails Team attribution : Minutes downtime or degradation : 1h45m (08:30UTC-10:15UTC)

Timeline

2019-06-05

08:20 UTC - increase in sidekiq method call count and cpu time, git timings on sidekiq nodes are many times higher than usual, increased number of SQL queries, max SQL timings are in the order of minutes, slight increase in pull_mirror queues, increase in memory usage on sidekiq asap nodes
08:32 UTC - increase in response times from rails
08:40 UTC - spike in the number of github import jobs
08:54 UTC - on-call gets paged about rails latency, significant increase in cpu load on besteffort sidekiq nodes
09:05 UTC - significant increase in pull_mirror queues
09:20 UTC - decrease in method call count and cpu time, decrease in memory usage on sidekiq asap nodes, spike in the number of github import jobs (at peak 1.2k)
09:49 UTC - on-call gets paged about increase in pull_mirror queues
10:00 UTC - pull_mirror alert clears
10:07 UTC - rails response times alert clears
10:15 UTC - all queues are back to normal

Edited Jun 05, 2019 by Michal Wasilewski