On-Call Handover 2019-12-22 15:00 UTC

On-Call Handover

EOC egress: @mwasilewski-gitlab
EOC ingress: @alejandro

Summary:

Ongoing alerts/incidents:

Resolved actionable alerts:

Alertmanager got stuck again at triggering a cloud function: production#1522 (closed) The function was patched: slackline!7 (closed) and deployed manually to production. The next time it fires, alertmanager should successfully trigger the function. You can see its logs here: https://console.cloud.google.com/logs/viewer?project=gitlab-infra-automation&minLogLevel=0&expandAll=false&timestamp=2019-12-22T11:06:08.854000000Z&customFacets=&limitCustomFacetWidth=true&dateRangeStart=2019-12-22T09:54:56.376Z&interval=PT1H&resource=cloud_function%2Ffunction_name%2FalertManagerBridge%2Fregion%2Fus-central1&scrollTimestamp=2019-12-22T10:49:02.421736993Z&dateRangeUnbound=forwardInTime
High cpu usage on sidekiq machines, the only significant cpu activity is on pullmirror sidekiq machines and it's not out of ordinary (looking at last 4 days it's pretty much the same). The next time the alert fires, we should take a closer look at why did it fire. It might be that in the chaos around sidekiq errors I saw an alert from another environment
High disk usage on sidekiq machines caused by deployments to cny. /var/log/, /tmp/ are both <1G. Disk usage went up by 3-4% each time there was a deployment. This is visible in Grafana:

src: https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1&from=1576997237332&to=1577005219941&fullscreen&panelId=90

as well as in journald on the relevant machines (there's a session open for the takeoff ssh user and ansible running around the time when the disk usage increase happened).

Similar behaviour can be observed using historical data:

src: https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1&from=1576655171560&to=1576663087791&fullscreen&panelId=90

Apt cache is only 2.3G:

# du -hs /var/cache/apt/archives/
2.3G	/var/cache/apt/archives/

but there were two files created in the cache today around the time when ansible was executed, both of which are 0.8G in size. 0.8G is 4% of disk space which is exactly the disk usage increase that was observed.

I cleared the cache on the two sidekiq machines that were alerting, with sudo apt-get clean (which released ~9% of disk).

In summary, two deployments to gprd-cny were started today (but failed), these deployments run ansible (which connected using the takeoff ssh user) and pulled two gitlab deb packages. On Sidekiq machines, each of those packages used 4% of disk and resulted in alerts firing.

Questions to the delivery team (@jarv do you have a gitlab handle for the delivery team?):

why was there any ansible activity on prod machines if the deployment was for gprd-cny? (my guess would be that this was accidental and would require further investigation)
why were the packages left behind after failures? (my guess would be it's probably a matter of handling errors differently, i.e. cleaning up even in case of failed ansible runs)

Unactionable alerts:

Resolved production incidents:

Change issues:

Edited Dec 22, 2019 by Michal Wasilewski