Gitter MongoDB high CPU usage

Problem

We are getting PagerDuty alerts like this

mongo-replica-01: loadavg(5min) of 2.3 matches resource limit [loadavg(5min)<2.0]

https://gitter.pagerduty.com/incidents/PEPXZE7

At the beginning these warnings were very sparse (once a few days) but over last few weeks they are increasing in nubers (roughly dozen or so a day).

Last Friday there was a hiccup when the site stopped responding for a minute https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10181 which indicates that these alerts can result in production outages.

Analysis

The CPU usage grew slightly around 22nd of April and it seems that it increased today even further:

(light purple is CPU cycles spent on iowait)

Last 3 months

All CPU stats

Screenshot_2020-05-19_at_12.18.41_PM

IOWait

https://app.datadoghq.com/dash/host/52505732?from_ts=1591155757659&to_ts=1591159357659&live=true Screenshot_2020-05-19_at_12.53.05_PM

It seems that IOWait is the main CPU usage that contributes to the increased load in the last month.

Today 2020-05-19

Screenshot_2020-05-19_at_12.18.13_PM

Next steps

  • Investigate wheteher there are deployments that corellate with the increase in CPU usage
  • Profile the DB to find out what queries are causing the increased load
  • Could we send some read queries to the replica which si not using almost any CPU?
Edited by Eric Eastwood