[RCA] osqueryd consuming too many resources in production

Summary

Service(s) affected : N/A Team attribution : infrastructure/security Minutes downtime or degradation : N/A

Impact & Metrics

What was the impact of the incident?
- CPU load and disk IO on most machines went up, consuming an equivalent of ca. 60 cores overall in the production fleet.
Who was impacted by this incident?
- mostly security and infrastructure teams having to spend time finding and fixing the root cause and preventing further impact
How did the incident impact customers? (i.e. preventing them from doing X, incorrect display of Y, ...)
- As the cpu usage of osqueryd was isolated to 1 core on each server there was no measurable slowdown of any service but it might be that it was contributing to slower response during times of higher load on the site.
How many attempts were made to access the impacted service/feature?
- N/A
How many customers were affected?
- N/A
How many customers tried to access the impacted service/feature?
- N/A

Overall osqueryd CPU consumption by env (we are missing the week before as we switched to thanos longterm storage last week):

Detection & Response

Start with the following:

How was the incident detected?
- a few hours after prod deployment, SRE oncall was paged for root disk filling up on git-servers (because of growing osqueryd data dir). Investigation showed issues with the embedded rocksdb and high CPU usage.
Did alarming work as expected?
- The existing alarming worked to notify us about systems coming into critical state, but we didn't have specific alerts or monitoring for osqueryd misbehaving. Also, FilesystemFullSoon alerts are not going to pagerduty, but disks filled so fast, that we were at risk to have them filled up before SRE oncall was taking notice.
How long did it take from the start of the incident to its detection?
- prod deployment started 1am UTC, filesystem-full alerts coming in 5:49am (in slack channel, not via pagerduty), SRE oncall starting to take action 12:15pm
How long did it take from detection to remediation?
immediate remediation for FileSystemFull alerts at 11:15am. Trimming down the osqueryd profile took place over several days with finally stopping it in production after 7 days because of rocketdb corruption issues not stopping.
Were there any issues with the response to the incident? (i.e. bastion host used to access the service was not available, relevant team member wasn't page-able, ...)
- security team having no insight to metrics or logs of osqueryd was slowing down the process of analysing and fixing the issue

Timeline

2019-03-21

01:00 UTC - starting to deploy osqueryd to production
05:49 UTC - FilesystemFullSoon alerts starting for git-* nodes
12:20 UTC - SRE On-call stopping osqueryd on git nodes and cleaning up data dirs
15:00 UTC - SRE On-call noticing that uptycs cookbook is re-installing and restarting osqueryd on each chef-client run (https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6440)
17:30 UTC - Security moved git hosts into new profile, collecting less data
18:34 UTC - SRE On-call deploying MR to fix uptycs cookbook reinstall issue (https://gitlab.com/gitlab-cookbooks/gitlab-uptycs/merge_requests/9)

2019-02-22

07:10 UTC - SRE On-call noticing that not all git nodes got a new profile (https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6434#note_153156697)
08:21 UTC - security applying the new profile to missing git nodes
11:04 UTC - SRE On-call providing list with other hosts having osqueryd issues (https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6434#note_153231389)
21:01 UTC - new profile without local caching applied to all hosts

2019-03-25

11:22 UTC - SRE-oncall still seeing hosts with issues, pinging security through slack

2019-03-28

15:50 UTC - process monitoring for osquery deployed, showing a lot CPU usage over the whole fleet (https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6434#note_155174540)

2019-03-29

13:15 UTC - osqueryd being stopped in production to stop waste of CPU resources until solution is found.

2019-04-03

02:15 UTC - deploying Uptycs 3.2.6.46 (wich contains a fix for rocksdb) in staging solves the CPU utilization issues (https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/945)

Root Cause Analysis

osqueryd showed anomalous behavior in production. That was caused by a profile that was collecting too many data with too high frequency and a bug specific to RocksDB in Uptycs 3.2.6.40 that was leading to db corruption and spinning with 100% cpu usage on 1 core. We did collect too many different metrics because it was assumed that the default uptycs profile would have no negative impact on our hosts and would not do local caching. That assumption wasn't proofed wrong in tests beforehand - presumably, because we didn't test on hosts with the same workload like in production and because we didn't have enough visibility into the behavior of osqueryd. We missed visibility because we didn't implement monitoring for the behavior of the osqueryd process before going into production and maybe also because we didn't examine the osqueryd logs enough in tests to see what it is collecting. Solving the issue in production took a while because the security team didn't have direct access to the staging and production systems, so they couldn't observe logs or system resource usage while the infra team didn't have insight into the osqueryd configuration profile and also didn't have expertise with the product.

What went well

Our existing monitoring and alerts made as aware of the issues before they could cause issues in production.
First response of infra and security teams was fast and we worked closely together to find the root cause and fix it.
Osqueryd was configured with resource usage limits, so it never exhausted a whole node, but only 1 core per node at maximum

What can be improved

We should do proper testing before bringing a new service into production.
SRE and security to work closer together in planning and developing a new service.
SREs should take a stronger lead on ensuring production readiness for a new service before going live (monitoring, alerting, runbook, ...).
We should have solid cookbooks with proper tests and ways to enable or disable the service. Initially, the cookbook was always re-installing osqueryd on each chef-run and there was no way to control the service.
We should only collect the necessary metrics and avoid collecting metrics that are already available through our existing monitoring systems.
We could have stopped osqueryd in production earlier instead of trying to fix it there and discovering more and more issues in the process.

Corrective actions

formalize and document production readiness process: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6523
add alerts for osqueryd cpu and disk usage: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6524
review the profile of osqueryd, what metrics it is collecting how often, and what we really need from it: https://gitlab.com/gitlab-com/gl-security/operations/issues/180
deploy fixed Uptycs version 3.2.6.46

Edited Apr 05, 2019 by Henri Philipps