osqueryd is leading to performance issues on small instances

Spun out of the investigation in https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14440
Related to scalability#1551
https://gitlab.com/gitlab-com/gl-infra/capacity-planning/-/issues/27
Ongoing work: &708

As part of our capacity planning process, using Tamland, over the past few months, we've be monitoring with concern the increase in the the node_schedstat_waiting resource metric.

Brian Brazil has a good writeup on what this metric means: https://www.robustperception.io/cpu-scheduling-metrics-from-the-node-exporter

An very brief summarization of that article is such:

node_schedstat_waiting_seconds_total can be used to spot not only if you have more processes to be run than CPU time available to handle them

In https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14440, @alexander-sosna isolated the problem to the osqueryd service running on the pgbouncer instances. Shutting osqueryd quickly resulted in the node_schedstat_waiting dropping back to a normal level.

Next Steps

This issue is to discuss what we should do about osqueryd performance.

Possible Solutions

Scale the instances up: it's worth noting that nearly all of the affected instances (except for the haproxy service nodes) are small 1 or 2 core machines. A quick fix would be to scale these instances up to 4-core machines and see if the problem goes away. This could be a relatively simple solution. @jarv proposed that perhaps we set a minimum instance size across our fleet.
Review the osqueryd configuration. Is osqueryd logging too much, or configured in some other inefficient way? Some metrics from the osqueryd process are unusually high - we extremely high context switch rates from osqueryd on some nodes for example.
Something else, ideas are welcome.

cc @joe-dub @mlancini @pmartinsgl @jarv @alejandro

Impacted Services

Here are some of the services that have been impacted by this issue, showing the value of the node_schedstat_waiting metric over a 6 month period. This seems to tie in with osqueryd rollout.

`pgbouncer`

pgbouncer is the worst case, since it had other OS upgrade issues, but prior to this, we were already seeing problems.

`frontend` service (`haproxy`)

`registry` service

`monitoring` service

`camoproxy` service

`redis-cache` service

`consul` service as an example of what this metric should look like

I include consul as an example of how this metric should look in a healthy system. 2% - 5% is probably reasonable.

Selected first iteration solution

In response to this comment, @nnelson writes:

Of the proposed possible solutions the first options has been selected -- to scale up the number of cpu cores available to some service instances and subsequently acquire additional data on the node_schedstat_waiting resource metric. Ideally, expanding the number of available cpu cores would eliminate much of the elevated context switching that is suspected to underlie this metric.

Since pgbouncer is the worst case, it would be nice to make these nodes the initial subject of this re-provisioning. However, because these systems are responsible for load-balancing large numbers of database client connections from the production gitlab-rails web service, there are risks associated with any operations which could even temporarily reduce available pooled database connections. Such a change plan may involve expanding the existing pgbouncer cluster by adding additional nodes, but with caution to not exceed the maximum number of available connections to the database server.

However, because of the way our terraform modules are configured for making all nodes of a given type similar to each other, I have doubts that it will be possible to add a single new up-sized instance at a time to the same cluster. It may be necessary to create an entirely new pgbouncer cluster with more powerful instances and somehow manage a migration of new connections from the old cluster to the new one. I will have to discuss this with our SMEs in the database squad.

I will also take a look at some of the other systems affected with this to see if making a different service node set the subject of this re-provisioning will reduce our exposure to higher risks, but also achieve our goal of acquiring adequate data for assessment of the effectiveness of the selected proposed solution.

I will add more details about the overall change plan approach after conferring with my colleagues, and then assemble the change plan.

Edited Jun 21, 2022 by Bob Van Landuyt

osqueryd is leading to performance issues on small instances

Next Steps

Possible Solutions

Impacted Services

pgbouncer

frontend service (haproxy)

registry service

monitoring service

camoproxy service

redis-cache service

consul service as an example of what this metric should look like