osqueryd is leading to performance issues on small instances
- Spun out of the investigation in https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14440
- Related to scalability#1551
- https://gitlab.com/gitlab-com/gl-infra/capacity-planning/-/issues/27
- Ongoing work: &708
As part of our capacity planning process, using Tamland, over the past few months, we've be monitoring with concern the increase in the the node_schedstat_waiting
resource metric.
Brian Brazil has a good writeup on what this metric means: https://www.robustperception.io/cpu-scheduling-metrics-from-the-node-exporter
An very brief summarization of that article is such:
node_schedstat_waiting_seconds_total
can be used to spot not only if you have more processes to be run than CPU time available to handle them
In https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14440, @alexander-sosna isolated the problem to the osqueryd
service running on the pgbouncer
instances. Shutting osqueryd
quickly resulted in the node_schedstat_waiting
dropping back to a normal level.
Next Steps
This issue is to discuss what we should do about osqueryd
performance.
Possible Solutions
-
Scale the instances up: it's worth noting that nearly all of the affected instances (except for the
haproxy
service nodes) are small 1 or 2 core machines. A quick fix would be to scale these instances up to 4-core machines and see if the problem goes away. This could be a relatively simple solution. @jarv proposed that perhaps we set a minimum instance size across our fleet. -
Review the
osqueryd
configuration. Isosqueryd
logging too much, or configured in some other inefficient way? Some metrics from theosqueryd
process are unusually high - we extremely high context switch rates fromosqueryd
on some nodes for example. -
Something else, ideas are welcome.
cc @joe-dub @mlancini @pmartinsgl @jarv @alejandro
Impacted Services
Here are some of the services that have been impacted by this issue, showing the value of the node_schedstat_waiting
metric over a 6 month period. This seems to tie in with osqueryd
rollout.
pgbouncer
pgbouncer
is the worst case, since it had other OS upgrade issues, but prior to this, we were already seeing problems.
frontend
service (haproxy
)
registry
service
monitoring
service
camoproxy
service
redis-cache
service
consul
service as an example of what this metric should look like
I include consul
as an example of how this metric should look in a healthy system. 2% - 5% is probably reasonable.
Selected first iteration solution
In response to this comment, @nnelson writes:
Of the proposed possible solutions the first options has been selected -- to scale up the number of cpu cores available to some service instances and subsequently acquire additional data on the node_schedstat_waiting
resource metric. Ideally, expanding the number of available cpu cores would eliminate much of the elevated context switching that is suspected to underlie this metric.
Since pgbouncer
is the worst case, it would be nice to make these nodes the initial subject of this re-provisioning. However, because these systems are responsible for load-balancing large numbers of database client connections from the production gitlab-rails
web service, there are risks associated with any operations which could even temporarily reduce available pooled database connections. Such a change plan may involve expanding the existing pgbouncer cluster by adding additional nodes, but with caution to not exceed the maximum number of available connections to the database server.
However, because of the way our terraform modules are configured for making all nodes of a given type similar to each other, I have doubts that it will be possible to add a single new up-sized instance at a time to the same cluster. It may be necessary to create an entirely new pgbouncer
cluster with more powerful instances and somehow manage a migration of new connections from the old cluster to the new one. I will have to discuss this with our SMEs in the database squad.
I will also take a look at some of the other systems affected with this to see if making a different service node set the subject of this re-provisioning will reduce our exposure to higher risks, but also achieve our goal of acquiring adequate data for assessment of the effectiveness of the selected proposed solution.
I will add more details about the overall change plan approach after conferring with my colleagues, and then assemble the change plan.