Skip to content

Move Rails SLI initialization to boot path

Matthias Käppler requested to merge mk-front-load-rails-slis into master

What does this MR do and why?

This change moves our Rails SLI metrics initialization logic to the boot path.

Previously this was done lazily on the first metrics scrape out of performance concerns; it comes with several seconds of delay, since we need to traverse all web and API endpoints.

However, the downside of this is that it causes strong coupling between the WebExporter and Rails, the former we are looking to first extract and then retire. It also means that we are performing a decent amount of work after forking into workers, which is less memory efficient. It also means that it doesn't factor into /readiness, which is what k8s uses to determine whether the application is ready to serve requests.

Rails boot performance is a known issue, but adding the delay does not dramatically increase the existing timing at least during local measurements, and it was confirmed by infrastructure that it is preferable to do this during application boot instead.

A pending optimization could be to only compile Grape routes in the api fleet, but we have no mechanism currently for Rails to know which workload it serves. This optimization would also only apply for SaaS, where we know how to identify the application type.

Impact

We were only able to measure this on a development machine so far.

CPU

This added anywhere between 3-4 seconds to application boot on my machine; this was mostly due to Grape route compilation, and more exact number can be found here: &7304 (comment 816702386)

We should note that this is unlikely to be representative of production CPU load, since pod specs will differ. However, in #321973 (closed) we added a metric that tracks application boot time, so at least we have a baseline to compare against.

Memory

The memory impact of this change will likely be net positive. The eager initialization happens in the Puma primary, before forking. This is beneficial because if it happens after workers fork from the primary, the memory pages being written to cannot be shared anymore between these processes, and the Linux kernel will copy them to a new memory address (due to copy-on-write), thus increasing the total amount of resident pages needed, cumulatively.

In &7304 (comment 819515401) I found that we were using 80MB less USS (pages unique to these processes) in each worker by doing init during on_master_start instead of lazily on the first scrape. This effect is cumulative: the more workers, the more memory we save this way on the node.

Since these pages should never change again over the life-time of the application, this effect will likely be lasting.

Screenshots or screen recordings

I looked at a local Prometheus instance to verify that metrics still work:

Screenshot_from_2022-02-07_10-34-32

How to set up and validate locally

Just boot the application, curl the /-/metrics endpoint (or boot a Prometheus) and verify that gitlab_sli:rails_request_apdex:* metrics are present and working.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Matthias Käppler

Merge request reports