Record and track application boot time (!79139) · Merge requests · GitLab.org / GitLab

Matthias Käppler requested to merge 321973-track-application-boot-time into master Jan 25, 2022

What does this MR do and why?

We are looking to improve Rails startup time. In order to measure progress, we need a baseline for comparison. To that end, this MR adds a mechanism to track the time it takes for the primary Rails process to start up.

To understand what application boot time even means, I summarized below how I approached this.

Implementation

What hook can we use to decide that the app is fully booted?

In a pre-fork server like Puma, the primary process runs through rackup, boots Rails, pre-loads classes, then forks into worker processes. In this case, we are only interested in the boot time for the primary process since forking is comparatively cheap and will not yield useful results.

In single-process environments like Sidekiq, but also the Rails console, for instance, we are largely interested in the point in time when the last initializer ran.

For both we can leverage the on_master_start life-cycle hook. In a single-proc environment, this will run immediately when called, i.e. during the last initializer. In a pre-fork env, it will run when the primary has fully started.

What point in time can we use as t0?

We need to decide what is the "time zero" event from which onward we track the time elapsed until the "boot finished" event occurs. This is not something we can easily do in the app itself. Instead, we can leverage the Linux kernel stats to tell us how much time elapsed since the process started. Linux maintains system and process statistics in procfs; particularly the /proc/uptime and /proc/self/stat files. The former is simply the system uptime in seconds. The latter is trickier; this file contains an array that in turn contains the starttime field for this process:

(22) starttime  %llu
                     The time the process started after system boot.  In kernels before Linux 2.6, this value was expressed in jiffies.  Since Linux 2.6, the value is expressed in clock ticks (divide by sysconf(_SC_CLK_TCK)).

The tricky bit here is that:

This is a timestamp, i.e. a fixed value that does not move every time we look at it.
The unit is clock ticks, not wall time.

So what we need to do is:

Take this value and divide it by the system's clock ticks (which is measured in ticks per second), so we have elapsed_ticks / clock_ticks / second which yields a value measured in seconds.
Subtract it from uptime, which is already in seconds.

I found an implementation of this in the linux_stat gem, but it uses a C extension that does not compile on macOS (procfs does not exist on Darwin), which many GitLab contributors use. This is unfortunate, since there is a lot of utility in this gem.

Where is it exposed?

I expose the boot time in 3 places for maximum visibility:

In the Rails console
In the application logs (structured JSON)
In prometheus via the gitlab_rails_boot_time_seconds gauge

Feature flag

We ended up putting this behind the track_application_boot_time feature flag, which is on by default. The reason being that this code path is only evaluated during startup of the primary process, so any change in the FF state would require a restart, i.e. turning it on is not equivalent to it activating.

Since we have confidence in the change from testing, I decided to make it default ON, so that we can turn it off in case a container crash-loops, unblocking a deployment.

This would never reach users in a broken state, since it would prevent the Rails process from starting.

#351769 (closed)

Test coverage

Because I touched console_message, undercoverage started to complain about a lack of test coverage. So I took the opportunity to split this into two pieces:

An initializer that merely patches IRB as it did before
A new module Gitlab::Console that contains the actual logic

The latter is now covered with tests for CE and EE.

Screenshots or screen recordings

Console boot time message

Note: This will only be displayed on Linux. 🐧

Prometheus metrics

You will notice that the time for workers is always 0; this is by design, since workers reset their metrics after forking, and this metrics wouldn't make a lot of sense in a pre-fork env anyway (see above.)

How to set up and validate locally

NOTE: The current implementation will not work on macOS, since it requires access to procfs and Ruby constants that are only present on supported GitLab platforms (that means Linux.) On macOS, application boot time will be 0.

You can run this in a Docker container instead if you're on Darwin. Just run the Rails console, boot Puma or Sidekiq and inspect the application logs or Prometheus.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

I have evaluated the MR acceptance checklist for this MR.

Related to #321973 (closed)

Edited Feb 02, 2022 by Matthias Käppler

Record and track application boot time