Instrument Gitlab using the prometheus API to make it easier to track, detect and manage issues with Gitlab.
Once instrumenting is added it becomes possible to scrape Gitlab in order to detect performance issues, changes in performance, reliability issues, debug problems, spot performance trends across versions or over uptime.
Proposal
Steps to integrating this:
Create a multi-proccess mode for the existing ruby prometheus library.
Update the existing ruby prometheus library to current standards. Being tracked in: client_ruby issue 9.
Have a way to set the prometheus_multiproc_dir environment variable within unicorn.
Document 'prometheus_multiproc_dir' usage
Link 'prometheus_multiproc_dir' in config.ru and Application Settings in Rails app
Metrics endpoint name.
There's some existing influx instrumentation - make use of that.
Work with gitlab.com production engineering to find important metrics that are required.
gdk install gitlab_repo=https://gitlab.com/lyda/gitlab-cecd gitlab-development-kit/gitlabgit checkout instrument-infrabundle install--without mysql production --jobs 4
To run the db (needed to run gdk update or gdk run app)
cd gitlab-development-kitgdk run db
To run the app
cd gitlab-development-kitgdk run app
To see what metrics are exposed: curl http://localhost:3000/admin/metrics
To update
Note this will stash your changes and move you back to master.
Also note that the db must be running here.
cd gitlab-development-kitgdk updatecd gitlabgit checkout instrument-infrabundle install--without mysql production --jobs 4cd ..gdk run app
Issues
Collector Fail
Can't use the Collector as the summary type isn't supported. Either
we need to support the type or we need to create a multiprocess-friendly
Collector.
[2017-03-23T23:46:28.167148 #80944] ERROR -- : Summary metric type does not have multiprocess support (ArgumentError)
How to add authentication
The health_check urls have a token parameter to limit access.
Currently the /admin/metrics is missing one and I'm still not
clear on how to do that. Other admin routes are defined in
config/routes/admin.rb. It might be that we can't use
Prometheus::Client::Rack::Exporter to set up the endpoint. Or we
need to add an option to it. Or maybe I need to set it up within
the routes file?
Where to put instrumentation
I've currently put metrics into a service (PromService). However
there's already a lib/gitlab/metrics which are for other metrics
gathering systems. I don't think we can create a bunch of registries
so we need a singleton or something similar to manage that. After
that it doesn't have to be a single class where metrics are defined,
but they do need to talk to that single registry.
Unfortunately I think lib/gitlab is used outside of rails (sidekiq,
etc) so we can't have it depend on a service. Need to pass in a
registry?
@lydainclude RequiresHealthToken in controller is enough to make the endpoint protected by the same token that health checks use.
Also Recently I've added /-/metrics endpoint that right now returns Prometheus-text-formated metrics collected from checks that are performed on each call to that endpoint.
@lyda I see you are using Git sourced Gem. I think it would be best to use a published Gem if possible, as I see we don't have any other such gems and I guess it might be difficult to pass review without publishing the gem first.
Will it be possible to publish this gem before this issue is merged ?
I've updated the client_ruby lib in the exchangeable-value-types.summary branch. It fixes a few issues. Summaries are supported, counters show labels and a number of other fixes.
Note that quantiles for summaries are broken at the moment, but values are there.
We discussed status in detail on this week's Prometheus team call on Tuesday. Unfortunately we ran into some hurdles with 9.2, and are now targeting 9.3. The goal is to get it integrated and with a handful of metrics, to start getting a handle on the performance impact of this module. We are targeting counter and gauge metrics, as they are more straight forward and further along. (We are still working on full support for summaries and histograms.)
@andrewn can you comment on exactly what metrics you need for the legacy code path?
can you comment on exactly what metrics you need for the legacy code path?
@joshlambert, we'll start with two simple metrics:
1. At each migration site: operation duration for migrated code-paths (running in Gitaly vs git spawn, Rugged, etc)
Recorded as a latency histogram. These metric events would look something like this:
gitaly_migration{feature="commitdiff"migrated=1, job="...", instance="..."} 100 # The migrated version of the `commitdiff` migration-site took 100msgitaly_migration{feature="commitdiff"migrated=0, job="...", instance="..."} 500 # The non-migrated version of the `commitdiff` migration-site took 100ms gitaly_migration{feature="defaultbranchname"migrated=1, job="...", instance="..."} 20 # The migrated version of the `defaultbranchname` migration-site took 100ms # etc...
2. Total server response time: probably adapted from an existing Rails middleware. What we're looking for is how long a route is taking on a red (migrated) host, vs. on a black (non-migrated) host.
Recorded as a latency histogram. These metric events would look something like this:
transaction_timing{feature="Admin::DashboardController#index", job="...", instance="..."} 200 # Admin::DashboardController#index took 200ms transaction_timing{feature="Admin::GroupsController#show", job="...", instance="..."} 300 # Admin::GroupsController#show took 300ms transaction_timing{feature="Admin::BackgroundJobsController#show", job="...", instance="..."} 330 # Admin::BackgroundJobsController#show took 330ms transaction_timing{feature="Admin::DashboardController#index", job="...", instance="..."} 292 # Admin::DashboardController#index` took 292ms # etc...
Then using what we know about the instance we can ascertain whether the metric is running on red (migrated) or black (not migrated).