Skip to content

Rescue from and log errors in metrics exporters

Matthias Käppler requested to merge 349592-log-port-conflicts into master

What does this MR do and why?

In #349592 (closed) we discussed that the removal of rescuing from and logging port allocation conflicts when starting an in-process metrics server in Sidekiq makes it difficult to understand what happened, should it occur.

We eventually decided that the problem is two-fold:

  • The metrics server should not crash sidekiq when coming up
  • Logging is handled in a somewhat confusing fashion

Ultimately, we are looking to completely isolate these failures in &6409 (closed) and &7304 (closed). In this MR I am trying to make some progress in the right direction by:

  1. Rescuing from server start exceptions and logging them to the metrics-server log
  2. Refactor the handling of logs by moving its responsibility into the BaseExporter superclass. Interestingly this reduces SidekiqExporter to merely an initializer, and I therefore removed its spec.

There are still some odd inconsistencies, for "historic reasons". For instance, due to an incident in the past, we added the ability to disable sidekiq exporter logs, and default them to off, but we did not do this for web_exporter (running in Puma), which means we need to jump through a few hoops to streamline logging logic.

Screenshots or screen recordings

Errors during start-up such as port conflicts are properly logged again:

$ docker exec -it gl-gck_sidekiq_1 bash
$ git@04470835775b:~/gitlab$ tail -fn 100 log/sidekiq_exporter.log
...
[2022-01-06T14:02:34.643+0000] INFO  WEBrick 1.6.1
[2022-01-06T14:02:34.643+0000] INFO  ruby 2.7.4 (2021-07-07) [x86_64-linux]
[2022-01-06T14:02:34.643+0000] ERROR Errno::EADDRINUSE: Address already in use - bind(2) for 0.0.0.0:3807
	/usr/local/lib/ruby/2.7.0/socket.rb:201:in `bind'
	/usr/local/lib/ruby/2.7.0/socket.rb:201:in `listen'
	/usr/local/lib/ruby/2.7.0/socket.rb:765:in `block in tcp_server_sockets'
	/usr/local/lib/ruby/2.7.0/socket.rb:227:in `each'
	/usr/local/lib/ruby/2.7.0/socket.rb:227:in `foreach'
	/usr/local/lib/ruby/2.7.0/socket.rb:763:in `tcp_server_sockets'
	/usr/local/lib/ruby/2.7.0/webrick/utils.rb:65:in `create_listeners'
	/usr/local/lib/ruby/2.7.0/webrick/server.rb:127:in `listen'
	/usr/local/lib/ruby/2.7.0/webrick/server.rb:108:in `initialize'
	/usr/local/lib/ruby/2.7.0/webrick/httpserver.rb:47:in `initialize'
	/home/git/gitlab/lib/gitlab/metrics/exporter/base_exporter.rb:38:in `new'
	/home/git/gitlab/lib/gitlab/metrics/exporter/base_exporter.rb:38:in `start_working'
	/home/git/gitlab/lib/gitlab/daemon.rb:42:in `block in start'
	/home/git/gitlab/lib/gitlab/daemon.rb:39:in `synchronize'
	/home/git/gitlab/lib/gitlab/daemon.rb:39:in `start'
	/home/git/gitlab/metrics_server/metrics_server.rb:59:in `start'
	/home/git/gitlab/metrics_server/metrics_server.rb:25:in `spawn'
	bin/metrics-server:12:in `<main>'

Plus, this doesn't crash the host process (sidekiq/puma worker) anymore 😬

How to set up and validate locally

  1. Make something bind to the metrics server port, or find a different way to crash it
  2. Make sure the sidekiq metrics server log is enabled: https://docs.gitlab.com/ee/administration/logs.html#sidekiq_exporterlog-and-web_exporterlog
  3. Run sidekiq
  4. Sidekiq shouldn't crash, but log/sidekiq_exporter.log should exist and contain the stack trace

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #349592 (closed), #347205

Edited by Matthias Käppler

Merge request reports