Skip to content

Add support for distributed tracing to GitLab

What does this MR do?

Requires https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/24186

Initial support for distributed tracing in the GitLab Rails application.

Dependencies

What are the relevant issue numbers?

Documentation for Reviewers

Challenges

  1. Jaeger client for Ruby is immature: this client will be good enough for development purposes, but I would not be willing to enable this in production.

    1. For this reason, I've worked to make the architecture fairly pluggable. You can use jaeger, lightstep, datadog or (importantly) no trace instrumentation whatsoever.
  2. Threading: The jaeger client will spawn threads. For this reason, it needs to be instantiated in the unicorn workers, and not the unicorn master. Otherwise, when the workers fork, the threads will not be started and jaeger silently fails.

  3. UDP: by default, jaeger will send UDP packets to a local agent which will forward those packets on to the jaeger server. Unfortunately these packets get quite large.

    • On macos, the UDP send will nearly always fail because the net.inet.udp.maxdgram is set very small by default.
    • This can be adjusted with sudo sysctl net.inet.udp.maxdgram=65536
    • Unfortunately the jaeger client does not do overflow control and will attempt to send UDP datagrams greater than 64k.
    • By increasing the flush interval (default 10s) to 5s, this can be mostly overcome but it's not totally reliable
    • For all these reasons, I would not recommend the UDP endpoint and would suggest sending over HTTP, directly to Jaeger.
    • Since this is for development purposes, this is safe, but in the production environment, it would overwhelm jaeger
  4. Instrumentation Libraries: several instrumentation libraries exist for various components, including Rack, Rails, Sidekiq, etc.

    • I have tried all of these and found them not to be mature enough.
    • They generally have conflicting dependency versions.
    • I've found them no to be extensible and lack some of the information that we require.
    • The rack tracer doesn't play nicely with our requirement of having to instantiate the Jaeger tracer in the client
    • Because of these reasons, I've chosen to write my own. They are all trivial to write.
    • This has the added advantage of allowing us to keep consistency between instrumentation - for example, all exceptions are traced in the same way, with the same information. This is not the case if we use third party libraries.

How to test

  1. Make sure that your GDK Workhorse and Gitaly are instrumented with the corresponding branches:
    1. Check out this branch (an/dtrace-opentracing-jaeger) in your GDK/gitlab directory
    2. While still in your GDK directory, run the following:
# Make sure you're on the `an/dtrace-opentracing-jaeger` branch of GitLab-CE...
# Then, execute the following commands:
GDK_ROOT=$(gdk help 2>&1 |head -1|cut -d\  -f2 |cut -d\) -f1)
cd $GDK_ROOT
rm -f gitaly/bin/gitaly && make gitaly/.git/pull gitaly/bin/gitaly BUILD_TAGS="tracer_static tracer_static_jaeger" 
rm -f gitlab-workhorse/bin/gitlab-workhorse && make gitlab-workhorse/.git/pull gitlab-workhorse/bin/gitlab-workhorse
# The follow steps are then required because GDK doesn't yet use the workhorse make scripts. This will change soon.
cd $GDK_ROOT/gitlab-workhorse/src/gitlab.com/gitlab-org/gitlab-workhorse
make BUILD_TAGS="tracer_static tracer_static_jaeger" 
cp gitlab-workhorse $GDK_ROOT/gitlab-workhorse/bin/gitlab-workhorse
  1. Run jaeger in Docker: docker run -e COLLECTOR_ZIPKIN_HTTP_PORT=9411 -p 5775:5775/udp -p 6831:6831/udp -p 6832:6832/udp -p 5778:5778 -p 16686:16686 -p 14268:14268 -p 9411:9411 jaegertracing/all-in-one:latest

  2. In another window, configure opentracing with a Jaeger connection: export GITLAB_TRACING="opentracing://jaeger?http_endpoint=http%3A%2F%2Flocalhost%3A14268%2Fapi%2Ftraces&sampler=const&sampler_param=1"

  3. Start GitLab with gdk run

  4. Open http://localhost:3000 and browse around to generate some traces

  5. Open http://localhost:16686/search and search for traces.

image

Does this MR meet the acceptance criteria?

Edited by Andrew Newdigate

Merge request reports

Loading