Production Jaeger tracing instance
Now that GitLab is well instrumented with distributed tracing, we should come up with a plan for rolling it out in production.
Proposal
- Consider adding Jaeger agent as a "sidecar" for each instrumented GitLab process
- Enable easy configuration of Jaeger via Omnibus (when enabled, Jaeger Agent will run on the host, when disabled, Jaeger agent is turned off): gitlab-org/omnibus-gitlab#4104 (closed)
- Configuration of the
GITLAB_TRACINGenv var for GitLab processes - Configuration of a Jaeger service endpoint for the agent to forward traces on to
- Lets keep standing up the Jaeger service itself out of scope for now (@joshlambert do you agree?)
- Configuration of the
- Use the GKE and the Jaeger operator: https://github.com/jaegertracing/jaeger-operator
- Use ElasticCloud as the backend
- Setup a chef recipe for Jaeger agent
- Roll out Jaeger agent to the web/api/git/gitaly fleet in staging
- Start with a very low sample rate, with head sampling and enabling tracing on a subset of staging
- Monitor, test, validate, ramp up
- Do the same in canary
- Do the same in production
Roll out plan
-
deploy Jaeger to production -
enable tracing in canary -
measure performance impact on the application from enabling tracing -
measure Jaeger performance -
gradually roll out tracing to the rest of the fleet
cc @joshlambert
Edited by Michal Wasilewski