Tracing with Jaeger
Description
Every application needs to monitor performance. We should make that part of Auto DevOps
Proposal
Architecture:
- APM agent: Datadog https://docs.datadoghq.com/libraries/#apm-tracing-client-libraries (open source, uses opentracing https://www.datadoghq.com/blog/opentracing-datadog-cncf/)
- Collector: Jaeger collector https://github.com/jaegertracing/jaeger/tree/master/cmd/collector (we'll be able to reuse that when we start to do tracing https://gitlab.com/gitlab-org/gitlab-ee/issues/3458 )
- Storage: Elastic Search (Jaeger only supports Cassandra and ES, we already use ES for search)
- Backend: GitLab Rails (already interfaces with ES for search, we want one application)
- UI: GitLab Vue (show the graphs in combination as the metrics of Prometheus)
/cc @markpundsack @joshlambert @bjk-gitlab
Notes
Need a collector, query service
Zipkin or Jaeger
Jaeger popular, Uber employees only? Also CNCF only Jaeger
Would need to change wire protocol for Jaeger Opentracing to Datadog
Nice if we want to use Jaeger for non-APM tracer later
Uber had a problem with muliple languages
https://eng.uber.com/wp-content/uploads/2017/02/4-5-EngBlog-Distributed-Tracing-at-Uber-768x432.png looks like how I would make it, Zipkin for UI and query
Zipkin uses thrift?
Even though the Zipkin backend was fairly well known and popular, it lacked a good story on the instrumentation side, especially outside of the Java/Scala ecosystem. All our client libraries have been build to support the OpenTracing API from inception.
Jaeger does sampling via the backend, Zipkin on initialization
We probably can't use Zipkin's UI since it is not made for APM and we want something in GitLab.
Zipkin has one prolific author (Yuri, who works at Uber!), Jaeger has more, they are complementary!
Jaeger has well defined client libraries
Zipkin is not a CNCF project
You use it with Prometheus for metrics and Jaeger for tracing https://www.youtube.com/watch?time_continue=6&v=fjYAU3jayVo at 41:00 plus https://github.com/objectiser/opentracing-prometheus-example
also https://github.com/jaegertracing/jaeger/commit/9988a3d3d0be3d832c1151f17f5d6eac59a0c052 to support Prometheus metrics as default for all components
standalone https://github.com/jaegertracing/jaeger/blob/master/cmd/standalone/standalone_test.go is the part we need?
jaegertracing/all-in-one The container runs the Jaeger backend with an in-memory store
There is also https://github.com/lightstep/lightstep-tracer-go (only one author) and https://github.com/sourcegraph/appdash (not active)
Logs are sampled but included. Still need full logs for compliance and diagnosing an error of a single user, but not for finding out what is happening to most users.
Conventions are in https://github.com/opentracing/specification/blob/master/semantic_conventions.md
Do we add APM to jaeger UI? No, we want everything in GitLab.
https://medium.com/opentracing/take-opentracing-for-a-hotrod-ride-f6e3141f7941 The most common measurements for RPC are request counts, error counts, and distribution of request latencies. OpenTracing instrumentation already captures all these signals and can emit metrics without additional instrumentation.
https://github.com/jaegertracing no ruby client
Jaeger cares about accepting Zipkin https://github.com/jaegertracing/xdock-zipkin-brave
Collector is https://github.com/jaegertracing/jaeger/tree/master/cmd/collector
One problem is that the collector doesn't allow queries.
Collector normally sends those to cassendra, lets use the same? The also have memory and Elasticsearch support. MySQL coming soon. Lets use Elasticsearch, since we use that for advanced search, will use it for logs.
You can measure Envoy with Jaeger https://www.envoyproxy.io/docs/envoy/latest/install/sandboxes/jaeger_tracing
Need to wait for MySQL support for Jaeger and then add PostgreSQL support. https://github.com/jaegertracing/jaeger/tree/master/plugin/storage doesn't have mysql http://jaeger.readthedocs.io/en/latest/deployment/#storage-backend says "There is ongoing work to add support for MySQL and ScyllaDB." but main author wants to remove that https://github.com/jaegertracing/jaeger/pull/473/commits/d69b82dc8b19099c2044448512f2c0708efeb5cc#r144877703 and is considering plugins as an alternative https://github.com/jaegertracing/jaeger/issues/422
Zipkin Jaeger drama https://github.com/jaegertracing/jaeger/issues/272#issuecomment-323922392 and https://github.com/jaegertracing/jaeger/issues/140#issuecomment-299073230
Ticketmaster is expirimenting https://github.com/jaegertracing/jaeger/issues/207#issuecomment-344788188