Built-in primitive logging distributed tracer
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
Preface
I was asked to join a client support call yesterday, with @stanhu, @lbot and others. (See https://gitlab.zendesk.com/agent/tickets/111902 if you have access)
A client was experiencing a sudden and pronounced slow-down in their GitLab instance.
During the call, a lot of time was spent trying to diagnose the source of the slow-down and isolate it to a single component. This was a complex manual task.
Now that the application has been instrumented with gitlab-ce~9331527, a far simpler approach to troubleshooting this class of problem should soon be available, once we enable it in production.
The problem is that it's unlikely (at least until we ship a tracer within the product) that self-managed clients will have tracing setup. This means that it won't be much use to the support engineers as they won't be able to collect traces from the self-managed instance.
With this in mind, I started thinking about how we could get some of the benefits of tracing to on-prem support immediately, and came up with this proposal
Built in logging tracer
The GitLab application can support multiple tracing implementations, eg Jaeger, Lightstep, etc
This is done by configuring the GITLAB_TRACING environment variable with a a tracing "connection string":
For example:
GITLAB_TRACING=opentracing://jaeger?udp_endpoint=localhost:123GITLAB_TRACING=opentracing://lightstep
I propose we add an additional ultra-simple logging tracer that will ship with the product.
To enable it, the connection string would be set with something like:
GITLAB_TRACING=opentracing://logger?token=secret
Once enabled, a trace could be generated by including a special header in the request:
curl -H "X-Log-Trace-Token: secret" https://gitlab.example.com/api/v4/projects.
If the header exists and matches the configured token, the process will emit the tracing information through stdout (or possible an different log file)
The logs would look something like:
2018-02-01T08:01:00.000Z 0.1s "activerecord:SELECT * FROM USERS" span.kind=client db.type=postgres
2018-02-01T08:01:00.000Z 0.1s "gitaly:FindCommit" span.kind=client grpc.method=FindCommit
2018-02-01T08:01:00.000Z 0.1s "gitaly:FindCommit" span.kind=client grpc.method=FindCommit
2018-02-01T08:01:00.000Z 0.1s "gitaly:FindCommit" span.kind=client grpc.method=FindCommit
2018-02-01T08:01:00.000Z 0.1s "gitaly:FindCommit" span.kind=client grpc.method=FindCommit
2018-02-01T08:01:00.000Z 0.1s "gitaly:FindCommit" span.kind=client grpc.method=FindCommit
2018-02-01T08:00:00.000Z 5s GET /api/v4/projects span.kind=server
The logs would include details and durations of all instrumented tracing information (this currently includes SQL calls, template renders, GRPC calls, Sidekiq calls).
While this is simple and accessible, it has many downsides over a proper tracing including:
- traces will not be hierarchical
- traces will not be in order (this would be much more complicated)
- only specific requests will be instrumented (no sampling, aggregration, etc)
- parts of traces for different processes will be in different files
Even with these disadvantages, I think this information would be incredibly helpful