Labkit as the in-application platform toolkit
For all the applications we deploy, we want them to be uniformly observable:
- Emit logs with the same fields: using the same units and naming conventions
- Provide an API to record custom metrics in expected formats: Application SLIs that can then also be changed in a single place while we iterate.
- Prevents logging in incorrect formats or sensitive information
- Guides developers to use the right metrics, preventing cardinality explosions in large services
- Pass along context information to be used in metrics, logs and traces that can be reused across multiple services that also implement Labkit.
- Provides continuous profiling
- Supports running the application in FIPS mode
On top of that, this also means that we can add technology specific information to metrics and logs from a single place. For example, for ruby we could add information about the GVL to metrics and logs. This would then be uniformly available for all ruby services.
We currently have 2/3 labkit libraries that are actively in use, but without don't have feature parity:
- Labkit-ruby: https://gitlab.com/gitlab-org/ruby/gems/labkit-ruby
- Labkit-go: https://gitlab.com/gitlab-org/labkit
- Labkit-python, not actively used, nothing implemented yet, but also needed: https://gitlab.com/gitlab-org/labkit-python
Labkit-ruby
Currently already handles contexts and correlation and makes sure that all fields are included in logs.
Does not handle metrics yet, there's a complication where GitLab-rails is tightly coupled to our own prometheus-client-mmap
gem, through Gitlab::Metrics::Prometheus
. Ideally, we'd remove this coupling and allow us to use the community Prometheus client gem. The community gem now supports being used in multiprocess servers like we're using, but the last I checked our rust implementation is more performant aggregating metrics to be served. We could consider contributing this upstream to unblock us.
The concept of application SLIs is implemented in 2 services already GitLab-rails, and customersdot.
Labkit-ruby has rudimentary support for tracing, but this is not enabled currently.
Has no profiling support yet because last time we checked this was not supported by the GCPs cloud profiler.
This gem is used in gitlab-rails
, and included in customersdot
Labkit-go
Already handles metrics using prometheus, but has not implemented the concept of Application SLIs. We could use it so we wouldn't have to put performance characteristics of gRPCs in our metrics catalog.
Enables profiling and profiles are available in the cloud profiler.
Passes along correlation ids, but no other context information. Context is a loaded term in go, but I mean the extra information we provide there (caller_id
, root_caller_id
, namespace
, project
, ...).
Allows writing structured logs, but this information is not enriched with context information like we do in Labkit-ruby. The fields are not uniformly named or using the same units (sometimes _ms
, sometimes _s
, I've even seen some _us
).
This is used in workhorse
, gitaly
, gitlab-sshd
(aka gitlab-shell
) and perhaps other services.
Labkit-python
Nothing implemented yet
Proposal
We should clearly define what we want the labkit library to do, and then scope out implementing that for these 3 languages that currently have customer facing services deployed at GitLab.