Geo: Correlate logs across sites
Problem
GitLab Geo is composed of multiple services (at least a primary site and a secondary site). This makes it difficult to find relevant logs quickly when debugging. For example, when a Git push occurs, the correlation ID does not propagate across sites.
Correlation IDs help solve this problem.
Proposal
These 3 things can be done in parallel or broken out to separate issues.
Thing 1
- When an action on the primary site creates a Geo event, include the current correlation ID in the Geo event payload
- In Geo Log Cursor, when processing a Geo event, if a correlation ID was included, then set the current correlation ID. Maybe:
Labkit::Correlation::CorrelationId.use_id(geo_event_correlation_id) { do_stuff }- See use_id
- Confirm that Geo Log Cursor log output includes that correlation ID
- Confirm that when Geo Log Cursor enqueues a Sidekiq job, that the Sidekiq job inherits that correlation ID
Thing 2
- When a site makes a request with
Geo::RequestServiceto another site, it should include its correlation ID - When a site receives a request originating from another Geo site, and the request includes a properly formatted correlation ID, then the correlation ID should be used
Thing 3
- When a secondary site proxies a request to a primary site, it should include its correlation ID
- When a primary site receives a proxied request from a secondary site, it should use the included correlation ID
Note that this won't solve all log correlation in Geo. (E.g. some tasks change DB state and then some other cronjob acts on a batch of records in the DB.) But this proposal provides good value by itself.
Edited by Michael Kozono