Skip to content

Geo metrics: geo_cursor_last_event_timestamp drops to zero on secondary site

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Summary

While conducting Geo failover tests on GitLab Dedicated ra5k test tenants ( https://gitlab.com/groups/gitlab-com/gl-infra/gitlab-dedicated/-/epics/325, https://gitlab.com/groups/gitlab-com/gl-infra/gitlab-dedicated/-/epics/446), we noticed that the Geo metric geo_cursor_last_event_timestamp exported on the Geo secondary site would occasionally drop to 0.

We are using the following equation to calculate Geo Log Cursor replication lag, per https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/team/-/issues/3754#geo-processing-lag and #197147 (comment 1733466667):

min(geo_last_event_timestamp - geo_cursor_last_event_timestamp)

When geo_cursor_last_event_timestamp drops to 0, the Geo Log Cursor replication lag query reports to be 50+ years (the epoch time reported by geo_last_event_timestamp, measured as seconds), which we logically know is not the case.

Steps to reproduce

  1. Spin up a Geo-enabled GitLab instance with at minimum two sites.
  2. Generate significant load on the instance (e.g. creating repos, branches, files, wikis, issues, snippets, etc).
  3. After a period of time of load generation running*, see the geo_cursor_last_event_timestamp metric exported from the Geo secondary site drop to 0.

*This behavior does not occur during every test; however, we have not performed any isolated testing yet to correlate this behavior to other system behaviors or events.

What is the current bug behavior?

geo_cursor_last_event_timestamp occasionally drops to 0. Correlated events/cause still unknown.

What is the expected correct behavior?

geo_cursor_last_event_timestamp always reports accurate (epoch) timestamp, regardless of load on the system.

Relevant logs and/or screenshots

A recent example is linked here.

Dashboard screenshot:

image

Edited by 🤖 GitLab Bot 🤖