Geo metrics: geo_cursor_last_event_timestamp drops to zero on secondary site
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
Summary
While conducting Geo failover tests on GitLab Dedicated ra5k test tenants ( https://gitlab.com/groups/gitlab-com/gl-infra/gitlab-dedicated/-/epics/325, https://gitlab.com/groups/gitlab-com/gl-infra/gitlab-dedicated/-/epics/446), we noticed that the Geo metric geo_cursor_last_event_timestamp
exported on the Geo secondary site would occasionally drop to 0
.
We are using the following equation to calculate Geo Log Cursor replication lag, per https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/team/-/issues/3754#geo-processing-lag and #197147 (comment 1733466667):
min(geo_last_event_timestamp - geo_cursor_last_event_timestamp)
When geo_cursor_last_event_timestamp
drops to 0
, the Geo Log Cursor replication lag query reports to be 50+ years (the epoch time reported by geo_last_event_timestamp
, measured as seconds), which we logically know is not the case.
Steps to reproduce
- Spin up a Geo-enabled GitLab instance with at minimum two sites.
- Generate significant load on the instance (e.g. creating repos, branches, files, wikis, issues, snippets, etc).
- For Dedicated tenant failover testing, we use drum-machine to generate this load.
- After a period of time of load generation running*, see the
geo_cursor_last_event_timestamp
metric exported from the Geo secondary site drop to0
.
*This behavior does not occur during every test; however, we have not performed any isolated testing yet to correlate this behavior to other system behaviors or events.
What is the current bug behavior?
geo_cursor_last_event_timestamp
occasionally drops to 0
. Correlated events/cause still unknown.
What is the expected correct behavior?
geo_cursor_last_event_timestamp
always reports accurate (epoch) timestamp, regardless of load on the system.
Relevant logs and/or screenshots
A recent example is linked here.
Dashboard screenshot: