Skip to content

Geo secondary proxying: replication lag can introduce read-after-write inconsistencies

When running a full QA pipeline on a GET setup with Geo proxying enabled, we can see a few tests failing as they are relying on read-your-writes consistency:

qa/specs/features/ee/browser_ui/3_create/repository/push_rules_spec.rb:162
qa/specs/features/browser_ui/3_create/repository/protocol_v2_push_ssh_spec.rb:30
qa/specs/features/ee/browser_ui/3_create/repository/file_locking_spec.rb:46
qa/specs/features/ee/browser_ui/3_create/repository/file_locking_spec.rb:35
qa/specs/features/api/3_create/merge_request/push_options_mwps_spec.rb:75
qa/specs/features/browser_ui/1_manage/user/follow_user_activity_spec.rb:64
qa/specs/features/browser_ui/6_release/pages/pages_pipeline_spec.rb:34
qa/specs/features/browser_ui/3_create/merge_request/merge_merge_request_from_fork_spec.rb:20
qa/specs/features/browser_ui/7_configure/auto_devops/create_project_with_auto_devops_spec.rb:116
qa/specs/features/browser_ui/3_create/snippet/clone_push_pull_personal_snippet_spec.rb:73
qa/specs/features/browser_ui/3_create/snippet/clone_push_pull_project_snippet_spec.rb:74
qa/specs/features/browser_ui/3_create/merge_request/rebase_merge_request_spec.rb:12
qa/specs/features/browser_ui/3_create/merge_request/suggestions/batch_suggestion_spec.rb:49
qa/specs/features/api/3_create/repository/push_postreceive_idempotent_spec.rb:20
qa/specs/features/browser_ui/3_create/merge_request/cherry_pick/cherry_pick_a_merge_spec.rb:25

Here's a flow that explains what happens, through the QA test:

D, [2021-11-09T15:33:13.255293 #2083224] DEBUG -- : Git: pwd=[/tmp/d20211109-2083224-tyhpwd], command=[HOME="/tmp/qa-netrc-credentials/2083224" git clone  http://root@unified.geo-gitlab-test.ml/gitlab-qa-sandbox-group/group_for_follow_user_activity_f3d2c6e495c0821d/project-for-tags-92bebaf335c7d849.git ./ 2>&1]
D, [2021-11-09T15:33:13.635379 #2083224] DEBUG -- : Git: output=[Cloning into '.'...]
D, [2021-11-09T15:33:13.656420 #2083224] DEBUG -- : Git: pwd=[/tmp/d20211109-2083224-tyhpwd], command=[HOME="/tmp/qa-netrc-credentials/2083224" git commit -m "This is a test commit" 2>&1]
D, [2021-11-09T15:33:13.667557 #2083224] DEBUG -- : Git: pwd=[/tmp/d20211109-2083224-tyhpwd], command=[HOME="/tmp/qa-netrc-credentials/2083224" git push http://root@unified.geo-gitlab-test.ml/gitlab-qa-sandbox-group/group_for_follow_user_activity_f3d2c6e495c0821d/project-for-tags-92bebaf335c7d849.git main:main 2>&1]
D, [2021-11-09T15:33:15.050080 #2083224] DEBUG -- : Git: output=[warning: redirecting to http://unified.geo-gitlab-test.ml/-/push_from_secondary/2/gitlab-qa-sandbox-group/group_for_follow_user_activity_f3d2c6e495c0821d/project-for-tags-92bebaf335c7d849.git/
remote:
remote: This request to a Geo secondary node will be forwarded to the
remote: Geo primary node:
remote:
remote:   http://unified.geo-gitlab-test.ml/gitlab-qa-sandbox-group/group_for_follow_user_activity_f3d2c6e495c0821d/project-for-tags-92bebaf335c7d849.git
remote:
remote:
To http://unified.geo-gitlab-test.ml/gitlab-qa-sandbox-group/group_for_follow_user_activity_f3d2c6e495c0821d/project-for-tags-92bebaf335c7d849.git
 * [new branch]      main -> main], exitstatus=[0]
D, [2021-11-09T15:33:15.586182 #2083224] DEBUG -- : Git: pwd=[/tmp/d20211109-2083224-oj34bs], command=[HOME="/tmp/qa-netrc-credentials/2083224" git clone  http://root@unified.geo-gitlab-test.ml/gitlab-qa-sandbox-group/group_for_follow_user_activity_f3d2c6e495c0821d/project-for-tags-92bebaf335c7d849.git ./ 2>&1]
D, [2021-11-09T15:33:15.909454 #2083224] DEBUG -- : Git: output=[Cloning into '.'...
warning: You appear to have cloned an empty repository.], exitstatus=[0]
D, [2021-11-09T15:33:15.937943 #2083224] DEBUG -- : Git: pwd=[/tmp/d20211109-2083224-oj34bs], command=[HOME="/tmp/qa-netrc-credentials/2083224" git checkout  "main" 2>&1]
D, [2021-11-09T15:33:15.951788 #2083224] DEBUG -- : Git: output=[error: pathspec 'main' did not match any file(s) known to git], exitstatus=[1]

Thus:

  1. Clone from the secondary => repository wasn't replicated yet, gets redirected to the primary, pulls the empty repo
  2. Push to the secondary => gets proxied to the primary, as secondary isn't writable
  3. Clone again from the secondary => the push wasn't replicated yet, resulting in the clone of the stale, empty repo

This seems to be a problem with even a tiny amount of replication lag, as in - any replication lag greater than the delay between the write -> read, as the secondary isn't aware of this delay in the first place.

Proposal

  • Add project_repository_last_updated_at (Naming is debatable but I think it should be at least different from last_repository_updated_at) to project_repository_states
  • Add wiki_repository_last_updated_at also since we're already doing a migration to this table
  • Add code to touch these when a project/wiki repository is updated
  • Modify ProjectRegistry#repository_out_of_date? to check the new field also, if it is set
  • If DB replication lag < small_db_lag, then sleep DB replication lag seconds in repository_out_of_date? right before checking project_repository_last_updated_at
  • Hard code small_db_lag at least for now, maybe 250ms

Leave last_repository_updated_at alone for now.

Edited by Michael Kozono