Geo secondary proxying: replication lag can introduce read-after-write inconsistencies
When running a full QA pipeline on a GET setup with Geo proxying enabled, we can see a few tests failing as they are relying on read-your-writes consistency:
qa/specs/features/ee/browser_ui/3_create/repository/push_rules_spec.rb:162
qa/specs/features/browser_ui/3_create/repository/protocol_v2_push_ssh_spec.rb:30
qa/specs/features/ee/browser_ui/3_create/repository/file_locking_spec.rb:46
qa/specs/features/ee/browser_ui/3_create/repository/file_locking_spec.rb:35
qa/specs/features/api/3_create/merge_request/push_options_mwps_spec.rb:75
qa/specs/features/browser_ui/1_manage/user/follow_user_activity_spec.rb:64
qa/specs/features/browser_ui/6_release/pages/pages_pipeline_spec.rb:34
qa/specs/features/browser_ui/3_create/merge_request/merge_merge_request_from_fork_spec.rb:20
qa/specs/features/browser_ui/7_configure/auto_devops/create_project_with_auto_devops_spec.rb:116
qa/specs/features/browser_ui/3_create/snippet/clone_push_pull_personal_snippet_spec.rb:73
qa/specs/features/browser_ui/3_create/snippet/clone_push_pull_project_snippet_spec.rb:74
qa/specs/features/browser_ui/3_create/merge_request/rebase_merge_request_spec.rb:12
qa/specs/features/browser_ui/3_create/merge_request/suggestions/batch_suggestion_spec.rb:49
qa/specs/features/api/3_create/repository/push_postreceive_idempotent_spec.rb:20
qa/specs/features/browser_ui/3_create/merge_request/cherry_pick/cherry_pick_a_merge_spec.rb:25
Here's a flow that explains what happens, through the QA test:
D, [2021-11-09T15:33:13.255293 #2083224] DEBUG -- : Git: pwd=[/tmp/d20211109-2083224-tyhpwd], command=[HOME="/tmp/qa-netrc-credentials/2083224" git clone http://root@unified.geo-gitlab-test.ml/gitlab-qa-sandbox-group/group_for_follow_user_activity_f3d2c6e495c0821d/project-for-tags-92bebaf335c7d849.git ./ 2>&1]
D, [2021-11-09T15:33:13.635379 #2083224] DEBUG -- : Git: output=[Cloning into '.'...]
D, [2021-11-09T15:33:13.656420 #2083224] DEBUG -- : Git: pwd=[/tmp/d20211109-2083224-tyhpwd], command=[HOME="/tmp/qa-netrc-credentials/2083224" git commit -m "This is a test commit" 2>&1]
D, [2021-11-09T15:33:13.667557 #2083224] DEBUG -- : Git: pwd=[/tmp/d20211109-2083224-tyhpwd], command=[HOME="/tmp/qa-netrc-credentials/2083224" git push http://root@unified.geo-gitlab-test.ml/gitlab-qa-sandbox-group/group_for_follow_user_activity_f3d2c6e495c0821d/project-for-tags-92bebaf335c7d849.git main:main 2>&1]
D, [2021-11-09T15:33:15.050080 #2083224] DEBUG -- : Git: output=[warning: redirecting to http://unified.geo-gitlab-test.ml/-/push_from_secondary/2/gitlab-qa-sandbox-group/group_for_follow_user_activity_f3d2c6e495c0821d/project-for-tags-92bebaf335c7d849.git/
remote:
remote: This request to a Geo secondary node will be forwarded to the
remote: Geo primary node:
remote:
remote: http://unified.geo-gitlab-test.ml/gitlab-qa-sandbox-group/group_for_follow_user_activity_f3d2c6e495c0821d/project-for-tags-92bebaf335c7d849.git
remote:
remote:
To http://unified.geo-gitlab-test.ml/gitlab-qa-sandbox-group/group_for_follow_user_activity_f3d2c6e495c0821d/project-for-tags-92bebaf335c7d849.git
* [new branch] main -> main], exitstatus=[0]
D, [2021-11-09T15:33:15.586182 #2083224] DEBUG -- : Git: pwd=[/tmp/d20211109-2083224-oj34bs], command=[HOME="/tmp/qa-netrc-credentials/2083224" git clone http://root@unified.geo-gitlab-test.ml/gitlab-qa-sandbox-group/group_for_follow_user_activity_f3d2c6e495c0821d/project-for-tags-92bebaf335c7d849.git ./ 2>&1]
D, [2021-11-09T15:33:15.909454 #2083224] DEBUG -- : Git: output=[Cloning into '.'...
warning: You appear to have cloned an empty repository.], exitstatus=[0]
D, [2021-11-09T15:33:15.937943 #2083224] DEBUG -- : Git: pwd=[/tmp/d20211109-2083224-oj34bs], command=[HOME="/tmp/qa-netrc-credentials/2083224" git checkout "main" 2>&1]
D, [2021-11-09T15:33:15.951788 #2083224] DEBUG -- : Git: output=[error: pathspec 'main' did not match any file(s) known to git], exitstatus=[1]
Thus:
- Clone from the secondary => repository wasn't replicated yet, gets redirected to the primary, pulls the empty repo
- Push to the secondary => gets proxied to the primary, as secondary isn't writable
- Clone again from the secondary => the push wasn't replicated yet, resulting in the clone of the stale, empty repo
This seems to be a problem with even a tiny amount of replication lag, as in - any replication lag greater than the delay between the write -> read, as the secondary isn't aware of this delay in the first place.
Proposal
-
Add project_repository_last_updated_at
(Naming is debatable but I think it should be at least different fromlast_repository_updated_at
) toproject_repository_states
-
Add wiki_repository_last_updated_at
also since we're already doing a migration to this table -
Add code to touch these when a project/wiki repository is updated -
Modify ProjectRegistry#repository_out_of_date?
to check the new field also, if it is set -
If DB replication lag < small_db_lag, then sleep DB replication lag seconds in repository_out_of_date?
right before checkingproject_repository_last_updated_at
-
Hard code small_db_lag at least for now, maybe 250ms
Leave last_repository_updated_at
alone for now.
Edited by Michael Kozono