Geo: Replication QA tests are false green or misnamed
Problem
Most or all Geo replication QA tests are structured like this https://gitlab.com/gitlab-org/gitlab/-/blob/edf01bb3ad459b8a61c07752b584df1f0a9b46f7/qa/qa/specs/features/ee/browser_ui/12_systems/geo/geo_replication_ci_job_log_artifacts_spec.rb#L61:
- Setup: Create a resource
- Check:
- Visit the secondary site
- Observe the resource exists on the page
This procedure used to be valid when secondary sites were all read-only, and data would appear to not exist until it became replicated.
Now, browsing a secondary site or making API requests against it will be forwarded to the primary. Additionally, Git requests are forwarded as soon as the secondary site becomes aware that a repo has changed. Therefore the resource will immediately appear to exist even if it is not replicated yet.
Therefore the Geo replication QA tests do not fail when replication is broken behind-the-scenes.
Proposal
Rename "Geo replication" QA tests to "Geo UI proxying".
Geo replication testing can be added to CI with https://gitlab.com/gitlab-org/quality/geo-replication-tester. Follow up issue for making geo-replication-tester do a proper check: gitlab-org/quality/geo-replication-tester#2
The Check step must be done in such a way that it fails when replication is not complete, and succeeds when replication is complete.
With the SSF, one way to perform the Check step with 99% confidence is to poll a GraphQL query (which already exists and is in-use by the UI for Admin > Geo > Replicable Details) for the registry records, and observe the state field move from pending or started to synced.
We can get 99.5% confidence by additionally waiting for the registry record's verification_state field to move to verification_succeeded.
These numbers are made up, but note that confidence can never be 100%. For example, the code can have a bug which lies to you.
Details
This issue is meant to track fixing one case of this. Then the same check code can be reused in one or more easy follow up MRs for all other data types. Weight 4 because QA tests are a somewhat unique domain of their own, but I expect the change to be small.
We already make a few GraphQL queries elsewhere in QA tests.