Pin main DB replica for merge request pipeline jobs

What does this MR do and why?

When a runner picks up a job, the job is serialized against merge request data read from a main database replica. For a freshly created merge request pipeline, a lagging replica can serialize the job against stale or missing merge request state — effectively going backwards in time relative to the pipeline creation logic.

This MR makes the runner's view of the merge request at least as up-to-date as the pipeline creation logic:

  • Producer (Gitlab::Ci::Pipeline::Chain::Create): records the main database WAL location for the pipeline's merge request at pipeline creation via MergeRequest.sticking.stick(:merge_request, ...).
  • Consumer (Ci::RegisterJobService): before serializing a job, ensures the main replica it reads from is caught up to that WAL location via MergeRequest.sticking.find_caught_up_replica. If the replica is lagging, it returns a conflict so the runner retries.

This is diagnostic in part: it helps eliminate replication lag as a factor while we confirm the true cause.

Both sides are gated behind the ci_pipeline_mr_main_db_wal_pinning feature flag (gitlab_com_derisk, default disabled).

Notes

  • The caught-up check is best-effort: the stuck location expires after Sticking::EXPIRATION (30s), after which the check is bypassed (the replica has almost certainly caught up by then).
  • A new :queue_merge_request_replication_lag queue metric is added for observability.

References

Screenshots or screen recordings

Before (bad refs/heads/ prefix) After (valid ref)
Screenshot_2026-06-18_at_12.57.21_PM Screenshot_2026-06-18_at_12.57.44_PM

How to set up and validate locally

1. Add a replica

gdk config set load_balancing.enabled true
gdk config set postgresql.replica.enabled true
gdk reconfigure

2. Make the replica lag deterministically

Append to postgresql-replica/data/postgresql.auto.conf:

recovery_min_apply_delay = '30s'

This keeps the replica online (still receiving WAL) but applies it 30s late.

3. Force reads to the lagging replica only

By default, GDK includes the primary among replicas. This makes it harder to reproduce the problem and the fix.

In gitlab/config/database.yml, edit development.main.load_balancing.hosts to list only the replica (remove the primary line):

    load_balancing:
      hosts:
        - <gdk_root>/postgresql-replica

and run

gdk restart

4. Reproduce (flag OFF)

bundle exec rails runner 'Feature.disable(:ci_pipeline_mr_main_db_wal_pinning)'
gdk restart # just to ensure consistent state

Add the following .gitlab-ci.yml

workflow:
  rules:
    - if: $CI_MERGE_REQUEST_IID
    - if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH
    - if: $CI_COMMIT_TAG

job:
  script:
    - exit 0

Add/edit a file and create an MR. This should start a pipeline.

Expected result: The CI job starts almost immediately, but fails during the git fetch.

5. Verify the fix (flag ON)

Enable the flag:

bundle exec rails runner 'Feature.enable(:ci_pipeline_mr_main_db_wal_pinning)'`
gdk restart # to ensure flag propogated

Expected result: The CI job should stay in pending for about 30 seconds, but never fail after it starts running.

MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Hordur Freyr Yngvason

Merge request reports

Loading