Add cron job to heal traversal_ids

What does this MR do and why?

We already update the traversal_ids of namespaces when they are moved between groups.

Ideally this job is not necessary, but we still see cases where the traversal_ids didn't get updated properly on the issues table.

To heal the system in cases that we have not catched yet, this healing service constanlty checks the oldest work item of a namespace, and checks if it has the correct traversal_ids. If not, we spawn a job to fix the Namespace's work item traversal_ids.

References

Queries

Example query for 1000 namespaces https://console.postgres.ai/gitlab/gitlab-production-main/sessions/51717/commands/152624 or https://console.postgres.ai/gitlab/gitlab-production-main/sessions/51717/commands/152652

Calculations

  • ~1s for 1000 namespaces
  • max 200 seconds per job run => 200*1000 => 200.000 namespaces per run
  • Job runs every 5min, so 12 per hour, 288 runs a day
  • => 288 Runs * 200,000 Namespaces = 57M namespaces per day that we get get through

For gitlab.com this means we can heal all namespaces within <1-2 days. I played around with the idea to filter out User namespaces, but this is not efficient:

Screenshots or screen recordings

Before After

How to set up and validate locally

Open a Rails console:

bundle exec rails c

Step 1 — Introduce the divergence

project = Project.first
namespace = project.project_namespace

WorkItem.where(namespace_id: namespace.id).update_all(namespace_traversal_ids: [-1])


# Verify the corruption
WorkItem.where(namespace_id: namespace.id).first.namespace_traversal_ids # => [-1]
namespace.traversal_ids                                                   # => [22, 23]

Step 2 — Enable the feature flag

Feature.enable(:work_items_traversal_ids_healing_service)

Step 3 — Run the cron worker

The cron worker scans namespaces for divergence and enqueues a heal job for each one it finds:

WorkItems::TraversalIdsHealingCronWorker.new.perform

Step 4 — Run the heal worker

In a real environment this runs asynchronously via Sidekiq. Locally, trigger it manually:

WorkItems::UpdateNamespaceTraversalIdsWorker.new.perform(namespace.id)

Step 5 — Verify the fix

WorkItem.where(namespace_id: namespace.id).first.namespace_traversal_ids # => [22, 23]

MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Nicolas Dular

Merge request reports

Loading