Add cron job to heal traversal_ids
What does this MR do and why?
We already update the traversal_ids of namespaces when they are moved between groups.
Ideally this job is not necessary, but we still see cases where the traversal_ids didn't get updated properly on the issues table.
To heal the system in cases that we have not catched yet, this healing service constanlty checks the oldest work item of a namespace, and checks if it has the correct traversal_ids. If not, we spawn a job to fix the Namespace's work item traversal_ids.
References
Queries
Example query for 1000 namespaces https://console.postgres.ai/gitlab/gitlab-production-main/sessions/51717/commands/152624 or https://console.postgres.ai/gitlab/gitlab-production-main/sessions/51717/commands/152652
Calculations
~1sfor1000namespaces- max
200 secondsper job run => 200*1000 =>200.000 namespaces per run - Job runs every 5min, so 12 per hour,
288 runs a day - => 288 Runs * 200,000 Namespaces =
57M namespaces per daythat we get get through
For gitlab.com this means we can heal all namespaces within <1-2 days.
I played around with the idea to filter out User namespaces, but this is not efficient:
- With user filter: https://console.postgres.ai/gitlab/gitlab-production-main/sessions/51717/commands/152650
- Without user filter: https://console.postgres.ai/gitlab/gitlab-production-main/sessions/51717/commands/152651
Screenshots or screen recordings
| Before | After |
|---|---|
How to set up and validate locally
Open a Rails console:
bundle exec rails cStep 1 — Introduce the divergence
project = Project.first
namespace = project.project_namespace
WorkItem.where(namespace_id: namespace.id).update_all(namespace_traversal_ids: [-1])
# Verify the corruption
WorkItem.where(namespace_id: namespace.id).first.namespace_traversal_ids # => [-1]
namespace.traversal_ids # => [22, 23]Step 2 — Enable the feature flag
Feature.enable(:work_items_traversal_ids_healing_service)
Step 3 — Run the cron worker
The cron worker scans namespaces for divergence and enqueues a heal job for each one it finds:
WorkItems::TraversalIdsHealingCronWorker.new.performStep 4 — Run the heal worker
In a real environment this runs asynchronously via Sidekiq. Locally, trigger it manually:
WorkItems::UpdateNamespaceTraversalIdsWorker.new.perform(namespace.id)Step 5 — Verify the fix
WorkItem.where(namespace_id: namespace.id).first.namespace_traversal_ids # => [22, 23]MR acceptance checklist
Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.