Skip to content

Elasticsearch index should be periodically repaired if projects are missing

Release notes

Index integrity detects and fixes missing repository data. This feature is automatically used when code searches scoped to a group or project return no results.

Problem to solve

For various reasons a project initial indexing may fail. We've had customers report projects not being in the index and then needing to manually re-index that project from a rake task to ensure it ends up in the index. We don't yet know exactly why/how but it could happen. If the project is not in the index then none of it's child resources will be searchable so it can be quite confusing.

We discussed a scheduled cron job but for large instances with many indexed namespaces (like GitLab.com), that would take too long to run to hopefully find some data. We need a more targeted approach.

Proposal

Every project or group scoped query (/search or /count from the web UI) that hits Elasticsearch can store the index discrepancy as a key/value store (or other appropriate data structure that supports de-duplication) in Redis. Index discrepancy is defined as the blobs scope returns 0 results. The information stored in Redis can include: namespace_id, project_id (only project scoped searches), searched_at timestamp.

A new cron worker will be created to process the discrepancy queue. The worker (name proposal Search::IndexRepairWorker) would process the Redis queue described above. It could run every hour.

  • First iteration: only log when issues are found, add a graph to visualize
  • Second iteration: perform repair work for any missing projects

Technical details

  • The worker should look up the namespace and validate it still exists
  • The worker will perform one ES query, use namespace ancestry (if available from #351381 (closed)) to do a prefix search, and perform an aggregation by project_id to get counts for all blob type documents
  • The worker will compare the aggregations to the project_statistics table for each project and compare repository_size column. If there is a repository_size > 0 and blob_count of 0,
  • Only log a WARNING if a discrepancy if found. Note: be sure to set the class name for the logger so that it's easier to find in kibana
  • For the first iteration, the worker will only log a WARNING. We can iterate on repairing the index once the logs are reviewed and the worker scheduling is tuned
  • Need to limit it so that the worker will only run once for a namespace (same way that indexer is done)
  • Eevaluate the deduplication strategies for sidekiq and determine which works best
  • Any repository which contains only binary data will be flagged (indexing skips binary data) and we can look into how often that happens
  • The worker could be introduced behind a feature flag with a namespace actor
  • A metric or graph should be added to see how often we find missing data
Edited by Changzheng Liu