Elasticsearch index should be periodically repaired if projects are missing

Release notes

Index integrity detects and fixes missing repository data. This feature is automatically used when code searches scoped to a group or project return no results.

Problem to solve

For various reasons a project initial indexing may fail. We've had customers report projects not being in the index and then needing to manually re-index that project from a rake task to ensure it ends up in the index. We don't yet know exactly why/how but it could happen. If the project is not in the index then none of it's child resources will be searchable so it can be quite confusing.

We discussed a scheduled cron job but for large instances with many indexed namespaces (like GitLab.com), that would take too long to run to hopefully find some data. We need a more targeted approach.

Proposal

Every project or group scoped query (/search or /count from the web UI) that hits Elasticsearch can store the index discrepancy as a key/value store (or other appropriate data structure that supports de-duplication) in Redis. Index discrepancy is defined as the blobs scope returns 0 results. The information stored in Redis can include: namespace_id, project_id (only project scoped searches), searched_at timestamp.

A new cron worker will be created to process the discrepancy queue. The worker (name proposal Search::IndexRepairWorker) would process the Redis queue described above. It could run every hour.

  • First iteration: only log when issues are found, add a graph to visualize
  • Second iteration: perform repair work for any missing projects

Technical details

  • The worker should look up the namespace and validate it still exists
  • The worker will perform one ES query, use namespace ancestry (if available from #351381 (closed)) to do a prefix search, and perform an aggregation by project_id to get counts for all blob type documents
  • The worker will compare the aggregations to the project_statistics table for each project and compare repository_size column. If there is a repository_size > 0 and blob_count of 0,
  • Only log a WARNING if a discrepancy if found. Note: be sure to set the class name for the logger so that it's easier to find in kibana
  • For the first iteration, the worker will only log a WARNING. We can iterate on repairing the index once the logs are reviewed and the worker scheduling is tuned
  • Need to limit it so that the worker will only run once for a namespace (same way that indexer is done)
  • Eevaluate the deduplication strategies for sidekiq and determine which works best
  • Any repository which contains only binary data will be flagged (indexing skips binary data) and we can look into how often that happens
  • The worker could be introduced behind a feature flag with a namespace actor
  • A metric or graph should be added to see how often we find missing data
Edited by Changzheng Liu