Elasticsearch index should be periodically repaired if projects are missing
### Release notes Index integrity detects and fixes missing repository data. This feature is automatically used when code searches scoped to a group or project return no results. ### Problem to solve For various reasons a project initial indexing may fail. We've had customers report projects not being in the index and then needing to manually re-index that project from a rake task to ensure it ends up in the index. We don't yet know exactly why/how but it could happen. If the project is not in the index then none of it's child resources will be searchable so it can be quite confusing. We discussed a scheduled cron job but for large instances with many indexed namespaces (like GitLab.com), that would take too long to run to *hopefully* find some data. We need a more targeted approach. ### Proposal Every `project` or `group` scoped query (`/search` or `/count` from the web UI) that hits **Elasticsearch** can store the index discrepancy as a key/value store (or other appropriate data structure that supports de-duplication) in Redis. Index discrepancy is defined as the `blobs` scope returns 0 results. The information stored in Redis can include: `namespace_id`, `project_id` (only project scoped searches), `searched_at` timestamp. A new cron worker will be created to process the discrepancy queue. The worker (name proposal `Search::IndexRepairWorker`) would process the Redis queue described above. It could run every hour. - First iteration: only log when issues are found, add a graph to visualize - Second iteration: perform repair work for any missing projects ### Technical details * The worker should look up the namespace and validate it still exists * The worker will perform one ES query, use namespace ancestry (if available from https://gitlab.com/gitlab-org/gitlab/-/issues/351381) to do a prefix search, and perform an aggregation by `project_id` to get counts for all `blob` type documents * The worker will compare the aggregations to the `project_statistics` table for each project and compare `repository_size` column. If there is a `repository_size` > 0 and `blob_count` of 0, * Only log a `WARNING` if a discrepancy if found. _Note: be sure to set the class name for the logger so that it's easier to find in kibana_ * For the first iteration, the worker will only log a `WARNING`. We can iterate on repairing the index once the logs are reviewed and the worker scheduling is tuned * Need to limit it so that the worker will only run once for a namespace (same way that indexer is done) * Eevaluate the [deduplication strategies for sidekiq](https://docs.gitlab.com/ee/development/sidekiq/idempotent_jobs.html#deduplication) and determine which works best * Any repository which contains only **binary data** will be flagged (indexing skips binary data) and we can look into how often that happens * The worker could be introduced behind a feature flag with a namespace actor * A metric or graph should be added to see how often we find missing data
issue