Elasticsearch index should be periodically repaired if projects are missing
Release notes
Index integrity detects and fixes missing repository data. This feature is automatically used when code searches scoped to a group or project return no results.
Problem to solve
For various reasons a project initial indexing may fail. We've had customers report projects not being in the index and then needing to manually re-index that project from a rake task to ensure it ends up in the index. We don't yet know exactly why/how but it could happen. If the project is not in the index then none of it's child resources will be searchable so it can be quite confusing.
We discussed a scheduled cron job but for large instances with many indexed namespaces (like GitLab.com), that would take too long to run to hopefully find some data. We need a more targeted approach.
Proposal
Every project
or group
scoped query (/search
or /count
from the web UI) that hits Elasticsearch can store the index discrepancy as a key/value store (or other appropriate data structure that supports de-duplication) in Redis. Index discrepancy is defined as the blobs
scope returns 0 results. The information stored in Redis can include: namespace_id
, project_id
(only project scoped searches), searched_at
timestamp.
A new cron worker will be created to process the discrepancy queue. The worker (name proposal Search::IndexRepairWorker
) would process the Redis queue described above. It could run every hour.
- First iteration: only log when issues are found, add a graph to visualize
- Second iteration: perform repair work for any missing projects
Technical details
- The worker should look up the namespace and validate it still exists
- The worker will perform one ES query, use namespace ancestry (if available from #351381 (closed)) to do a prefix search, and perform an aggregation by
project_id
to get counts for allblob
type documents - The worker will compare the aggregations to the
project_statistics
table for each project and comparerepository_size
column. If there is arepository_size
> 0 andblob_count
of 0, - Only log a
WARNING
if a discrepancy if found. Note: be sure to set the class name for the logger so that it's easier to find in kibana - For the first iteration, the worker will only log a
WARNING
. We can iterate on repairing the index once the logs are reviewed and the worker scheduling is tuned - Need to limit it so that the worker will only run once for a namespace (same way that indexer is done)
- Eevaluate the deduplication strategies for sidekiq and determine which works best
- Any repository which contains only binary data will be flagged (indexing skips binary data) and we can look into how often that happens
- The worker could be introduced behind a feature flag with a namespace actor
- A metric or graph should be added to see how often we find missing data