Index integrity detects and fixes missing repository data. This feature is automatically used when code searches scoped to a group or project return no results.
Problem to solve
For various reasons a project initial indexing may fail. We've had customers report projects not being in the index and then needing to manually re-index that project from a rake task to ensure it ends up in the index. We don't yet know exactly why/how but it could happen. If the project is not in the index then none of it's child resources will be searchable so it can be quite confusing.
We discussed a scheduled cron job but for large instances with many indexed namespaces (like GitLab.com), that would take too long to run to hopefully find some data. We need a more targeted approach.
Proposal
Every project or group scoped query (/search or /count from the web UI) that hits Elasticsearch can store the index discrepancy as a key/value store (or other appropriate data structure that supports de-duplication) in Redis. Index discrepancy is defined as the blobs scope returns 0 results. The information stored in Redis can include: namespace_id, project_id (only project scoped searches), searched_at timestamp.
A new cron worker will be created to process the discrepancy queue. The worker (name proposal Search::IndexRepairWorker) would process the Redis queue described above. It could run every hour.
First iteration: only log when issues are found, add a graph to visualize
Second iteration: perform repair work for any missing projects
Technical details
The worker should look up the namespace and validate it still exists
The worker will perform one ES query, use namespace ancestry (if available from #351381 (closed)) to do a prefix search, and perform an aggregation by project_id to get counts for all blob type documents
The worker will compare the aggregations to the project_statistics table for each project and compare repository_size column. If there is a repository_size > 0 and blob_count of 0,
Only log a WARNING if a discrepancy if found. Note: be sure to set the class name for the logger so that it's easier to find in kibana
For the first iteration, the worker will only log a WARNING. We can iterate on repairing the index once the logs are reviewed and the worker scheduling is tuned
Need to limit it so that the worker will only run once for a namespace (same way that indexer is done)
Here is a good example of a bug that could have caused the index to be wrong !29751 (comment 325875803) and this kind of repairing could have helped. Moving a project to an indexed namespace wouldn't have triggered an indexing of that project due the invalid cache. Repairing can help with these kinds of bugs which may be inevitable given the complexity of our "should this project be in Elasticsearch?" logic.
@dgruzd and I had a pairing session and came up with a proposal for a first iteration on a repair strategy. Check it out and let me know what you think?
I think this makes sense. I like the idea of logging first to see what the state of things are and then to fix it.
I do have a few questions for future iterations:
What happens for API searches? Are we assuming that if we systematically fix for web searches, APIs will be fine eventually; or maybe that we can roll this out to APIs as later step?
Do we know how often this kind of thing happens and if monitoring gitlab.com would surface enough cases or do they mostly happen for self-managed?
What happens for API searches? Are we assuming that if we systematically fix for web searches, APIs will be fine eventually; or maybe that we can roll this out to APIs as later step?
We definitely should roll this out for API searches as a future iteration. I thought web searches would be a good first step as it's often where users encounter missing data. The support tickets have all had web UI searches as part of them.
Do we know how often this kind of thing happens and if monitoring gitlab.com would surface enough cases or do they mostly happen for self-managed?
I'm not sure how often this happens, but I suspect more than we realize. The support team has steps to trigger reindexing of projects for customers where data is missing and may not always post in our slack channel. Issues about missing repository/code data have been reported GitLab.com and self managed and debugging has proven to be difficult.
I really like the cache miss strategy. Maybe we could look into storing indexing discrepancies as key value store in Redis. Something like this, we could store the value as a json string and include any metadata that would help in remediation.:
This key value strategy has inherent deduplication, and I wonder if it would be beneficial to have a cron worker to avoid bursts from automated searches that fail in quick succession.
Ultimate customer (Internal only) affected and is interested in this fix. They would like a way to automatically tell when projects are not indexed and this proposed feature would help.
Please see our support ticket for more details. But essentially we are seeing projects getting created blank initially by users (a common occurrence). The user then later adds commits to the project. But the indexer runs before the commits are added. In the indexing code, it seems like an IndexStatus is created with last_commit set to a dummy value (Gitlab::Git::BLANK_SHA). Once it has this dummy value, the indexer doesn't seem to ever retry the project. So the project silently goes without any search results even after commits have been added.
In our cluster, we detect and resolve this in the following way:
candidates=IndexStatus.where('last_commit = ?',Gitlab::Git::BLANK_SHA)# get projects to reindexreindex_projects=candidates.mapdo|status|nextifstatus.project.repository.commit('HEAD').nil?# no HEAD, don't indexstatus.projectend.compact# reindex given projects in batches -- this is based on code in the ``index_projects`` Rake taskreindex_projects.each_slice(5)do|slice|puts"indexing: #{slice.map{|x|x.id}}"::Elastic::ProcessInitialBookkeepingService.backfill_projects!(*slice)sleep(60)end
Now, we don't want to have to do this. It seems like something that should be in place automatically. It also seems like the Rake tasks related to index status are not accurate. So there's no built-in way to either detect it or resolve it except to force reindex all projects (which in a large cluster can be a very expensive operation).
Some things that I'm surprised by that maybe worth debugging a little:
the indexer runs before the commits are added: The indexer is meant to be triggered after the push is received and should be triggered to run after every push
Once it has this dummy value, the indexer doesn't seem to ever retry the project: This is strange. The default is always Gitlab::Git::BLANK_SHA and this should be set for all newly created projects. The Gitlab::Git::BLANK_SHA is a good thing because when the indexer runs next it compares the current SHA to Gitlab::Git::BLANK_SHA and basically says "all commits are new and need to be indexed". I'd be more concerned if index_status was being updated to the latest SHA but the updates were never reflected in Elasticsearch
I don't 100% know what's happening in your case but my thinking would be:
Watch the log files above
Push code and confirm that it triggered a ElasticCommitIndexerWorker job to run (if not then we need to debug why that's not happening)
If the ElasticCommitIndexerWorker then look at the logs to see if this job succeeds. If it does succeed then it should have updated Elasticsearch and moved the index_status along to the latest commit. If it didn't succeed then then this is the next thing to debug and there will probably be an error in one of the above logs
Some things I could imagine accounting for this behaviour:
Problems with Sidekiq configuration meaning that we aren't handling pushes to repos correctly or the indexer workers not running correctly
Somehow you aren't pushing code to the repository's default branch and so we don't index it. This would be surprising though because you said you are able to fix it by running ::Elastic::ProcessInitialBookkeepingService so that would indicate that it is a default branch
I don't know the end-to-end behavior of indexing, so some of what I said is speculation (for example the behavior of how new commits get indexed). But I have observed the following:
I am not seeing any error logs in Elasticsearch or Sidekiq logs (that seem relevant to indexing).
Not all projects seem to be affected. In one of my recent additions to the support ticket, I list a set of 11 or so project IDs that have the issue. The IDs are not in sequence, so it seems like some projects get indexed just fine. Each of these projects were created within the last few days and all have commits, but still have the 'empty' IndexStatus.
In tests creating new projects and pushing commits, I haven't been able to replicate this issue, unfortunately. I'm only detecting it as I described in my last reply
One thing I haven't tried yet is comparing the commit ID of project's with 'non-blank' IndexStatus (i.e. "working" projects) to the HEAD of that project to see if any are behind in their indexing. It's possible the issue is as you described, something is intermittently preventing new commits from getting indexed. We haven't had any user reports of missing search results for projects which are otherwise returning search results (for example, if you add a commit, the search results seem to update relatively fast for projects that are working -- whereas projects which aren't working never seem to update; before I fixed them, there were some projects that had existed for months and still had the 'blank' IndexStatus). But I can systematically check this next week
As another side note, we recently (in last few days) upgraded from 15.3.x to 15.4.x (latest). I haven't dived into this since we upgraded, so (optimistically) maybe something improves in the later version too
Upgrading to 15.4.x latest didn't resolve this issue.
I also confirmed that there are projects that have been indexed in the past, but which don't appear to be receiving updates (i.e. IndexStatus shows a commit ID that is not HEAD for that project). So there's a generalized problem with both new projects and existing projects where repo push is not properly triggering the index update
One thing I noticed is that Sidekiq process sometime shows some OOM killer warnings when it gets above 2G resident memory (the actual hosts where Sidekiq runs are not remotely memory starved, however.. so this seems to be capping Sidekiq unnecessarily). I didn't notice this before because it was a WARN not an ERROR, and also doesn't mention anything specifically about the project IDs where the indexing is behind. I bumped this up to 4G and it seems to have removed these OOM errors. If these OOMs are negatively affecting indexing, it's possible this fixed the issue for me, but I need to monitor over a longer period of time. In any case, it seems like these OOM events should be errors not warnings. I understand the primary purpose of this OOM process is to resolve memory leaks, however it also may be preventing actual jobs from running. I guess another approach is to automatically configure Sidekiq RSS max based on the system's max memory instead of defining it statically at 2GB. For example, it could be set to (host_max_rss - 1GB / num_sidekiq_process_groups) if it's a dedicated Sidekiq node.
As a side note, it appears there may be some kind of bug in the following specific case:
When I create a brand new project and seed it with a README (using the GitLab wizard), that initial commit does not get indexed until another commit is pushed. My guess is any kind of templated project (where GitLab itself introduces the commit rather than it being pushed) will exhibit this issue.
I didn't notice this before because it was a WARN not an ERROR
You're right that a warning level is not appropriate if the indexer is crashing. Looking at the code I think it should be returning error if the process crashes from OOM but this may not be the log entry you're seeing. Perhaps you could create a new issue with more detail about this problem so we can investigate it.
I guess another approach is to automatically configure Sidekiq RSS max based on the system's max memory instead of defining it statically at 2GB.
I think this might be a reasonable feature request to look into. I think I'm not familiar enough with where these memory limits are being enforced and it may depend on the installation method for GitLab so probably a separate issue with more detail would help figure out what is feasible here.
When I create a brand new project and seed it with a README (using the GitLab wizard), that initial commit does not get indexed until another commit is pushed. My guess is any kind of templated project (where GitLab itself introduces the commit rather than it being pushed) will exhibit this issue.
I wasn't able to reproduce this with this project by selecting the create readme option on creation https://gitlab.com/dylan-silver/newly-indexed-project-with-readme . The code seems to be in the index and searchable but it did take a minute to show up (as Elasticsearch index refreshing is async). Could you please create a separate issue for this if the problem still persists and we can investigate. This shouldn't happen.
You tested on GitLab SaaS though, not GitLab 15.4 on-prem, correct?
I understand asking to create additional issues for some of these problems, but I'm slightly confused about this process. Premium support keeps directing me to update this issue, but I'm not even certain what the problem is, hence the seemingly unrelated, random details of my issue. Creating additional open source issues seems like it will just lead to more churn. I'm going to escalate through commercial support again instead of providing details here since it doesn't seem helpful.
I will note one last thing: Projects which update their default branch seem to also break the indexing. We have customers who, after they make a release, change the default branch to the new release. I don't really agree with that workflow, but it's not in my control. I've found in these cases, the indexing gets out of sync and doesn't seem to recover.
I'm going to escalate through commercial support again instead of providing details here since it doesn't seem helpful.
@jcmcken - thanks for working through this with us. I would agree I think we need to spend some time troubleshooting to understand what is actually happening, rather than relying on this issue to fix any indexing issue. We'd still like to fix the underlying root causes of index issues.
I will note one last thing: Projects which update their default branch seem to also break the indexing. We have customers who, after they make a release, change the default branch to the new release. I don't really agree with that workflow, but it's not in my control. I've found in these cases, the indexing gets out of sync and doesn't seem to recover.
An interesting issue found while debugging a customer with empty project search results for code. When the project document is missing from the index, the project level code searches will come back empty (due to the parent_join used in the Elasticsearch query). I've opened an MR to look for this and log out the issue during the index repair.
There's a chance to immediately fix over 2000 projects in the index as a part of this work.
Next steps are to get the index integrity worker FF re-enabled, confirm it's not causing any more query timeouts, and implement the reindexing process in the Search::IndexRepairService.
FWIW, I'm not sure the proposed mechanism to detect projects that need repair will actually work 100% of the time if I'm reading the code right (and I know it's not complete yet, but just providing some feedback)
For example, here, it seems like the plan is to repair the project if there are 0 blobs found with the given query. However, in our on-prem GitLab cluster, we have examples of projects that do have some searchable blobs, just not all of them. For example, in our case, there's a particular file which we know exists in a specific repo, but when we search for it, we get empty results. But for other searches in the same project, we do get results. So it's only missing indexed results for a range of commits, not all of them.
So really what you would need to do is find some way of determining a priori how many indexed blobs the project should have (maybe using an equivalent of git ls-tree -r, although this might be slow). Then you can run the Elasticsearch query to get the count that it currently has. Then if they are not equal, you repair the index.
Another different way of resolving this is to fix whatever is tracking the success of index operations. So for example, we see projects where project.index_status.last_commit matches the HEAD of the project (i.e. project.index_status.last_commit == project.repository.commit('HEAD').id). Nonetheless, the project is still missing search results. When I find such a project and manually execute backfill_projects! against it, the indexing operation succeeds and the search results then appear normal. Of course fixing this will then still require some mechanism to fix incorrect IndexStatus records after the fact.
In our case, what seems to happen Sidekiq randomly hits its RSS limit for the job pool that handles indexing. The Sidekiq process crashes out, but it seems like the IndexStatus.last_commit still gets updated. So it gives the appearance that the indexing completed even though it did not
FWIW, I'm not sure the proposed mechanism to detect projects that need repair will actually work 100% of the time if I'm reading the code right (and I know it's not complete yet, but just providing some feedback)
@jcmcken It has been pretty difficult to track down the root cause of this issue. The details you've provided in the thread are much appreciated! I agree that this won't catch all cases. The first step was to detect and fix the obvious issues where a project is completely missing from the index.
Another different way of resolving this is to fix whatever is tracking the success of index operations. So for example, we see projects where project.index_status.last_commit matches the HEAD of the project (i.e. project.index_status.last_commit == project.repository.commit('HEAD').id). Nonetheless, the project is still missing search results. When I find such a project and manually execute backfill_projects! against it, the indexing operation succeeds and the search results then appear normal. Of course fixing this will then still require some mechanism to fix incorrect IndexStatus records after the fact.
In the case where some commits (but not the latest) are missing, are all of the commits searchable by sha in the index? Commits are stored in a separate index from repository data (depending on the version of GitLab). That could be an avenue to explore for comparing index parity.
In our case, what seems to happen Sidekiq randomly hits its RSS limit for the job pool that handles indexing. The Sidekiq process crashes out, but it seems like the IndexStatus.last_commit still gets updated. So it gives the appearance that the indexing completed even though it did not
This is interesting and not something I've encountered. Thank you for providing the details on what you're seeing. I'm wondering if it's possible to simulate an RSS limit being reached locally
Another different way of resolving this is to fix whatever is tracking the success of index operations. So for example, we see projects where project.index_status.last_commit matches the HEAD of the project (i.e. project.index_status.last_commit == project.repository.commit('HEAD').id). Nonetheless, the project is still missing search results.
This issue is designed to help alleviate that problem but if we could actually track down the root cause of this then that would be preferable. We've seen this happen occasionally and we've never had the log history or data we'd need to actually figure out how it happened. Based on our understanding of the code it just doesn't seem possible, but I've seen it happen enough now to believe it is possible.
The Sidekiq process crashes out, but it seems like the IndexStatus.last_commit still gets updated
This is the hardest part for me to understand. Updating the index status is the last step. So I can see a crash could very reasonably lead to a scenario where we do the indexing but don't get around to updating the index status but the other way around doesn't make sense. The only way I can get how this can happen is the gitlab-elasticsearch-indexer somehow returns a successful exit code after a failure or if somehow Elasticsearch is just losing some updates. We've never been able to track down this mystery failure which is why we figured another option would be to just implement a periodic repair process instead.
The index repair work for missing project documents has shown a decrease over the last 7 days. I think this means it's working as intended and project documents aren't being dropped from the index.
This has been globally enabled on GitLab.com for a month. In %16.3 we will set the FF to default enabled true, and plan to fully release the feature in %16.4