Automatically disable elasticsearch functionality when the cluster isn't responding
Problem to solve
If we're using elasticsearch and the cluster is down for any reason, then searches will begin to fail. Manual administrator intervention is required to either fix the cluster, or to go to the admin settings and uncheck the "Elasticsearch search" check box.
Elasticsearch indexing jobs may also exceed their retry time and cause gaps in indexing in this scenario.
Instance administrators, anyone who needs to search for stuff on GitLab
Elasticsearch is an external system and may also be externally administered. We can expect it to be broken at times when GitLab itself is running fine. Since we have two search code paths in this scenario, it seems it would be relatively easy to fall back to the database search if elasticsearch is enabled, but the cluster isn't contactable for any reason.
For indexing, it would be good if we could avoid exhausting sidekiq retry limits and uselessly retrying jobs that must fail, when sidekiq is down.
Introduce a circuit breaker that monitors the health of the configured elasticsearch cluster. When the cluster is down, we can fall back to database search and pause processing of
ElasticIndexerWorker sidekiq jobs. This should reduce the severity of an elasticsearch cluster failure significantly, in terms of both immediate user experience, and data integrity.
When the elasticsearch cluster returns, we could hold off on re-enabling the elasticsearch functionality until the related sidekiq queues drop below a (high) threshold. We could even make this part of the conditions for the circuit breaker tripping in the first place.
Permissions and Security
Detailed circuit breaker information should only be accessible to GitLab admins. The fact that the circuit breaker has tripped will be visible as a result of degradation of the search UI, but existing timeouts / 500 errors would also leak that information.
We'd need to update the elasticsearch integration docs to take account of the new functionality
What does success look like, and how can we measure that?
In the event of an extended elasticsearch cluster outage on GitLab.com, search gracefully downgrades itself by disabling Advanced Search and Global Code Search functionality. When the cluster returns, searches return issues and code that were added to GitLab while the cluster was down.
What is the type of buyer?
Links / references
@DouweM @jramsay @mwasilewski-gitlab this came out of a short call I had earlier with Michal. I don't think it needs to be a blocker, but we might find this functionality very useful for reducing the impact of outages or search performance issues on GitLab.com
@reprazent I know you were involved in the Gitaly / disk access circuit breaker efforts. Do you have any lessons from there that might be useful for us to take into account here?