What are you trying to do? Articulate your objectives using absolutely no jargon.
Add the ability to specify Geo Elasticsearch Nodes or clusters
How is it done today, and what are the limits of current practice?
Currently, GitLab Geo uses the Elasticsearch application settings and therefore the same Elasticsearch cluster as the primary. As a Geo Instance is setup, in a diffrent region. The same Elasticsearch Cluster will be used and not take advantage of the Value of multiplue zones.
What's new in your approach and why do you think it will be successful?
By Specifing the Geo Specific Cluster and auth, the GitLab instance can be used in other locations.
Who cares? If you're successful, what difference will it make?
Advanced Search with elasticsearch is a critical feature to Big Code customers. These are often times the Same customers who need HA and Geo.
What are the risks and the payoffs?
If a Customer needs to rebuild there Indexes this can take days for Repos that are 2 TB + , Not only would this mean search features would take longer to recover, there would also be a perfromance strain to account for Workers that will build the index.
How much will it cost?
How long will it take?
What are the midterm and final "exams" to check for success?
Having geo-local nodes to query would be great; is there a mechanism in ElasticSearch to keep these nodes up to date, or would we need each Geo instance to build the ElasticSearch index itself, locally?
Having geo-local nodes to query would be great; is there a mechanism in ElasticSearch to keep these nodes up to date, or would we need each Geo instance to build the ElasticSearch index itself, locally?
This is easy to solve, you can just create separate ES node(Replica Shard) that is close to your geo node and that's it. The replication process will be taken care by ES
@vsizov Is it wise to use ES replication on a WAN? We would have to open some ports between these nodes, and I'm not certain that communication is encrypted. Not the mention I'm not really sure that ES is meant to handle the latency over a WAN for replication purposes.
I wonder if the easiest implementation for Geo would be to allow specifying ES connection details for secondary nodes and let each secondary Geo instance index it's own dataset.
@stanhu ES clusters communicate on a proprietary protocol on port 9300 IIRC. Also, clusters assume they are on the same local network. I also found this post from Elastic talking about the topic: "We are frequently asked whether it is advisable to distribute an Elasticsearch cluster across multiple data centers (DCs). The short answer is "no"". See https://www.elastic.co/blog/clustering_across_multiple_data_centers
What problem we're trying to solve? An additional latency of search requests due to slow connections to remote ES servers?
If yes, then I'm a bit confused because we're talking about one HTTP(s) request (actual search query). So, theoretically, this should not create any problems. Unfortunately, I don't have any practical information...
For the DR case, it is (sort of) important to have an independent copy of the elasticsearch data in the same physical location as each secondary.
I don't think we should rely on the secondaries to build an independent elasticsearch index - replicating the existing ES data to the secondary in some manner seems like the better option. Ideally, ES would have some kind of standby mode similar to postgresql secondaries. The Geo secondaries could perform queries against that with some "override application_settings" configuration - perhaps that would go on the GeoNode?
@dblessing the first barrier is that we currently store index status in the main GitLab database, which is readonly on secondaries. So there's some significant engineering work required upfront.
Then, actually building the index is computationally expensive. Not everyone is GitLab.com scale but we could easily be talking about a week to a month of processing on each secondary - and search would be broken on the secondary until it were complete. I'd much rather have a bandwidth capacity problem than a CPU capacity problem :)
I'd much rather have a bandwidth capacity problem than a CPU capacity problem
I'm not sure everyone would agree. Especially, let's say for a customer on AWS that is replicating to a secondary in another provider (or even region - though I'm not sure about network costs in this case). It could get really expensive.
I understand the issue with the database state. I'm not sure if it's simple enough to say that replicating is the better way to go from a bandwidth vs. CPU standpoint. The 'hit' for building the index would be a one-time on provisioning a secondary. Of course it's also true for the bandwidth hit so it's mostly a matter of your first point plus what this might cost from a network vs. CPU perspective in a cloud. Something to think about...
CPU time on AWS isn't cheap either. We could probably compute relative costs for the GitLab.com case, to see if there's a clear winner either way? My intuition is that bandwidth wins, but I could easily be wrong
Once the backfill is done, there's also a (smaller) ongoing cost as new documents show up, in both scenarios.
I do suggest moving the index status to elasticsearch itself in https://gitlab.com/gitlab-org/gitlab-ee/issues/2341 - this would resolve one technical barrier to pursuing independent rebuilds on each secondary.
The time issue remains worrying - as there's no processing to be done, I think it's safe to claim that elasticsearch-level replication will inevitably be faster than independent rebuilds. https://gitlab.com/gitlab-org/gitlab-ee/issues/3492 could help to make this concern manageable though.
For object storage, we're eventually going to support both mechanisms - external replication and gitlab-mediated replication (let's call that internal replication). Perhaps that's an option for elasticsearch as well.
Currently we can't support external replication for elasticsearch because the settings are stored in application_settings, which is read-only and replicated from the primary. We'll need this to be fixed for both internal and external replication, and fixing it means that external replication will start to be an option.
Another use case suggested by @amulvany today is having repository indexing operations happen on the secondary, with an elasticsearch cluster shared between both primary and secondary.
We'd need to resolve https://gitlab.com/gitlab-org/gitlab-ee/issues/2341 first, but this is otherwise quite possible. The secondary gets notifications whenever a repository is created, updated or removed, so all we'd need to do is add settings that delegate indexing actions to a particular secondary.
The motivation is high CPU load on the primary. By moving indexing operations to a (quieter) secondary, resources used for indexing can be partitioned away from other resources.
Having the indexing happen on the secondary can help with internet bandwidth issues for the secondary, but I'm not 100% positive we should index the elastic on the primary using the secondary infra.
This heavy load sounds like something we could improve in HA setup, by reading data from a local follower on the primary otherwise, we may be building a death star topology.
With Elastic now supporting Cross Cluster Replication would a first iteration be to just allow Geo nodes to point to a different ES cluster from the primary?
@rnienaber I am contemplating scheduling this to %12.2 - let me know what you think.
@fzimmer I have added ~"product discovery" to this issue. For the first iteration, we should confirm what the work is to be done and how it would be built as there seems to be some concerns above. Once we've figured those out, we can raise tickets for the work and schedule those in. But yes, we could do the discovery portion in 12.2.
Ideally we shouldn't create this coupling between primary and secondary on ElasticSearch level.
I'm fast-forwarding/trolling a little bit here: but it's a good way to achieve "Death Star architecture":
The less we couple primary with secondary, the more robust we can make both of them. Also in a disaster recovery situation, you don't want to have to think about promoting to "primary" every single dependency you have.
So we should think about triggering indexing on both, instead of relying on replication (there is also bandwidth to consider here).
A few other downsides I can think of: because both the database and ElasticSearch are independent, if one replicates faster than the other, or one replication stops working, you may endup in a situation where your search says something and your database says another thing, or worst, you take too much time to reflect a permission/visibility change and expose something you souldn't etc.
I had a couple of discussions with customers in the past weeks on this. Is the current status that the Geo secondary uses the same ELK cluster as the primary? Therefore, failing over to the secondary would entail setting up a new ELK stack and re-indexing?