What are you trying to do? Articulate your objectives using absolutely no jargon.
Add the ability to specify Geo Elasticsearch Nodes or clusters
How is it done today, and what are the limits of current practice?
Currently, GitLab Geo uses the Elasticsearch application settings and therefore the same Elasticsearch cluster as the primary. As a Geo Instance is setup, in a diffrent region. The same Elasticsearch Cluster will be used and not take advantage of the Value of multiplue zones.
What's new in your approach and why do you think it will be successful?
By Specifing the Geo Specific Cluster and auth, the GitLab instance can be used in other locations.
Who cares? If you're successful, what difference will it make?
Advanced Search with elasticsearch is a critical feature to Big Code customers. These are often times the Same customers who need HA and Geo.
What are the risks and the payoffs?
If a Customer needs to rebuild there Indexes this can take days for Repos that are 2 TB + , Not only would this mean search features would take longer to recover, there would also be a perfromance strain to account for Workers that will build the index.
How much will it cost?
How long will it take?
What are the midterm and final "exams" to check for success?
Having geo-local nodes to query would be great; is there a mechanism in ElasticSearch to keep these nodes up to date, or would we need each Geo instance to build the ElasticSearch index itself, locally?
Having geo-local nodes to query would be great; is there a mechanism in ElasticSearch to keep these nodes up to date, or would we need each Geo instance to build the ElasticSearch index itself, locally?
This is easy to solve, you can just create separate ES node(Replica Shard) that is close to your geo node and that's it. The replication process will be taken care by ES
@vsizov Is it wise to use ES replication on a WAN? We would have to open some ports between these nodes, and I'm not certain that communication is encrypted. Not the mention I'm not really sure that ES is meant to handle the latency over a WAN for replication purposes.
I wonder if the easiest implementation for Geo would be to allow specifying ES connection details for secondary nodes and let each secondary Geo instance index it's own dataset.
@stanhu ES clusters communicate on a proprietary protocol on port 9300 IIRC. Also, clusters assume they are on the same local network. I also found this post from Elastic talking about the topic: "We are frequently asked whether it is advisable to distribute an Elasticsearch cluster across multiple data centers (DCs). The short answer is "no"". See https://www.elastic.co/blog/clustering_across_multiple_data_centers
What problem we're trying to solve? An additional latency of search requests due to slow connections to remote ES servers?
If yes, then I'm a bit confused because we're talking about one HTTP(s) request (actual search query). So, theoretically, this should not create any problems. Unfortunately, I don't have any practical information...
For the DR case, it is (sort of) important to have an independent copy of the elasticsearch data in the same physical location as each secondary.
I don't think we should rely on the secondaries to build an independent elasticsearch index - replicating the existing ES data to the secondary in some manner seems like the better option. Ideally, ES would have some kind of standby mode similar to postgresql secondaries. The Geo secondaries could perform queries against that with some "override application_settings" configuration - perhaps that would go on the GeoNode?
@dblessing the first barrier is that we currently store index status in the main GitLab database, which is readonly on secondaries. So there's some significant engineering work required upfront.
Then, actually building the index is computationally expensive. Not everyone is GitLab.com scale but we could easily be talking about a week to a month of processing on each secondary - and search would be broken on the secondary until it were complete. I'd much rather have a bandwidth capacity problem than a CPU capacity problem :)
I'd much rather have a bandwidth capacity problem than a CPU capacity problem
I'm not sure everyone would agree. Especially, let's say for a customer on AWS that is replicating to a secondary in another provider (or even region - though I'm not sure about network costs in this case). It could get really expensive.
I understand the issue with the database state. I'm not sure if it's simple enough to say that replicating is the better way to go from a bandwidth vs. CPU standpoint. The 'hit' for building the index would be a one-time on provisioning a secondary. Of course it's also true for the bandwidth hit so it's mostly a matter of your first point plus what this might cost from a network vs. CPU perspective in a cloud. Something to think about...
CPU time on AWS isn't cheap either. We could probably compute relative costs for the GitLab.com case, to see if there's a clear winner either way? My intuition is that bandwidth wins, but I could easily be wrong
Once the backfill is done, there's also a (smaller) ongoing cost as new documents show up, in both scenarios.
I do suggest moving the index status to elasticsearch itself in https://gitlab.com/gitlab-org/gitlab-ee/issues/2341 - this would resolve one technical barrier to pursuing independent rebuilds on each secondary.
The time issue remains worrying - as there's no processing to be done, I think it's safe to claim that elasticsearch-level replication will inevitably be faster than independent rebuilds. https://gitlab.com/gitlab-org/gitlab-ee/issues/3492 could help to make this concern manageable though.
For object storage, we're eventually going to support both mechanisms - external replication and gitlab-mediated replication (let's call that internal replication). Perhaps that's an option for elasticsearch as well.
Currently we can't support external replication for elasticsearch because the settings are stored in application_settings, which is read-only and replicated from the primary. We'll need this to be fixed for both internal and external replication, and fixing it means that external replication will start to be an option.
Another use case suggested by @amulvany today is having repository indexing operations happen on the secondary, with an elasticsearch cluster shared between both primary and secondary.
We'd need to resolve https://gitlab.com/gitlab-org/gitlab-ee/issues/2341 first, but this is otherwise quite possible. The secondary gets notifications whenever a repository is created, updated or removed, so all we'd need to do is add settings that delegate indexing actions to a particular secondary.
The motivation is high CPU load on the primary. By moving indexing operations to a (quieter) secondary, resources used for indexing can be partitioned away from other resources.
Having the indexing happen on the secondary can help with internet bandwidth issues for the secondary, but I'm not 100% positive we should index the elastic on the primary using the secondary infra.
This heavy load sounds like something we could improve in HA setup, by reading data from a local follower on the primary otherwise, we may be building a death star topology.
With Elastic now supporting Cross Cluster Replication would a first iteration be to just allow Geo nodes to point to a different ES cluster from the primary?
@rnienaber I am contemplating scheduling this to %12.2 - let me know what you think.
@fzimmer I have added ~"product discovery" to this issue. For the first iteration, we should confirm what the work is to be done and how it would be built as there seems to be some concerns above. Once we've figured those out, we can raise tickets for the work and schedule those in. But yes, we could do the discovery portion in 12.2.
Ideally we shouldn't create this coupling between primary and secondary on ElasticSearch level.
I'm fast-forwarding/trolling a little bit here: but it's a good way to achieve "Death Star architecture":
The less we couple primary with secondary, the more robust we can make both of them. Also in a disaster recovery situation, you don't want to have to think about promoting to "primary" every single dependency you have.
So we should think about triggering indexing on both, instead of relying on replication (there is also bandwidth to consider here).
A few other downsides I can think of: because both the database and ElasticSearch are independent, if one replicates faster than the other, or one replication stops working, you may endup in a situation where your search says something and your database says another thing, or worst, you take too much time to reflect a permission/visibility change and expose something you souldn't etc.
I had a couple of discussions with customers in the past weeks on this. Is the current status that the Geo secondary uses the same ELK cluster as the primary? Therefore, failing over to the secondary would entail setting up a new ELK stack and re-indexing?
I had a couple of discussions with customers in the past weeks on this. Is the current status that the Geo secondary uses the same ELK cluster as the primary?
yes. One note, ELK term is slightly incorrect here as it stands for (Elasticsearch + Logstash + Kibana) but we only need Elasticsearch server for search.
Therefore, failing over to the secondary would entail setting up a new ELK stack and re-indexing?
It depends. Elasticsearch is a separate service and it can be installed wherever the customer wants. If we fall back to secondary, then this node will be responsible for updating the search index on every data change(in the same ES server). Currently, there is a check "if ES search enabled & Geo is enabled & this is a secondary node => don't update ES index, only read it for search".
However, in most cases, ES node will be close to primary node (probably in the same region and network). So if you plan to stop using the whole region and fall back to the secondary on some other region then you will probably want to set up a new ES server there. In this case, you will need to re-index the whole content.
A next iteration here could then be to allow for a secondary, independent ELK cluster that would index the information on the secondary?
I didn't do a research but from my underdstanding, we need to use ES cross-cluster replication and then to make secondary use the ES replica. As ES configuration data is stored in the database we probably need to create an alternative way of configuring it for each node separately (in geo_nodes table).
We can't rely on Elasticsearch cross-cluster replication feature:
CCR is a platinum level feature, and is available through 30-day trial license that can be activated through the start trial API or directly from Kibana.
I believe we would need to follow the regular replication path here:
When we process repository events, we should trigger elastic search synchronizations (to get source-code level stuff in sync)
That doesn't help us with issues / merge request metadata... theoretically we should also add events whenever new indexable stuff happens there, but I believe this starts to become very intensive. I don't fully understand what we index today, so it's a little bit harder to make a plan... should we start with a documentation step first?
This may also be a good fit for a verification framework implementation... I expect this to be easy to implement in our yet-to-be-made framework, as the concepts are simples:
Register an event
Read the event on the secondary, and trigger a re-index operation
This should also serve as an experiment with us having another team try out the replication framework, and seeing how things can be improved or how easy it is to adopt, etc.
We can't rely on Elasticsearch cross-cluster replication feature:
CCR is a platinum level feature, and is available through 30-day trial license that can be activated through the start trial API or directly from Kibana.
Good to know but why we can't rely on it? At least it's one option, I agree that we can't make it the only option as it's paid.
I believe we would need to follow the regular replication path here
There is one catch here, we can only index already synced content, for the database, it's not a problem as events are stored there too but it's a problem for repositories and wikis. So I believe, for repos, we have to choose another strategy and reindex them after every successful sync. Also, we have to expect significant increase of database writes on primary as we will have to create a "reindex event" for every new comment and every updated item in the database. So I think this strategy should be just an alternative and not the main recommended approach.
My understanding is that from a business perspective its not ideal to make a paid solution (like EE) that depends on another paid solution (specific elastic search license/subscription).
There are all sorts of problems like, loosing control over pricing, having to renegotiate terms/rates, we could be subject to extra restrictions or embargoes... we can basically loose control over our business.
There are all sorts of problems like, loosing control over pricing, having to renegotiate terms/rates, we could be subject to extra restrictions or embargoes... we can basically loose control over our business.
we run opensource everything as far as I know... we run the same thing we ship in our packages as a way to also dogfood/drink our own wine.
We don't ship ES with GitLab package.
About paid cross-cluster replication feature. Let me express my opinion a bit better. Whatever approach we take we need to be able to configure ES separately for secondary nodes, even if we implement our own event/indexing system on Geo secondary site. That means that we don't need to do anything to support paid cross-cluster replication, except documenting this possibility. It comes almost at no cost, right?. No reason to not do that.
On the other hand, @brodock 's concern is valid if we say we support ES with Geo and the only mechanism is paid cross-cluster replication. This is something that should not happen as I mentioned above.
I believe this situation is a bit similar to how we used to handle object storage replication. I'd suggest we proceed in the following iterations:
Enable configuring different ES clusters for a secondary Geo node
State that in order to achieve replication one could use cross-cluster replication and describe on a how that could look. This would allow customers to decide if they want to use this feature or not (including ourselves)
Add native ES cluster replication to Geo after evaluating technical complexity involved
I believe we should be careful suggesting CCR is a valid/supported option, as that also carries support burden on our team.
I still believe that Geo doing it should be the first supported way and we could consider CCR as an optimization later if customers ask for it.
While I understand this is more or less similar to object storage situation, there is only one vendor involved, so I believe we should approach this in a different way.
Also, do you both think the ES Geo integration would be a good next data type for the yet-to-be-conceived verification framework?
I can't justify that but I think it's important, right. The most important thing here is that this made me think of how would it look like in our "Self-Service POC", it's challenging TBH...
Do all Geos need to stay active and in synch with eachohter?
These sound more like product questions, whether it's acceptable to have some sort of delay with Elasticsearch. Perhaps @fzimmer has some thoughts here.
For disaster recovery, it's probably fine to have some sort of delay. If people are using Geo secondaries for improved performance, there might be some complaints if search results don't appear in a timely fashion.
Can there be a Synch delay that is configurable by the user? (Like hourly to daily?)
I'm not sure. We don't have a sync delay setting for PostgreSQL or repository updates; why should Elasticsearch be any different?
I'm not saying anything new here, I'm just trying to find a way forward.
Proposal
I propose the MVP of "Geo supports Advanced Search" should be:
Disaster Recovery: A sysadmin is able to recover from a primary region outage with a notably shorter RPO and RTO than backup and restore.
Geo Replication: Users are able to use the Advanced Search feature, even when their requests are routed to a secondary Geo site.
Both criteria are already possible with testing and documentation.
Details
Importantly, Advanced Search depends on an external Elasticsearch or OpenSearch service. We don't ship one with GitLab. GitLab does not control configuration of the search service.
Disaster Recovery
Test and document how to configure GitLab with Geo with Advanced Search.
Use AWS OpenSearch and Cross-cluster replication.
Point the secondary site at the follower search cluster, to simplify failover by one step.
Test and document how to do a site failover with Advanced Search working before and after.
Document that Geo requires Cross-cluster replication to be configured for a search service.
Recall that the Advanced Search docs also don't tell you how to install and configure a search service. It points you to possible options. Therefore I suggest Geo does not need to provide all the details of how to configure replication for your particular search service in the MVP. Geo needs to confirm that you can set it up for DR in concert with Geo, and provide general docs.
Geo Replication
Test and document that Advanced Search already works the same on secondary sites as on the primary.
Prior to that, we were backed into a corner with DR for Elasticsearch. See discussions above. All good solutions would take significant work, and the consequences of not prioritizing them were less severe than the consequences of not prioritizing other work. Also, documenting Cross-cluster replication for Elasticsearch was a very incomplete solution given that it was only available on an expensive paid tier.
For the "Geo Replication" use-case, we now have Geo proxying enabled by default. So HTTP requests are handled by the primary site by default. This means this should already work. In a later iteration (not the MVP), we can do the work to "accelerate" specific routes on a case-by-case basis.
An analogy
We ship Postgres in Omnibus, and we have done a lot of work to manage Postgres replication and failover. But we also have a document describing how you can use an external (not managed by GitLab) Postgres database with Geo. We also have a document describing how to promote a secondary site with an external Postgres database.
We can treat this like we treat external Postgres databases, with documentation, and then build that into GET when using Amazon OpenSearch. We don't design our own Postgres replication. If Advanced Search brings in OpenSearch and it gets shipped with GitLab the product, then we can manage it directly.
A Large Ultimate Customer is interested in this feature. They would need failover for Advance Search in case the primary site goes down. ZD Ticket(Internal)