Better Advanced Search support for GitLab Geo

added ~149378 label

Hmm, this is especially aggravated by https://gitlab.com/gitlab-org/gitlab-ee/issues/1373 - pushing potentially-confidential search results across the public Internet over http is bad.

Having geo-local nodes to query would be great; is there a mechanism in ElasticSearch to keep these nodes up to date, or would we need each Geo instance to build the ElasticSearch index itself, locally?

/cc @vsizov

Having geo-local nodes to query would be great; is there a mechanism in ElasticSearch to keep these nodes up to date, or would we need each Geo instance to build the ElasticSearch index itself, locally?

This is easy to solve, you can just create separate ES node(Replica Shard) that is close to your geo node and that's it. The replication process will be taken care by ES

mentioned in issue #1747 (closed)

mentioned in issue #1850 (closed)

mentioned in issue #846 (closed)

@vsizov Is it wise to use ES replication on a WAN? We would have to open some ports between these nodes, and I'm not certain that communication is encrypted. Not the mention I'm not really sure that ES is meant to handle the latency over a WAN for replication purposes.

I wonder if the easiest implementation for Geo would be to allow specifying ES connection details for secondary nodes and let each secondary Geo instance index it's own dataset.

I think the ES protocol is over HTTP(s), so it's up to the people setting up it to make the certificates work.

@stanhu ES clusters communicate on a proprietary protocol on port 9300 IIRC. Also, clusters assume they are on the same local network. I also found this post from Elastic talking about the topic: "We are frequently asked whether it is advisable to distribute an Elasticsearch cluster across multiple data centers (DCs). The short answer is "no"". See https://www.elastic.co/blog/clustering_across_multiple_data_centers

@dblessing Ah, right. Thanks.

What problem we're trying to solve? An additional latency of search requests due to slow connections to remote ES servers?

If yes, then I'm a bit confused because we're talking about one HTTP(s) request (actual search query). So, theoretically, this should not create any problems. Unfortunately, I don't have any practical information...

For the DR case, it is (sort of) important to have an independent copy of the elasticsearch data in the same physical location as each secondary.

I don't think we should rely on the secondaries to build an independent elasticsearch index - replicating the existing ES data to the secondary in some manner seems like the better option. Ideally, ES would have some kind of standby mode similar to postgresql secondaries. The Geo secondaries could perform queries against that with some "override application_settings" configuration - perhaps that would go on the GeoNode?

A plugin for ES streaming replication, although I've no idea if it's any good: https://github.com/cdgz/es-replication

@nick.thomas

I don't think we should rely on the secondaries to build an independent elasticsearch index

Why do you feel this is a bad idea?

@dblessing the first barrier is that we currently store index status in the main GitLab database, which is readonly on secondaries. So there's some significant engineering work required upfront.

Then, actually building the index is computationally expensive. Not everyone is GitLab.com scale but we could easily be talking about a week to a month of processing on each secondary - and search would be broken on the secondary until it were complete. I'd much rather have a bandwidth capacity problem than a CPU capacity problem :)

I'd much rather have a bandwidth capacity problem than a CPU capacity problem

I'm not sure everyone would agree. Especially, let's say for a customer on AWS that is replicating to a secondary in another provider (or even region - though I'm not sure about network costs in this case). It could get really expensive.

I understand the issue with the database state. I'm not sure if it's simple enough to say that replicating is the better way to go from a bandwidth vs. CPU standpoint. The 'hit' for building the index would be a one-time on provisioning a secondary. Of course it's also true for the bandwidth hit so it's mostly a matter of your first point plus what this might cost from a network vs. CPU perspective in a cloud. Something to think about...

CPU time on AWS isn't cheap either. We could probably compute relative costs for the GitLab.com case, to see if there's a clear winner either way? My intuition is that bandwidth wins, but I could easily be wrong

Once the backfill is done, there's also a (smaller) ongoing cost as new documents show up, in both scenarios.

I do suggest moving the index status to elasticsearch itself in https://gitlab.com/gitlab-org/gitlab-ee/issues/2341 - this would resolve one technical barrier to pursuing independent rebuilds on each secondary.

The time issue remains worrying - as there's no processing to be done, I think it's safe to claim that elasticsearch-level replication will inevitably be faster than independent rebuilds. https://gitlab.com/gitlab-org/gitlab-ee/issues/3492 could help to make this concern manageable though.

mentioned in issue #76 (closed)

mentioned in issue #3944 (closed)

For object storage, we're eventually going to support both mechanisms - external replication and gitlab-mediated replication (let's call that internal replication). Perhaps that's an option for elasticsearch as well.

Currently we can't support external replication for elasticsearch because the settings are stored in application_settings, which is read-only and replicated from the primary. We'll need this to be fixed for both internal and external replication, and fixing it means that external replication will start to be an option.

mentioned in issue #4163 (closed)

mentioned in issue #6155

Another use case suggested by @amulvany today is having repository indexing operations happen on the secondary, with an elasticsearch cluster shared between both primary and secondary.

We'd need to resolve https://gitlab.com/gitlab-org/gitlab-ee/issues/2341 first, but this is otherwise quite possible. The secondary gets notifications whenever a repository is created, updated or removed, so all we'd need to do is add settings that delegate indexing actions to a particular secondary.

The motivation is high CPU load on the primary. By moving indexing operations to a (quieter) secondary, resources used for indexing can be partitioned away from other resources.

Having the indexing happen on the secondary can help with internet bandwidth issues for the secondary, but I'm not 100% positive we should index the elastic on the primary using the secondary infra.

This heavy load sounds like something we could improve in HA setup, by reading data from a local follower on the primary otherwise, we may be building a death star topology.

Customer ticket -> https://gitlab.zendesk.com/agent/tickets/100866 (internal use)

added customer+ label

added 1 deleted label

added [deprecated] Accepting merge requests label

removed milestone

added to epic &893 (closed)

added devopssystems label

mentioned in issue #12213 (closed)

mentioned in issue #12356 (closed)

With Elastic now supporting Cross Cluster Replication would a first iteration be to just allow Geo nodes to point to a different ES cluster from the primary?

@rnienaber I am contemplating scheduling this to %12.2 - let me know what you think.

added product discovery [DEPRECATED] label

@fzimmer I have added ~"product discovery" to this issue. For the first iteration, we should confirm what the work is to be done and how it would be built as there seems to be some concerns above. Once we've figured those out, we can raise tickets for the work and schedule those in. But yes, we could do the discovery portion in 12.2.

Ideally we shouldn't create this coupling between primary and secondary on ElasticSearch level.

I'm fast-forwarding/trolling a little bit here: but it's a good way to achieve "Death Star architecture":

The less we couple primary with secondary, the more robust we can make both of them. Also in a disaster recovery situation, you don't want to have to think about promoting to "primary" every single dependency you have.

So we should think about triggering indexing on both, instead of relying on replication (there is also bandwidth to consider here).

A few other downsides I can think of: because both the database and ElasticSearch are independent, if one replicates faster than the other, or one replication stops working, you may endup in a situation where your search says something and your database says another thing, or worst, you take too much time to reflect a permission/visibility change and expose something you souldn't etc.

removed from epic &893 (closed)

added to epic &893 (closed)

added groupgeo label

added workflowproblem validation label

added 1 deleted label and removed workflowproblem validation label

added workflowproblem validation label and removed 1 deleted label

added Enterprise Edition label

https://gitlab.my.salesforce.com/00161000004yLEy

added customer label

I had a couple of discussions with customers in the past weeks on this. Is the current status that the Geo secondary uses the same ELK cluster as the primary? Therefore, failing over to the secondary would entail setting up a new ELK stack and re-indexing?

@brodock Can you confirm this?

A next iteration here could then be to allow for a secondary, independent ELK cluster that would index the information on the secondary?

Given @brodock's comment relying on ES replication itself may be not desirable.

Better Advanced Search support for GitLab Geo

What are you trying to do? Articulate your objectives using absolutely no jargon.

How is it done today, and what are the limits of current practice?

What's new in your approach and why do you think it will be successful?

Who cares? If you're successful, what difference will it make?

What are the risks and the payoffs?

How much will it cost?

How long will it take?

What are the midterm and final "exams" to check for success?

Designs

Child items ...

Activity

Better Advanced Search support for GitLab Geo

What are you trying to do? Articulate your objectives using absolutely no jargon.

How is it done today, and what are the limits of current practice?

What's new in your approach and why do you think it will be successful?

Who cares? If you're successful, what difference will it make?

What are the risks and the payoffs?

How much will it cost?

How long will it take?

What are the midterm and final "exams" to check for success?

Relates to

Activity