Project 'gitlab-com/gl-infra/infrastructure' was moved to 'gitlab-com/gl-infra/production-engineering'. Please update any links and bookmarks that may still have the old path.
Elastic Cloud cluster for indexing in gitlab.com (production cluster)
This issue does not appear to have an issue weight set.
As a general guidelines use a weight of 1 for an access request issue or a simple
configuration update. Use this as a multiplier for setting the weight.
If you are unsure about what weight to set it is better to add a generous estimate and change it later.
If the weight on this issue is 8 or larger then it might be a good idea
to consider splitting this issue up into smaller pieces.
@nick.thomas will we ever remove data from the index? e.g. when a project is removed from gitlab instance? do we not care about leftovers? how can we monitor how much data is stale (e.g. will it be marked somehow as belonging to a project that was removed)?
which logs hold information about enqueuing sidekiq jobs? on staging there were a few million elastic_commit_indexer jobs enqueued and the number was growing very rapidly, I want to check what caused that. I searched for elastic_commit_indexer in unicorn and rails logs but didn't find anything
do you think that the number of all database objects that are being indexed is a good basis for estimating database objects/index size ratio? or perhaps disk space used by tables (?) for all projects in that namespaces?
@nick.thomas what happens with search queries between the time the elastic integration is enabled and the index is present in elastic cluster (this might be in the order of tens of hours)?
@mwasilewski-gitlab data is removed from the elasticsearch index when it is removed from the GitLab database or filesystem. When a project is removed, we delete corresponding documents from the index. Similarly, if an issue is removed, then the elasticsearch document for that index is also removed. The only way to discover if a particular document in elasticsearch is stale compared to the database is to cross-reference between the two. There's nothing automatic for that at present, and it sounds expensive to do.
I think we log a line every time a sidekiq job is enqueued, but there's nothing elasticsearch-integration-specific about it. We have hooks that enqueue a sidekiq job every time an issue or comment is made, and every time a git push happens. What questions are you trying to get answers to from the logs? Maybe thre's a gap here, or maybe the information is present in some other form. I'd hazard a guess that sidekiq jobs are far too numerous for us to log every time we enqueue one of them.
Repository content absolutely dominates the total index size; database content is almost a rounding error. Maybe a 1:1 correspondence between the size of the relevant tables is to be expected, since we index the full content of, say, issues and notes.
RE: searches, we have two settings available to us in the admin panel - we can enable indexing and searching separately, and independently of each other. I'd recommend we leave the "search" checkbox disabled until backfill is complete and all the sidekiq jobs have been processed. This will mean we keep the index up to date, but don't perform searches against it.
yup, found some logs in Kibana, thanks! I'm trying to get more insights into what's happening during indexing. When I triggered indexing via rake tasks (which means no sidekiq was involved), it took ~15h and the index was ~5GB in size. However, when I enable indexing and leave it to the indexer to enqueue sidekiq jobs (as we discussed) after 18h the index is only 50MB in size. There are elastic_commit_indexer jobs enqueued constantly, some of them with retry numbers going up, e.g. retry = 2 and 3. These jobs do not seem to be present on the queue, prometheus graph , but some (all?) of them do run: graph. I also do not see any failures reported by sidekiq, kibana. There are also very little database related jobs kibana.
yes, I'm happy to talk to security. I don't think there will be a single MR which they can review, so perhaps it's best if we just get them involved here. /cc @gitlab-com/gl-security/appsec I am setting up an elastic cluster which will hold the search index for gitlab-org namespace and in the future the entire gitlab.com instance. We are running an indexer that processes database content as well as repositories and uploads the index it creates to an elasticsearch cluster (at the moment, this is an Elastic Cloud cluster provided by elastic.co). The index holds potentially sensitive user data (code, comments, notes). Could someone from the security team make sure that's ok? If there's another way you'd prefer me to raise this for review or if you need more details please let me know.
did you try adding a namespace, saving config, reloading page, removing the namespace, saving config and reloading page (at which point I was expecting the list to be empty but it kept showing the namespace)? I'll try again and if it still fails I'll open a ticket with detailed steps to reproduce.
@mwasilewski-gitlab and I would be able to answer questions within infra scope
@nick.thomas would be able to answer questions for the historical reasons, data being processed and encryption.
I will start off and let the other two fill in.
What problem is this solving?
Enabling elasticsearch integration helps us solve these problems for starters.
Who are the intended users?
Intended users would be anyone who has a Gitlab account.
How will they interact with the application? Via Kibana?
They will not interact via Kibana, but via the regular search window that Gitlab already provides, currently. (Just the search mechanism in the backend would change from doing DB search to ES search).
Are there any authorization roles?
And security feature is enabled (we would serve results to users who are allowed to see them).
References for the security features are 1, 2 and 3
What data is it processing?
Data that is indexed and made available for search are:
I believe so, but I will let the others confirm. (TBD)
And how is it stored?
Index prep work would happen within our secured infrastructure (GCP VMs via Sidekiq jobs) and would be sent to Elastic Cloud for the actual indexing and storing.
Will it need to be encrypted at rest?
Data is indexed and stored in clear in Elasticsearch. Communication between clients and nodes can be encrypted via feature such as x-pack. But users would do their search via gitlab.com which is already on TLS 1.2. The search would query ES on our Elasticsearch cluster which is also on TLS 1.2. There is an option of not storing the entire _source, choose which fields to index/store in clear and which ones to either ignore or encrypt. But I will let @nick.thomas confirm here as well. (TBD)
What's the exposure of the application?
Elasticsearch is already used by EE customers. We are trying to enable it on Gitlab.com. Thus, the exposure would be public/world-wide. But the above mentioned security features would make sure people can search and see results that they are only allowed to.
Worst case (security) scenario from your point of view?
Just like any cloud provider best practice, perhaps not rotating the credentials to Elastic cloud might become a concern? (But I don't think this is an isolated concern and that it should be a concern across all the dependencies/3rd party/cloud providers we interact with across the company)
Private repos and issues (and also confidential issues) are indexed as well. We store the confidentiality information and the access levels required for each project in elasticsearch. When composing queries we compose them using the current user's accessible projects to ensure we for example, only return confidential issues in the results if the user actually has the right access level to see them.
All queries contain a filter to check for project membership (in case of private projects) or project visibility (so we still return public projects)
The worst case I can think of is that the Elastic Cloud authentication is configured incorrectly or leaked, and someone manages to dump all private information for the gitlab-org group on GitLab.com (including GitLab RED data) as a single huge data dump ^^
From what I can see of the authorization to elastic cloud, we're using basic auth with a static username and password. So any leak of the application_settings table becomes a leak of the entire group.
The elastic cloud cluster is available from anywhere on the internet at present. We could mitigate this to an extent by introducing IP-based ACLs on the elastic cloud service, if they support that.
We store the confidentiality information and the access levels required for each project in elasticsearch.
@mdelaossa do we update that info when things change? e.g. when a project is moved or permissions are changed do we trigger some "elastic indexer" that updates information stored in ES to match the state in gitlab db?
The worst case I can think of is that the Elastic Cloud authentication is configured incorrectly or leaked, and someone manages to dump all private information for the gitlab-org group on GitLab.com (including GitLab RED data) as a single huge data dump ^^
That's my biggest concern as well. We should also keep in mind that at this point in time we're talking about one namespace, but in the future we'll be potentially indexing users' private projects as well.
We could mitigate this to an extent by introducing IP-based ACLs on the elastic cloud service, if they support that.
There's plenty of documentation and discussions about how to do it in the open source version and Elastic Cloud Enterprise (their on-prem solution), but not Elastic Cloud (SaaS). I tried using Elastic's X-Pack filtering as documented here, but it didn't work. Here's the json I used, feel free to verify yourself:
Some xpack features are listed on the subscription page, but the list is not very detailed. Someone asked about this on the Elastic.co forum and in July 2018 this was not possible. Some x-pack config options are listed as supported here, but ip filtering is not on the list. I opened a topic to check if something has changed.
If we were able to use xpack features in ES we could probably switch to TLS based authentication. Although, our integration only supports base auth so I'm not sure if that's relevant for this release. Shall I open an issue in gitlab-ee with a feature request as I imagine that some of the ES integration users have their clusters secured with TLS authentication?
@mdelaossa do we update that info when things change? e.g. when a project is moved or permissions are changed do we trigger some "elastic indexer" that updates information stored in ES to match the state in gitlab db?
@mwasilewski-gitlab this is correct, we run an index update on every single code push, and on every single model update to Project, Issue, MergeRequest, Milestones, etc
@mwasilewski-gitlab the difference between using the rake tasks vs. allowing the "elasticsearch limited projects" thing to work is that the former will index everything, while the latter will only index content in the listed namespaces, so it's expected that the indexes will be of different sizes.
@mdelaossa I wonder if we should consider modifying these rake tasks in the short term so they respect the "limited indexing" settings. I think it's going to be a source of confusion for all our customers - WDYT?
oh, I see, didn't know that, I thought that rake tasks are also bound to selected namespaces. I still think there's something wrong going on or I'm missing something. There were only 5 ElasticIndexerWorker jobs, all 5 processed personal snippets, src: kibana . This is also confirmed by the number of processed jobs, graph . Besides, the size of the index is waaaay to low. 9GB of repos indexed into 50MB can't be right. As for ElasticCommitIndexerWorker, hundreds of thousands of jobs are being processed and yet the index is not growing.
PS I was pasting wrong Kibana URLs in my previous comments
I recreated the cluster and the behavior is more or less the same. I can see documents being created in elastic, these documents are only for 4 commits, however hundreds of sidekiq jobs were run. I'm looking at the indexer worker code and the indexer itself. I can see that the go indexer will log the range of commits it is processing in debug mode, how do I change log level? Could one of you help me debug this?
So the problem seems to have been that the mapping was not created since the gitlab:elastic:create_empty_index rake task was not run, but instead the index was created manually BECAUSE we don't have a way to specify the number of shards to use.
This is a great find by @mwasilewski-gitlab - we should enhance the rake task to allow for selecting how many shards you want. This might be negatively impacting some of our customers that have very big clusters? We'll want to allow configuration of this
is the below expected? It looks almost as if search is using ES integration even if searching is disabled
in all cases, ES was limited to gitlab-org namespace
integration enabled
integration disabled
search enabled
1
3
search disabled
2
4
AD1.
searching works, UI shows ES enabled
AD2.
500s, UI shows ES enabled (just before switching to any project which throws 500s)
AD3.
UI doesn't show ES enabled, different results than with ES integration, but works fine
AD4.
same as above
I'll be triggering indexing and than wiping the index for the next couple of hours (testing some things before rolling out to prod), so staging might behave unexpectedly today
I just confirmed on staging that 3 is the behavior we get during initial index creation (ES integration enabled, search is disabled, almost no data in the index)
@nick.thomas if you could take a peek I'd appreciate it. The decoupling of the ES schema work is taking much more of my time than anticipated.
This should definitely not 500, we should instead return empty (or whatever we already indexed) results, and it would be improved later by https://gitlab.com/gitlab-org/gitlab-ee/issues/3492
@nick.thomas - I am joining the task force to enable ES on Gitlab.com from Infra team. :) I have been doing bunch of reading of all the different epics, issues, MRs, comments...etc to catch up. Michal is attending a conference so I thought I would just jump in and see if I can ask some questions I am not very clear about.
It looks like we have a prod CR: production#800 (closed) but in that CR, this current issue is tagged as something we need to check/verify before we proceed further. Have you found anything on this issue further to see whether there is a bug or not?
Thanks for the heads-up @aamarsanaa . I've not had time to dig into this particular observed problem yet - I should get to it next week though.
We have a number of other code changes going through %12.0 that will be needed before we enable elasticsearch on GitLab.com anyway, so this isn't the only blocker ^^.
@mwasilewski-gitlab@aamarsanaa since the elastic cloud cluster now exists, I'll close this issue. We have https://gitlab.com/gitlab-org/gitlab-ee/issues/11419 to cover the application setup side of things, which is assigned to me. Feel free to reopen if I've misunderstood and there's more work to track on the elastic cloud side of things!
I'm going through steps required to stop initial indexing in the middle. I disabled the ES integration and removed ElasticsearchIndexedNamespace objects through the console. However, jobs are still being scheduled and they are writing to the ES cluster. It doesn't look like a sidekiq job:
so I'm guessing there is something else that schedules them, maybe elastic_namespace_indexer that is still running? but why would it not show up in the console (see above)?
@mwasilewski-gitlab with the "elasticsearch indexing" checkbox disabled, no new sidekiq jobs should be enqueued. Already-enqueued ones will be run until they're all gone, though - perhaps that's what you're seeing?
@nick.thomas I did some more investigation and I think I know what is happening. I was wrong in assuming that if w.size doesn't return anything than it means there are no running jobs (or maybe it was incorrectly reporting nothing). The bottom line is, there is still an elastic_indexer_worker job running since Mon Apr 29 17:06:18 UTC 2019 (which is just before I disabled the integration). This is probably the job that is still writing to ES cluster.
I'm looking now at ways to kill running sidekiq jobs, if you know how to do it, I could use some help. Here's what I've got so far:
I think there is an intermittent problem somewhere, I'm seeing similar behavior as I did before with w.size not reporting anything. I also think this is the reason why include didn't return anything. I'll try to debug it further the next time it happens.
so now I have a list of jobs that I want to kill elastic_jobs.each { |job| job.delete } says I'm giving it the wrong number of arguments, any ideas?
@mdelaossa I think that at the moment there is no way to kill a sidekiq job cleanly (please do correct me if I'm wrong, I don't see anything in gitlab-ee/app/workers/concerns/application_worker.rb).
One option, which is super hacky and dirty, is to recreate the index. I tested it and it causes the running jobs to stop (I didn't see any errors in logs though, similarly to when we hit the problem with missing mappings). I'm not a big fan of this, but it can be our fallback mechanism to revive the entire instance should something go wrong.
Another way out is to kill the thread using another worker, something like this.
Shoud I open a ticket in gitlab-ee with a suggestion to extend application_worker.rb with regular status checks and handling of cancellation of jobs?
@mwasilewski-gitlab I think it's interesting that ElasticIndexerWorker jobs are hanging for so long. Did you get any indications of why? If it's just a single network call hanging indefinitely, we can probably fix it easily.
We can always interrupt a running job by killing the sidekiq process it's on - crude, but effective!
@nick.thomas I don't think they are hanging, I think they are processing data and writing to ES cluster. So once you kick off the indexing process of a big instance, jobs will potentially run for many hours
That's true, I was wondering though if there was a mechanism in place for killing/canceling running jobs. Didn't want to run around killing processes
Hmm, I wouldn't expect these index jobs to be particularly long-lived. Perhaps this indicates a problem in how we're currently scheduling the work @mdelaossa ?
Current scheduled work is one job per Project. We have seen that very large projects do take a long time to index, but in the end I don't think we have enough visibility into the status to really know if the job is hanging or just performing as expected. It also seems strange that we have a ton of jobs that hang out this much - I'd really only expect three or four jobs (gitlab EE and CE, maybe workhorse and gitaly)
The current architecture assumes that it's possible for the database content of any project to be indexed, in full, in the lifetime of a single sidekiq job. If the job fails for any reason, then indexing of that project must begin again from the start, because we don't have any way of tracking the progress of the job.
The same is true for the repository content, but there we do most of the work in an external subprocess, which fools the sidekiq memory killer.
I think we'll need to look again at this once we're implemented the bulk API operations, to see if larger projects consistently fail to be indexed. If they do, we'll have to introduce some idea of status, or break the work into multiple, smaller jobs, to fix it.
Michal Wasilewskichanged title from Elastic Cloud cluster for indexing in gitlab.com to Elastic Cloud cluster for indexing in gitlab.com (production cluster)
changed title from Elastic Cloud cluster for indexing in gitlab.com to Elastic Cloud cluster for indexing in gitlab.com (production cluster)
the cluster is being used for indexing in production so I'll close this issue. The change issue will be used for the remaining work related to rolling out indexing to prod