Enable Advanced Global Search for Paid Groups on GitLab.com (#1736) · Epics · GitLab.org

Enable Advanced Global Search for Paid Groups on GitLab.com

In discussions about how to continue rolling out Advanced Global Search (Elasticsearch) for *all* of GitLab.com it was determined that Paid groups on GitLab.com represent a [much smaller percentage](https://gitlab.com/gitlab-org/gitlab-ee/issues/11840#note_200518527) of our total storage requirements. Paid groups on GitLab.com represent 7168941859658 bytes \~7TB and can be supported by our existing Elastic as a Service Infrastructure. In light of this information, we should move forward with rolling out Elasticsearch to all paid groups on GitLab.com and while continuing to work on long term plans for the rest of GitLab.com. [Initial Thoughts on how to approach](https://gitlab.com/gitlab-org/gitlab-ee/issues/11840#note_200895695) this from Michal Wasilewski: * Get exact numbers on the maximum size of a single shard (2-10 days). This would require some help from dev. The purpose of this is so that we are not taken by surprise by the fact that a shard grows beyond what a single node can handle. We currently distribute data unevenly across shards and I don't know if there is any way to get exact numbers (rather than estimates) on how big they can get. What I'm looking for is for example "it's predictable because we assign entire namespaces to shards and the mapping is based on a hash rather than on sth random, so the max size will be x GB". One solution that comes to my mind that would help elevate this problem is recreating the index with a big number of shards, e.g. 100, and accept the memory overhead caused by such a high number. It might make individual shards smaller and buy us flexibility in the future (which might actually be more important than max shard size), but again, I don't know if it works that way. * Estimate the number of search requests per second the resized cluster will have to be able to handle (2-5 days). After initial indexing, the highest load will be generated by search requests. We need to have an understanding of the order of magnitude of the number of those requests. One idea is to estimate that by looking at our single-namespace index and use something like per user/project estimations and extrapolate that linearly. For example: `gitlab-com` namespace has 10k users who generate 300 requests/s, all paid groups have 100k users so we expect there will be 3000 req/s or `gitlab-com` has 20 projects, all paid groups have 5000 projects, so there will be 75000 req/s. * Test the resizing procedure without downtime on staging (3-5 days). I've tested live resizing of an ElasticCloud cluster while working on the single namespace (it's documented somewhere), but we would need to take a closer look at the index behavior in such conditions (shard reallocation, uneven distribution of load, latency on requests, any hiccups that would result in a need for reindexing). * Resize production (1 days) * Index paid groups (5-10 days) ### Estimate size of group <details><summary>**Script for estimating size**</summary> ```ruby include ActionView::Helpers::NumberHelper # Suppress output in IRB return_format = conf.return_format conf.return_format = "" def counts(namespace_paths, count_comments: false) ns = namespace_paths.map { |p| Namespace.find_by_full_path(p) } segments = { "gitlab-org + gitlab-com" => [Group.find(9970), Group.find(6543)], "all" => ns, } ns.each { |n| segments[n.path] = [n] } puts "| | namespaces | projects | repository size | issues | merge requests | comments |" puts "| ----- | ----- | ----- | ----- | ----- | ----- | ---- |" segments.each do |name, groups| namespaces_count = groups.count all_projects = groups.flat_map { |g| g.all_projects.includes(:statistics).to_a } projects_count = all_projects.count repository_size = number_to_human_size(all_projects.sum { |p| p.statistics.repository_size }) units = {thousand: "K", million: "M", billion: "B"} issues_count = all_projects.sum { |p| p.issues.count } merge_requests_count = all_projects.sum { |p| p.merge_requests.count } comments_count = -1 if count_comments comments_count = all_projects.sum { |p| p.issues.sum { |i| i.notes.count } } + all_projects.sum { |p| p.merge_requests.sum { |mr| mr.notes.count } } end puts "| #{name} | #{namespaces_count} | #{projects_count} | #{repository_size } | #{number_to_human(issues_count, units: units)} | #{number_to_human(merge_requests_count, units: units)} | #{number_to_human(comments_count, units: units)} |" end end # Return output in IRB conf.return_format = return_format ``` </details> ### Beta testing Issue Template <details><summary>Issue template</summary> ```markdown # Production Change - Criticality 3 ~"C3" | Change Objective | Describe the objective of the change | |:-------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------| | Change Type | Operation | | Services Impacted | Advanced Search (ES integration), Sidekiq, Redis, Gitaly, PostgreSQL, Elastic indexing cluster | | Change Team Members | @DylanGriffith | | Change Severity | C3 | | Change Reviewer or tested in staging | This was performed many times in the past in both staging and production, for example see: https://gitlab.com/gitlab-com/gl-infra/production/issues/1608 | | Dry-run output | - | | Due Date | 2020-XX-XX XX:XX:XX UTC | | Time tracking | | ## Detailed steps for the change ### Pre-check - [ ] Top level group: `CHANGE_ME` - [ ] Customer communication issue: CHANGE_ME - [ ] [Estimate size of group](https://gitlab.com/groups/gitlab-org/-/epics/1736#estimate-size-of-group) ### Roll out - [ ] Notify the customer in the customer communication issue - [ ] Check with SRE on call in `#production`: `@sre-oncall I would like to roll out this change <LINK>. Please let me know if there are any ongoing incidents or any other reason to hold off for the time being. Please note this will likely trigger a high CPU alert for sidekiq workers but since Elasticsearch has a dedicated sidekiq fleet it should not impact any other workers.` - [ ] note disk space and number of documents for the `gitlab-production` index - [ ] login as admin, go to `Admin > Settings > Integrations > Elasticsearch` - [ ] add the top level group to indexed namespaces - [ ] note start time: `2020-XX-XX XX:XX:XX UTC` and update Due Date in table - [ ] Wait for namespace to finish indexing - [ ] Look at [Sidekiq Queue Lengths per Queue ](https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1) - [ ] Look for [`ElasticNamespaceIndexerWorker`](https://log.gprd.gitlab.net/goto/3d4511da1ef6f295d8a62775f98d7798) and find the `correlation_id` - [ ] Look for [`done` jobs with that `correlation_id`](https://log.gprd.gitlab.net/goto/8adc81109977b4a98da66fa7166b1901). When you are finished there should be 3 jobs done (1x`ElasticIndexerWorker`, 2x`ElasticCommitIndexerWorker`) per project in the group. - [ ] note end time: `2020-XX-XX XX:XX:XX UTC` - [ ] note time taken: `XXm` - [ ] note increase in index size - [ ] Check for [any failed projects for that `correlation_id`](https://log.gprd.gitlab.net/goto/b6f4877e7db12153019662a64e312791 and [manually retry them if necessary](https://gitlab.com/gitlab-com/gl-infra/production/issues/1499#note_268415778) - [ ] Test out searching - [ ] Inform customer it is finished | In [`gitlab-production` index](https://00a4ef3362214c44a044feaa539b4686.us-central1.gcp.cloud.es.io:9243/app/monitoring#/elasticsearch/indices/gitlab-production?_g=(cluster_uuid:D31oWYIkTUWCDPHigrPwHg)) | Before | After | | ------ | ------ | ------ | | [Total](https://00a4ef3362214c44a044feaa539b4686.us-central1.gcp.cloud.es.io:9243/app/monitoring#/elasticsearch/indices/gitlab-production?_g=(cluster_uuid:D31oWYIkTUWCDPHigrPwHg)) | | | | [Documents](https://00a4ef3362214c44a044feaa539b4686.us-central1.gcp.cloud.es.io:9243/app/monitoring#/elasticsearch/indices/gitlab-production?_g=(cluster_uuid:D31oWYIkTUWCDPHigrPwHg)) | | | ## Monitoring ### Key metrics to observe * Gitlab admin panel: * [queue for `ElasticIndexerWorker`](https://gitlab.com/admin/sidekiq/queues/elastic_indexer) * [queue for `ElasticCommitIndexerWorker`](https://gitlab.com/admin/sidekiq/queues/elastic_commit_indexer) * Grafana: * [Platform triage](https://dashboards.gitlab.net/d/general-triage/general-platform-triage?orgId=1) * Sidekiq: * [Sidekiq SLO dashboard overview](https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1&from=now-12h&to=now) * `Sidekiq Queue Lengths per Queue` * Expected to climb during initial indexing and sudden drop-off once initial indexing jobs are finished. * `Sidekiq Inflight Operations by Queue` * `Node Maximum Single Core Utilization per Priority` * expected to be 100% during initial indexing * Redis-sidekiq: * [Redis-sidekiq SLO dashboard overview](https://dashboards.gitlab.net/d/redis-sidekiq-main/redis-sidekiq-overview?orgId=1&from=now-6h&to=now) * `Memory Saturation` * If memory usage starts growing rapidly it might get OOM killed (which would be really bad because we would lose all other queued jobs, for example scheduled CI jobs). * If it gets close to SLO levels, the rate of growth should be evaluated and the indexing should potentially be stopped. ### Other metrics to observe * Grafana: * Rails: * [Search controller performance](https://dashboards.gitlab.net/d/rPsQXrImk/rails-controller?orgId=1&refresh=1m&from=now-24h&to=now&var-env=gprd&var-type=web&var-stage=main&var-controller=SearchController&var-action=show) * Postgres: * [patroni SLO dashboard overview](https://dashboards.gitlab.net/d/patroni-main/patroni-overview?orgId=1&from=now-6h&to=now) * [postgresql overview](https://dashboards.gitlab.net/d/000000144/postgresql-overview?orgId=1&from=now-2d&to=now) * [pgbouncer SLO dashboard overview](https://dashboards.gitlab.net/d/pgbouncer-main/pgbouncer-overview?orgId=1&from=now-6h&to=now) * [pgbouncer overview](https://dashboards.gitlab.net/d/PwlB97Jmk/pgbouncer-overview?orgId=1&from=now-2d&to=now&var-prometheus=Global&var-environment=gprd&var-type=patroni) * ["Waiting Sidekiq pgbouncer Connections"](https://dashboards.gitlab.net/d/PwlB97Jmk/pgbouncer-overview?orgId=1&from=now-2d&to=now&var-prometheus=Global&var-environment=gprd&var-type=patroni&fullscreen&panelId=15) * If we see this increase to say 500 and stay that way then we should be concerned and disable indexing at that point * Gitaly: * [Gitaly SLO dashboard overview](https://dashboards.gitlab.net/d/gitaly-main/gitaly-overview?orgId=1&from=now-6h&to=now) * [Gitaly latency](https://dashboards.gitlab.net/d/gitaly-main/gitaly-overview?orgId=1&from=now-6h&to=now&fullscreen&panelId=3) * [Gitaly saturation overview](https://dashboards.gitlab.net/d/gitaly-main/gitaly-overview?orgId=1&from=now-6h&to=now&fullscreen&panelId=6) * [Gitaly single node saturation](https://dashboards.gitlab.net/d/gitaly-main/gitaly-overview?orgId=1&from=now-6h&to=now&fullscreen&panelId=53) * If any nodes on this graph are maxed out for a long period of time correlated with enabling this we should disable it. We should first confirm by shutting down `ElasticCommitIndexerWorker` that it will help and then stop if it's clearly correlated. ## Rollback steps - [ ] General "ES integration in Gitlab" runbook: https://gitlab.com/gitlab-com/runbooks/blob/master/elastic/doc/elasticsearch-integration-in-gitlab.md - [ ] remove the two added namespaces from the admin panel - [ ] clear elasticsearch sidekiq queue: https://gitlab.com/gitlab-com/runbooks/blob/master/troubleshooting/large-sidekiq-queue.md#dropping-an-entire-queue - [ ] stop running sidekiq jobs: https://gitlab.com/gitlab-com/runbooks/blob/master/troubleshooting/large-sidekiq-queue.md#kill-running-jobs-as-opposed-to-removing-jobs-from-a-queue - [ ] clean up changes made to the ES index (you actually don't need to do anything, src: https://gitlab.com/gitlab-com/runbooks/blob/master/elastic/doc/elasticsearch-integration-in-gitlab.md#cleaning-up-index ) - [ ] **nuclear option** stop sidekiq-elasticsearch completely (note this might lead to a broken ES index as the worker updates in the database what objects it has processed): `sudo gitlab-ctl status sidekiq-cluster` - [ ] **nuclear option** disable anything related to ES in the entire Gitlab instance (note this does not clear sidekiq queues or kill running jobs): https://gitlab.com/gitlab-com/runbooks/blob/master/elastic/doc/elasticsearch-integration-in-gitlab.md#disabling-es-integration ## Changes checklist - [X] Detailed steps and rollback steps have been filled prior to commencing work - [X] Person on-call has been informed prior to change being rolled out /label ~C3 /confidential /relate https://gitlab.com/gitlab-org/gitlab/issues/33679 ``` </details> ### Customer communication template <details><summary>Customer communication template</summary> ``` @ contact as a licensed member of GitLab.com we are please to add you to our "Advanced Global Search" roll-out on GitLab.com . A couple of notes of house keeping: 1. This feature is in the early stages of being rolled out to a wider audience on GitLab.com - this means that we cannot guarantee access to this feature and at times and we may have to turn the feature on and off at times. Turning the feature off will revert you back to "Basic search" which is the limited form of search you had before we enabled this. 1. In order to take advantage of the Advanced Search Features you have to scope /search to your group (example screenshot below) and look for Advanced search functionality enabled in the top right. ![advanced_search](https://gitlab.com/groups/gitlab-org/-/uploads/b2c919140e1d284943ee9e931a308635/advanced_search.png) 1. With Advanced Search there's a couple large benefits. The first is the option to do cross-project code search across all projects within your group. The second is [Advanced search syntax](https://docs.gitlab.com/ee/user/search/advanced_search_syntax.html) Let me or @phikai know by responding to this comment if you have feedback or questions. ``` </details>

epic