Index `gitlab-com` group with Elasticsearch on GitLab.com
Blockers:
-
Figure out if we need to deal with https://gitlab.com/gitlab-org/gitlab/issues/103325#note_261461370 -
Determine if we need to resize Elasticsearch cluster
C3
Production Change - Criticality 3Change Objective | Describe the objective of the change |
---|---|
Change Type | ConfigurationChange |
Services Impacted | GitLab.com |
Change Team Members | Dylan Griffith |
Change Severity | C3 |
Change Reviewer or tested in staging | staging |
Dry-run output | N/A |
Due Date | 2019-01-07 08:02 UTC |
Time tracking |
Pre-check
-
Run the below steps on staging -
Determine there is enough space left in Elasticsearch cluster ✅ - We've only used 40GB and there is
440GB360GB left *see comment below
- We've only used 40GB and there is
-
Compare number of projects in gitlab-com
togitlab-org
to estimate rough comparison of how long indexing will take and also compare repo size across groups.- Details at #1499 (comment 262021990)
- Some evidence in #800 (comment 174903937) to suggest this wasn't a big amount of data relative to
gitlab-org
when it was last looked into.
Detailed steps for the change
-
Resize cluster -
create a silence for component saturation alert for sidekiq cpu, similar to this one: https://alerts.gprd.gitlab.net/#/silences/5b12f2f5-8998-4373-a823-762f46c1706f -
Go to Admin > Settings > Integrations > Elasticsearch
-
Note start time: 2019-01-07 08:02 UTC
-
Add gitlab-com
to indexed namespaces -
Wait for Elasticsearch reindex to complete (watch queue for ElasticIndexerWorker
and it should trigger for each project in the group ~600 and they will each spin out 2ElasticCommitIndexerWorker
for indexing the source code and wiki code. Worth noting that after the queue is approximately drained it may keep requeueing anyway as updates come in so we only need to see it almost reach zero and it should stabilise at a low number.-
Number of in process jobs for ElasticCommitIndexerWorker
has stabilised to roughly zero https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1&from=now-6h&to=now&fullscreen&panelId=22
-
-
Note end time: 2019-01-07 08:55 UTC
-
Note the time take: 53 minutes
-
Update below table for historical record
In gitlab-production index |
Before | After |
---|---|---|
Total | 38.8GB | 55.8GB |
Documents | 3.3M | 4.4M |
Rollback steps
-
Remove gitlab-com
from indexed namespaces
Changes checklist
-
Detailed steps and rollback steps have been filled prior to commencing work -
Person on-call has been informed prior to change being rolled out
Monitoring
- queue for
ElasticIndexerWorker
- queue for
ElasticCommitIndexerWorker
- Overall sidekiq queues
- Search controller performance
- Check "Waiting Client Connections per Pool"
- If we see this increase to say 500 and stay that way then we should be concerned and disable indexing at that point
- Check Gitaly node saturation
- If any nodes on this graph are maxed out for a long period of time correlated with enabling this we should disable it. We should first confirm by shutting down
ElasticCommitIndexerWorker
that it will help and then stop if it's clearly correlated.
- If any nodes on this graph are maxed out for a long period of time correlated with enabling this we should disable it. We should first confirm by shutting down
- Check Redis memory
- If memory usage starts growing rapidly it might get OOM killed (which would be really bad because we would lose all other queued jobs, for example scheduled CI jobs).
- If it gets close to SLO levels, the rate of growth should be evaluated and the indexing should potentially be stopped.
- Runbook for managing (killing) running/enqueued sidekiq jobs: https://gitlab.com/gitlab-com/runbooks/blob/master/troubleshooting/large-sidekiq-queue.md
- Other dashboards:
- CPU utilisation of sidekiq (filter for
sidekiq-elasticsearch-01-sv-gprd.c.gitlab-production.internal
) https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1&from=now-6h&to=now&fullscreen&panelId=89 - https://dashboards.gitlab.net/d/gitaly-host-detail/gitaly-host-detail?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-fqdn=file-cny-01-stor-gprd.c.gitlab-production.internal
- https://dashboards.gitlab.net/d/pgbouncer-main/pgbouncer-overview?orgId=1&from=now-6h&to=now
- https://dashboards.gitlab.net/d/patroni-main/patroni-overview?orgId=1&from=now-6h&to=now
- https://dashboards.gitlab.net/d/000000144/postgresql-overview?orgId=1&from=now-2d&to=now
- https://dashboards.gitlab.net/d/000000144/postgresql-overview?orgId=1&from=now-2d&to=now
- CPU utilisation of sidekiq (filter for
Edited by Dylan Griffith