enable elasticsearch integration on `gitlab.com` on `gitlab-org` namespace
C2
Production Change - Criticality 2Change Objective | enable elasticsearch integration on gitlab.com on gitlab-org namespace |
---|---|
Change Type | terraform, gitlab config change through the admin panel, new elastic.co clusters |
Services Impacted | GCP firewall, sidekiq, postgres, gitaly, search feature in gitlab.com
|
Change Team Members | @mwasilewski-gitlab and @nick.thomas |
Change Severity | C2 |
Buddy check or tested in staging | the change was tested on staging environment: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6851 |
Schedule of the change | 2019-06-24 09:10:00 UTC |
Duration of the change | ~1h ( including a possible rollback ) |
Detailed steps for the change. Each step must include: |
prep:
-
open firewall on console, MRthere are nodeny_all
egress rules in gprd so the MR is no longer relevant -
create 2 elk clusters as described in the blueprint
rollout:
-
put in elastic creds -
notify on-call in the production channel on slack that the change is being rolled out -
enable elastic integration in the gitlab admin panel, limit to groups, do not put in any groups (pay attention to this step, if you enable the integration and do not limit to specific namespaces, the entire instance will be indexed) -
on a machine with access to elastic (console): -
gitlab-rake gitlab:elastic:create_empty_index
-
gitlab-rake gitlab:elastic:clear_index_status
-
-
add gitlab-org
namespace in the admin panel (this will start scheduling elastic jobs, keep an eye on sidekiq and gitaly) -
confirm sidekiq jobs are processed without any errors -
grafana, "job failures and errors mtail panel" or directly in prometheus -
search in kibana with logs from sidekiq jobs, however in the past this proved to give mixed results, not all types of failures are visible in these logs (e.g. lack of mappings in the index did not result in failures/errors in sidekiq jobs)
-
-
and index is growing -
go to Kibana on the elastic.co cluster created for monitoring link and navigate to Monitoring -> Indices -> gitlab-production
. Index rate != 0 and document count going up means the index is growing
-
-
ensure personal snippets are indexed -
once the initial indexing is finished (index stopped growing) enable searching with elastic
rollback:
-
see runbooks for more details on this -
disable elastic integration in the admin panel -
destroy all ElasticsearchIndexedNamespace
(this will effectively stop creation of new sidekiq jobs) -
recreate index -
clear indexing status
Possible problems/notes:
- blueprint (
source/handbook/engineering/infrastructure/blueprint/201904-indexing-with-elastic-one-namespaces/index.html.md
): gitlab-com/www-gitlab-com!21352 (merged) - firewall on console/web/sidekiq (all indexers should run on sidekiq and there are no egress deny rules for sidekiq fleet, so in principal there should be no problem with firewall)
- not enough db connections -> increase db_pool
- gitaly overloaded -> limit number of concurrent sidekiq elastic jobs
- sidekiq besteffort nodes flooded -> add more nodes to the fleet
- if possible completely avoid rake tasks, do not trigger indexing jobs (let gitlab take care of it)
Edited by Michal Wasilewski