switch logging to the new ES7 clusters
C4
Production Change - Criticality 4Change Objective | Switch logging to the new ES7 clusters |
---|---|
Change Type | Configuration change |
Services Impacted | logging |
Change Team Members | @mwasilewski-gitlab |
Change Severity | C4 |
Buddy check or tested in staging | A colleague will review the change and the change was tested on staging environment |
Schedule of the change | |
Duration of the change | the execution itself lasts ~30 mins, but we will be gradually rolling this out over the span of a few days |
Detailed steps for the change. Each step must include: |
issue boards:
- https://gitlab.com/groups/gitlab-com/gl-infra/-/boards/1324108?milestone_title=Dev%20%26%20Ops%3A%202019-09-23%20-%202019-10-06&scope=all&utf8=%E2%9C%93&state=opened&label_name[]=Service%3AELK
- https://gitlab.com/groups/gitlab-com/gl-infra/-/boards/1222376?milestone_title=Stand%20up%20a%20new%20ES%207%20cluster%20with%20hot-warm%20deployment
Pre:
-
2019-08-28 08:30:00 UTC switch gstg env: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/1693/diffs -
2019-08-29 announce these changes in the following Slack channels: security, infrastructure-lounge, production, backend, development, support_gitlab-com. The purpose of this is to give people an early warning that a change like this is coming, for example:
Hello! In an effort to improve scalability of our logging infrastructure, we are moving to a more recent version of ElasticStack and starting to use a number of ES features. By the end of this week logs from gstg, dr, pre and ops environments will all be available at https://nonprod-log.gitlab.net/. During the week of 16th-20th September logs from gprd environment will be gradually switched over to https://prod-log.gitlab.net/. For details on which indices will be switched when see: https://gitlab.com/gitlab-com/gl-infra/production/issues/1098 If you have any concerns or questions please reach out to the infra team in the infrastructure-lounge Slack channel. For more details please see this epic: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6593 and the issue board: https://gitlab.com/groups/gitlab-com/gl-infra/-/boards/1222376?milestone_title=Stand%20up%20a%20new%20ES%207%20cluster%20with%20hot-warm%20deployment
-
2019-08-29 switch dr, ops and pre: -
create indices for pre env (I only did it for gstg, dr and ops) -
merge chef MR: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/1700 (it should be automatically uploaded to the chef servery by a CI job and picked up by scheduled chef-client
runs) -
update creds in GKMS for gitlab-elk
cookbook in all of these envs
-
-
create the new production cluster: -
3 hot VMs max spec (1 VM in each zone), 2 warm VMs max spec -
ILM policy, templates, indices + aliases, log-proxy
andpubsubuser
-
-
update credentials in GKMS for the proxy, test https://log.gprd.gitlab.net/ is operational -
send logs from one of the log streams to the new cluster for a couple of days
Roll out:
-
2019-10-x announce the change again in the following Slack channels: security, infrastructure-lounge, production, backend, development, support_gitlab-com:
Hello! In the coming days we will be applying a number of changes to our logging infrastructure. We will be switching production logs to the latest ES version (gstg, dr, pre, ops are already on it). Logs will be gradually switched from https://log.gitlab.net/ to https://log.gprd.gitlab.net/ . For more details please see: https://gitlab.com/gitlab-com/gl-infra/production/issues/1098 If you have any concerns or questions please reach out to the infra team in the #infrastructure-lounge Slack channel.
-
2019-10-x put a MOTD in the elastic clusters explaining we're in the middle of a switch and link to this issue -
for each pubsubbeat VM: -
systemctl stop chef-client; systemctl disable chef-client;
-
manually download binary compatible with ES7 and manually update config file (see diff of the cookbook between latest version and current version in prod)
-
status | Date | indices | notes |
---|---|---|---|
2019-10-x 12:00:00 UTC | rails, workhorse, unicorn | ||
2019-10-x 12:00:00 UTC | all remaining logs |
Post
-
after all pubsubbeat VMs have new config: -
merge: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/1703 -
update credentials for gitlab-elk
cookbook with credentials for the new cluster -
chef why run (no changes should be listed) -
systemctl start chef-client; systemctl enable chef-client
. No changes should be applied
-
-
move all saved dashboards, searches, visualizations, watchers. See: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8005
Roll back:
- revert MR that covers the relevant env
- revert creds in GKMS for
gitlab-elk
(so that logs are sent back to the old cluster)
Edited by Michal Wasilewski