Incremental Rollout of Pages API based config source after "other" fixes
C4
Production Change - Criticality 4Change Objective | Incrementally rollout the new Pages API based config source - start with list of predefined domains and then proceed with all other domains in batches. |
---|---|
Change Type | ConfigurationChange |
Services Impacted | GitLab-Pages |
Change Team Members | @vshushlin @krasio @grzesiek @aamarsanaa |
Change Severity | C4 |
Change Reviewer or tested in staging | Similar change was applied to staging: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8941 |
Dry-run output | If the change is done through a script, it is mandatory to have a dry-run capability in the script, run the change in dry-run mode and output the result |
Due Date | Date and time in UTC timezone for the execution of the change, if possible add the local timezone of the engineer executing the change |
Time tracking | To estimate and record times associated with changes ( including a possible rollback ) |
Precondition
-
Make sure pages with fixes are deployed to production:
/chatops run auto_deploy status 05d583ae7560672d1ea78f9fb5fc76f95d1dbf52
-
Make sure test domains are served succesfully(just open in browser): - vshushlin.gitlab.io
- shushlin.dev
- pages-rollout.gitlab.io
Detailed steps for the change
On Infra side
-
2020-03-11 01:00 (UTC) : Rollout to 20% of the domains -
ssh
topages-01-stor-gprd
-
sudo vi /var/opt/gitlab/gitlab-rails/shared/pages/.gitlab-source-config.yml
-
replace the file content with the content bellow, save & quit domains: enabled: - vshushlin.gitlab.io - shushlin.dev - pages-rollout.gitlab.io rollout: percentage: 20
-
sudo chown git:git /var/opt/gitlab/gitlab-rails/shared/pages/.gitlab-source-config.yml
-
-
2020-03-12 04:34 (UTC) : Rollout to 50% of the domains Repeat previous steps but change 20
to50
-
2020-03-13 01:00 (UTC) : Rollout to 50% of the domains Repeat previous steps but change 50
to100
Monitoring / Validation
Visit https://vshushlin.gitlab.io, https://vshushlin.gitlab.io/gitlab-meetup-pages, https://shushlin.dev/, and http://pages-rollout.gitlab.io/ many times, you should see it in logs below
-
API endpoint logs -
Visualization of API endpoint request duration based on application logs -
Grafana dashboard for the API endpoint -
web-pages service overview -
API Service Overview -
(optional for this issue) Prometheus graph: - should increase every time you access shushlin.dev or any other domains - increase of 400s or 500s will indicate a bug -
(optional for this issue) Grafana dashboard: https://dashboards.gitlab.net/d/_IQB_rSmk/pages?orgId=1&refresh=1m&from=now-3h&to=now&var-worker=All -
CPU Graph across web-pages fleet: https://thanos-query.ops.gitlab.net/graph?g0.range_input=2d&g0.max_source_resolution=0s&g0.expr=instance%3Acpu_utilization%3Aratio_avg%7Bfqdn%3D~%22web-pages-.*%22%2C%20environment%3D%22gprd%22%7D&g0.tab=0 -
Memory Graph across web-pages fleet: https://thanos-query.ops.gitlab.net/graph?g0.range_input=2d&g0.max_source_resolution=0s&g0.expr=instance%3Amemory_utilization%3Aratio_avg%7Bfqdn%3D~%22web-pages.*%22%2C%20environment%3D%22gprd%22%7D&g0.tab=0
Rollback steps
-
ssh
topages-01-stor-gprd
-
sudo vi /var/opt/gitlab/gitlab-rails/shared/pages/.gitlab-source-config.yml
-
replace the file content with the content bellow, save & quit domains: enabled: - vshushlin.gitlab.io - shushlin.dev - pages-rollout.gitlab.io
Changes checklist
-
Detailed steps and rollback steps have been filled prior to commencing work -
Person on-call has been informed prior to change being rolled out
Edited by Krasimir Angelov