DON'T DO IT: 5% percent rollout of pages API
Production Change - Criticality 2 C2
THIS IS BLOCKED BY
- testing percentage rollout on staging: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8941#note_285210117
- testing pages API in production: #1639 (closed)
| Change Objective | Test Pages API on 5% of pages domains |
|---|---|
| Change Type | ConfigurationChange |
| Services Impacted | GitLab-Pages |
| Change Team Members | Name of the engineers involved in the change |
| Change Severity | C2 |
| Change Reviewer | A colleague who will review the change |
| Tested in staging | Evidence or assertion the change was tested on staging environment |
| Dry-run output | If the change is done through a script, it is mandatory to have a dry-run capability in the script, run the change in dry-run mode and output the result |
| Due Date | Date and time in UTC timezone for the execution of the change, if possible add the local timezone of the engineer executing the change |
| Time tracking | To estimate and record times associated with changes ( including a possible rollback ) |
Detailed steps for the change
This is done by Pages watching a file. This file should be named
.gitlab-source-config.ymland placed in the same dir the Pages process is running from, same as the.updatefile currently used. This should beSettings.pages.pathwhich I think in the case of staging and production should be/var/opt/gitlab/gitlab-rails/shared/pages.
domains:
enabled:
- vshushlin.gitlab.io
- shushlin.dev
- pages-rollout.gitlab.io
rollout:
percentage: 5
Rollback steps
Remove that file
Changes checklist
-
Detailed steps and rollback steps have been filled prior to commencing work -
Person on-call has been informed prior to change being rolled out
Verifying that it works
These 2 metrics should go up:
- https://thanos-query.ops.gitlab.net/graph?g0.range_input=1h&g0.max_source_resolution=auto&g0.expr=gitlab_pages_domains_source_cache_hit&g0.tab=0&g1.range_input=1h&g1.max_source_resolution=0s&g1.expr=&g1.tab=1
- https://thanos-query.ops.gitlab.net/graph?g0.range_input=1h&g0.max_source_resolution=auto&g0.expr=gitlab_pages_domains_source_cache_miss&g0.tab=0&g1.range_input=1h&g1.max_source_resolution=0s&g1.expr=&g1.tab=1
- Status codes should not change: https://dashboards.gitlab.net/d/_IQB_rSmk/pages?orgId=1&refresh=1m&from=now-3h&to=now&var-worker=All , increase of 400s or 500s will indicate a bug
- https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&from=1581488564810&to=1581510164810&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-sigma=2 - we might generate a slight increase in load, but that probably will only be noticeable with higher rollout percentage.
Edited by Vladimir Shushlin