Skip to content

DON'T DO IT: 5% percent rollout of pages API

Production Change - Criticality 2 C2

THIS IS BLOCKED BY

  1. testing percentage rollout on staging: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8941#note_285210117
  2. testing pages API in production: #1639 (closed)
Change Objective Test Pages API on 5% of pages domains
Change Type ConfigurationChange
Services Impacted GitLab-Pages
Change Team Members Name of the engineers involved in the change
Change Severity C2
Change Reviewer A colleague who will review the change
Tested in staging Evidence or assertion the change was tested on staging environment
Dry-run output If the change is done through a script, it is mandatory to have a dry-run capability in the script, run the change in dry-run mode and output the result
Due Date Date and time in UTC timezone for the execution of the change, if possible add the local timezone of the engineer executing the change
Time tracking To estimate and record times associated with changes ( including a possible rollback )

Detailed steps for the change

This is done by Pages watching a file. This file should be named .gitlab-source-config.yml and placed in the same dir the Pages process is running from, same as the .update file currently used. This should be Settings.pages.path which I think in the case of staging and production should be /var/opt/gitlab/gitlab-rails/shared/pages.

domains:
  enabled:
    - vshushlin.gitlab.io
    - shushlin.dev
    - pages-rollout.gitlab.io
  rollout:
    percentage: 5

Rollback steps

Remove that file

Changes checklist

  • Detailed steps and rollback steps have been filled prior to commencing work
  • Person on-call has been informed prior to change being rolled out

Verifying that it works

These 2 metrics should go up:

  1. https://thanos-query.ops.gitlab.net/graph?g0.range_input=1h&g0.max_source_resolution=auto&g0.expr=gitlab_pages_domains_source_cache_hit&g0.tab=0&g1.range_input=1h&g1.max_source_resolution=0s&g1.expr=&g1.tab=1
  2. https://thanos-query.ops.gitlab.net/graph?g0.range_input=1h&g0.max_source_resolution=auto&g0.expr=gitlab_pages_domains_source_cache_miss&g0.tab=0&g1.range_input=1h&g1.max_source_resolution=0s&g1.expr=&g1.tab=1
  3. Status codes should not change: https://dashboards.gitlab.net/d/_IQB_rSmk/pages?orgId=1&refresh=1m&from=now-3h&to=now&var-worker=All , increase of 400s or 500s will indicate a bug
  4. https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&from=1581488564810&to=1581510164810&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-sigma=2 - we might generate a slight increase in load, but that probably will only be noticeable with higher rollout percentage.
Edited by Vladimir Shushlin