Rollout serving migrated data (feature flag `pages_serve_from_migrated_zip`)

What

Rollout :pages_serve_from_migrated_zip feature flag that makes us serve data from https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3252.

Introduced by: !52573 (merged)

Owners

  • Team: GitLab Pages
  • Most appropriate slack channel to reach out to: #gitlab_pages
  • Best individual to reach out to: @vshushlin @ayufan

Expectations

What are we expecting to happen?

A migrated projects will be served using Object Storage ZIP artifact instead of using VFS disk.

What might happen if this goes wrong?

A data might not be served properly.

What can we monitor to detect problems with this?

Similar to: gitlab-com/gl-infra/production#2808 (comment 430825927)

  • Monitor the percentage of pages using VFS zip: https://thanos-query.ops.gitlab.net/graph?g0.range_input=1h&g0.stacked=1&g0.max_source_resolution=0s&g0.expr=sum(rate(gitlab_pages_vfs_operations_total%5B5m%5D))%20by%20(vfs_name)&g0.tab=0 => The amount of files served with zip should increase
  • Monitor TTFB for Object Storage: https://thanos-query.ops.gitlab.net/graph?g0.range_input=12h&g0.max_source_resolution=0s&g0.expr=avg(rate(gitlab_pages_httprange_trace_duration_sum%7Brequest_stage%3D%22httptrace.ClientTrace.GotFirstResponseByte%22%7D%5B5m%5D)%2Frate(gitlab_pages_httprange_trace_duration_count%7Brequest_stage%3D%22httptrace.ClientTrace.GotFirstResponseByte%22%7D%5B5m%5D))&g0.tab=0
  • Monitor average latency: ZIP vs NFS: https://log.gprd.gitlab.net/goto/c6ef577321c0cba3c77086aef974202f
  • Unique domains: https://log.gprd.gitlab.net/goto/80a0b0682e4987c0180bac8d39b80143
  • Caching of ZIP archives: https://thanos-query.ops.gitlab.net/graph?g0.range_input=1h&g0.moment_input=2021-02-22%2015%3A15%3A02&g0.max_source_resolution=0s&g0.expr=avg(gitlab_pages_zip_cached_entries%7Bop%3D%22archive%22%7D)&g0.tab=0
  • Amount of cached ZIP entries: https://prometheus-app.gprd.gitlab.net/graph?g0.expr=avg(gitlab_pages_zip_archive_entries_cached)&g0.tab=0&g0.stacked=0&g0.range_input=1h

We do percentage rollout

# 5% of projects
/chatops run feature set pages_serve_from_migrated_zip 5 --actors

# 10% of projects
/chatops run feature set pages_serve_from_migrated_zip 10 --actors

# 25% of projects
/chatops run feature set pages_serve_from_migrated_zip 25 --actors

# 50% of projects
/chatops run feature set pages_serve_from_migrated_zip 50 --actors

# 100% of projects
/chatops run feature set pages_serve_from_migrated_zip 1 --actors

Roll Out Steps

  • Enable on staging (/chatops run feature set pages_serve_from_migrated_zip true --staging)
  • Test on staging
  • Ensure that documentation has been updated
  • [-] Continue performing percentage rollout of actors
  • Enable on production for specific project (/chatops run feature set --project=ayufan/pages-jekyll pages_serve_from_migrated_zip true)
  • Coordinate a time to enable the flag with the SRE oncall and release managers
    • In #production mention @sre-oncall and @release-managers. Once an SRE on call and Release Manager on call confirm, you can proceed with the rollout
  • 5% rollout /chatops run feature set pages_serve_from_migrated_zip 5 --actors
  • 10% rollout /chatops run feature set pages_serve_from_migrated_zip 10 --actors
  • 25% rollout /chatops run feature set pages_serve_from_migrated_zip 25 --actors
  • 50% rollout /chatops run feature set pages_serve_from_migrated_zip 50 --actors
  • Enable a 100% rollout on GitLab.com by running chatops command in #production (/chatops run feature set feature_name true)
  • Cross post chatops Slack command to #support_gitlab-com (more guidance when this is necessary in the dev docs) and in your team channel
  • Announce on the issue that the flag has been enabled
  • Remove feature flag and add changelog entry
  • After the flag removal is deployed, clean up the feature flag by running chatops command in #production channel

Rollback Steps

  • This feature can be disabled by running the following Chatops command:
/chatops run feature delete pages_serve_from_migrated_zip
Edited Mar 03, 2021 by Vladimir Shushlin
Assignee Loading
Time tracking Loading