Skip to content

Rollout zip VFS for GitLab Pages

Production Change

Change Summary

We want to widely test new architecture of GitLab Pages based on serving from zip archives stored in object storage.

This issue tracks feature flag rollout.

Original discussion can be seen in gitlab-org/gitlab#231459 (closed)

Possible expected problems

  • Pages daemon consuming too much memory

    Pages daemon caches list of all files in of web-sites served in memory. In this case, it may be required to restart pages daemon as well after disabling feature flag.

  • Users complain about range-requests not working

    This is expected because new architecture does not support them yet. It may be required to rollback the feature.

Change Details

/label C2 changeunscheduled

  1. Services Impacted - GitLab Pages / artifacts object storage bucket
  2. Change Technician - @vshushlin
  3. Change Criticality - C2
  4. Change Type - changeunscheduled
  5. Change Reviewer - @ayufan
  6. Due Date - 2020-10-08 12-00 UTC
  7. Time tracking - no time, just feature flag change
  8. Downtime Component - no downtime expected

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

Nothing

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - {1}

  • {enable feature flag}
/chatops run feature set pages_artifacts_archive 20 --actors

There are about 10% of projects with pages_metadatum with archives so enabling a certain percentage only has a ~10% effectiveness

https://gitlab.slack.com/archives/C1BSEQ138/p1602147362120400?thread_ts=1602147179.119100&cid=C1BSEQ138

[ gprd ] production> ProjectPagesMetadatum.where(deployed: true).count
=> 242850
[ gprd ] production> ProjectPagesMetadatum.where(deployed: true).where.not(artifacts_archive_id: nil).count
=> 25056

  • 2020-10-19 00:00 UTC Increase rollout to 5% (.5% effective)
  • 2020-10-19 23:33 UTC Increase rollout to 25% (10% effective)
  • 2020-10-20 23:26 UTC Increase rollout to 50% (~42% effective)
  • 2020-10-22 01:12 UTC Increase rollout to 75%
  • 2020-10-22 06:25 UTC Increase rollout to 100%

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

Nothing

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

  • {disable feature flag}
/chatops run feature set pages_artifacts_archive 0 --actors

Monitoring

Key metrics to observe

  • Grafana dashboards:
  • Profiler
    • If you switch to "Heap" you'll see 30% chunk of (*zipArchive).readArchive - it shows memory used to store cached index of all files served by pages with enabled zip serving.
    • It didn't change much after enabling feature to 5%, most likely because docs.gitlab.com is huge and takes most of this space, and 5% other projects don't make a difference
  • cached entries
    • File entries cached. E.g. If we cache 3 zip archives with 100 files each, this metric should be 300. After 5% rollout this metric grow on one of the servers, but memory usage doesn't. Most likey we don't decrease this metric properly.
  • cached zip archives
    • basically number of projects which recently were served by zip architecture and cached(for 1 minute). Started to grow after 5% rollout. We think that gradual increase can be explained by the usual increase of pages users over the time of day. More people use pages - more projects stay in cache.
  • cache hits/misses
  • open archives count
    • same as cached_zip_archives, but without cache. So this counts individual project accessed regardless of cache. Jumped after 5% rollout and stayed at almost the same level, which is expected.
  • vfs operations total ZIP vs Local
    • show disk operations performed on ZIP vfs(object storage) vs local disk. Didn't change much with 5% rollout, majority of all operations is still docs.gitlab.com
  • vfs file open ZIP vs local
    • same as above, but only for open file operations
  • rate of opening new ZIP archives
    • number of zip archives opened in 5 min, jumped after 5% rollout
  • currently allocated memory in Go process
    • didn't change after 5% rollout

Summary of infrastructure changes

  • Does this change introduce new compute instances? - No
  • Does this change re-size any existing compute instances? - No
  • Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? Adds some traffic to artifacts object storage bucket

Changes checklist

  • This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled).
  • This issue has the change technician as the assignee.
  • Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
  • Necessary approvals have been completed based on the Change Management Workflow.
  • Change has been tested in staging and resultes noted in a comment on this issue.
  • A dry-run has been conducted and results noted in a comment on this issue.
  • SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue.)
  • There are currently no active incidents.
Edited by Vladimir Shushlin