Rollout zip VFS for GitLab Pages
Production Change
Change Summary
We want to widely test new architecture of GitLab Pages based on serving from zip archives stored in object storage.
This issue tracks feature flag rollout.
Original discussion can be seen in gitlab-org/gitlab#231459 (closed)
Possible expected problems
-
Pages daemon consuming too much memory
Pages daemon caches list of all files in of web-sites served in memory. In this case, it may be required to restart pages daemon as well after disabling feature flag.
-
Users complain about range-requests not working
This is expected because new architecture does not support them yet. It may be required to rollback the feature.
Change Details
/label C2 changeunscheduled
- Services Impacted - GitLab Pages / artifacts object storage bucket
- Change Technician - @vshushlin
- Change Criticality - C2
- Change Type - changeunscheduled
- Change Reviewer - @ayufan
- Due Date - 2020-10-08 12-00 UTC
- Time tracking - no time, just feature flag change
- Downtime Component - no downtime expected
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
Nothing
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - {1}
-
{enable feature flag}
/chatops run feature set pages_artifacts_archive 20 --actors
There are about 10% of projects with pages_metadatum
with archives so enabling a certain percentage only has a ~10% effectiveness
[ gprd ] production> ProjectPagesMetadatum.where(deployed: true).count
=> 242850
[ gprd ] production> ProjectPagesMetadatum.where(deployed: true).where.not(artifacts_archive_id: nil).count
=> 25056
-
2020-10-08 04:00 UTC Increase rollout to 5% (.5% effective) - https://gitlab.slack.com/archives/C101F3796/p1602129966067800 -
2020-10-08 12:00 UTC Increase rollout to 20% (2% effective) - https://gitlab.slack.com/archives/C101F3796/p1602158634078800 -
2020-10-13 00:10 UTC Increase rollout to 50% (5% effective) - https://gitlab.slack.com/archives/C101F3796/p1602547753160500
- 2020-10-13 9:53 UTC rolled back to
docs.gitlab.com
only after #2808 (comment 430655227)
-
2020-10-19 00:00 UTC Increase rollout to 5% (.5% effective) -
2020-10-19 23:33 UTC Increase rollout to 25% (10% effective) -
2020-10-20 23:26 UTC Increase rollout to 50% (~42% effective) -
2020-10-22 01:12 UTC Increase rollout to 75% -
2020-10-22 06:25 UTC Increase rollout to 100%
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
Nothing
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
{disable feature flag}
/chatops run feature set pages_artifacts_archive 0 --actors
-
2020-10-13 9:53 UTC rolled back to docs.gitlab.com
only after #2808 (comment 430655227)
Monitoring
Key metrics to observe
- Grafana dashboards:
- https://dashboards.gitlab.net/d/pages-main/pages-overview?orgId=1 (no visible changes after 5% enabling)
- https://dashboards.gitlab.net/d/web-pages-main/web-pages-overview?orgId=1 (no visible changes after 5% enabling)
-
Profiler
- If you switch to "Heap" you'll see 30% chunk of
(*zipArchive).readArchive
- it shows memory used to store cached index of all files served by pages with enabled zip serving. - It didn't change much after enabling feature to 5%, most likely because docs.gitlab.com is huge and takes most of this space, and 5% other projects don't make a difference
- If you switch to "Heap" you'll see 30% chunk of
-
cached entries
- File entries cached. E.g. If we cache 3 zip archives with 100 files each, this metric should be 300. After 5% rollout this metric grow on one of the servers, but memory usage doesn't. Most likey we don't decrease this metric properly.
-
cached zip archives
- basically number of projects which recently were served by zip architecture and cached(for 1 minute). Started to grow after 5% rollout. We think that gradual increase can be explained by the usual increase of pages users over the time of day. More people use pages - more projects stay in cache.
- cache hits/misses
-
open archives count
- same as cached_zip_archives, but without cache. So this counts individual project accessed regardless of cache. Jumped after 5% rollout and stayed at almost the same level, which is expected.
-
vfs operations total ZIP vs Local
- show disk operations performed on ZIP vfs(object storage) vs local disk. Didn't change much with 5% rollout, majority of all operations is still docs.gitlab.com
-
vfs file open ZIP vs local
- same as above, but only for open file operations
-
rate of opening new ZIP archives
- number of zip archives opened in 5 min, jumped after 5% rollout
-
currently allocated memory in Go process
- didn't change after 5% rollout
Summary of infrastructure changes
-
Does this change introduce new compute instances? - No -
Does this change re-size any existing compute instances? - No -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? Adds some traffic to artifacts object storage bucket
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled). -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and resultes noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue.) -
There are currently no active incidents.