2023-01-27: [STAGING] Improve caching policy in Cloudflare
Production Change
Change Summary
During a production issue on Sep 8, 2022, we realized the caching for some endpoints such as raw
and archive
is not implemented efficiently. In fact, we do not set the correct HTTP headers for some of our endpoints and as a result, the content for these endpoints is not cached properly at our edge network (Cloudflare). See this issue for more context.
Furthermore, our improper caching caused an unexpected behavior in which despite setting the Snippets permission to Only Project Members, the raw pages of snippets from that project are still getting cached leading to possible leakage. See this issue for more context.
We are tracking all the latest changes that need to be made in this issue.
Change Details
- Services Impacted - ServiceCloudflare
-
Change Technician -
@miladx
- Change Reviewer - @T4cC0re @f_santos
- Time tracking - unknown
- Downtime Component - none
Detailed steps for the change
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - Roughly 20-30 minutes
-
Set label changein-progress /label ~change::in-progress
-
Merge this MR for sending the proper HTTP headers from the application. -
Merge this MR for changing the page rules on Cloudflare in Staging. -
Validate the change in staging. -
Test URLs behave as expected. -
curl -s -I https://staging.gitlab.com/miladx - cache-control: max-age=0, private, must-revalidate
- etag: Weak
- pragma: no-cache
- vary: Accept
-
curl -s -I https://staging.gitlab.com/miladx.keys - cache-control: max-age=0, private, must-revalidate
- etag: Weak
- pragma: no-cache
- vary: Accept
-
curl -s -I https://staging.gitlab.com/my-public-space/public-repo/-/issues/1 - cache-control: max-age=0, private, must-revalidate
- etag: Weak
- pragma: no-cache
- vary: Accept
-
curl -s -I https://staging.gitlab.com/my-public-space/public-repo/-/merge_requests/1 - cache-control: max-age=0, private, must-revalidate
- etag: Weak
- pragma: no-cache
- vary: Accept
-
curl -s -I https://staging.gitlab.com/my-public-space/public-repo/-/merge_requests/1.diff - cache-control: no-cache
- pragma: no-cache
-
curl -s -I https://staging.gitlab.com/my-public-space/public-repo/-/merge_requests/1.patch - cache-control: no-cache
- pragma: no-cache
-
curl -s -I https://staging.gitlab.com/my-public-space/public-repo/-/commit/5d0434602afee4f7fcc87d6662956eb2b7a2271f.diff - cache-control: no-cache
- pragma: no-cache
-
curl -s -I https://staging.gitlab.com/my-public-space/public-repo/-/commit/5d0434602afee4f7fcc87d6662956eb2b7a2271f.patch - cache-control: no-cache
- pragma: no-cache
-
curl -s -I https://staging.gitlab.com/ - cache-control: no-cache, no-store, must-revalidate
- pragma: no-cache
-
curl -s -I https://staging.gitlab.com/my-public-space/public-repo/-/blob/main/README.md - cache-control: max-age=0, private, must-revalidate
- etag: Weak
- pragma: no-cache
- vary: Accept
-
curl -s -I https://staging.gitlab.com/my-public-space/public-repo/-/blob/ca8f6a6453cce16f53123e95032cb3297e237cdf/README.md - cache-control: max-age=0, private, must-revalidate
- etag: Weak
- pragma: no-cache
- vary: Accept
-
curl -s -I https://staging.gitlab.com/my-public-space/public-repo/-/raw/main/README.md` - cache-control: max-age=60, public, must-revalidate, stale-while-revalidate=60, stale-if-error=300, s-maxage=60
- etag: Strong
-
curl -s -I https://staging.gitlab.com/my-public-space/public-repo/-/raw/ca8f6a6453cce16f53123e95032cb3297e237cdf/README.md - cache-control: max-age=3600, public, must-revalidate, stale-while-revalidate=60, stale-if-error=300, s-maxage=60
- etag: Strong
-
curl -s -I https://staging.gitlab.com/my-public-space/public-repo/-/archive/main/public-repo-main.zip - cache-control: max-age=60, public, must-revalidate, stale-while-revalidate=60, stale-if-error=300, s-maxage=60
- etag: Strong
-
curl -s -I https://staging.gitlab.com/my-public-space/public-repo/-/archive/ca8f6a6453cce16f53123e95032cb3297e237cdf/public-repo-ca8f6a6453cce16f53123e95032cb3297e237cdf.zip - cache-control: max-age=3600, public, must-revalidate, stale-while-revalidate=60, stale-if-error=300, s-maxage=60
- etag: Strong
-
-
The cache-related bug is fixed. - In a private project, go to "Settings • General" page.
- Expand "Visibility, project features, permissions".
- Ensure "Snippets" is set to "Only Project Members".
- Go to "Snippets" section and create a new public snippet.
- Then click on "Open Raw" icon.
- Make a change to the snippet and save it.
- Open the URL for raw content in an incognito window.
- The content should be updated at this point and the stale cached content should NOT be displayed.
-
-
Set label changecomplete /label ~change::complete
Rollback
In case anything goes wrong, we first quickly and manually undo our Cloudflare page rules and then roll back the merge requests.
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - Roughly 10-15 minutes
-
Go to Cloudflare and staging.gitlab.com
website. -
Navigate to "Page Rules", disable the new rule and re-enable the old ones. [ ] Disable Page Rule forstaging.gitlab.com/*
-
Enable Page Rule for staging.gitlab.com/*/repository/*/archive.*
-
Enable Page Rule for staging.gitlab.com/*/repository/archive.*
-
Enable Page Rule for staging.gitlab.com/*/raw/*
-
Enable Page Rule for staging.gitlab.com/*/-/archive/*
[ ] Purdge Cloudflare CDN cache for staging atCloudflare > Account > Domain > Caching > Configuration
.-
Revert the MR. -
Set label changeaborted /label ~change::aborted
Monitoring
Key metrics to observe
- Metric: Requests vs. Time by Cache Status
- Location: Cloudflare Cache Analytics
- Location: Cloudflare Traffic Analytics by Cache Status
- We expect to see an increase in the amount of cached data and number of cache hits due to the new general Cloudflare Page Rule for caching all endpoints (the caching uses strong and weak tags for content validation).
Change Reviewer checklist
-
Check if the following applies: - The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
-
Check if the following applies: - The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
- If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary
Change Technician checklist
-
Check if all items below are complete: - The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- The change execution window respects the Production Change Lock periods.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention
@sre-oncall
and this issue and await their acknowledgement.) - For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
- For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
- Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention
@release-managers
and this issue and await their acknowledgment.) - There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.