Draft: Add means to bump Markdown cache gradually
What does this MR do and why?
Closes Bumping CACHE_COMMONMARK_VERSION is risky (#597379) by adding a vetted and repeatable method to bump CACHE_COMMONMARK_VERSION, controlled by a pair of version constants (instead of our current one) and an ops FF to manage the rollout — as it is indeed a rollout that happens each time!
- Check the FF is at 0%.
- Land MR declaring the cache version we're rolling forward to — no behavioural change yet, but the FF is now being checked.
- Adjust the FF to control the percentage of cache version checks report the new version, ramping up to 100% gradually while watching database load. This process can take as long as it needs — hours, days, weeks.
- Land MR declaring the roll-forward complete; the FF is now being ignored.
- Reset the FF to 0%.
Markdown cache updates currently happen in the following circumstances:
- The cache column is empty for whatever reason (e.g. not yet generated).
- The source column for the Markdown has changed.
- The
cached_markdown_versioncolumn contains a value less than the current declared cache version.
Note that while I use "cache column" and "source column" in the singular, there can be multiple Markdown columns per record, so this may apply to multiple. There is only one cached_markdown_version per record, however, so bumping that forces a freshening of all caches in that record.
This MR updates the behaviour of the cached_markdown_version column check. The "declared cache version" now depends on an FF read (when the "rolling forward" version is declared). It's OK that a single row may be read sometimes at the current and sometimes at the later version; we do not regress the version when the record's cache version is newer than the current one.
We always write the latest cache version regardless of the cache version check result, as the new write is by definition current.
On using a percentage_of_time FF
Please read the extensively-updated-in-this-MR "Banzai pipeline and parsing" docs regarding the FF type selection. It is by design, and does not trip the concerns that led to its being marked as deprecated (Percentage-based Feature Flags should return th... (#425202 - closed), 2023-09-14: Issues and comments not loading cor... (gitlab-com/gl-infra/production#16366 - closed)). We explicitly do not want the flipper to return the same value for multiple calls in the one request.
References
- 2021-05-05: Slow DB queries affecting shared_ru... (gitlab-com/gl-infra/production#4481 - closed) — incident caused by bumping the version number outright
- Prevent markdown version changes from impacting... (#330313 - closed) — issue to resolve this problem which was closed with adding a line saying "don't change this number"
- #330313 (comment 573768166) — comment from previous Markdown DRI noting we might have to change this number!
- !183564 (comment 2769977228) — example of feature development complicated by the necessity to support all possible cached renders
Screenshots or screen recordings
| Before | After |
|---|---|
How to set up and validate locally
MR acceptance checklist
Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.