Tune GitLab Pages caches to increase hit-rates on gitlab.com
We currently have very low hit-rates on zip-archive caches in gitlab-pages: https://thanos-query.ops.gitlab.net/graph?g0.expr=sum(gitlab_pages_zip_cached_entries%7Benv%3D%22gprd%22%7D)%20by%20(op)&g0.tab=0&g0.stacked=0&g0.range_input=1w&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D&g1.expr=sum(rate(gitlab_pages_zip_cache_requests%7Benv%3D%22gprd%22%2Ccache%3D%22hit%22%7D%5B1h%5D))%20by%20(op%20)%20%2F%20sum(rate(gitlab_pages_zip_cache_requests%7Benv%3D%22gprd%22%20%7D%5B1h%5D))%20by%20(op%20)&g1.tab=0&g1.stacked=0&g1.range_input=1w&g1.max_source_resolution=0s&g1.deduplicate=1&g1.partial_response=0&g1.store_matches=%5B%5D:
There 3 different hit rates there "archive"
, and we have configuration options for it:
zipCacheExpiration = flag.Duration("zip-cache-expiration", 60*time.Second, "Zip serving archive cache expiration interval")
zipCacheCleanup = flag.Duration("zip-cache-cleanup", 30*time.Second, "Zip serving archive cache cleanup interval")
zipCacheRefresh = flag.Duration("zip-cache-refresh", 30*time.Second, "Zip serving archive cache refresh interval")
These options work like this:
- every
cacheClenup
interval we scan the cache and remove everything which was added to the cache longer thancacheExpiration
interval - every time we
hit
the cache if archive was added to the cache longer thanzipCacheRefresh
ago, weupdate
it - basically make it look like it was just added to the cache
so tradeoffs work like this:
- longer
zipCacheExpiration
- takes more memory, but increases the hit rate as we have more data in cache. - longer
zipCacheCleanup
- saves CPU time, but increases memory consumption, as do garbage collection less often - longer
zipCacheRefresh
- saves some CPU by doing fewer operations with memory, but makes cache less "up to date", as we can evict cache entries that were accessed recently. (TBH, I don't know why we can't just make this 0, I'm not sure how big CPU impact of this will)
However, pages daemon already takes quite big amount of memory, up to 1.7 GB on gitlab.com. And 90% of this memory is used by zip archives cache. There is an issue to optimize this. So if we want to include more data in this cache, we need to do that very carefully.
I think default 30 sec
for zipCacheCleanup
and zipCacheRefresh
is OK. But want to try increasing zipCacheExpiration
from default 1 min
to 2
, or maybe 5-10 minutes
depending on the memory impact of such change.
Success criteria:
- hit-rate for
archive
operation on the graph above goes up - 95%
duration_ms
percentile for files under1 MB
goes down -
ttfb
(from the same dashboard☝ ) goes down
I'll set the weight to 4 to carefully set zipCacheExpiration
on:
-
production
to1.5 min
-
production
to2 min
-
production
to5 min
-
production
to10 min
We can stop after each step, if it doesn't affect duration_ms
or if memory usage goes to high.
(we also need to set the same values on stg/pre every time, so there will be 8 MRs if we go all the way, but I don't want to the scary 8 weight on the issue)
References of similar MRs:
- changing pages config on staging/pre: gitlab-com/gl-infra/k8s-workloads/gitlab-com!1499 (merged)
- in production - gitlab-com/gl-infra/k8s-workloads/gitlab-com!1500 (merged)