Tune GitLab Pages caches to increase hit-rates on gitlab.com
We currently have very low hit-rates on zip-archive caches in gitlab-pages: https://thanos-query.ops.gitlab.net/graph?g0.expr=sum(gitlab_pages_zip_cached_entries%7Benv%3D%22gprd%22%7D)%20by%20(op)&g0.tab=0&g0.stacked=0&g0.range_input=1w&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D&g1.expr=sum(rate(gitlab_pages_zip_cache_requests%7Benv%3D%22gprd%22%2Ccache%3D%22hit%22%7D%5B1h%5D))%20by%20(op%20)%20%2F%20sum(rate(gitlab_pages_zip_cache_requests%7Benv%3D%22gprd%22%20%7D%5B1h%5D))%20by%20(op%20)&g1.tab=0&g1.stacked=0&g1.range_input=1w&g1.max_source_resolution=0s&g1.deduplicate=1&g1.partial_response=0&g1.store_matches=%5B%5D:
There 3 different hit rates there "archive", and we have configuration options for it:
zipCacheExpiration = flag.Duration("zip-cache-expiration", 60*time.Second, "Zip serving archive cache expiration interval")
zipCacheCleanup = flag.Duration("zip-cache-cleanup", 30*time.Second, "Zip serving archive cache cleanup interval")
zipCacheRefresh = flag.Duration("zip-cache-refresh", 30*time.Second, "Zip serving archive cache refresh interval")
These options work like this:
- every
cacheClenupinterval we scan the cache and remove everything which was added to the cache longer thancacheExpirationinterval - every time we
hitthe cache if archive was added to the cache longer thanzipCacheRefreshago, weupdateit - basically make it look like it was just added to the cache
so tradeoffs work like this:
- longer
zipCacheExpiration- takes more memory, but increases the hit rate as we have more data in cache. - longer
zipCacheCleanup- saves CPU time, but increases memory consumption, as do garbage collection less often - longer
zipCacheRefresh- saves some CPU by doing fewer operations with memory, but makes cache less "up to date", as we can evict cache entries that were accessed recently. (TBH, I don't know why we can't just make this 0, I'm not sure how big CPU impact of this will)
However, pages daemon already takes quite big amount of memory, up to 1.7 GB on gitlab.com. And 90% of this memory is used by zip archives cache. There is an issue to optimize this. So if we want to include more data in this cache, we need to do that very carefully.
I think default 30 sec for zipCacheCleanup and zipCacheRefresh is OK. But want to try increasing zipCacheExpiration from default 1 min to 2, or maybe 5-10 minutes depending on the memory impact of such change.
Success criteria:
- hit-rate for
archiveoperation on the graph above goes up - 95%
duration_mspercentile for files under1 MBgoes down -
ttfb(from the same dashboard☝ ) goes down
I'll set the weight to 4 to carefully set zipCacheExpiration on:
-
productionto1.5 min -
productionto2 min -
productionto5 min -
productionto10 min
We can stop after each step, if it doesn't affect duration_ms or if memory usage goes to high.
(we also need to set the same values on stg/pre every time, so there will be 8 MRs if we go all the way, but I don't want to the scary 8 weight on the issue)
References of similar MRs:
- changing pages config on staging/pre: gitlab-com/gl-infra/k8s-workloads/gitlab-com!1499 (merged)
- in production - gitlab-com/gl-infra/k8s-workloads/gitlab-com!1500 (merged)
