This issue has been moved and the description cleared of content to avoid polluting the search results of this tracker, see the moved issue link for the original newsletter
The EMEA shift was overall good, with some noise, but nothing that should bleed over to the next shift. Some things worth mentioning:
wal-g basebackup keeps failing randomly. Rehab is deploying a new version of wal-g that might help with this production#5550 (closed)
There was one of them random GCP reboots on a gitaly node production#5558 (closed). What would help avoid downtime for projects in that shard would be Gitaly HA
I was only involved in the cleanup but the redis-persistent saturation incident also consumed some of our time: production#5546 (closed)
The effect of the cron change means it won't fully summarize the week of the rotation, but I think the general summary of the issues once per week is probably good.
One thing we noticed, for reasons explained in issue production#5546 (closed), we need to block the namespace/project on the edge, too, when blocking a namespace/project for ToS violation.
A note from our handover meeting was that PCLs can be confusing for timing across days. While it's spelled out in the handbook, that by default, they are from 9 am UTC to the next day 9 am UTC, we figured it would be best to just include this timing information in the PCL date directly, to avoid confusion.
APAC had nothing major, but a steady stream of little things. In particular:
Wal-G: already noted, one more day until the fix is deployed and my sanity can be restored.
Gitaly FindTag slowness: one customer, one project, 270K+ tags, FindTag gRPC calls took 3-8s, significantly affecting apdex. A housekeeping run on the repo packed them into a single file and made the FindTag call take 70-80ms instead. Lesson: there may be things we can do to repos, like housekeeping, to cleanup some of these gitaly apdex issues.
KAS: At quieter times, KAS apdex alerts fire. production#5561 (closed) eventually found that this is nginx-ingress-controller scaleup events causing low-grade 502s that are just more noticeable to KAS due to low request rates. https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14210 covers trying to find out exactly why, but Delivery are removing nginx from the API fleet next week (from 2021-09-27), so unless I find my plate miraculously clean of other more timely work and I need to pick something up, any further investigation is probably a waste of time. For the next week or two, we may just have to suck up those alerts and ignore them (I don't think a silence is correct for that period of time as it would hide bigger failures).
Prometheus/Alertmanager: When AM cycles pods, sometimes some prometheii can lose track of one of them (they should do DNS refresh, but they don't). It's happened 3-4 times in the last couple of months; restarting prometheus (one at a time, waiting for them to be fully up) is the immediate answer. Code spelunking has not thrown up anything useful yet, but I'm hunting for a sporadic/random/race-condition bug so it must be non-obvious
1
ops-gitlab-netchanged the descriptionCompare with previous version