Weekly Reliability (SRE) Team Newsletter – On-call Period: 2021-09-14 - 2021-09-21

The EMEA shift was overall good, with some noise, but nothing that should bleed over to the next shift. Some things worth mentioning:

wal-g basebackup keeps failing randomly. Rehab is deploying a new version of wal-g that might help with this production#5550 (closed)
There was one of them random GCP reboots on a gitaly node production#5558 (closed). What would help avoid downtime for projects in that shard would be Gitaly HA
I was only involved in the cleanup but the redis-persistent saturation incident also consumed some of our time: production#5546 (closed)
This implementation is still being finished (private) https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14108

/cc @nnelson @cmiskell @T4cC0re @rehab @devin

I think we need to make sure these two things are updated properly:

Populate this issue with the incident and change issue report before this meeting.
Update the bot reminding EOC's to update this issue a little sooner.

@dawsmith @kwanyangu

Noted and thanks - moved faster than the bot could handle.

@dawsmith I'm not sure how this letter is being generated, do you have a link to the source code?

I think the On-Call During This Period is also reporting inaccurate information.

@rehab - the code is at https://gitlab.com/gitlab-com/gl-infra/oncall-robot-assistant/-/tree/master and mirrored to https://ops.gitlab.net/gitlab-com/gl-infra/oncall-robot-assistant/-/tree/master. The pipelines run on ops. I've changed the schedule on ops to run on Mondays at 06:55 UTC. Happy to continue to try adjusting that.

The effect of the cron change means it won't fully summarize the week of the rotation, but I think the general summary of the issues once per week is probably good.

cc @cmcfarland and @kwanyangu

Nothing too fancy from my side either.

One thing we noticed, for reasons explained in issue production#5546 (closed), we need to block the namespace/project on the edge, too, when blocking a namespace/project for ToS violation.

A note from our handover meeting was that PCLs can be confusing for timing across days. While it's spelled out in the handbook, that by default, they are from 9 am UTC to the next day 9 am UTC, we figured it would be best to just include this timing information in the PCL date directly, to avoid confusion.

APAC had nothing major, but a steady stream of little things. In particular:

Wal-G: already noted, one more day until the fix is deployed and my sanity can be restored.
Gitaly FindTag slowness: one customer, one project, 270K+ tags, FindTag gRPC calls took 3-8s, significantly affecting apdex. A housekeeping run on the repo packed them into a single file and made the FindTag call take 70-80ms instead. Lesson: there may be things we can do to repos, like housekeeping, to cleanup some of these gitaly apdex issues.
KAS: At quieter times, KAS apdex alerts fire. production#5561 (closed) eventually found that this is nginx-ingress-controller scaleup events causing low-grade 502s that are just more noticeable to KAS due to low request rates. https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14210 covers trying to find out exactly why, but Delivery are removing nginx from the API fleet next week (from 2021-09-27), so unless I find my plate miraculously clean of other more timely work and I need to pick something up, any further investigation is probably a waste of time. For the next week or two, we may just have to suck up those alerts and ignore them (I don't think a silence is correct for that period of time as it would hide bigger failures).
Prometheus/Alertmanager: When AM cycles pods, sometimes some prometheii can lose track of one of them (they should do DNS refresh, but they don't). It's happened 3-4 times in the last couple of months; restarting prometheus (one at a time, waiting for them to be fully up) is the immediate answer. Code spelunking has not thrown up anything useful yet, but I'm hunting for a sporadic/random/race-condition bug so it must be non-obvious

changed the description

Alert analytics (new dashboard):

30 day trend:

To generate the dashboard URL (based on the day the shift started):

$ date=2021-09-14
$ printf 'https://nonprod-log.gitlab.net/app/dashboards#/view/dacc1d40-1c64-11ec-b8fd-b5d052b1f8cb?_g=(time:(from:%s,to:%s),filters:!((query:(match_phrase:(type.keyword:pagerduty))),(query:(match_phrase:(status.keyword:triggered)))))\n' "'$(TZ=UTC gdate +'%Y-%m-%dT%H:%M:%SZ' --date="$date+1hour")'" "'$(TZ=UTC gdate +'%Y-%m-%dT%H:%M:%SZ' --date="$date+1hour+7days")'"

closed

mentioned in issue #14228 (moved)

added workflow-infraDone label

moved to reliability-reports#75 (closed)

changed the description

Weekly Reliability (SRE) Team Newsletter – On-call Period: 2021-09-14 - 2021-09-21

Designs

Child items ...

Activity