On-Call Handover 2020-03-11 23:00 UTC

On-Call Handover

Brought to you by the Slack slash command: /sre-oncall handover

One Gitaly node (file-45) saturated its CPUs for 25 minutes.
An elevated error rate on staging.gitlab.com lead to a series of PagerDuty alerts.
No noteworthy fall-out from the change requests or feature-flag toggles (which I really appreciate people giving a heads-up about!).
During the previous on-call shift, redis-cache had response time spikes. No similar events occurred during this shift, but mitigation work is in progress that will possibly quickly make its way into production.
- Supplemental notes:
  - My impression is that we expect the daily traffic peak to remain at risk of starving redis-cache's CPU time until improvements are made.
  - One idea is that switching from unicorn to puma may have implicitly reduced the lifespan of thread-local caching, because puma threads have a short lifespan. Changing the locality of the cache back from thread-specific to process-specific should restore the effective cache lifespan that we saw under Unicorn. The implementing class (ActiveSupport::Cache::MemoryStore) is expected to be thread-safe (i.e. implements locking during reads and writes to protect threads from concurrent mutations). gitlab-org/gitlab!26935 (merged)

Postgres replication lag is back again on the dr-archive replica.
- Still ignorable in the short-term.
- Did not have time to revisit, but based on yesterday's observations I still think having wal-e fetch more WAL files in advance would prevent more than half of the times postgres has to pause transaction replay.
- PagerDuty: https://gitlab.pagerduty.com/incidents/PE9RIOA - [#18527] Firing 1 - Postgres Replication lag is over 3 hours on archive recovery replica

A single Gitaly node (file-45) saturated its CPU and spiked its memory usage.
- PagerDuty: https://gitlab.pagerduty.com/incidents/PRHLS53 - [#18523] Firing 1 - Gitaly error rate is too high: 18.71
- Incident: production#1757 (closed)
- Root cause investigation is still underway. We identified a strongly correlated set of gRPC calls but whether they were the trigger or just victims is an open question.
staging.gitlab.com has an elevated rate of HTTP 500 responses.
- PagerDuty: https://gitlab.pagerduty.com/incidents/P4M1O08 - [#18535] Firing 1 - 5xx Error Rate on staging.gitlab.com CloudFlare zone
- The PagerDuty alert described Cloudflare's edge responses, but follow-up showed that the HTTP 500 responses were indeed coming from origin (GCP). They started at 20:07 UTC, over an hour before the alert triggered.
- Active discussion via Slack thread: https://gitlab.slack.com/archives/C101F3796/p1583962314163600
- The PagerDuty alert's graph link is broken, so I created a disposable one for now: https://dashboards.gitlab.net/d/7ef50NuWk/cloudflare-http-response-codes-by-zone
- This alert re-triggered several times, but the staging env has now stabilized. See note from @rspeicher: https://gitlab.slack.com/archives/C101F3796/p1583966140184100?thread_ts=1583962314.163600&cid=C101F3796

Elasticsearch logs for Registry are fixed, but historical logs had to be dropped due to corrupt timestamps: production#1756 (closed)
- For part of the day, I thought the Gitaly index was also fixed by this maintenance, but I later discovered that Kibana was still showing me the Registry index's logs even though I had explicitly switched indexes and clicked "Refresh". Screenshots confirmed the observation. I restarted my browser and cleared cached state, and I haven't seen it reoccur since. Your mileage may vary.
Ongoing Gitaly repo migrations from file-34 to file-44: https://gitlab.com/gitlab-com/gl-infra/production/issues/1250#note_303368434

Edited Mar 11, 2020 by Matt Smiley