Skip to content

On-Call Handover 2020-03-11 23:00 UTC

On-Call Handover

Brought to you by the Slack slash command: /sre-oncall handover

Summary:

  • One Gitaly node (file-45) saturated its CPUs for 25 minutes.
  • An elevated error rate on staging.gitlab.com lead to a series of PagerDuty alerts.
  • No noteworthy fall-out from the change requests or feature-flag toggles (which I really appreciate people giving a heads-up about!).
  • During the previous on-call shift, redis-cache had response time spikes. No similar events occurred during this shift, but mitigation work is in progress that will possibly quickly make its way into production.
    • Supplemental notes:
      • My impression is that we expect the daily traffic peak to remain at risk of starving redis-cache's CPU time until improvements are made.
      • One idea is that switching from unicorn to puma may have implicitly reduced the lifespan of thread-local caching, because puma threads have a short lifespan. Changing the locality of the cache back from thread-specific to process-specific should restore the effective cache lifespan that we saw under Unicorn. The implementing class (ActiveSupport::Cache::MemoryStore) is expected to be thread-safe (i.e. implements locking during reads and writes to protect threads from concurrent mutations). gitlab-org/gitlab!26935 (merged)

Ongoing alerts/incidents:

  • Postgres replication lag is back again on the dr-archive replica.
    • Still ignorable in the short-term.
    • Did not have time to revisit, but based on yesterday's observations I still think having wal-e fetch more WAL files in advance would prevent more than half of the times postgres has to pause transaction replay.
    • PagerDuty: https://gitlab.pagerduty.com/incidents/PE9RIOA - [#18527] Firing 1 - Postgres Replication lag is over 3 hours on archive recovery replica

Resolved actionable alerts:

Unactionable alerts:

Resolved production incidents:

Change issues:

  • Elasticsearch logs for Registry are fixed, but historical logs had to be dropped due to corrupt timestamps: production#1756 (closed)
    • For part of the day, I thought the Gitaly index was also fixed by this maintenance, but I later discovered that Kibana was still showing me the Registry index's logs even though I had explicitly switched indexes and clicked "Refresh". Screenshots confirmed the observation. I restarted my browser and cleared cached state, and I haven't seen it reoccur since. Your mileage may vary.
  • Ongoing Gitaly repo migrations from file-34 to file-44: https://gitlab.com/gitlab-com/gl-infra/production/issues/1250#note_303368434
Edited by Matt Smiley