Squad 3 W11, W12 · GitLab Infrastructure Team · GitLab

Squad 3 W11, W12

2022-03-24

🎉 Done

Prometheus OOM kills on startup
- Prometheus version bumps are done and we are running the latest version (v2.34.0) 👉 https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15310. This should help with WAL recovery time.
- Blocked Waiting on review for https://github.com/prometheus/prometheus/pull/10406
Vertical scale instance for OSquery performance
- Update redis-cache-sentinel in production 👉 https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15428
Staging environment for Thanos Query
- We've already found this useful! We've upgraded to the latest version and didn't break production 👉 gitlab-com/gl-infra/k8s-workloads/tanka-deployments!350 (comment 883867678)
Upgrade to Thanos 0.24.0
- We've found our blocker for deployments and have plans to address it.
Interruptions

⌛ In Progress

Increased latency from us-east1-d for GCS buckets
- GCP was able to use https://gitlab.com/gitlab-com/gl-infra/perfdiag to reproduce the problem from their end and are seeing the increased latencies in US timezones.
ES diagnostics are no longer running

⏭ Next

Focus on SRE tech screening revamp
Continue working on Thanos OOM kills on DNS resolution
Finish Thanos upgrade to v0.24.0

📣 Shoutout

@f_santos Fixing Prometheus sometimes doesn't refresh it's alertmanager IPs when they cycle

2022-03-17

🎉 Done

Prometheus OOM kills on startup
- We've upgraded all staging Prometheus servers to the latest version to help with WAL corruption 👉 gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!636 (merged) and https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/1542
Vertically scale services to compensate for osqueryd saturation
- Vertical scale redis-cache-sentinel on staging 👉 https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15402
Miscellaneous
- Fix labeling issues with CustomerDot metrics 👉 https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15442
- Remove dead config from Prometheus 👉 https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15463

⏳ In Progress

Increased latency from us-east1-d for GCS buckets
- GCP still acknowledges that this is a problem even if their systems say that everything is OK.
- They have looked at the networking infrastructure and both on a path level and hardware level everything looks good.
- They are shifting focus on our GKE clusters and see if they can find anything there.
Fix thanos-store OOM kills
- No update, engineer still on PTO.
Prometheus OOM kills on startup
- Working with the Prometheus community for our upstream patch
- Finish the Prometheus server upgrades early next week.

⏭ Next

Vertically scale services to compensate for osqueryd saturation
- Update redis-cache-sentinel in production 👉 https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15428
Set up Thanos-Query staging environment
- Opening an change management issue to roll out the new environment.

📣 Shoutout

@f_santos Enabling persistent state for Alertmanager in https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15411