Closed
Milestone
Mar 14, 2022–Mar 26, 2022
Squad 3 W11, W12
2022-03-24
🎉 Done
-
Prometheus OOM kills on startup
- Prometheus version bumps are done and we are running the latest version (v2.34.0)
👉 https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15310. This should help with WAL recovery time. - Blocked Waiting on review for https://github.com/prometheus/prometheus/pull/10406
- Prometheus version bumps are done and we are running the latest version (v2.34.0)
-
Vertical scale instance for OSquery performance
- Update
redis-cache-sentinel
in production👉 https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15428
- Update
-
Staging environment for Thanos Query
- We've already found this useful! We've upgraded to the latest version and didn't break production
👉 gitlab-com/gl-infra/k8s-workloads/tanka-deployments!350 (comment 883867678)
- We've already found this useful! We've upgraded to the latest version and didn't break production
-
Upgrade to Thanos 0.24.0
- We've found our blocker for deployments and have plans to address it.
- Interruptions
⌛ In Progress
-
Increased latency from us-east1-d for GCS buckets
- GCP was able to use https://gitlab.com/gitlab-com/gl-infra/perfdiag to reproduce the problem from their end and are seeing the increased latencies in US timezones.
- ES diagnostics are no longer running
⏭ Next
- Focus on SRE tech screening revamp
- Continue working on Thanos OOM kills on DNS resolution
- Finish Thanos upgrade to v0.24.0
📣 Shoutout
2022-03-17
🎉 Done
-
Prometheus OOM kills on startup
- We've upgraded all staging Prometheus servers to the latest version to help with WAL corruption
👉 gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!636 (merged) and https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/1542
- We've upgraded all staging Prometheus servers to the latest version to help with WAL corruption
-
Vertically scale services to compensate for osqueryd saturation
- Vertical scale
redis-cache-sentinel
on staging👉 https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15402
- Vertical scale
- Miscellaneous
- Fix labeling issues with CustomerDot metrics
👉 https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15442 - Remove dead config from Prometheus
👉 https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15463
- Fix labeling issues with CustomerDot metrics
⏳ In Progress
-
Increased latency from us-east1-d for GCS buckets
- GCP still acknowledges that this is a problem even if their systems say that everything is OK.
- They have looked at the networking infrastructure and both on a path level and hardware level everything looks good.
- They are shifting focus on our GKE clusters and see if they can find anything there.
-
Fix thanos-store OOM kills
- No update, engineer still on PTO.
-
Prometheus OOM kills on startup
- Working with the Prometheus community for our upstream patch
- Finish the Prometheus server upgrades early next week.
⏭ Next
-
Vertically scale services to compensate for osqueryd saturation
- Update
redis-cache-sentinel
in production👉 https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15428
- Update
-
Set up Thanos-Query staging environment
- Opening an change management issue to roll out the new environment.
📣 Shoutout
- @f_santos Enabling persistent state for Alertmanager in https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15411