Closed
Milestone
Mar 28, 2022–Apr 9, 2022
Squad 3 W13, W14
2022-04-07
🎉 Done
-
Increased latency from us-east1-d for GCS buckets
- GCP found the issue from their end and we've seen massive improvements
- We are waiting on an RCA about this from GCP
-
SLI of the web-pages service in region
us-east
has an error rate violating SLO- Fixed
502
errors on pod startup👉 gitlab-com/gl-infra/k8s-workloads/gitlab-com!1670 (merged). This doesn't have to do anything with the SLO but is a red herring that makes debugging harder. - Fixed
502
errors on pod shutdown ingstg
👉 gitlab-com/gl-infra/k8s-workloads/gitlab-com!1686 (merged)
- Fixed
⏭ Next
-
SLI of the web-pages service in region
us-east
has an error rate violating SLO- Rollout change in
gprd
- Upstream fix in our helm chart
- Rollout change in
-
Thanos upgrade
- @swainaina is back from vacation so we'll pick up this work again.
- The first step is to update
thanos-store
andthanos-query
🔦 Interupttions/Reduced capacity
- On-Call
- PTO/sick leaf
2022-04-01
🎉 Done
-
Upgrade to Thanos v0.24.0
- Updated alerts to take into consideration the new metrics that were renamed.
-
Increased latency from us-east1-d for GCS buckets
- Relax imagescaler SLO since it was generating a lot of alerts
👉 gitlab-com/runbooks!4477 (merged)
- Relax imagescaler SLO since it was generating a lot of alerts
⏳ In Progress
-
Increased latency from us-east1-d for GCS buckets
- GCP is doing a deeper analysis of their network, using more heavy tooling that captures all incoming requests rather then sampling.
-
Prometheus recovery on WAL file corruption
- Still waiting on the review of https://github.com/prometheus/prometheus/pull/10406
-
Implement a continuous profiler tool for golang binaries with pprof
- Going through the code review process
👉 woodhouse!135 (closed)
- Going through the code review process
⏭ Next
- @steveazz and @nnelson Will focus on &717 (closed)
🔦 Miscellaneous/Interuptions
- Advanced search elasticsearch instability
- Help with CI database configuration changes
👉 production#6730 (closed) - Pubsub investigations
👉 https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15501#note_893114787 - ES diagnostics
👉 es-diagnostics!9 (merged)