Production Change

Change Summary

This change adds and enable autoscaling based on PubSub metrics for the pubsubbeat deployment in production, so that it can handle large sudden influx of logs in a timely manner while not being over-provisioned in normal times.

See https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15451

Change Details

Services Impacted - ServiceLogging
Change Technician - @pguinoiseau
Change Reviewer - @gsgl
Time tracking - 15 minutes
Downtime Component - none

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 5 minutes

Obtain approval for the MR gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!648 (merged)
Set label changein-progress on this issue

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 5 minutes

Merge the MR gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!648 (merged) and make sure it applies successfully

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 5 minutes

Verify that the pubsubbeat deployments are healthy:
```
kubectl --namespace pubsubbeat get deployments
```

Verify that the pubsubbeat HPAs have created and are able to get PubSub metrics:

kubectl --namespace pubsubbeat get hpa
kubectl --namespace pubsubbeat describe pubsubbeat-pubsub-rails-inf-gprd

should show something like:

Metrics:                                                                                 ( current / target )
  "pubsub.googleapis.com|subscription|num_undelivered_messages" (target average value):  0 / 80k
  resource cpu on pods  (as a percentage of request):                                    1% (12m) / 80%

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 5 minutes

Revert the MR gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!648 (merged) and merge/apply it

Monitoring

Key metrics to observe

Metric: PubSub subscriptions Undelivered messages
- Location: https://dashboards.gitlab.net/d/logging-main/logging-overview?orgId=1&viewPanel=106&from=now-1h&to=now
- What changes to this metric should prompt a rollback: Undelivered messages count climbing above 80k and pubsubbeat pods not scaling out
Metric: pubsubbeat SLI Error Ratio
- Location: https://dashboards.gitlab.net/d/logging-main/logging-overview?orgId=1&from=now-1h&to=now&viewPanel=2430725542
- What changes to this metric should prompt a rollback: Error ratio climbing above 1%

Summary of infrastructure changes

Does this change introduce new compute instances?
Does this change re-size any existing compute instances?
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

Summary of the above

Change Reviewer checklist

C4 C3 C2 C1:

The scheduled day and time of execution of the change is appropriate.
The change plan is technically accurate.
The change plan includes estimated timing values based on previous testing.
The change plan includes a viable rollback plan.
The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
The change plan includes success measures for all steps/milestones during the execution.
The change adequately minimizes risk within the environment/service.
The performance implications of executing the change are well-understood and documented.
The specified metrics/monitoring dashboards provide sufficient visibility for the change. - If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
The change has a primary and secondary SRE with knowledge of the details available during the change window.

2022-03-21: Enable autoscaling for pubsubbeat

Production Change

Change Summary

Change Details

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Change Steps - steps to take to execute the change

Post-Change Steps - steps to take to verify the change

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Monitoring

Key metrics to observe

Summary of infrastructure changes

Change Reviewer checklist

Change Technician checklist