Observability for customers.gitlab.com
Problem
We received a weekend page for customers.gitlab.com (gitlab-com/gl-infra/production#2435 (closed)) that resolved itself quickly.
Because we don't have kibana or prometheus set up for this host, it's quite difficult to diagnose what happened. The only option seems to be sifting through logs on the box itself.
AFAIK this is the only host running in azure, and prometheus has been broken there. cc @bjk-gitlab
Proposal
We should prioritize work to get observability for this application. This may be easier by migrating away from the current snowflake-y setup.
For example, k8s: #671 (closed).
Result
- We will be able to more effectively operate this service and diagnose issues.
- We will be able to measure availability and define SLOs.
- We will be able to improve availability.
- We will reduce pager load for the on-call.
Next steps (if any)
cc @brentnewton, @dawsmith, @AnthonySandoval, and @chris_baus for prioritization.
How will we measure success?
- Logs are available in kibana
- Metrics are available in prometheus
- We have SLO dashboards and alerts