Automated weekly error budget report on slack
Original ask with screenshots and links
Recently, I have noticed that group product managers have weekly tasks where they are reporting out on error budgets. Here are a few examples: here and here with images included below. After doing some initial research, I located https://api.slack.com/messaging/webhooks which discusses how to create a Slack app that would accept and process an incoming webhook (Grafana). I'm wondering if this would be an opportunity to help automate some of these updates to specific channels to improve efficiency. With all of the work that has been done already towards error budgets, I thought this could help extend on these efforts to make things more efficient.
Plan to resolve
Current setup
Multiple different options to report this, including humans, various snippets (https://gitlab.com/gitlab-com/gl-infra/scalability/-/snippets/2299544) and nothing at all.
Ask is to allow all teams to get weekly updates of error budget status in slack (which would include current availability, error budget used, and error budget remaining). I suspect that a link to a dashboard or location with more information would also be very helpful to these alerts.
Requirements
- Update weekly error budgets per stage group in the periodic thanos queries bucket. Weekly error budget analysis per stage group. gitlab-com/runbooks!4850 (merged)
- A tool that manages the opt-in or opt-out of stage groups and lookup of appropriate slack channel names, as well as linking the current error budget dashboards per group. The error budget detail dashboard is generated using a predictable slug based on the group name here, using the
stageGroupDashboards.dashboardUid
function. So I think we should be able to build or get to that slug from wherever we report.) This tool will also need to write to slack using slackline. - A scheduled pipeline to run this once a week similar to how the error budget report works today. https://gitlab.com/gitlab-org/error-budget-reports/-/blob/d18c3afef93dc8d0fcca52f6d7168d953c60680a/.gitlab-ci.yml
Status
Periodic queries are complete, tool has been created, and slack app has been approved. Next steps is to set up the scheduled pipeline and begin beta testing.
Beta teams
grouprunner and ~"group::pipeline insights" to start with -- confirming with teams in question that they're still ready to do this.