Database SLAs per team
Some time during 2018 I want to start defining database SLAs for the different teams. An example discussed in https://gitlab.com/gitlab-com/infrastructure/issues/3198 is an SLA for writes, but I want much more than that. Currently I'm thinking of the following:
- The number of SQL queries executed per controller/API must not exceed 100
- 100 is arbitrary, but I'd like to think you really shouldn't need more than that for anything sane
- The total time spent in SQL per controller must not exceed 100 milliseconds
- Database tables can at most only have X inserts, Y updates, and Z deletes in a given time period
In all cases we already have the data (though the SQL timings/numbers are currently only in Influx until Prometheus for app metrics is back), but the tricky part will be things such as: setting up dashboards, alerts, getting teams to work on these issues, etc. I'm also not entirely sure how to best go about doing this. Do we just define these SLAs then get grumpy when they don't get solved? How are we going to ensure teams will actively work to meet these SLAs instead of ignoring them for long periods of time? Are these thresholds even reasonable?
Alerting wise @bjk-gitlab mentioned it's possible to set up alerts that only trigger at a given interval (https://prometheus.io/docs/alerting/configuration/#route). This means we could set up an alert that fires once a month if a certain controller does not meet the SLA, instead of firing all the time for months.
Dashboard wise we probably need at least separate sections per team as simply showing 500 graphs isn't going to work. We'd also have to come up with a way to map controllers to teams.
Workflow wise we need to come up with a way to essentially force teams to work on these issues when SLAs are not met. This means we need to set up a budget and some kind of "motivation" to meet the budget. This means product managers also need to be aware of this and take this into account with planning.
Ultimately the idea is to have a clear view on how we are doing across the board, which teams may need help or time with these issues, where our big problems are, etc.
@andrewn @DouweM @smcgivern @rymai: I'd love your thoughts / suggestions on this.
⚠ Blocked By
- #54 (moved): apdex scores would be used for setting SLAs
- https://gitlab.com/gitlab-com/infrastructure/issues/1962: blocked until https://gitlab.com/gitlab-com/infrastructure/issues/1962#note_68292200 is resolved