Push background processing incident resolution left (towards Development)
Entirely copied from @andrewn's excellent comment at #86 (comment 266907087):
However, I would like to propose that we, as the infrastructure team, also start putting processes in place to push the resolution left.
What do I mean by this?
- We cannot guarantee infinite scalability. We need to define reasonable limits for the workload that the application can create.
- While solutions like #42 (closed) will help, ultimately we should define clearer SLOs and split responsibilities for each job.
- Much of the groundwork for these SLOs has already been done as part of the worker attribution work, but the SLOs have yet to be formalised (#25 (closed)) and communicated with the wider engineering group
- What are the SLOs? (These are example numbers, real values may differ)
- The infrastructure team undertakes to:
- For 95% of latency sensitive workers, maintain a maximum queuing duration of 5 seconds
- For 95% of non-latency sensitive workers, maintain a maximum queuing duration of 30 seconds
- The engineering team responsible for each worker undertakes to:
- For 95% of latency sensitive workers, ensure a maximum execution duration of 5 seconds
- For 95% of non-latency sensitive workers, ensure a maximum execution duration of 120 seconds
- Put simply, we will uphold our end of the agreement (to maintain queue durations) if the engineering teams uphold their end of the agreement.
But how does this tie in with making these alerts actionable?
We, as the infrastructure team can only ensure the required throughput on GitLab.com if the application teams ensure that the workers perform within certain limits. For example, it's impossible for us to guarantee that we run 10k jobs that each take 10 minutes to run while maintain a 1s queueing duration SLO.
Having split responsibilities ensures that each stakeholder is responsible for their end of the agreement.
- If the queuing duration SLO (5 seconds for latency sensitive jobs, 30 seconds for non-latency sensitive jobs) is exceeded, the infrastructure team (EOC) is responsible and needs to investigate.
- Possible reasons may include database slowdowns, lack of capacity for processing jobs
- Possible actions include scaling up a sidekiq fleet, resolving a slowdown in the infrastructure.
- However: if the execution duration SLO is not being met, and there is no clear slowdown on the infrastructure side (ie, a specific Gitaly or postgres issue), then the action would be
Engage with the engineering team responsible for that sidekiq worker by creating an issue and assigning to the team
Silence the alert (for that single worker class) with a link to the issue. Suggested silence duration - 1 week
At this point, the SLO is not being met, and the error budget for the appropriate feature is being used up, but the infrastructure team are not being alerted.
If the problem persists over a long period, we could request that the team remove the
latency_senstivity
attribute for a job. For example, in the case ofauthorized_projects
, this might be the simplest course of action, if the team responsible for the job agree.
I'm creating a separate issue to make sure we don't lose this.