Push background processing incident resolution left (towards Development)

added to epic &96

added 1 deleted label

mentioned in issue #86 (closed)

marked this issue as related to #25 (closed)

added ServiceSidekiq workflow-infraTriage labels

marked this issue as related to #161 (closed)

marked this issue as related to production#1739

@andrewn How can we make this issue actionable? Would it make sense to start with a couple of queues that have the following criteria:

Has feature attribution
Has not been causing any alerts in the past 3 months (has been stable)
Has been rolled out to production as part of work in #27 (closed), or ideally #177 (closed)

This way we can start working on docs, and communication plan and create a model for all other queues, including the ones that are common source of alerts.

How can we make this issue actionable? Would it make sense to start with a couple of queues that have the following criteria:

@marin I propose that as a first step, we start auto-generating our alert-routing rule set, as proposed in #214 (closed)

In gitlab-com/runbooks!2002 (merged), the new alerts now have a feature_category label.

The alert manager routing tree is (partially) generated from the service catalog and stages.yml. At generation time, we map feature categories to teams (using stages.yml) then lookup the teams in the service catalog.

If the team has a alerts_slack_channel attribute, we setup a routing rule for the team.

For example

If an alert has a feature_category: gitaly label on it, we automatically route it to the existing #gitaly-alerts channel.

All alerts would continue to be be sent to #alerts, #alerts-general and pagerduty, when appropriate, but having a team specific channels would encourage team members to join the appropriate channel, without receiving the deluge of untargeted alerts they might see (and likely ignore) in the existing channels.

@andrewn Should a first iteration of #214 (closed) do this for alerts for queue SLO breaches? I believe we already have alerts for those?

@reprazent we would not limit this alert routing to specific alert types. Any alert that has a feature_category label would be routed to the appropriate Slack channel.

As it happens, "queue SLO breaches" are the only alerts (iirc) that currently export this label, so, yes, they would be the first set of alerts to be routed this way

@andrewn I'm looking for a way to make this issue actionable, would it make sense to pick a queue from #177 (comment 304296946), as @marin mentioned, we don't want to start redirecting all alerts that have a feature category to a group's slack channel, do we?

Would it be a good idea to start by routing the alerts for AutoMergeProgressWorker to #g_ci, the queue is low-urgency-cpu-bound, it is already running on that shard, and it seems well behaved:

https://dashboards.gitlab.net/d/sidekiq-queue-detail/sidekiq-queue-detail?orgId=1&from=now-7d&to=now&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-queue=auto_merge:auto_merge_process

mentioned in issue #214 (closed)

mentioned in epic &96

mentioned in issue #161 (closed)

added 1 deleted label

removed priority4 label

added DE_Infra_InitiativesService Levels label

Bumping to ~"Scalability::P3" as having a process for that soonish would be great (as we started collaborations across other teams e.g. #167 (closed)).

added 1 deleted label and removed 1 deleted label

marked this issue as related to #214 (closed)

mentioned in issue #256 (closed)

mentioned in issue #259 (closed)

changed epic to &209

changed epic to &216

I'm going to close this as the action in this issue is captured in a variety of other issues.

For example, we are creating a way to categorize controllers and endpoints in &269, which will help us to surface feature category information in dashboards (#399 (closed)), and route errors to stage groups (#378 (closed)).

closed

mentioned in issue #535 (closed)

removed 1 deleted label

Push background processing incident resolution left (towards Development)

But how does this tie in with making these alerts actionable?

Designs

Child items ...

Activity

For example

Push background processing incident resolution left (towards Development)

But how does this tie in with making these alerts actionable?

Relates to

Activity

For example