The fluentd_log_output SLI of the logging service (`main` stage) has an error rate violating SLO
Start time: 01 February 2021, 11:05AM (UTC)
Severity: critical
full_query: ((gitlab_component_errors:ratio_1h{component="fluentd_log_output",monitor="global",type="logging"} > (14.4 * 0.001)) and (gitlab_component_errors:ratio_5m{component="fluentd_log_output",monitor="global",type="logging"} > (14.4 * 0.001)) or (gitlab_component_errors:ratio_6h{component="fluentd_log_output",monitor="global",type="logging"} > (6 * 0.001)) and (gitlab_component_errors:ratio_30m{component="fluentd_log_output",monitor="global",type="logging"} > (6 * 0.001))) and on(env, environment, tier, type, stage, component) (sum by(env, environment, tier, type, stage, component) (gitlab_component_ops:rate_1h{component="fluentd_log_output",monitor="global",type="logging"}) >= 1)
Monitoring tool: Prometheus
Description: This SLI monitors fluentd log output and the number of output errors in fluentd.
Currently the error-rate is 0.7769%.
GitLab alert: https://gitlab.com/gitlab-com/gl-infra/production/-/alert_management/86/details
Summary
More information will be added as we investigate the issue.
Timeline
All times UTC.
`YYYY-MM-DD` - `00:00` - ...Corrective Actions
Click to expand or collapse the Incident Review section.
Incident Review
Summary
- Service(s) affected:
- Team attribution:
- Time to detection:
- Minutes downtime or degradation:
Metrics
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- ...
-
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- ...
-
How many customers were affected?
- ...
-
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
- ...
What were the root causes?
Incident Response Analysis
-
How was the incident detected?
- ...
-
How could detection time be improved?
- ...
-
How was the root cause diagnosed?
- ...
-
How could time to diagnosis be improved?
- ...
-
How did we reach the point where we knew how to mitigate the impact?
- ...
-
How could time to mitigation be improved?
- ...
-
What went well?
- ...
Post Incident Analysis
-
Did we have other events in the past with the same root cause?
- ...
-
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
- ...
-
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
- ...
Lessons Learned
Guidelines
Resources
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)