Skip to content

Summary of issues not in Epics (autogenerated)

Summary of issues that are not in an Epic

Total Issues: 131

Team Tasks

Issues: 14 team-tasks

Topic Team Service Board Workflow Status Due Date
Keeping documentation up to date
#4016
workflow-infraProposal
Document functionality to create GitLab issues from prometheus alerts
#4013
boardplanning workflow-infraTriage
Dreaming of 2025: Observability Wish List!
#4012
workflow-infraStalled
Metrics data for consumption and analysis
#3967
boardplanning workflow-infraTriage
Add script to update feature categories to stage-groups-index
#3962
workflow-infraTriage
Remove promtool from the runbooks image
#3960
workflow-infraTriage
Route saturation alerts to service owners
#3941
workflow-infraTriage
Metric / o11y on inactive sidekiq threads
#3808
workflow-infraStalled
jsonnet-tool should pass along JSONNET_PATH
#3678
workflow-infraTriage
Link the unwinded source metrics of an alert in an alert message
#3662
workflow-infraTriage
Commit and push feature categories should alert on failure
#3616
workflow-infraTriage
Tamland documentation: a day-in-the-life of Capacity planning
#3578
boardbuild workflow-infraReady
Automate creation of MR to update feature categories
#3400
workflow-infraStalled
Summary of issues not in Epics (autogenerated)
#538

Service::AIGateway

Issues: 2 ServiceAIGateway

Topic Team Service::AIGateway Board Workflow Status
Analysis of frequency and duration of specific AI Gateway errors with code 429
#4210
ServiceAIGateway workflow-infraTriage
Monitor for high percentage of non-200 requests to the AI gateway
#3600
ServiceAIGateway workflow-infraTriage

Service::AlertManager

Issues: 2 ServiceAlertManager

Topic Team Service::AlertManager Board Workflow Status
Corrective action: Workhorse and Load Balancer SLI interdependency for alerts
#2955
ServiceAlertManager workflow-infraTriage
Traffic absent alerts causing pager noise
#3276
ServiceAlertManager workflow-infraProposal

Service::ClickHouseCloud

Issues: 1 ServiceClickHouseCloud

Topic Team Service::ClickHouseCloud Board Workflow Status
Help setup Clickhouse Rails logs
#2982
ServiceClickHouseCloud workflow-infraStalled

Service::Container Registry

Issues: 1 ServiceContainer Registry

Topic Team Service::Container Registry Board Workflow Status
Include a link to a specific kibana error log search in the alert definition for the garbage collection component of the container registry service
#3293
ServiceContainer Registry workflow-infraTriage

Service::Database

Issues: 1 ServiceDatabase

Topic Team Service::Database Board Workflow Status
Stage group index is broken again.
#4043
ServiceDatabase workflow-infraStalled

Service::Elasticsearch

Issues: 4 ServiceElasticsearch

Topic Team Service::Elasticsearch Board Workflow Status
Create test deployment for gitlab-docs-website for Localization team
#4141
ServiceElasticsearch workflow-infraTriage
Move logging and elasticsearch alerts to #g_infra_observability_alerts.
#2941
ServiceElasticsearch workflow-infraStalled
Migrate elasticsearch configuration from custom scripts to terraform
#2966
ServiceElasticsearch workflow-infraBlocked
Confidential Issue
https://gitlab.com/gitlab-com/gl-infra/observability/team/-/issues/3027
ServiceElasticsearch workflow-infraReady

Service::GCP

Issues: 2 ServiceGCP

Topic Team Service::GCP Board Workflow Status
Review new Google Cloud Logging regional ingestion quotas
#4056
ServiceGCP workflow-infraTriage
Implement GCP scheduled snapshots health check
#3257
ServiceGCP workflow-infraTriage

Service::GitLab Rails

Issues: 1 ServiceGitLab Rails

Topic Team Service::GitLab Rails Board Workflow Status
Review Request: maven virtual registry, multiple usptreams support
#4096
teamScalability ServiceGitLab Rails boardbuild workflow-infraReady

Service::Gitaly

Issues: 1 ServiceGitaly

Topic Team Service::Gitaly Board Workflow Status
Corrective action: alert on GitLab pipeline failures due to load.
#3245
ServiceGitaly workflow-infraTriage

Service::Grafana

Issues: 12 ServiceGrafana

Topic Team Service::Grafana Board Workflow Status
Adding a grafana datasource to through configuration fails on secrets
#4214
ServiceGrafana workflow-infraTriage
Confidential Issue
https://gitlab.com/gitlab-com/gl-infra/observability/team/-/issues/4193
ServiceGrafana workflow-infraTriage
SLI detail panels should apply the same selectors as the SLI itself does
#4189
ServiceGrafana workflow-infraTriage
Replace redis-sidekiq shard template to use the generic shard template
#4171
ServiceGrafana workflow-infraTriage
Make the service overview show the SLI for each shard in a different colour
#4169
ServiceGrafana workflow-infraTriage
Confidential Issue
https://gitlab.com/gitlab-com/gl-infra/observability/team/-/issues/4079
ServiceGrafana workflow-infraBacklog
Confidential Issue
https://gitlab.com/gitlab-com/gl-infra/observability/team/-/issues/4044
ServiceGrafana workflow-infraTriage
Migrate Grafana to Okta
#4003
ServiceGrafana workflow-infraProposal
Webservice dashboard link to kibana slow rails requests broken
#3855
ServiceGrafana workflow-infraTriage
Escaping of promql queries in alertmanager Slack alerts broken
#3854
ServiceGrafana workflow-infraTriage
Streamline latency attribution via service dashboards
#3849
ServiceGrafana workflow-infraTriage
Review grafana monitoring and alerting rules.
#2971
ServiceGrafana workflow-infraTriage

Service::Kube

Issues: 3 ServiceKube

Topic Team Service::Kube Board Workflow Status
Create process to periodically review nodepool instance families in kubernetes
#4173
ServiceKube workflow-infraProposal
Monitor kubernetes node CPU wait / noisy neighbour
#4172
ServiceKube workflow-infraProposal
Corrective action: The cluster_scaleups SLI of the kube service (main stage) has an error rate violating SLO
#3256
ServiceKube workflow-infraTriage

Service::Logging

Issues: 15 ServiceLogging

Topic Team Service::Logging Board Workflow Status
Update runbooks and docs-hub logging documentation
#4217
ServiceLogging workflow-infraProposal
Confidential Issue
https://gitlab.com/gitlab-com/gl-infra/observability/team/-/issues/4128
ServiceLogging workflow-infraStalled
Audit GCP Cloud Logs usage
#4118
ServiceLogging workflow-infraTriage
Decommission Loki
#4116
ServiceLogging workflow-infraTriage
Ingest sampled logs for some percentage of gitlab rails sql queries
#4107
ServiceLogging workflow-infraTriage
Differentiate disk space in Elastic by data tier
#4066
ServiceLogging workflow-infraTriage
How should we handle large json.message fields in Elastic.
#4052
ServiceLogging workflow-infraTriage
Confidential Issue
https://gitlab.com/gitlab-com/gl-infra/observability/team/-/issues/3880
ServiceLogging workflow-infraTriage
fluentbit requesting a large percentage of cpu resources on k8s nodes
#3836
ServiceLogging workflow-infraTriage
clean up duplicate logs from our GCS logging archive
#3019
ServiceLogging workflow-infraReady
Improve the pubsubbeat deployment
#3255
ServiceLogging workflow-infraTriage
Push Elasticsearch ILM policies and index templates on a schedule
#3292
ServiceLogging workflow-infraTriage
Add Kibana fields for Postgres autovacuum auto-analyze log messages
#3232
ServiceLogging workflow-infraTriage
investigate mising stacks in flamegraphs generation in es-diagnostics
#3288
ServiceLogging workflow-infraTriage
Find what is putting extra lines in prometheus logs
#3296
ServiceLogging workflow-infraTriage

Service::Mimir

Issues: 10 ServiceMimir

Topic Team Service::Mimir Board Workflow Status
Implement aggregation for metrics with endpoint_id
#4139
ServiceMimir workflow-infraProposal
Increase in Mimir store getRange latencies since upgrade to 2.15.0
#4124
ServiceMimir workflow-infraStalled
Observability improvements for Mimir
#3891
ServiceMimir workflow-infraTriage
Validating alert recording rules on live metrics
#3853
ServiceMimir workflow-infraTriage
Create a testing framework for recording- and alerting rules
#3851
ServiceMimir workflow-infraTriage
Add Metric Management information to Monitoring section of handbook
#3704
ServiceMimir workflow-infraTriage
Request to update prometheus blackbox config for handbook website
#3675
ServiceMimir workflow-infraTriage
Rename GCP bucket thanos-periodic-queries to periodic-queries
#3519
ServiceMimir workflow-infraTriage
Move periodic queries execution from ops to GitLab.com
#3512
ServiceMimir workflow-infraReady
Combine enqueued_jobs and sidekiq_queueing SLI in Sidekiq
#3488
ServiceMimir boardplanning workflow-infraTriage

Service::Monitoring-Other

Issues: 4 ServiceMonitoring-Other

Topic Team Service::Monitoring-Other Board Workflow Status
Provide capability to backtest new alert definitions
#4143
ServiceMonitoring-Other workflow-infraTriage
Add apdex and error metrics to the git/gitlab_shell SLI
#3239
ServiceMonitoring-Other workflow-infraTriage
Review kubernetes container resource saturation monitoring
#3076
ServiceMonitoring-Other workflow-infraReady
Reduce GitLab's histograms to 3-5 buckets for most histograms
#476
ServiceMonitoring-Other workflow-infraTriage

Service::Oncall-Tooling

Issues: 3 ServiceOncall-Tooling

Topic Team Service::Oncall-Tooling Board Workflow Status
Severity labels not being applied consistently to incident issues
#3236
ServiceOncall-Tooling workflow-infraReady
Create SLI / SLO for the autocomplete endpoints in the runbooks
#3280
ServiceOncall-Tooling workflow-infraTriage
Rename prometheus missing from cluster notifications to be more helpful
#3241
ServiceOncall-Tooling workflow-infraTriage

Service::PVS

Issues: 1 ServicePVS

Topic Team Service::PVS Board Workflow Status
Fix the PvsServiceHttpApdexSLOViolation source metric
#2968
ServicePVS workflow-infraTriage

Service::Pages

Issues: 2 ServicePages

Topic Team Service::Pages Board Workflow Status
Investigate alerting thresholds for WebPagesServiceWebPagesServerApdexSLOViolationRegional
#3503
ServicePages workflow-infraTriage
Corrective Action: Implement Transactional Monitoring for Pages service
#3502
ServicePages workflow-infraTriage

Service::Patroni

Issues: 1 ServicePatroni

Topic Team Service::Patroni Board Workflow Status
Monitor Postgres TOAST oid exhaustion
#3180
ServicePatroni boardplanning workflow-infraNeeds More Info

Service::Pgbouncer

Issues: 1 ServicePgbouncer

Topic Team Service::Pgbouncer Board Workflow Status
Review capacity planning for pgbouncer async primary pool
#3917
ServicePgbouncer workflow-infraTriage

Service::Postgres

Issues: 1 ServicePostgres

Topic Team Service::Postgres Board Workflow Status
Patroni main, data growth drill-down
#3784
ServicePostgres boardplanning workflow-infraTriage

Service::Prometheus

Issues: 6 ServicePrometheus

Topic Team Service::Prometheus Board Workflow Status
Test Prometheus 3.0
#4000
ServicePrometheus workflow-infraTriage
Corrective action: Update runbook for prometheus increase storage
#2967
ServicePrometheus workflow-infraTriage
Create a more general purpose stackdriver-exporter for teams
#2997
ServicePrometheus workflow-infraReady
Deploy Prometheus Rules and Alertmanager from gitlab-helmfiles instead of runbooks
#3267
ServicePrometheus workflow-infraTriage
Add monitoring for OAuth2 login endpoints
#3228
ServicePrometheus workflow-infraTriage
Implement memory and CPU limits to the Prometheus processes in VMs
#3273
ServicePrometheus workflow-infraTriage

Service::Redis

Issues: 2 ServiceRedis

Topic Team Service::Redis Board Workflow Status
Mirror process-exporter image to be resilient to docker registry failure
#1709
ServiceRedis workflow-infraTriage
Evaluate porting scheduled CPU profiles for redis observability on Kubernetes
#1633
ServiceRedis workflow-infraTriage

Service::Runbooks

Issues: 2 ServiceRunbooks

Topic Team Service::Runbooks Board Workflow Status
Report availability per service and overall GitLab availability.
#4082
ServiceRunbooks workflow-infraTriage
Fix update feature categories script on runbooks
#3961
ServiceRunbooks workflow-infraTriage

Service::Sentry

Issues: 3 ServiceSentry

Topic Team Service::Sentry Board Workflow Status
Fix gap in sentry monitoring
#4224
ServiceSentry workflow-infraTriage
Sentry is not processing events
#4223
ServiceSentry workflow-infraTriage
Corrective Action: relieve memory pressure issues with Sentry's kafka
#4026
ServiceSentry workflow-infraTriage

Service::Sidekiq

Issues: 2 ServiceSidekiq

Topic Team Service::Sidekiq Board Workflow Status
Continuous profiling for Ruby Projects
#3827
ServiceSidekiq workflow-infraTriage
Discuss removal of histogram metrics on Sidekiq for self-managed
#2474
ServiceSidekiq workflow-infraProposal

Service::Thanos

Issues: 1 ServiceThanos

Topic Team Service::Thanos Board Workflow Status
Remove remaining Thanos components
#4008
ServiceThanos workflow-infraProposal

Service::Unknown

Issues: 5 ServiceUnknown

Topic Team Service::Unknown Board Workflow Status
Make the pager stop melting if the world is on fire.
#4222
ServiceUnknown workflow-infraTriage
tenant-observability-stack: Add support for node selector in tenant-observability-config-manager job
#4158
ServiceUnknown workflow-infraTriage
tenant-observability-stack: Make images configurable for ARM support
#4157
ServiceUnknown workflow-infraTriage
Confidential Issue
https://gitlab.com/gitlab-com/gl-infra/observability/team/-/issues/4151
ServiceUnknown workflow-infraStalled
Creating test dashboard do not properly work
#4144
ServiceUnknown workflow-infraTriage

Service::Web

Issues: 3 ServiceWeb

Topic Team Service::Web Board Workflow Status
Web pods are being throttled
#4205
ServiceWeb workflow-infraTriage
Add job to update feature categories to the rails app
#3963
ServiceWeb workflow-infraTriage
Alert for fatal errors [Corrective action] Mixed deployment issues with WebAuthn logins
#3259
ServiceWeb workflow-infraTriage

Other

Topic **Team ** Board Workflow Status
Can we make the triage dashboards useful?
#4221
Introduce floor threshold into our Capacity Planning process to improve financial efficiency
#4108
Confidential Issue
https://gitlab.com/gitlab-com/gl-infra/observability/team/-/issues/4104
Implement a feature registry
#4099
Warn of imminent saturation in a more urgent way.
#4054
Dimension lookup during reporting and issue management
#4033
Discussion: Observability Service topology for metrics in cells
#4029
patroni.disk_sustained_write_iops and patroni.disk_sustained_read_iops missing graphs
#4032
Error Budgets should be based on full calendar month
#3979
Stewardship for common-ci-tasks and related projects
#3948
Confidential Issue
https://gitlab.com/gitlab-com/gl-infra/observability/team/-/issues/3840
Confidential Issue
https://gitlab.com/gitlab-com/gl-infra/observability/team/-/issues/3835
Early Feedback Cost Dashboards
#3794
Confidential Issue
https://gitlab.com/gitlab-com/gl-infra/observability/team/-/issues/3729
Tamland runner shard saturates on CPU
#3712
Synthetic Monitoring / Testing
#3637
Move product error budget dashboards out of the current folder
#3618
Confidential Issue
https://gitlab.com/gitlab-com/gl-infra/observability/team/-/issues/3441
boardplanning
Use expanded labels recording rule for alerting dashboards
#3426
boardplanning
Observability Feedback from Engineering Productivity Pulse Survey - FY25Q1
#2953
Turn the get-hybrid monitoring config into a monitoring mixin
#2832
boardplanning
Labkit as the in-application platform toolkit
#2793
Introduce open_fds saturation point for process_exporter
#2778
Rename SLOs we use in saturation points
#2168
Remove custom feature category recordings for the puma component
#1481
Edited by service-epic-status-automation