Summary of issues not in Epics (autogenerated)
Summary of issues that are not in an Epic
Total Issues: 131
Team Tasks
Issues: 14 team-tasks
Topic | Team | Service | Board | Workflow Status | Due Date |
---|---|---|---|---|---|
Keeping documentation up to date #4016 |
workflow-infraProposal | ||||
Document functionality to create GitLab issues from prometheus alerts #4013 |
boardplanning | workflow-infraTriage | |||
Dreaming of 2025: Observability Wish List! #4012 |
workflow-infraStalled | ||||
Metrics data for consumption and analysis #3967 |
boardplanning | workflow-infraTriage | |||
Add script to update feature categories to stage-groups-index #3962 |
workflow-infraTriage | ||||
Remove promtool from the runbooks image #3960 |
workflow-infraTriage | ||||
Route saturation alerts to service owners #3941 |
workflow-infraTriage | ||||
Metric / o11y on inactive sidekiq threads #3808 |
workflow-infraStalled | ||||
jsonnet-tool should pass along JSONNET_PATH #3678 |
workflow-infraTriage | ||||
Link the unwinded source metrics of an alert in an alert message #3662 |
workflow-infraTriage | ||||
Commit and push feature categories should alert on failure #3616 |
workflow-infraTriage | ||||
Tamland documentation: a day-in-the-life of Capacity planning #3578 |
boardbuild | workflow-infraReady | |||
Automate creation of MR to update feature categories #3400 |
workflow-infraStalled | ||||
Summary of issues not in Epics (autogenerated) #538 |
Service::AIGateway
Issues: 2 ServiceAIGateway
Topic | Team | Service::AIGateway | Board | Workflow Status |
---|---|---|---|---|
Analysis of frequency and duration of specific AI Gateway errors with code 429 #4210 |
ServiceAIGateway | workflow-infraTriage | ||
Monitor for high percentage of non-200 requests to the AI gateway #3600 |
ServiceAIGateway | workflow-infraTriage |
Service::AlertManager
Issues: 2 ServiceAlertManager
Topic | Team | Service::AlertManager | Board | Workflow Status |
---|---|---|---|---|
Corrective action: Workhorse and Load Balancer SLI interdependency for alerts #2955 |
ServiceAlertManager | workflow-infraTriage | ||
Traffic absent alerts causing pager noise #3276 |
ServiceAlertManager | workflow-infraProposal |
Service::ClickHouseCloud
Issues: 1 ServiceClickHouseCloud
Topic | Team | Service::ClickHouseCloud | Board | Workflow Status |
---|---|---|---|---|
Help setup Clickhouse Rails logs #2982 |
ServiceClickHouseCloud | workflow-infraStalled |
Service::Container Registry
Issues: 1 ServiceContainer Registry
Topic | Team | Service::Container Registry | Board | Workflow Status |
---|---|---|---|---|
Include a link to a specific kibana error log search in the alert definition for the garbage collection component of the container registry service #3293 |
ServiceContainer Registry | workflow-infraTriage |
Service::Database
Issues: 1 ServiceDatabase
Topic | Team | Service::Database | Board | Workflow Status |
---|---|---|---|---|
Stage group index is broken again. #4043 |
ServiceDatabase | workflow-infraStalled |
Service::Elasticsearch
Issues: 4 ServiceElasticsearch
Topic | Team | Service::Elasticsearch | Board | Workflow Status |
---|---|---|---|---|
Create test deployment for gitlab-docs-website for Localization team #4141 |
ServiceElasticsearch | workflow-infraTriage | ||
Move logging and elasticsearch alerts to #g_infra_observability_alerts. #2941 |
ServiceElasticsearch | workflow-infraStalled | ||
Migrate elasticsearch configuration from custom scripts to terraform #2966 |
ServiceElasticsearch | workflow-infraBlocked | ||
Confidential Issue https://gitlab.com/gitlab-com/gl-infra/observability/team/-/issues/3027 |
ServiceElasticsearch | workflow-infraReady |
Service::GCP
Issues: 2 ServiceGCP
Topic | Team | Service::GCP | Board | Workflow Status |
---|---|---|---|---|
Review new Google Cloud Logging regional ingestion quotas #4056 |
ServiceGCP | workflow-infraTriage | ||
Implement GCP scheduled snapshots health check #3257 |
ServiceGCP | workflow-infraTriage |
Service::GitLab Rails
Issues: 1 ServiceGitLab Rails
Topic | Team | Service::GitLab Rails | Board | Workflow Status |
---|---|---|---|---|
Review Request: maven virtual registry, multiple usptreams support #4096 |
teamScalability | ServiceGitLab Rails | boardbuild | workflow-infraReady |
Service::Gitaly
Issues: 1 ServiceGitaly
Topic | Team | Service::Gitaly | Board | Workflow Status |
---|---|---|---|---|
Corrective action: alert on GitLab pipeline failures due to load. #3245 |
ServiceGitaly | workflow-infraTriage |
Service::Grafana
Issues: 12 ServiceGrafana
Topic | Team | Service::Grafana | Board | Workflow Status |
---|---|---|---|---|
Adding a grafana datasource to through configuration fails on secrets #4214 |
ServiceGrafana | workflow-infraTriage | ||
Confidential Issue https://gitlab.com/gitlab-com/gl-infra/observability/team/-/issues/4193 |
ServiceGrafana | workflow-infraTriage | ||
SLI detail panels should apply the same selectors as the SLI itself does #4189 |
ServiceGrafana | workflow-infraTriage | ||
Replace redis-sidekiq shard template to use the generic shard template #4171 |
ServiceGrafana | workflow-infraTriage | ||
Make the service overview show the SLI for each shard in a different colour #4169 |
ServiceGrafana | workflow-infraTriage | ||
Confidential Issue https://gitlab.com/gitlab-com/gl-infra/observability/team/-/issues/4079 |
ServiceGrafana | workflow-infraBacklog | ||
Confidential Issue https://gitlab.com/gitlab-com/gl-infra/observability/team/-/issues/4044 |
ServiceGrafana | workflow-infraTriage | ||
Migrate Grafana to Okta #4003 |
ServiceGrafana | workflow-infraProposal | ||
Webservice dashboard link to kibana slow rails requests broken #3855 |
ServiceGrafana | workflow-infraTriage | ||
Escaping of promql queries in alertmanager Slack alerts broken #3854 |
ServiceGrafana | workflow-infraTriage | ||
Streamline latency attribution via service dashboards #3849 |
ServiceGrafana | workflow-infraTriage | ||
Review grafana monitoring and alerting rules. #2971 |
ServiceGrafana | workflow-infraTriage |
Service::Kube
Issues: 3 ServiceKube
Topic | Team | Service::Kube | Board | Workflow Status |
---|---|---|---|---|
Create process to periodically review nodepool instance families in kubernetes #4173 |
ServiceKube | workflow-infraProposal | ||
Monitor kubernetes node CPU wait / noisy neighbour #4172 |
ServiceKube | workflow-infraProposal | ||
Corrective action: The cluster_scaleups SLI of the kube service (main stage) has an error rate violating SLO #3256 |
ServiceKube | workflow-infraTriage |
Service::Logging
Issues: 15 ServiceLogging
Topic | Team | Service::Logging | Board | Workflow Status |
---|---|---|---|---|
Update runbooks and docs-hub logging documentation #4217 |
ServiceLogging | workflow-infraProposal | ||
Confidential Issue https://gitlab.com/gitlab-com/gl-infra/observability/team/-/issues/4128 |
ServiceLogging | workflow-infraStalled | ||
Audit GCP Cloud Logs usage #4118 |
ServiceLogging | workflow-infraTriage | ||
Decommission Loki #4116 |
ServiceLogging | workflow-infraTriage | ||
Ingest sampled logs for some percentage of gitlab rails sql queries #4107 |
ServiceLogging | workflow-infraTriage | ||
Differentiate disk space in Elastic by data tier #4066 |
ServiceLogging | workflow-infraTriage | ||
How should we handle large json.message fields in Elastic. #4052 |
ServiceLogging | workflow-infraTriage | ||
Confidential Issue https://gitlab.com/gitlab-com/gl-infra/observability/team/-/issues/3880 |
ServiceLogging | workflow-infraTriage | ||
fluentbit requesting a large percentage of cpu resources on k8s nodes #3836 |
ServiceLogging | workflow-infraTriage | ||
clean up duplicate logs from our GCS logging archive #3019 |
ServiceLogging | workflow-infraReady | ||
Improve the pubsubbeat deployment #3255 |
ServiceLogging | workflow-infraTriage | ||
Push Elasticsearch ILM policies and index templates on a schedule #3292 |
ServiceLogging | workflow-infraTriage | ||
Add Kibana fields for Postgres autovacuum auto-analyze log messages #3232 |
ServiceLogging | workflow-infraTriage | ||
investigate mising stacks in flamegraphs generation in es-diagnostics #3288 |
ServiceLogging | workflow-infraTriage | ||
Find what is putting extra lines in prometheus logs #3296 |
ServiceLogging | workflow-infraTriage |
Service::Mimir
Issues: 10 ServiceMimir
Topic | Team | Service::Mimir | Board | Workflow Status |
---|---|---|---|---|
Implement aggregation for metrics with endpoint_id #4139 |
ServiceMimir | workflow-infraProposal | ||
Increase in Mimir store getRange latencies since upgrade to 2.15.0 #4124 |
ServiceMimir | workflow-infraStalled | ||
Observability improvements for Mimir #3891 |
ServiceMimir | workflow-infraTriage | ||
Validating alert recording rules on live metrics #3853 |
ServiceMimir | workflow-infraTriage | ||
Create a testing framework for recording- and alerting rules #3851 |
ServiceMimir | workflow-infraTriage | ||
Add Metric Management information to Monitoring section of handbook #3704 |
ServiceMimir | workflow-infraTriage | ||
Request to update prometheus blackbox config for handbook website #3675 |
ServiceMimir | workflow-infraTriage | ||
Rename GCP bucket thanos-periodic-queries to periodic-queries #3519 |
ServiceMimir | workflow-infraTriage | ||
Move periodic queries execution from ops to GitLab.com #3512 |
ServiceMimir | workflow-infraReady | ||
Combine enqueued_jobs and sidekiq_queueing SLI in Sidekiq #3488 |
ServiceMimir | boardplanning | workflow-infraTriage |
Service::Monitoring-Other
Issues: 4 ServiceMonitoring-Other
Topic | Team | Service::Monitoring-Other | Board | Workflow Status |
---|---|---|---|---|
Provide capability to backtest new alert definitions #4143 |
ServiceMonitoring-Other | workflow-infraTriage | ||
Add apdex and error metrics to the git/gitlab_shell SLI #3239 |
ServiceMonitoring-Other | workflow-infraTriage | ||
Review kubernetes container resource saturation monitoring #3076 |
ServiceMonitoring-Other | workflow-infraReady | ||
Reduce GitLab's histograms to 3-5 buckets for most histograms #476 |
ServiceMonitoring-Other | workflow-infraTriage |
Service::Oncall-Tooling
Issues: 3 ServiceOncall-Tooling
Topic | Team | Service::Oncall-Tooling | Board | Workflow Status |
---|---|---|---|---|
Severity labels not being applied consistently to incident issues #3236 |
ServiceOncall-Tooling | workflow-infraReady | ||
Create SLI / SLO for the autocomplete endpoints in the runbooks #3280 |
ServiceOncall-Tooling | workflow-infraTriage | ||
Rename prometheus missing from cluster notifications to be more helpful #3241 |
ServiceOncall-Tooling | workflow-infraTriage |
Service::PVS
Issues: 1 ServicePVS
Topic | Team | Service::PVS | Board | Workflow Status |
---|---|---|---|---|
Fix the PvsServiceHttpApdexSLOViolation source metric #2968 |
ServicePVS | workflow-infraTriage |
Service::Pages
Issues: 2 ServicePages
Topic | Team | Service::Pages | Board | Workflow Status |
---|---|---|---|---|
Investigate alerting thresholds for WebPagesServiceWebPagesServerApdexSLOViolationRegional #3503 |
ServicePages | workflow-infraTriage | ||
Corrective Action: Implement Transactional Monitoring for Pages service #3502 |
ServicePages | workflow-infraTriage |
Service::Patroni
Issues: 1 ServicePatroni
Topic | Team | Service::Patroni | Board | Workflow Status |
---|---|---|---|---|
Monitor Postgres TOAST oid exhaustion #3180 |
ServicePatroni | boardplanning | workflow-infraNeeds More Info |
Service::Pgbouncer
Issues: 1 ServicePgbouncer
Topic | Team | Service::Pgbouncer | Board | Workflow Status |
---|---|---|---|---|
Review capacity planning for pgbouncer async primary pool #3917 |
ServicePgbouncer | workflow-infraTriage |
Service::Postgres
Issues: 1 ServicePostgres
Topic | Team | Service::Postgres | Board | Workflow Status |
---|---|---|---|---|
Patroni main, data growth drill-down #3784 |
ServicePostgres | boardplanning | workflow-infraTriage |
Service::Prometheus
Issues: 6 ServicePrometheus
Topic | Team | Service::Prometheus | Board | Workflow Status |
---|---|---|---|---|
Test Prometheus 3.0 #4000 |
ServicePrometheus | workflow-infraTriage | ||
Corrective action: Update runbook for prometheus increase storage #2967 |
ServicePrometheus | workflow-infraTriage | ||
Create a more general purpose stackdriver-exporter for teams #2997 |
ServicePrometheus | workflow-infraReady | ||
Deploy Prometheus Rules and Alertmanager from gitlab-helmfiles instead of runbooks #3267 |
ServicePrometheus | workflow-infraTriage | ||
Add monitoring for OAuth2 login endpoints #3228 |
ServicePrometheus | workflow-infraTriage | ||
Implement memory and CPU limits to the Prometheus processes in VMs #3273 |
ServicePrometheus | workflow-infraTriage |
Service::Redis
Issues: 2 ServiceRedis
Topic | Team | Service::Redis | Board | Workflow Status |
---|---|---|---|---|
Mirror process-exporter image to be resilient to docker registry failure #1709 |
ServiceRedis | workflow-infraTriage | ||
Evaluate porting scheduled CPU profiles for redis observability on Kubernetes #1633 |
ServiceRedis | workflow-infraTriage |
Service::Runbooks
Issues: 2 ServiceRunbooks
Topic | Team | Service::Runbooks | Board | Workflow Status |
---|---|---|---|---|
Report availability per service and overall GitLab availability. #4082 |
ServiceRunbooks | workflow-infraTriage | ||
Fix update feature categories script on runbooks #3961 |
ServiceRunbooks | workflow-infraTriage |
Service::Sentry
Issues: 3 ServiceSentry
Topic | Team | Service::Sentry | Board | Workflow Status |
---|---|---|---|---|
Fix gap in sentry monitoring #4224 |
ServiceSentry | workflow-infraTriage | ||
Sentry is not processing events #4223 |
ServiceSentry | workflow-infraTriage | ||
Corrective Action: relieve memory pressure issues with Sentry's kafka #4026 |
ServiceSentry | workflow-infraTriage |
Service::Sidekiq
Issues: 2 ServiceSidekiq
Topic | Team | Service::Sidekiq | Board | Workflow Status |
---|---|---|---|---|
Continuous profiling for Ruby Projects #3827 |
ServiceSidekiq | workflow-infraTriage | ||
Discuss removal of histogram metrics on Sidekiq for self-managed #2474 |
ServiceSidekiq | workflow-infraProposal |
Service::Thanos
Issues: 1 ServiceThanos
Topic | Team | Service::Thanos | Board | Workflow Status |
---|---|---|---|---|
Remove remaining Thanos components #4008 |
ServiceThanos | workflow-infraProposal |
Service::Unknown
Issues: 5 ServiceUnknown
Topic | Team | Service::Unknown | Board | Workflow Status |
---|---|---|---|---|
Make the pager stop melting if the world is on fire. #4222 |
ServiceUnknown | workflow-infraTriage | ||
tenant-observability-stack: Add support for node selector in tenant-observability-config-manager job #4158 |
ServiceUnknown | workflow-infraTriage | ||
tenant-observability-stack: Make images configurable for ARM support #4157 |
ServiceUnknown | workflow-infraTriage | ||
Confidential Issue https://gitlab.com/gitlab-com/gl-infra/observability/team/-/issues/4151 |
ServiceUnknown | workflow-infraStalled | ||
Creating test dashboard do not properly work #4144 |
ServiceUnknown | workflow-infraTriage |
Service::Web
Issues: 3 ServiceWeb
Topic | Team | Service::Web | Board | Workflow Status |
---|---|---|---|---|
Web pods are being throttled #4205 |
ServiceWeb | workflow-infraTriage | ||
Add job to update feature categories to the rails app #3963 |
ServiceWeb | workflow-infraTriage | ||
Alert for fatal errors [Corrective action] Mixed deployment issues with WebAuthn logins #3259 |
ServiceWeb | workflow-infraTriage |
Other
Topic | **Team ** | Board | Workflow Status |
---|---|---|---|
Can we make the triage dashboards useful? #4221 |
|||
Introduce floor threshold into our Capacity Planning process to improve financial efficiency #4108 |
|||
Confidential Issue https://gitlab.com/gitlab-com/gl-infra/observability/team/-/issues/4104 |
|||
Implement a feature registry #4099 |
|||
Warn of imminent saturation in a more urgent way. #4054 |
|||
Dimension lookup during reporting and issue management #4033 |
|||
Discussion: Observability Service topology for metrics in cells #4029 |
|||
patroni.disk_sustained_write_iops and patroni.disk_sustained_read_iops missing graphs #4032 |
|||
Error Budgets should be based on full calendar month #3979 |
|||
Stewardship for common-ci-tasks and related projects #3948 |
|||
Confidential Issue https://gitlab.com/gitlab-com/gl-infra/observability/team/-/issues/3840 |
|||
Confidential Issue https://gitlab.com/gitlab-com/gl-infra/observability/team/-/issues/3835 |
|||
Early Feedback Cost Dashboards #3794 |
|||
Confidential Issue https://gitlab.com/gitlab-com/gl-infra/observability/team/-/issues/3729 |
|||
Tamland runner shard saturates on CPU #3712 |
|||
Synthetic Monitoring / Testing #3637 |
|||
Move product error budget dashboards out of the current folder #3618 |
|||
Confidential Issue https://gitlab.com/gitlab-com/gl-infra/observability/team/-/issues/3441 |
boardplanning | ||
Use expanded labels recording rule for alerting dashboards #3426 |
boardplanning | ||
Observability Feedback from Engineering Productivity Pulse Survey - FY25Q1 #2953 |
|||
Turn the get-hybrid monitoring config into a monitoring mixin #2832 |
boardplanning | ||
Labkit as the in-application platform toolkit #2793 |
|||
Introduce open_fds saturation point for process_exporter #2778 |
|||
Rename SLOs we use in saturation points #2168 |
|||
Remove custom feature category recordings for the puma component #1481 |