Summary of issues not in Epics (autogenerated)
Summary of issues that are not in an Epic
Total Issues: 129
Team Tasks
Issues: 14 team-tasks
| Topic | Team | Service | Board | Workflow Status | Due Date |
|---|---|---|---|---|---|
| Keeping documentation up to date #4016 |
workflow-infraProposal | ||||
| Document functionality to create GitLab issues from prometheus alerts #4013 |
boardplanning | workflow-infraTriage | |||
| Dreaming of 2025: Observability Wish List! #4012 |
workflow-infraStalled | ||||
| Metrics data for consumption and analysis #3967 |
boardplanning | workflow-infraTriage | |||
| Add script to update feature categories to stage-groups-index #3962 |
workflow-infraTriage | ||||
| Remove promtool from the runbooks image #3960 |
workflow-infraTriage | ||||
| Route saturation alerts to service owners #3941 |
workflow-infraTriage | ||||
| Metric / o11y on inactive sidekiq threads #3808 |
workflow-infraStalled | ||||
| jsonnet-tool should pass along JSONNET_PATH #3678 |
workflow-infraTriage | ||||
| Link the unwinded source metrics of an alert in an alert message #3662 |
workflow-infraTriage | ||||
| Commit and push feature categories should alert on failure #3616 |
workflow-infraTriage | ||||
| Tamland documentation: a day-in-the-life of Capacity planning #3578 |
boardbuild | workflow-infraReady | |||
| Automate creation of MR to update feature categories #3400 |
workflow-infraStalled | ||||
| Summary of issues not in Epics (autogenerated) #538 |
Service::AIGateway
Issues: 2 ServiceAIGateway
| Topic | Team | Service::AIGateway | Board | Workflow Status |
|---|---|---|---|---|
| Analysis of frequency and duration of specific AI Gateway errors with code 429 #4210 |
ServiceAIGateway | workflow-infraTriage | ||
| Monitor for high percentage of non-200 requests to the AI gateway #3600 |
ServiceAIGateway | workflow-infraTriage |
Service::AlertManager
Issues: 2 ServiceAlertManager
| Topic | Team | Service::AlertManager | Board | Workflow Status |
|---|---|---|---|---|
| Corrective action: Workhorse and Load Balancer SLI interdependency for alerts #2955 |
ServiceAlertManager | workflow-infraTriage | ||
| Traffic absent alerts causing pager noise #3276 |
ServiceAlertManager | workflow-infraProposal |
Service::ClickHouseCloud
Issues: 1 ServiceClickHouseCloud
| Topic | Team | Service::ClickHouseCloud | Board | Workflow Status |
|---|---|---|---|---|
| Help setup Clickhouse Rails logs #2982 |
ServiceClickHouseCloud | workflow-infraStalled |
Service::Container Registry
Issues: 1 ServiceContainer Registry
| Topic | Team | Service::Container Registry | Board | Workflow Status |
|---|---|---|---|---|
| Include a link to a specific kibana error log search in the alert definition for the garbage collection component of the container registry service #3293 |
ServiceContainer Registry | workflow-infraTriage |
Service::Database
Issues: 1 ServiceDatabase
| Topic | Team | Service::Database | Board | Workflow Status |
|---|---|---|---|---|
| Stage group index is broken again. #4043 |
ServiceDatabase | workflow-infraStalled |
Service::Elasticsearch
Issues: 5 ServiceElasticsearch
| Topic | Team | Service::Elasticsearch | Board | Workflow Status |
|---|---|---|---|---|
Create Elastic Cloud Serverless Project with Elasticsearch project gitlab-docs-website for TW team #4370 |
ServiceElasticsearch | |||
| Create test deployment for gitlab-docs-website for Localization team #4141 |
ServiceElasticsearch | workflow-infraTriage | ||
| Move logging and elasticsearch alerts to #g_infra_observability_alerts. #2941 |
ServiceElasticsearch | workflow-infraStalled | ||
| Migrate elasticsearch configuration from custom scripts to terraform #2966 |
ServiceElasticsearch | workflow-infraBlocked | ||
| Confidential Issue https://gitlab.com/gitlab-com/gl-infra/observability/team/-/issues/3027 |
ServiceElasticsearch | workflow-infraReady |
Service::GCP
Issues: 2 ServiceGCP
| Topic | Team | Service::GCP | Board | Workflow Status |
|---|---|---|---|---|
| Review new Google Cloud Logging regional ingestion quotas #4056 |
ServiceGCP | workflow-infraTriage | ||
| Implement GCP scheduled snapshots health check #3257 |
ServiceGCP | workflow-infraTriage |
Service::GitLab Rails
Issues: 1 ServiceGitLab Rails
| Topic | Team | Service::GitLab Rails | Board | Workflow Status |
|---|---|---|---|---|
| Review Request: maven virtual registry, multiple usptreams support #4096 |
teamScalability | ServiceGitLab Rails | boardbuild | workflow-infraReady |
Service::Gitaly
Issues: 1 ServiceGitaly
| Topic | Team | Service::Gitaly | Board | Workflow Status |
|---|---|---|---|---|
| Corrective action: alert on GitLab pipeline failures due to load. #3245 |
ServiceGitaly | workflow-infraTriage |
Service::Grafana
Issues: 12 ServiceGrafana
| Topic | Team | Service::Grafana | Board | Workflow Status |
|---|---|---|---|---|
| Adding a grafana datasource to through configuration fails on secrets #4214 |
ServiceGrafana | workflow-infraTriage | ||
| Confidential Issue https://gitlab.com/gitlab-com/gl-infra/observability/team/-/issues/4193 |
ServiceGrafana | workflow-infraTriage | ||
| SLI detail panels should apply the same selectors as the SLI itself does #4189 |
ServiceGrafana | workflow-infraTriage | ||
| Replace redis-sidekiq shard template to use the generic shard template #4171 |
ServiceGrafana | workflow-infraTriage | ||
| Make the service overview show the SLI for each shard in a different colour #4169 |
ServiceGrafana | workflow-infraTriage | ||
| Confidential Issue https://gitlab.com/gitlab-com/gl-infra/observability/team/-/issues/4079 |
ServiceGrafana | workflow-infraBacklog | ||
| Confidential Issue https://gitlab.com/gitlab-com/gl-infra/observability/team/-/issues/4044 |
ServiceGrafana | workflow-infraTriage | ||
| Migrate Grafana to Okta #4003 |
ServiceGrafana | workflow-infraProposal | ||
| Webservice dashboard link to kibana slow rails requests broken #3855 |
ServiceGrafana | workflow-infraTriage | ||
| Escaping of promql queries in alertmanager Slack alerts broken #3854 |
ServiceGrafana | workflow-infraTriage | ||
| Streamline latency attribution via service dashboards #3849 |
ServiceGrafana | workflow-infraTriage | ||
| Review grafana monitoring and alerting rules. #2971 |
ServiceGrafana | workflow-infraTriage |
Service::Kube
Issues: 3 ServiceKube
| Topic | Team | Service::Kube | Board | Workflow Status |
|---|---|---|---|---|
| Create process to periodically review nodepool instance families in kubernetes #4173 |
ServiceKube | workflow-infraProposal | ||
| Monitor kubernetes node CPU wait / noisy neighbour #4172 |
ServiceKube | workflow-infraProposal | ||
| Corrective action: The cluster_scaleups SLI of the kube service (main stage) has an error rate violating SLO #3256 |
ServiceKube | workflow-infraTriage |
Service::Logging
Issues: 13 ServiceLogging
| Topic | Team | Service::Logging | Board | Workflow Status |
|---|---|---|---|---|
| Update runbooks and docs-hub logging documentation #4217 |
ServiceLogging | workflow-infraProposal | ||
| Audit GCP Cloud Logs usage #4118 |
ServiceLogging | workflow-infraTriage | ||
| Decommission Loki #4116 |
ServiceLogging | workflow-infraTriage | ||
| Ingest sampled logs for some percentage of gitlab rails sql queries #4107 |
ServiceLogging | workflow-infraTriage | ||
| Differentiate disk space in Elastic by data tier #4066 |
ServiceLogging | workflow-infraTriage | ||
| How should we handle large json.message fields in Elastic. #4052 |
ServiceLogging | workflow-infraTriage | ||
| Confidential Issue https://gitlab.com/gitlab-com/gl-infra/observability/team/-/issues/3880 |
ServiceLogging | workflow-infraTriage | ||
| fluentbit requesting a large percentage of cpu resources on k8s nodes #3836 |
ServiceLogging | workflow-infraTriage | ||
| Improve the pubsubbeat deployment #3255 |
ServiceLogging | workflow-infraTriage | ||
| Push Elasticsearch ILM policies and index templates on a schedule #3292 |
ServiceLogging | workflow-infraTriage | ||
| Add Kibana fields for Postgres autovacuum auto-analyze log messages #3232 |
ServiceLogging | workflow-infraTriage | ||
| investigate mising stacks in flamegraphs generation in es-diagnostics #3288 |
ServiceLogging | workflow-infraTriage | ||
| Find what is putting extra lines in prometheus logs #3296 |
ServiceLogging | workflow-infraTriage |
Service::Mimir
Issues: 10 ServiceMimir
| Topic | Team | Service::Mimir | Board | Workflow Status |
|---|---|---|---|---|
| Implement aggregation for metrics with endpoint_id #4139 |
ServiceMimir | workflow-infraProposal | ||
| Increase in Mimir store getRange latencies since upgrade to 2.15.0 #4124 |
ServiceMimir | workflow-infraStalled | ||
| Observability improvements for Mimir #3891 |
ServiceMimir | workflow-infraTriage | ||
| Validating alert recording rules on live metrics #3853 |
ServiceMimir | workflow-infraTriage | ||
| Create a testing framework for recording- and alerting rules #3851 |
ServiceMimir | workflow-infraTriage | ||
| Add Metric Management information to Monitoring section of handbook #3704 |
ServiceMimir | workflow-infraTriage | ||
| Request to update prometheus blackbox config for handbook website #3675 |
ServiceMimir | workflow-infraTriage | ||
| Rename GCP bucket thanos-periodic-queries to periodic-queries #3519 |
ServiceMimir | workflow-infraTriage | ||
| Move periodic queries execution from ops to GitLab.com #3512 |
ServiceMimir | workflow-infraReady | ||
| Combine enqueued_jobs and sidekiq_queueing SLI in Sidekiq #3488 |
ServiceMimir | boardplanning | workflow-infraTriage |
Service::Monitoring-Other
Issues: 4 ServiceMonitoring-Other
| Topic | Team | Service::Monitoring-Other | Board | Workflow Status |
|---|---|---|---|---|
| Provide capability to backtest new alert definitions #4143 |
ServiceMonitoring-Other | workflow-infraTriage | ||
| Add apdex and error metrics to the git/gitlab_shell SLI #3239 |
ServiceMonitoring-Other | workflow-infraTriage | ||
| Review kubernetes container resource saturation monitoring #3076 |
ServiceMonitoring-Other | workflow-infraReady | ||
| Reduce GitLab's histograms to 3-5 buckets for most histograms #476 |
ServiceMonitoring-Other | workflow-infraTriage |
Service::Oncall-Tooling
Issues: 3 ServiceOncall-Tooling
| Topic | Team | Service::Oncall-Tooling | Board | Workflow Status |
|---|---|---|---|---|
| Severity labels not being applied consistently to incident issues #3236 |
ServiceOncall-Tooling | workflow-infraReady | ||
| Create SLI / SLO for the autocomplete endpoints in the runbooks #3280 |
ServiceOncall-Tooling | workflow-infraTriage | ||
| Rename prometheus missing from cluster notifications to be more helpful #3241 |
ServiceOncall-Tooling | workflow-infraTriage |
Service::Pages
Issues: 2 ServicePages
| Topic | Team | Service::Pages | Board | Workflow Status |
|---|---|---|---|---|
| Investigate alerting thresholds for WebPagesServiceWebPagesServerApdexSLOViolationRegional #3503 |
ServicePages | workflow-infraTriage | ||
| Corrective Action: Implement Transactional Monitoring for Pages service #3502 |
ServicePages | workflow-infraTriage |
Service::Patroni
Issues: 1 ServicePatroni
| Topic | Team | Service::Patroni | Board | Workflow Status |
|---|---|---|---|---|
| Monitor Postgres TOAST oid exhaustion #3180 |
ServicePatroni | boardplanning | workflow-infraNeeds More Info |
Service::Pgbouncer
Issues: 1 ServicePgbouncer
| Topic | Team | Service::Pgbouncer | Board | Workflow Status |
|---|---|---|---|---|
| Review capacity planning for pgbouncer async primary pool #3917 |
ServicePgbouncer | workflow-infraTriage |
Service::Postgres
Issues: 1 ServicePostgres
| Topic | Team | Service::Postgres | Board | Workflow Status |
|---|---|---|---|---|
| Patroni main, data growth drill-down #3784 |
ServicePostgres | boardplanning | workflow-infraTriage |
Service::Prometheus
Issues: 6 ServicePrometheus
| Topic | Team | Service::Prometheus | Board | Workflow Status |
|---|---|---|---|---|
| Test Prometheus 3.0 #4000 |
ServicePrometheus | workflow-infraTriage | ||
| Corrective action: Update runbook for prometheus increase storage #2967 |
ServicePrometheus | workflow-infraTriage | ||
| Create a more general purpose stackdriver-exporter for teams #2997 |
ServicePrometheus | workflow-infraReady | ||
| Deploy Prometheus Rules and Alertmanager from gitlab-helmfiles instead of runbooks #3267 |
ServicePrometheus | workflow-infraTriage | ||
| Add monitoring for OAuth2 login endpoints #3228 |
ServicePrometheus | workflow-infraTriage | ||
| Implement memory and CPU limits to the Prometheus processes in VMs #3273 |
ServicePrometheus | workflow-infraTriage |
Service::Redis
Issues: 2 ServiceRedis
| Topic | Team | Service::Redis | Board | Workflow Status |
|---|---|---|---|---|
| Mirror process-exporter image to be resilient to docker registry failure #1709 |
ServiceRedis | workflow-infraTriage | ||
| Evaluate porting scheduled CPU profiles for redis observability on Kubernetes #1633 |
ServiceRedis | workflow-infraTriage |
Service::Runbooks
Issues: 2 ServiceRunbooks
| Topic | Team | Service::Runbooks | Board | Workflow Status |
|---|---|---|---|---|
| Report availability per service and overall GitLab availability. #4082 |
ServiceRunbooks | workflow-infraTriage | ||
| Fix update feature categories script on runbooks #3961 |
ServiceRunbooks | workflow-infraTriage |
Service::Sentry
Issues: 3 ServiceSentry
| Topic | Team | Service::Sentry | Board | Workflow Status |
|---|---|---|---|---|
| Fix gap in sentry monitoring #4224 |
ServiceSentry | workflow-infraTriage | ||
| Sentry is not processing events #4223 |
ServiceSentry | workflow-infraTriage | ||
| Corrective Action: relieve memory pressure issues with Sentry's kafka #4026 |
ServiceSentry | workflow-infraTriage |
Service::Sidekiq
Issues: 2 ServiceSidekiq
| Topic | Team | Service::Sidekiq | Board | Workflow Status |
|---|---|---|---|---|
| Continuous profiling for Ruby Projects #3827 |
ServiceSidekiq | workflow-infraTriage | ||
| Discuss removal of histogram metrics on Sidekiq for self-managed #2474 |
ServiceSidekiq | workflow-infraProposal |
Service::Thanos
Issues: 1 ServiceThanos
| Topic | Team | Service::Thanos | Board | Workflow Status |
|---|---|---|---|---|
| Remove remaining Thanos components #4008 |
ServiceThanos | workflow-infraProposal |
Service::Unknown
Issues: 5 ServiceUnknown
| Topic | Team | Service::Unknown | Board | Workflow Status |
|---|---|---|---|---|
| Make the pager stop melting if the world is on fire. #4222 |
ServiceUnknown | workflow-infraTriage | ||
| tenant-observability-stack: Add support for node selector in tenant-observability-config-manager job #4158 |
ServiceUnknown | workflow-infraTriage | ||
| tenant-observability-stack: Make images configurable for ARM support #4157 |
ServiceUnknown | workflow-infraTriage | ||
| Confidential Issue https://gitlab.com/gitlab-com/gl-infra/observability/team/-/issues/4151 |
ServiceUnknown | workflow-infraStalled | ||
| Creating test dashboard do not properly work #4144 |
ServiceUnknown | workflow-infraTriage |
Service::Web
Issues: 3 ServiceWeb
| Topic | Team | Service::Web | Board | Workflow Status |
|---|---|---|---|---|
| Web pods are being throttled #4205 |
ServiceWeb | workflow-infraTriage | ||
| Add job to update feature categories to the rails app #3963 |
ServiceWeb | workflow-infraTriage | ||
| Alert for fatal errors [Corrective action] Mixed deployment issues with WebAuthn logins #3259 |
ServiceWeb | workflow-infraTriage |
Other
| Topic | **Team ** | Board | Workflow Status |
|---|---|---|---|
| Can we make the triage dashboards useful? #4221 |
|||
| Introduce floor threshold into our Capacity Planning process to improve financial efficiency #4108 |
|||
| Confidential Issue https://gitlab.com/gitlab-com/gl-infra/observability/team/-/issues/4104 |
|||
| Implement a feature registry #4099 |
|||
| Warn of imminent saturation in a more urgent way. #4054 |
|||
| Dimension lookup during reporting and issue management #4033 |
|||
| Discussion: Observability Service topology for metrics in cells #4029 |
|||
| patroni.disk_sustained_write_iops and patroni.disk_sustained_read_iops missing graphs #4032 |
|||
| Error Budgets should be based on full calendar month #3979 |
|||
| Stewardship for common-ci-tasks and related projects #3948 |
|||
| Confidential Issue https://gitlab.com/gitlab-com/gl-infra/observability/team/-/issues/3840 |
|||
| Confidential Issue https://gitlab.com/gitlab-com/gl-infra/observability/team/-/issues/3835 |
|||
| Early Feedback Cost Dashboards #3794 |
|||
| Confidential Issue https://gitlab.com/gitlab-com/gl-infra/observability/team/-/issues/3729 |
|||
| Tamland runner shard saturates on CPU #3712 |
|||
| Synthetic Monitoring / Testing #3637 |
|||
| Move product error budget dashboards out of the current folder #3618 |
|||
| Confidential Issue https://gitlab.com/gitlab-com/gl-infra/observability/team/-/issues/3441 |
boardplanning | ||
| Use expanded labels recording rule for alerting dashboards #3426 |
boardplanning | ||
| Observability Feedback from Engineering Productivity Pulse Survey - FY25Q1 #2953 |
|||
| Turn the get-hybrid monitoring config into a monitoring mixin #2832 |
boardplanning | ||
| Labkit as the in-application platform toolkit #2793 |
|||
| Introduce open_fds saturation point for process_exporter #2778 |
|||
| Rename SLOs we use in saturation points #2168 |
|||
| Remove custom feature category recordings for the puma component #1481 |