Feature Category Summary for Core Services Scalability

Workflow board for Core Services Scalability: Board

Project Work

Topic Links Status Summary
GitLab SaaS Availability reporting improvements
@stejacks-gitlab
&1012
Board
workflow-infraIn Progress 2023-09-20

Our current focus is completing the work to have a fully functional thanos staging environment (&1068), which will allow us to close out the final task in the precision subepic (&1098) and begin the work to create shardable recording rules (&1107). In the past week we've deployed to the staging environment, refactored the sharding method, and worked to improve observability in the staging environment.
Stop using namespaces in sidekiq
@schin1
&944
Board
workflow-infraIn Progress 2023-09-20

Charts MR is left which will mean 16.5 is our first release. This epic will end when 16.7 is released (see release plan at #2288 (closed)). Updates to this epic will be infrequent.
Functional Partitioning options to prevent Redis saturation
@lmcandrew
&900
Board
workflow-infraTriage
Functional Partitioning for pub/sub in Redis
@schin1
&1066
Board
workflow-infraIn Progress 2023-09-20

The workhorse workload on gstg will be migrated this week. There is a delay to gprd provisioning as the GKE cluster is blocked on external-dns refactoring efforts. We may need to move the Due Date by 1 week.
Migrate exclusive lease keys from Redis persistent to redis-cluster-shared-state
@marcogreg
&1094
Board
workflow-infraIn Progress 2023-09-20

Staging:

* Exclusive lease keys have been migrated to the new redis-cluster-shared-state.

Prod:

* Sizing properties are under review.
* To be followed by provisioning the cluster (CR).
Rails metrics cardinality review
&330
Board
workflow-infraTriage
Horizontally Scale redis persistent using Redis Cluster
&1055
Board
workflow-infraTriage
Upgrading Sidekiq and Redis gems
&941
Board
workflow-infraTriage
Application optimizations for Sidekiq in Kubernetes
@msmiley
&539
Board
workflow-infraTriage
Scaling out and online re-sharding of Redis Cluster
&1105
Board
workflow-infraTriage

Issues Not in Epics

Summary of issues that are not in an Epic (for Core Services Scalability)

Total Issues: 92

Topic Service Board Workflow Status
Add traceability to Sidekiq worker type feature flags
#2529 (closed)
Category:Core Services Scalability
Update run_sidekiq_jobs and drop_sidekiq_jobs FF to use request as actor
#2528 (closed)
ServiceSidekiq workflow-infraProposal
Use correct store and logical abstractions when rate-limiting instead of using exclusive lease
#2513 (closed)
ServiceRateLimiting
GITLAB_THROTTLE_USER_ALLOWLIST still rate limit user
#2514 (closed)
ServiceRateLimiting workflow-infraTriage
Possible application-side performance regressions in cache due to Redis Cluster
#2503 (moved)
ServiceRedisClusterCache workflow-infraTriage
Discuss removal of histogram metrics on Sidekiq for self-managed
#2474
ServiceSidekiq workflow-infraProposal
Discussion: when should we scale up our Redis Clusters
#2466 (closed)
ServiceRedisClusterRateLimiting
Support Redis Cluster in gitlab compose kit
#2465 (closed)
ServiceRedis
Knowledge sharing: Redis
#2432 (closed)
ServiceRedis workflow-infraTriage
[Discussion] Improve redis.yml robustness
#2413 (closed)
ServiceRedis
Discussion: Redis password rotation process
#2402 (closed)
ServiceRedis
Discussion: Scalability group's stance on future Redis deployments (k8s or VMs)
#2340 (closed)
ServiceRedis
Audit unused Sidekiq metrics
#2297 (closed)
ServiceSidekiq workflow-infraTriage
Set Sidekiq cron jobs to be idempotent
#2289 (closed)
ServiceSidekiq workflow-infraTriage
Evaluate redis-sessions workload's Redis Cluster compatibility
#2270 (moved)
ServiceRedisSessions workflow-infraTriage
Increased eval latency and elevated memory utilization on redis shared state
#2263 (closed)
ServiceRedis workflow-infraTriage
Capture redis slowlogs from Redis Cluster nodes
#2258 (closed)
ServiceRedisCache workflow-infraTriage
Removal of queue selector
#2220 (closed)
ServiceSidekiq workflow-infraProposal
Saturated disk write throughput on Gitaly VMs
#2208 (closed)
ServiceGitaly workflow-infraTriage
Update object storage blueprint
#2207 (closed)
ServiceAPI workflow-infraTriage
Confidential Issue
https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2202
ServiceRedis workflow-infraTriage
Investigate missing /var/log/syslog on redis-repository-cache in gprd
#2188 (closed)
ServiceRedis workflow-infraTriage
SaaS Platforms' Redis Roadmap
#2155 (closed)
ServiceRedis
Add "preserve config" mode to Redis and Sentinel in Omnibus
#2154 (closed)
ServiceRedis workflow-infraTriage
Disable BGSAVE for redis-registry-cache
#2102 (closed)
ServiceRedis workflow-infraTriage
[pre] Metrics from VMs in the pre-evironment arent scraped
#2007 (closed)
ServiceRedis workflow-infraTriage
Fix the chef config for pre not using the correct ratelimiting instances.
#1966 (closed)
ServiceRedis workflow-infraTriage
Use a headless service to determine redis sentinels to connect to from the application
#1961 (closed)
Category:Core Services Scalability
Scalability's involvement with Application Rate Limiting architecture
#1947 (closed)
ServiceWeb workflow-infraTriage
Update sidekiq guides to restrict use of sidekiq APIs
#1939 (closed)
ServiceSidekiq workflow-infraTriage
Recalibrate redis IO threads after eliminating wide fan-out of PUBLISH events
#1935 (closed)
ServiceRedis workflow-infraIn Progress
Consume slowlogs from redis in k8s
#1911 (closed)
ServiceRedis workflow-infraTriage
Fluentd single-threaded bottleneck
#1906 (closed)
ServiceLogging workflow-infraTriage
Establish keys subset to shard out of SharedState
#1863 (closed)
ServiceRedis workflow-infraTriage
Direct reads from Sidekiq to read-only replicas by default
#1811 (closed)
ServicePatroni workflow-infraTriage
Improve the Gitaly weight assigner to take CPU utilization into account
#1782 (closed)
ServiceGitaly workflow-infraTriage
Add a saturation metric for redis memory usage compared to the configured maxmemory
#1765 (closed)
ServiceRedis workflow-infraTriage
Reduce process-exporter scrape interval on Redis nodes
#1742 (closed)
ServiceRedis workflow-infraTriage
Doc: update user documentation of webhook log
#1740 (closed)
ServiceWeb workflow-infraTriage
Data for MailRoom ownership
#1737 (closed)
ServiceMailroom workflow-infraTriage
Set context types of MailRoom's postback requests to text/plain in the upstream gem
#1734 (closed)
ServiceMailroom workflow-infraTriage
Mirror process-exporter image to be resilient to docker registry failure
#1709
ServiceRedis workflow-infraTriage
Object Storage Roadmap
#1683 (closed)
ServiceWeb workflow-infraTriage
Separate pgbouncer connection pools for latency sensitive Sidekiq workloads
#1682 (closed)
ServiceSidekiq workflow-infraTriage
Confidential Issue
https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1656
ServiceSidekiq workflow-infraTriage
Do not log Sentry exceptions by default
#1583 (closed)
ServiceLogging workflow-infraTriage
Investigate: Pod failures and PVC's
#1563 (closed)
ServiceRedis workflow-infraTriage
Test data restoration practices
#1539 (closed)
ServiceRedis workflow-infraTriage
Ensure sysctls are set for Redis nodes in GKE
#1803 (closed)
ServiceRedis workflow-infraTriage
Reject oversized emails early before reading their request bodies
#1521 (closed)
ServiceMailroom workflow-infraTriage
Investigate saturation risk for PgBouncer client connections
#1503 (closed)
ServicePgbouncer workflow-infraTriage
Rework Sidekiq style guide
#1495 (closed)
ServiceSidekiq boardbuild workflow-infraReady
Support Mailroom postback retry attempt
#1487 (closed)
ServiceMailroom workflow-infraProposal
Clean up mailroom Sidekiq delivery strategy
#1463 (closed)
ServiceSidekiq workflow-infraTriage
Generate Redis interaction map systematically
#1452 (closed)
ServiceRedis workflow-infraTriage
Redis Cache Sentinel node_schedstat_waiting
#1427 (closed)
ServiceRedis workflow-infraTriage
Redis failover inhibited by residual phantom sentinel voters
#1385 (closed)
ServiceRedis workflow-infraTriage
Handle Sidekiq jobs with a compressed payload > 5MB are being rejected
#1349 (closed)
ServiceSidekiq boardplanning workflow-infraTriage
Move workers off of quarantine shard
#1211 (closed)
ServiceSidekiq workflow-infraTriage
Make Sidekiq routing rule validation noisier
#1182 (closed)
ServiceSidekiq workflow-infraTriage
Investigation into INCRBY operations showing in the slowlog
#1168 (closed)
ServiceRedis workflow-infraTriage
Upgrade kitchen-inspec and improve parallel specs in gitlab_fluentd cookbook
#1165 (closed)
ServiceMonitoring-Other workflow-infraTriage
Move the SSL handshake to cloudflare for GitLab-pages accessed through GitLab.io
#1163 (closed)
ServicePages workflow-infraTriage
Sidekiq management: move selected enqueued jobs to another queue
#1080 (closed)
ServiceSidekiq workflow-infraTriage
Add sidekiq queries to the patroni SLI
#1059 (closed)
ServicePostgres workflow-infraTriage
Add a method to allow SREs to intervene in the processing of jobs for specific workers
#997 (closed)
ServiceSidekiq workflow-infraProposal
Increase Sidekiq retries from 3 back to 25
#986 (closed)
ServiceSidekiq workflow-infraTriage
Keep Redis' production flamegraphs longer
#860 (closed)
ServiceRedis workflow-infraProposal
Integration tests for HAProxy
#817 (closed)
ServiceHAProxy workflow-infraTriage
Add retry-after header into HAProxy
#816 (closed)
ServiceHAProxy workflow-infraTriage
Review duplication in fields logged from Rails
#770 (closed)
ServiceLogging workflow-infraTriage
Some queues send to the dead jobs queue very frequently
#715 (closed)
ServiceSidekiq workflow-infraTriage
Rate limiting git traffic
#691 (closed)
ServiceGit workflow-infraTriage
puma performance for git ssh and git https
#673 (closed)
ServiceGit workflow-infraBlocked
Confidential Issue
https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/660
ServiceWeb workflow-infraProposal
Move Sidekiq throttling functionality into the application
#638 (closed)
ServiceSidekiq workflow-infraTriage
Confidential Issue
https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/617
ServiceWeb workflow-infraTriage
Workhhorse logs should report bytes read via labkit
#587 (closed)
ServiceWeb workflow-infraReady
Reduce redundancy in Sidekiq process management
#575 (closed)
ServiceRedis workflow-infraBlocked
In process memory caching on the client side for redis
#570 (closed)
ServiceRedis workflow-infraTriage
Sidekiq worker classes should be annotated with criticality
#541 (closed)
ServiceSidekiq workflow-infraTriage
Define an SLO for rails pod startup times
#466 (closed)
ServiceWeb workflow-infraTriage
Document what a 'Sidekiq Shard' is (GitLab.com terminology)
#454 (closed)
ServiceSidekiq workflow-infraTriage
Resolve name consistency of Redis instances/nodes
#444 (closed)
ServiceRedis workflow-infraTriage
Duplicate information controller and action information in rails logs
#383 (closed)
ServiceLogging workflow-infraTriage
CI Runner Structured Log Cleanup
#371 (closed)
ServiceCI Runners workflow-infraTriage
Redis keyspace analysis: alerting
#363 (closed)
ServiceRedis workflow-infraBlocked
Redis key analysis: automate analysis of entire keyspace
#362 (closed)
ServiceRedis workflow-infraStalled
PeriodicExclusiveDaemon Application Infrastructure
#342 (closed)
ServiceGitLab Rails workflow-infraTriage
Remove --experimental-queue-selector flag from sidekiq-cluster
#260 (closed)
ServiceSidekiq workflow-infraTriage
Redis failover should be tested as part of our QA integration test suite
#133 (closed)
ServiceRedis workflow-infraTriage
Use multiple Redis cache instances in Rails.cache
#49 (closed)
ServiceRedis boardplanning workflow-infraIn Progress
Edited by Rachel Nienaber