Feature Category Summary for Core Services Scalability
Workflow board for Core Services Scalability: Board
Project Work
| Topic | Links | Status | Summary |
|---|---|---|---|
| GitLab SaaS Availability reporting improvements @stejacks-gitlab |
&1012 Board |
workflow-infraIn Progress | 2023-09-20 Our current focus is completing the work to have a fully functional thanos staging environment (&1068), which will allow us to close out the final task in the precision subepic (&1098) and begin the work to create shardable recording rules (&1107). In the past week we've deployed to the staging environment, refactored the sharding method, and worked to improve observability in the staging environment. |
| Stop using namespaces in sidekiq @schin1 |
&944 Board |
workflow-infraIn Progress | 2023-09-20 Charts MR is left which will mean 16.5 is our first release. This epic will end when 16.7 is released (see release plan at #2288 (closed)). Updates to this epic will be infrequent. |
| Functional Partitioning options to prevent Redis saturation @lmcandrew |
&900 Board |
workflow-infraTriage | |
| Functional Partitioning for pub/sub in Redis @schin1 |
&1066 Board |
workflow-infraIn Progress | 2023-09-20 The workhorse workload on gstg will be migrated this week. There is a delay to gprd provisioning as the GKE cluster is blocked on external-dns refactoring efforts. We may need to move the Due Date by 1 week. |
| Migrate exclusive lease keys from Redis persistent to redis-cluster-shared-state @marcogreg |
&1094 Board |
workflow-infraIn Progress | 2023-09-20 Staging: * Exclusive lease keys have been migrated to the new redis-cluster-shared-state. Prod: * Sizing properties are under review. * To be followed by provisioning the cluster (CR). |
| Rails metrics cardinality review |
&330 Board |
workflow-infraTriage | |
| Horizontally Scale redis persistent using Redis Cluster |
&1055 Board |
workflow-infraTriage | |
| Upgrading Sidekiq and Redis gems |
&941 Board |
workflow-infraTriage | |
| Application optimizations for Sidekiq in Kubernetes @msmiley |
&539 Board |
workflow-infraTriage | |
| Scaling out and online re-sharding of Redis Cluster |
&1105 Board |
workflow-infraTriage |
Issues Not in Epics
Summary of issues that are not in an Epic (for Core Services Scalability)
Total Issues: 92
| Topic | Service | Board | Workflow Status |
|---|---|---|---|
| Add traceability to Sidekiq worker type feature flags #2529 (closed) |
Category:Core Services Scalability | ||
| Update run_sidekiq_jobs and drop_sidekiq_jobs FF to use request as actor #2528 (closed) |
ServiceSidekiq | workflow-infraProposal | |
| Use correct store and logical abstractions when rate-limiting instead of using exclusive lease #2513 (closed) |
ServiceRateLimiting | ||
| GITLAB_THROTTLE_USER_ALLOWLIST still rate limit user #2514 (closed) |
ServiceRateLimiting | workflow-infraTriage | |
| Possible application-side performance regressions in cache due to Redis Cluster #2503 (moved) |
ServiceRedisClusterCache | workflow-infraTriage | |
| Discuss removal of histogram metrics on Sidekiq for self-managed #2474 |
ServiceSidekiq | workflow-infraProposal | |
| Discussion: when should we scale up our Redis Clusters #2466 (closed) |
ServiceRedisClusterRateLimiting | ||
| Support Redis Cluster in gitlab compose kit #2465 (closed) |
ServiceRedis | ||
| Knowledge sharing: Redis #2432 (closed) |
ServiceRedis | workflow-infraTriage | |
| [Discussion] Improve redis.yml robustness #2413 (closed) |
ServiceRedis | ||
| Discussion: Redis password rotation process #2402 (closed) |
ServiceRedis | ||
| Discussion: Scalability group's stance on future Redis deployments (k8s or VMs) #2340 (closed) |
ServiceRedis | ||
| Audit unused Sidekiq metrics #2297 (closed) |
ServiceSidekiq | workflow-infraTriage | |
| Set Sidekiq cron jobs to be idempotent #2289 (closed) |
ServiceSidekiq | workflow-infraTriage | |
| Evaluate redis-sessions workload's Redis Cluster compatibility #2270 (moved) |
ServiceRedisSessions | workflow-infraTriage | |
| Increased eval latency and elevated memory utilization on redis shared state #2263 (closed) |
ServiceRedis | workflow-infraTriage | |
| Capture redis slowlogs from Redis Cluster nodes #2258 (closed) |
ServiceRedisCache | workflow-infraTriage | |
| Removal of queue selector #2220 (closed) |
ServiceSidekiq | workflow-infraProposal | |
| Saturated disk write throughput on Gitaly VMs #2208 (closed) |
ServiceGitaly | workflow-infraTriage | |
| Update object storage blueprint #2207 (closed) |
ServiceAPI | workflow-infraTriage | |
| Confidential Issue https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2202 |
ServiceRedis | workflow-infraTriage | |
| Investigate missing /var/log/syslog on redis-repository-cache in gprd #2188 (closed) |
ServiceRedis | workflow-infraTriage | |
| SaaS Platforms' Redis Roadmap #2155 (closed) |
ServiceRedis | ||
| Add "preserve config" mode to Redis and Sentinel in Omnibus #2154 (closed) |
ServiceRedis | workflow-infraTriage | |
| Disable BGSAVE for redis-registry-cache #2102 (closed) |
ServiceRedis | workflow-infraTriage | |
| [pre] Metrics from VMs in the pre-evironment arent scraped #2007 (closed) |
ServiceRedis | workflow-infraTriage | |
| Fix the chef config for pre not using the correct ratelimiting instances. #1966 (closed) |
ServiceRedis | workflow-infraTriage | |
| Use a headless service to determine redis sentinels to connect to from the application #1961 (closed) |
Category:Core Services Scalability | ||
| Scalability's involvement with Application Rate Limiting architecture #1947 (closed) |
ServiceWeb | workflow-infraTriage | |
| Update sidekiq guides to restrict use of sidekiq APIs #1939 (closed) |
ServiceSidekiq | workflow-infraTriage | |
| Recalibrate redis IO threads after eliminating wide fan-out of PUBLISH events #1935 (closed) |
ServiceRedis | workflow-infraIn Progress | |
| Consume slowlogs from redis in k8s #1911 (closed) |
ServiceRedis | workflow-infraTriage | |
| Fluentd single-threaded bottleneck #1906 (closed) |
ServiceLogging | workflow-infraTriage | |
| Establish keys subset to shard out of SharedState #1863 (closed) |
ServiceRedis | workflow-infraTriage | |
| Direct reads from Sidekiq to read-only replicas by default #1811 (closed) |
ServicePatroni | workflow-infraTriage | |
| Improve the Gitaly weight assigner to take CPU utilization into account #1782 (closed) |
ServiceGitaly | workflow-infraTriage | |
| Add a saturation metric for redis memory usage compared to the configured maxmemory #1765 (closed) |
ServiceRedis | workflow-infraTriage | |
| Reduce process-exporter scrape interval on Redis nodes #1742 (closed) |
ServiceRedis | workflow-infraTriage | |
| Doc: update user documentation of webhook log #1740 (closed) |
ServiceWeb | workflow-infraTriage | |
| Data for MailRoom ownership #1737 (closed) |
ServiceMailroom | workflow-infraTriage | |
| Set context types of MailRoom's postback requests to text/plain in the upstream gem #1734 (closed) |
ServiceMailroom | workflow-infraTriage | |
| Mirror process-exporter image to be resilient to docker registry failure #1709 |
ServiceRedis | workflow-infraTriage | |
| Object Storage Roadmap #1683 (closed) |
ServiceWeb | workflow-infraTriage | |
| Separate pgbouncer connection pools for latency sensitive Sidekiq workloads #1682 (closed) |
ServiceSidekiq | workflow-infraTriage | |
| Confidential Issue https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/1656 |
ServiceSidekiq | workflow-infraTriage | |
| Do not log Sentry exceptions by default #1583 (closed) |
ServiceLogging | workflow-infraTriage | |
| Investigate: Pod failures and PVC's #1563 (closed) |
ServiceRedis | workflow-infraTriage | |
| Test data restoration practices #1539 (closed) |
ServiceRedis | workflow-infraTriage | |
| Ensure sysctls are set for Redis nodes in GKE #1803 (closed) |
ServiceRedis | workflow-infraTriage | |
| Reject oversized emails early before reading their request bodies #1521 (closed) |
ServiceMailroom | workflow-infraTriage | |
| Investigate saturation risk for PgBouncer client connections #1503 (closed) |
ServicePgbouncer | workflow-infraTriage | |
| Rework Sidekiq style guide #1495 (closed) |
ServiceSidekiq | boardbuild | workflow-infraReady |
| Support Mailroom postback retry attempt #1487 (closed) |
ServiceMailroom | workflow-infraProposal | |
| Clean up mailroom Sidekiq delivery strategy #1463 (closed) |
ServiceSidekiq | workflow-infraTriage | |
| Generate Redis interaction map systematically #1452 (closed) |
ServiceRedis | workflow-infraTriage | |
| Redis Cache Sentinel node_schedstat_waiting #1427 (closed) |
ServiceRedis | workflow-infraTriage | |
| Redis failover inhibited by residual phantom sentinel voters #1385 (closed) |
ServiceRedis | workflow-infraTriage | |
| Handle Sidekiq jobs with a compressed payload > 5MB are being rejected #1349 (closed) |
ServiceSidekiq | boardplanning | workflow-infraTriage |
| Move workers off of quarantine shard #1211 (closed) |
ServiceSidekiq | workflow-infraTriage | |
| Make Sidekiq routing rule validation noisier #1182 (closed) |
ServiceSidekiq | workflow-infraTriage | |
| Investigation into INCRBY operations showing in the slowlog #1168 (closed) |
ServiceRedis | workflow-infraTriage | |
| Upgrade kitchen-inspec and improve parallel specs in gitlab_fluentd cookbook #1165 (closed) |
ServiceMonitoring-Other | workflow-infraTriage | |
| Move the SSL handshake to cloudflare for GitLab-pages accessed through GitLab.io #1163 (closed) |
ServicePages | workflow-infraTriage | |
| Sidekiq management: move selected enqueued jobs to another queue #1080 (closed) |
ServiceSidekiq | workflow-infraTriage | |
| Add sidekiq queries to the patroni SLI #1059 (closed) |
ServicePostgres | workflow-infraTriage | |
| Add a method to allow SREs to intervene in the processing of jobs for specific workers #997 (closed) |
ServiceSidekiq | workflow-infraProposal | |
| Increase Sidekiq retries from 3 back to 25 #986 (closed) |
ServiceSidekiq | workflow-infraTriage | |
| Keep Redis' production flamegraphs longer #860 (closed) |
ServiceRedis | workflow-infraProposal | |
| Integration tests for HAProxy #817 (closed) |
ServiceHAProxy | workflow-infraTriage | |
| Add retry-after header into HAProxy #816 (closed) |
ServiceHAProxy | workflow-infraTriage | |
| Review duplication in fields logged from Rails #770 (closed) |
ServiceLogging | workflow-infraTriage | |
| Some queues send to the dead jobs queue very frequently #715 (closed) |
ServiceSidekiq | workflow-infraTriage | |
| Rate limiting git traffic #691 (closed) |
ServiceGit | workflow-infraTriage | |
| puma performance for git ssh and git https #673 (closed) |
ServiceGit | workflow-infraBlocked | |
| Confidential Issue https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/660 |
ServiceWeb | workflow-infraProposal | |
| Move Sidekiq throttling functionality into the application #638 (closed) |
ServiceSidekiq | workflow-infraTriage | |
| Confidential Issue https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/617 |
ServiceWeb | workflow-infraTriage | |
| Workhhorse logs should report bytes read via labkit #587 (closed) |
ServiceWeb | workflow-infraReady | |
| Reduce redundancy in Sidekiq process management #575 (closed) |
ServiceRedis | workflow-infraBlocked | |
| In process memory caching on the client side for redis #570 (closed) |
ServiceRedis | workflow-infraTriage | |
| Sidekiq worker classes should be annotated with criticality #541 (closed) |
ServiceSidekiq | workflow-infraTriage | |
| Define an SLO for rails pod startup times #466 (closed) |
ServiceWeb | workflow-infraTriage | |
| Document what a 'Sidekiq Shard' is (GitLab.com terminology) #454 (closed) |
ServiceSidekiq | workflow-infraTriage | |
| Resolve name consistency of Redis instances/nodes #444 (closed) |
ServiceRedis | workflow-infraTriage | |
| Duplicate information controller and action information in rails logs #383 (closed) |
ServiceLogging | workflow-infraTriage | |
| CI Runner Structured Log Cleanup #371 (closed) |
ServiceCI Runners | workflow-infraTriage | |
| Redis keyspace analysis: alerting #363 (closed) |
ServiceRedis | workflow-infraBlocked | |
| Redis key analysis: automate analysis of entire keyspace #362 (closed) |
ServiceRedis | workflow-infraStalled | |
| PeriodicExclusiveDaemon Application Infrastructure #342 (closed) |
ServiceGitLab Rails | workflow-infraTriage | |
| Remove --experimental-queue-selector flag from sidekiq-cluster #260 (closed) |
ServiceSidekiq | workflow-infraTriage | |
| Redis failover should be tested as part of our QA integration test suite #133 (closed) |
ServiceRedis | workflow-infraTriage | |
| Use multiple Redis cache instances in Rails.cache #49 (closed) |
ServiceRedis | boardplanning | workflow-infraIn Progress |
Edited by Rachel Nienaber