Memory Group - 14.10 Planning
This page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.
Capacity
No noteworthy PTO is expected that could impact the capacity for %14.10
Planning
%14.10
Top Priorities forImprove efficiency and maintainability of application metrics exporters
Who: @mkaeppler
We want to review our approach for serving application metrics into Prometheus.
Priority topics for %14.10:
-
Move metrics server out of Puma primary
Following the successful completion of exporting Sidekiq metrics from a separate process, we are looking to extract the metrics server thread running in the Puma primary into a separate server process to improve fault tolerance and GitLab availability. This is in response to incidents we have seen in the past such as gitlab-org/gitlab#118839 (closed), in which we found that the in-process Rack server in the Puma master can lock up the entire process.
In %14.9 we have continued our work on spawning a dedicated metrics-server process on puma-master, which we plan to complete early in %14.10.
-
Productionize Golang application exporter
We want to implement a new integrated way of exporting metrics that:
- provides a single application exporter system, which subsumes both gitlab-exporter and app-internal exporters
- runs outside of the Rails monolith
- performs efficiently in face of large data volumes (tens to hundreds of thousands of samples per scrape)
Our plan is to base our implementation on the prototype Golang exporter that we have built in past milestones. It has proven to be 8 times faster than the existing exporter while using a similar amount of memory.
Performance related tooling
Who: @alipniagov
@rzwambag
In %14.9 we have completed our initiative on consolidating our profiling tools (summary of current status).
In %14.10, as the last pending task to successfully complete this initial iteration on performance related tooling, we will release documentation around performance profiling and create clear guidelines for developers (in one place)
Address rubyzip related issues
Who: @alipniagov
@rzwambag
We have found that rubyzip
can run against performance issues whenever iterating zip files or reading the central directory is required. This is a well defined, known performance issue (gitlab-org/gitlab#345673) that we would like to address.
Priority topics for %14.10:
Optimize workers that consume lot of memory and cause OOM kills
Who: @nmilojevic1
We have identified several workers that occasionally consume more than 1 GB of memory and are regularly hitting Out Of Memory conditions, resulting in more than 1000 OOM kills observed on Sidekiq containers per day.
In %14.9 we have addressed issues with CoverageReportWorker, which has resulted in 80% reduction of workers being terminated due to Out Of Memory events.
In %14.10 we plan to investigate and evaluate as many of the rest of those top "offender" workers as possible. If we identify a root cause, we'll try to optimize at least one of them. Priority topics for %14.10:
- Investigate ActionMailer::MailDeliveryJob
- Investigate RepositoryUpdateMirrorWorker
- Investigate Ci::ArchiveTraceWorker
- Investigate ExternalServiceReactiveCachingWorker
Update supported Ruby version to 3.0
Who: @alipniagov
@rzwambag
We have put the Ruby 3 upgrade related work on hold since %14.4 (gitlab-org/memory-team/team-tasks#99), when we shifted our focus to creating a dedicated Redis instance for session keys.
With that initiative successfully completed and all the other follow-up initiatives already planned and on track, we plan to revisit our work on the Ruby 3 migration and drive it to completion.
Additional Issues for consideration
-
Catalog data sources that reveal performance bottlenecks
Open question: What would the next steps for this one be?