Memory Group - 14.10 Planning

This page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.

Capacity

No noteworthy PTO is expected that could impact the capacity for %14.10

Planning

Top Priorities for %14.10

Improve efficiency and maintainability of application metrics exporters

Who: @mkaeppler

We want to review our approach for serving application metrics into Prometheus.

Priority topics for %14.10:

Move metrics server out of Puma primary

Following the successful completion of exporting Sidekiq metrics from a separate process, we are looking to extract the metrics server thread running in the Puma primary into a separate server process to improve fault tolerance and GitLab availability. This is in response to incidents we have seen in the past such as gitlab-org/gitlab#118839 (closed), in which we found that the in-process Rack server in the Puma master can lock up the entire process.

In %14.9 we have continued our work on spawning a dedicated metrics-server process on puma-master, which we plan to complete early in %14.10.
Productionize Golang application exporter

We want to implement a new integrated way of exporting metrics that:
- provides a single application exporter system, which subsumes both gitlab-exporter and app-internal exporters
- runs outside of the Rails monolith
- performs efficiently in face of large data volumes (tens to hundreds of thousands of samples per scrape)
Our plan is to base our implementation on the prototype Golang exporter that we have built in past milestones. It has proven to be 8 times faster than the existing exporter while using a similar amount of memory.

Performance related tooling

Who: @alipniagov @rzwambag

In %14.9 we have completed our initiative on consolidating our profiling tools (summary of current status).

In %14.10, as the last pending task to successfully complete this initial iteration on performance related tooling, we will release documentation around performance profiling and create clear guidelines for developers (in one place)

Rework performance profiling documentation

Address rubyzip related issues

Who: @alipniagov @rzwambag

We have found that rubyzip can run against performance issues whenever iterating zip files or reading the central directory is required. This is a well defined, known performance issue (gitlab-org/gitlab#345673) that we would like to address.

Priority topics for %14.10:

Explore rubyzip alternatives

Optimize workers that consume lot of memory and cause OOM kills

Who: @nmilojevic1

We have identified several workers that occasionally consume more than 1 GB of memory and are regularly hitting Out Of Memory conditions, resulting in more than 1000 OOM kills observed on Sidekiq containers per day.

In %14.9 we have addressed issues with CoverageReportWorker, which has resulted in 80% reduction of workers being terminated due to Out Of Memory events.

In %14.10 we plan to investigate and evaluate as many of the rest of those top "offender" workers as possible. If we identify a root cause, we'll try to optimize at least one of them. Priority topics for %14.10:

Update supported Ruby version to 3.0

Who: @alipniagov @rzwambag

We have put the Ruby 3 upgrade related work on hold since %14.4 (gitlab-org/memory-team/team-tasks#99), when we shifted our focus to creating a dedicated Redis instance for session keys.

With that initiative successfully completed and all the other follow-up initiatives already planned and on track, we plan to revisit our work on the Ruby 3 migration and drive it to completion.

Additional Issues for consideration

Allow admins to configure rack-timeout for Puma
Catalog data sources that reveal performance bottlenecks

Open question: What would the next steps for this one be?

Kickoff Video

https://youtu.be/IvWaqGMP2ww

Edited Mar 15, 2022 by Yannis Roussos