Memory Group - 15.3 Planning

This page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.

Capacity

No noteworthy PTO is expected that could impact the capacity for %15.3.

Planning

Top Priorities for %15.3

Investigate Puma long-term memory use

Who: @mkaeppler @alipniagov

During our investigation on the Puma runaway memory issues, we have identified the need to automatically collect diagnostic data from production Puma instances.

In this first iteration during %15.3, we will focus on producing such reports on a running Puma production instance. Collection of the reports will still require the help of an SRE to copy down those files for analysis. We will also work on identifying which diagnostic reports to collect.

Priority topics for %15.3:

Puma diagnostic reports
Tune jemalloc settings for SaaS (memory efficiency improvement brought in this effort as it relates with rest of work)
Identify useful diagnostic reports

Memory killer improvements and evolution

Who: @mkaeppler (@nmilojevic1)

Investigate issues with puma-worker-killer and figure out ways to either replace it or improve it.

We are working on adding a new memory watchdog for Puma that optimizes for the following instead:

Maintain high heap utilization in Puma workers.
Avoid node memory saturation by expanding too far into available memory.

Priority topics for %15.3:

Consider replacing puma_worker_killer

Create custom SLIs for Global Search

Who: @rzwambag

We continue our work to support groupglobal search on setting up the custom SLIs for the SearchController and Search API.

Priority topics for %15.3:

Optimize workers that consume lot of memory and cause OOM kills

Who: @nmilojevic1

Before we continue with our work to optimize more workers that consume lot of memory, we want to first improve our monitoring and metrics for OOM kills and figure out more reliable ways to gather the data. Once we validate that we have all the metrics we want in place, we will focus on the candidate workers that we'll find with those metrics.

Priority topics for %15.3:

Improve monitoring and metrics of workers that cause OOM kills
- Could we use Sidekiq Memory Killer to track workers that cause OOM Kills - it seems that the Sidekiq memory killer never reached the soft limit, resulting in the observed OOM kills.
- Could we use GitLab Sidekiq Reliable Fetcher to track workers that cause OOM kills by using the interrupted_count metric

High Severity / Priority issues

Update [redacted] to mitigate VULNDB-xxxx (bugvulnerability)

Who: @alipniagov

Skipping details on this as it is a confidential issue about a vulnerability

Kickoff Video

https://youtu.be/4xazLatG8Ws

Edited Jul 14, 2022 by Yannis Roussos