Memory Group - 15.3 Planning
This page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.
Capacity
No noteworthy PTO is expected that could impact the capacity for %15.3.
Planning
%15.3
Top Priorities forInvestigate Puma long-term memory use
Who: @mkaeppler
@alipniagov
During our investigation on the Puma runaway memory issues, we have identified the need to automatically collect diagnostic data from production Puma instances.
In this first iteration during %15.3, we will focus on producing such reports on a running Puma production instance. Collection of the reports will still require the help of an SRE to copy down those files for analysis. We will also work on identifying which diagnostic reports to collect.
Priority topics for %15.3:
- Puma diagnostic reports
- Tune jemalloc settings for SaaS (memory efficiency improvement brought in this effort as it relates with rest of work)
- Identify useful diagnostic reports
Memory killer improvements and evolution
Who: @mkaeppler
(@nmilojevic1
)
Investigate issues with puma-worker-killer
and figure out ways to either replace it or improve it.
We are working on adding a new memory watchdog for Puma that optimizes for the following instead:
- Maintain high heap utilization in Puma workers.
- Avoid node memory saturation by expanding too far into available memory.
Priority topics for %15.3:
Create custom SLIs for Global Search
Who: @rzwambag
We continue our work to support groupglobal search on setting up the custom SLIs for the SearchController
and Search API.
Priority topics for %15.3:
- Expose global_search_apdex as a prometheus metric
- Expose global_search_success as a prometheus metric
- Determine SLO for success rate
- Determine SLO for latency
Optimize workers that consume lot of memory and cause OOM kills
Who: @nmilojevic1
Before we continue with our work to optimize more workers that consume lot of memory, we want to first improve our monitoring and metrics for OOM kills and figure out more reliable ways to gather the data. Once we validate that we have all the metrics we want in place, we will focus on the candidate workers that we'll find with those metrics.
Priority topics for %15.3:
-
Improve monitoring and metrics of workers that cause OOM kills
- Could we use Sidekiq Memory Killer to track workers that cause OOM Kills - it seems that the Sidekiq memory killer never reached the soft limit, resulting in the observed OOM kills.
-
Could we use GitLab Sidekiq Reliable Fetcher to track workers that cause OOM kills by using the
interrupted_count
metric
High Severity / Priority issues
Update [redacted] to mitigate VULNDB-xxxx (bugvulnerability)
Who: @alipniagov
Skipping details on this as it is a confidential issue about a vulnerability