Application Performance Group - 15.10 Planning
This page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.
Capacity
In %15.10, the application team will be operating at ~70% capacity. In particular, @nmilojevic1
will be away for two months on paternity leave; he will return at the end of %15.11.
Upcoming PTO
Type | Who | When |
---|---|---|
Nikola Milojevic | 2023-02-15 - 2023-04-15 | |
Aleksei Lipniagov | 2023-02-10 - 2023-02-24 | |
Roy Zwambag | 2023-02-22 - 2023-02-24 | |
Matthias Käppler | 2023-02-23 - 2023-02-24 |
Planning
Indicate major efforts by linking to the appropriate epic. Under each epic indicate who is focusing on these efforts and list the individual issues assigned for this milestone.
%15.10
Top Priorities forTLDR Ruby 3 roll-out continues to be the team's top priority. Our focus is shifting from memory optimizations to exploring other areas of perceived performance: application profiling, and real-time capabilities for GitLab.
Update supported Ruby version to 3.0
Who: @mkaeppler
@rzwambag
@alipniagov
We have scheduled a hard PCL on March 7th, 2023 for Ruby 3 production roll-out. Ruby 2.7 has an end-of-life date of March 31, 2023. We must cutover all systems to use Ruby 3 before then. Our next biggest unknowns is to clarify the schedule of events for roll-out day (including DRI availability). A weekly status overview is available here.
- We are currently ON TRACK for a Ruby 3 production roll-out in %15.10.
- Roll-out day source-of-truth
- After the roll-out, we will continue Ruby 3: Post-release tasks (gitlab-org&9635).
- We will also be planning a Ruby 3 retro in %15.10
- Some highlights from %15.9 include:
- We have resolved the majority of unknowns in our Ruby 3 list of adoption blockers. The two remaining items are on-track for our target roll-out date.
- We have concluded pre-launch manual testing
- As of Jan 23, Ruby 3 is the default for developers using the GDK and GCK.
- We have defined https://gitlab.com/gitlab-org/application-performance-team/team-tasks/-/issues/133+
- We have clarified that GitLab's regular rollback strategies will be sufficient for Ruby 3 deployment.
Tooling: Spike Cloud Profiling to assess suitability in helping identify GitLab Performance Bottlenecks
Who: @alipniagov
@rzwambag
The Application Performance team is time-boxing an exploration to identify cross-cutting opportunities to improve GitLab's application speed. We are assessing Google Cloud Profiling for Ruby, and if suitable will deploy to staging in an effort to identify low-hanging fruits for optimization and improvements.
Research on expanding Real-Time Features Across GitLab
Who: @mkaeppler
This work stream is aligned with GitLab's product focus on usability. Real-time collaboration is a table stakes feature for GitLab users. Historically, GitLab has taken an incremental and iterative approach toward real-time features. The application performance team will dive into how real-time features currently work at GitLab, with the aim of understanding how we can accelerate real-time feature development.
The team does not have much familiarity with this domain and will need time to ramp up. We are prioritizing a context-transfer from the Real Time Text Editing Single Engineer Group due to a team member departure on March 16th. This means that other real-time research will likely be pushed to %15.11.
%15.10 Focus:
- Ramp up on how real-time text editing is being built at GitLab: https://gitlab.com/gitlab-org/application-performance-team/team-tasks/-/issues/136+
- Identify a small piece of functionality (for example Performance Bar related) and look at adding real time capabilities. (also maybe look at the peek lib usage of this at the same time)
%15.11 Onward:
- Team demo of research results, around end of March
- Expand on realtime developer docs (gitlab-org/gitlab#390366 - closed)
- Docs tutorial: How to contribute real-time feat... (gitlab-org/gitlab#390402 - closed)
- Performance analysis of GraphQL-to-REST POC (gitlab-org/gitlab#369097)
Restrict Access to Redis Commands
Stretch:Who: @rzwambag
Security continues to be an important facet of GitLab's product offering. This effort will prevent default
users from executing Redis commands tagged @admin
, @connection
, @dangerous
and others. This effort has two main components:
- Enabling Redis Access-Control-Lists across dev/helm/omnibus, and
- Rolling it out to both SaaS and self-managed environments.
Closing out items
Gitlab Metrics Exporter: production rollout
Who: @alipniagov
@mkaeppler
Current status: We decided to park GME work. Currently, all environments use default Ruby metrics exporters, not GME.
The reason: On staging, we observed memory usage spikes. We bundle GME (as a "sidecar") together with our web server, so these spikes would affect the overall memory budget for the node. We time-boxed our effort, and we were not able to identify the root cause of that. We switched to items with higher priority.
What's next for GME: We did not remove the GME code completely though, just turned off the switches. We want to keep the opportunity open if we decide to revisit it. We believe that enabling Google Cloud Profiler for GME (gitlab-org/gitlab-metrics-exporter#21) could empower us with valuable insights regarding where the memory is allocated.
Fixing and fine-tuning the Sidekiq memory killer
Who: @nmilojevic1
Over the past few months, we shipped our memory watchdog tool; it replaces both sidekiq and puma worker killer, consolidating 3 different memory killers in a single tool. This memory killer is now deployed across both our SaaS and self-managed fleets; it is enabled on catchall
, urgent-other
, and urgent-cpu-bound
shards since our investigation showed that they suffered mostly because of the OOM kills. Our optimization efforts have reduced out-of-memory kills from 2000/day down to ~4-20 / day. A more detailed write-up of our efforts and impact is available.
GitLab is now in a strong position to understand what is happening when our workers exceed the memory limit: We now have metrics in place that indicate which workers were running at the time when we exceed the resource limits. This could help feature groups to identify workers that need further optimization gitlab-org&7553 (comment 1269056439)
Even though we've improved memory usage for some workers and improved general health by allowing our jobs to complete and get re-queued, Sidekiq' is a very big topic. We will likely see similar issues crop up in the near future due to the coarse granularity of our saturation forecasts. We are confident our investments to-date position Gitlab in a stable enough position to pause this work stream.
Investigate Puma long-term memory use
Who: @mkaeppler
@alipniagov
(and potentially @rzwambag
@nmilojevic1
as well)
Having spent a few milestones exploring this space, we do not have a strong hypothesis on what might be the underlying cause. There also no longer appears to be any immediately-pressing memory saturation issues for these nodes. We will be winding down this effort in favour of other priorities.