Improve Application Performance team self-sufficiency

Problem statement

It's been a pain point for our team since inception that we do not have access to production nodes to collect diagnostic data pertaining to application performance that is not generally available (metrics, logs). We do have access to application logs and Prometheus metrics but these typically tell you what is slow, not why. It is our team's primary function to answer the "why" of performance issues and propose or implement solutions. Moreover, certain classes of performance degradations are not reproducible in development or pre-production environments due to lack of real-world traffic or data volume. We have tried to expand access rights for our team to be similar to those of an SRE, but eventually found this to not be possible for reasons explained in this Access Request.

We have so far relied on ad-hoc communication with SREs (e.g. via Slack) to get assistance with this, but there is no actual process for this in place (none I am aware of anyway), and based on chats I had with @igorwwwwwwwwwwwwwwwwwwww, @skarbek and @steveazz, who kindly offered to pair with me on such problems, we seem to be in agreement that while this is a good way to explore our fields mutually and exchange knowledge, it is not a sustainable approach.

Examples of cases where this happened in the past:

Obtaining CDN or application logs older than 7 days (we fixed this already by granting some team members access to the respective GCS buckets, but I am mentioning it because it falls into the same general category and it took weeks to resolve)
Obtaining Prometheus metrics files from production Puma and Sidekiq nodes
Inspecting process listings and port bindings (staging would have sufficed, but we equally have no node access here, and Teleport only provides access to the Rails console)
Obtaining core dumps and process maps (this includes our CI infrastructure, where we had multiple occurrences of Ruby VM segfaults we weren't able to debug ourselves)
Most recently, this issue: gitlab-org&8105

To do our job effectively, we should try to find solutions that make our team more self-sufficient, with the aim being to minimize, though not remove, reliance on SREs.

This issue aims to summarize at a high level:

Our team's workflows and tooling requirements.
Proposals for general approaches to enable them.

Workflows and tooling

From my personal point of view, the main difference between SRE and Memory team workflows and tooling is that in case of incidents or performance problems, site reliability tends to focus on the node or container level (i.e. take more of a systems-level, blackbox-approach to diagnosis), whereas our team focuses on the application level (i.e. take more of a process-level, glassbox-approach). This observation is mostly anecdotal and based on runbooks structure and the scripts we maintain, which are almost entirely based on tooling that operates at the cluster or container level.

Our team's focus is not even SaaS, but performance at any scale, including single-node Omnibus deployments where containers do not exist. We are therefore focused on application internals such as:

Application memory footprint ("how much memory do we use")
Object allocations and GC churn ("how efficiently is that memory allocated")
Performance related implementation patterns ("how CPU or memory-efficient is this code")
Threading and parallelism ("is this a parallel problem and are we leveraging concurrency")
Inefficient database queries or use of ActiveRecord ("how efficient is this DB query")
Efficient collection of application metrics ("how efficient is this sampler or exporter")
Code extraction into more efficient technologies ("how efficient is this whole system")
Building or documenting developer tooling and workflows related to any of the above

It should be evident from this list that most of these tasks require inspecting application code and processes i.e. the Ruby VM rather than containers (though there can be overlap.) This means most of the analysis we do requires probing into running Linux tasks (processes, threads), either from inside or outside the application. The latter requires access to the host OS on which the task executes.

Tooling + reports

Below is an (incomplete) list of tools we use frequently, with focus on the reports they produce and categorized by what kind of access they require, how expensive they are to run etc. The reason I list cost (execution time, report size) as well is that some of these will only be feasible to run ad-hoc / manually.

Report	Tool/API	Interface	Node access required	Execution time	Report size
Ruby heap dumps	`rbtrace`	CLI	Yes	seconds	100s MB to GBs
Ruby heap dumps	`ObjectSpace`	Programmatic	No	seconds	100s MB to GBs
Ruby call traces	`rbtrace`	CLI	Yes	seconds	kBs to MBs
Application state (inject VM instructions)	`rbtrace`	CLI	Yes	seconds	kBs to MBs
CPU and allocation profiles	`stackprof`	Programmatic, signals	Yes	seconds to minutes	10s to 100s of MB
Ruby GC stats	`GC.stat`	Programmatic, `rbtrace`	No	< 1s	< 100KB
Process maps	`pmap`	CLI	Yes	< 1s	< 1MB
Memory + core dumps	`gdb`	CLI	Yes	seconds	> 1GB

Any row with Node access required = Yes poses a problem to our team, since we can only run them in development environments.

Proposal

I don't think there is a one-size-fits-all solution to this, but I can think of several ways to improve on the status quo:

Establish a framework for closer collaboration between SREs and Memory team. There are some tasks that will be hard or impossible to automate, such as obtaining core dumps from a production node. In these cases, pairing with SREs is necessary. Some ideas:
- Establish stable counterparts. We could identify 1-2 production engineers who can help our team even at short notice. This means we have a well-known list of individuals to reach out to, and there are clearer expectations. The downside is that this could still be disruptive to SREs.
- Use light-weight change requests. This is slightly more process-heavy, but we could have a lighter version of our Change Requests. This would have the benefit that SREs can plan work better. Maybe via the request-general issue template?
Create tooling for remote diagnosis. Since some tools require node access, we could look into ways to make these tools safe(r) to execute remotely by our team on-demand. Teleport is a solution we use to provide authenticated and audited rails console access to production and staging nodes already. We are also looking to extend this to full SSH node access for staging-ref in gitlab-org/gitlab#343938. If we had something similar in place that would allow us to remotely execute a well-defined list of read-only commands in gstg and gprd, this would be tremendously useful. Listing node processes and pulling process maps would be a good use case for this.
Extend tooling and reporting inside the Rails app. We are already exploring this idea in gitlab-org&8105. This would see us invest more in in-application diagnostic features that can produce reports accessible to our team e.g. by uploading them to GCS on some regular basis or in response to system conditions. The main constraint here is that there is a limit to what can be done here, since this typically only provides us with an "in-process" view. It must also be done in a way that does not put system availability at risk.
Utilize GET deployments for performance tests. QA engineers already use the GET for running performance tests, however, this creates a similar dependency between us and their team, and similarly we do not have process-level access to perform detailed analysis. This is more something for our team to explore, i.e. if have (or can obtain) the ability to spin up e.g. a hybrid 10k deployment with full node access for us to use. It will also not allow us to diagnose issues that arise from real-world traffic and usage, so I see this more as a complementary approach to the above.

Edited Aug 30, 2022 by Matthias Käppler