Draft: PoC diag reports uploader: from Puma (!96045) · Merge requests · GitLab.org / GitLab

Aleksei Lipniagov requested to merge 362902-poc-reports-uploads into master Aug 23, 2022

What does this MR do and why?

🛠 Rough PoC.
To play with the implementation approach.

Works on localhost.

How to set up and validate locally

I tested on a real GCS bucket: created my own GCP project via Sandbox Cloud Realm (guide), updated the config in gdk.yml according to GDK how-to
Pull the branch
set ENV var 1: export GITLAB_DIAGNOSTIC_REPORTS_ENABLED=true
set ENV var 2: export GITLAB_DIAGNOSTIC_REPORTS_PATH='/Users/al/dev/tmp-diag-reports-uploads' (of your choice or use default)
gdk restart or gdk restart rails-web
Put a file into the designated dir
It will be uploaded to the GCS bucket every 10s (will be rewritten). Cleanup: work in progress.

Checking PoC against requirements

Requirements and discussion: #362902 (comment 1075800269)

"Uploading potentially large files can be costly; any solution should have minimal impact on the ability of this node to serve user traffic." I'd prefer to put some expectations regarding max size. MK: Even with my devkit, a Puma master heap dump after a cold start is 800MB of uncompressed JSON. A worker heap dump from a long-running production node will likely be several times that size.

Status: 👀 (waiting for Jacob's opinion on CPU usage)

Comment: Regarding CPU usage, see !96045 (comment 1082558208) In short: in our experiments, it spends 10% of wall time in Ruby land (which would block the process from serving requests), scales with file size linearly, for 1G file it took 3s CPU from 30s of wall time.

Regarding Memory usage, see the dedicated requirement below.
Regarding how long it takes: I made a couple of experiments, see the results below.
I used a real GCS in europe-west4 (Netherlands), connected to my GDK (I am also located in the Netherlands).

Regarding other aspects of availability: any exception in the thread will break the uploader, but it shouldn't affect the worker's other threads. We don't want the uploader to break either - so we need to cover possible exceptions carefully and recover. Also, logging would help greatly to troubleshoot if needed.

It is good to know if uploading a large file will not result in allocating a lot of memory (because in that case, the reporting would make the memory usage worse, while it should help with reducing it)

Status: ✅

Comment: I traced the memory usage of the process during the upload. I haven't noticed anything unusual. There was a bump (~5%) during the first large file upload (10GB) after the restart, but then the memory was going up and down by not so much.

We want to compress the reports in the future. So we need to be aware of that coming. We could think about how the compression should be done and where it fits into the given solution. We need to make sure we minimally affect Puma workers' CPU usage.

TBD: TODO

We need to avoid uploading reports that are "still streaming". That means our code should either understand when the file is closed (it will require some sync?) or just never upload "fresh enough" reports (e.g. < 30 minutes old).

Status: 🟰 (could be solved in a follow-up to the code that dumps reports)

Comment: we could either 1. change the way how we dump them: stream into a file into some separate dir, or modify its name, for example add an extension (.tmp), then mv when ready 2. or just never upload "fresh" files (e.g. upload those which are 30+ minutes old)

I suggest opening a follow-up to the Jemalloc reports code.

"The uploader will scan a local directory for pending reports, and upload them to a GCS bucket." It's not important where the reports go while they're accessible by engineers and we could store and access 1-2+ weeks of data.

Status: ✅

The uploaded reports should be easily accessible by team members. I think this is the most important after the availability. The smallest iteration goal is to access our Jemalloc and future Heap dumps reports without SRE change requests.

Status: ✅ (⚠ with a bit more work to obtain a bucket)

Comment: To achieve this, we need to create a dedicated bucket and configure the app, so it'll likely require additional interaction with SRE.
But it's just a matter of work, so I think that our implementation meets that requirement.

"After a successful upload, local reports should be removed from GPRD to free up storage."

Status: ✅

Comment: This is implemented in PoC

We need to be aware of storing costs in GCS. MK: We can double-check with infrastructure. I am not sure what the cost is of these files piling up in GCS, say. I would suggest running the math behind how many files we expect to store and how that would grow over time, then passing this by a production or scalability engineer. I would also suggest exploring storage-side solutions to this. I know that on BigQuery you can set TTLs. Perhaps you can do that with GCS so that files older than 1 month or so get wiped automatically.

Status: 🟰

Comment: I personally don't think this should be a part of the PoC as an implementation. We could set expiration time metadata (https://cloud.google.com/storage/docs/lifecycle#expirationtime), or just simply run a process that cleans up. Using fog, we could easily access GCS API. Also, because we haven't yet implemented heap dump reports, we will only upload Jemalloc reports first in our MVC, so it is not a pressing concern and could be done later, in parallel with heap dumps. And the current approach doesn't set any limitations on how we are going to deal with it.

Do we want to upload right after the report is dumped or maybe even stream the report to the endpoint? In my initial implementation, I decided not to couple these operations. The uploader just scrapes the dir without any sync with the report creators.

Status: ➖ (by design)

Comment: In this approach, we don't explore a synchronous approach (dumping->uploading). Reports Daemons and Uploader are unaware of each other. I personally prefer this design more (less sync is more simple IMO). But streaming the report directly to the upload endpoint could result in more efficient memory usage and faster upload overall. But I see this as an idea to explore in a separate PoC, if we are interested.

If we do it from Puma, we need to avoid workers racing over the same report. We don't want to fork from the master, obviously (keep its state clean). The solution I tried is just thread out from `puma_0` all the time. The concern is that it would make workers non-equivalent, with `puma_0` doing more stuff in its threads. It's not ideal for the Puma model where workers should be interchangeable and non-unique.

Status: ✅ 👀

Comment: Currently we thread only from puma_0. Although it makes that worker "unique" which is not Puma's philosophy regarding workers, this should have a minimal impact on that worker. This is because the upload is the main operation and it is IO, so it releases GVL. I'd say, if we meet the main requirement (minimal performance impact on the worker), we are fine with spawning a thread only from puma_0.

TODO, implementation concerns, some for MVC

Handle all exceptions
Custom bucket configuration - should we add a separate config or just via ENV vars/hardcode for now?
Replace Daemon with Background Task?
Adjust sleep timers (currently set low amount for dev)
Add jitter to timers
Add logging (successful upload, timings)
Add metrics

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

I have evaluated the MR acceptance checklist for this MR.

Related to #362902 (closed)

Edited Aug 31, 2022 by Aleksei Lipniagov

Draft: PoC diag reports uploader: from Puma

What does this MR do and why?

How to set up and validate locally

Checking PoC against requirements

It is good to know if uploading a large file will not result in allocating a lot of memory (because in that case, the reporting would make the memory usage worse, while it should help with reducing it)

We want to compress the reports in the future. So we need to be aware of that coming. We could think about how the compression should be done and where it fits into the given solution. We need to make sure we minimally affect Puma workers' CPU usage.

We need to avoid uploading reports that are "still streaming". That means our code should either understand when the file is closed (it will require some sync?) or just never upload "fresh enough" reports (e.g. < 30 minutes old).

"The uploader will scan a local directory for pending reports, and upload them to a GCS bucket." It's not important where the reports go while they're accessible by engineers and we could store and access 1-2+ weeks of data.

The uploaded reports should be easily accessible by team members. I think this is the most important after the availability. The smallest iteration goal is to access our Jemalloc and future Heap dumps reports without SRE change requests.

"After a successful upload, local reports should be removed from GPRD to free up storage."

Do we want to upload right after the report is dumped or maybe even stream the report to the endpoint? In my initial implementation, I decided not to couple these operations. The uploader just scrapes the dir without any sync with the report creators.

TODO, implementation concerns, some for MVC

MR acceptance checklist

Merge request reports