Skip to content

Draft: Upload diagnostic reports to GCS: uploader part

What does this MR do and why?

Continue !97155 (merged)

Context

We need more data, especially from production instances, to analyze performance and memory problems in GitLab. This includes collecting jemalloc statistics or obtaining Ruby heap dumps from web server workers.

We needed to rely on SRE support to trigger and obtain these reports, which was not efficient and convenient.
We decided to build the ability to trigger these reports in the application. This is implemented in !91283 (merged). We started with Jemalloc reports. Currently, this is the only report we run, but we plan to add Ruby heap dumps reports (issue) soon.

The next step is to make these reports available to engineers.
We decided to build an automatic uploader that should upload them to the dedicated GCS bucket.
This bucket will be accessible to the GitLab team.

Implementation approach

There was a long discussion regarding how this should be implemented. We evaluated multiple approaches.
We came with multiple PoC: using fog to upload, curl upload, evaluated dedicated process but decided not to go with it just yet (more in the discussion).

Our primary requirements were listed in: !96045 (closed)

What I consider to do in follow-ups

  1. Add Prometheus metrics, if we decide we'll need them. I use logging for errors and successful uploads, that should be enough to validate if the uploader works as expected.
  2. Mark files that are still streaming and not yet ready to upload: #372850 (closed). Currently, I use a simple workaround to avoid uploading files that are not finished writing. I think it's fine to start and validate our solution, and we could improve that in a follow-up. Also, the main change should happen on the reporting side.
  3. Cleanup of the GCS bucket: #372848 (closed). We'll try to do this via storage-side solution, e.g. TTL. So far, we are not going to upload large files until we'll implement Ruby heap dump reports. I suggest we'll work on expiry and bucket cleanup separately to reduce the scope of this MR.
  4. Consider adding compression (shell out to zip), but I think we should start with uploading large reports uncompressed and monitor it. Maybe it'll be fine like that.
  5. Not related to this directly, but to proceed with heap dumps, we'll need to increase extraDir size limit: it's currently set to 1G which is fine for small jemalloc reports, but won't fit huge Ruby heap dumps

Other notes

Self-managed is not affected because you need to set GITLAB_DIAGNOSTIC_REPORTS_ENABLED. Currently, we are not working on enabling this for self-managed.

Local dev env is not affected unless you set GITLAB_DIAGNOSTIC_REPORTS_ENABLED to start with. Also, you need FF enabled.

Prerequisite: Obtain and configure GCS bucket for diagnostic reports (#372239 (closed)) - we need SRE help for that.

FF rollout issue: #372771 (closed)

How to set up and validate locally

  1. I tested on a real GCS bucket: and created my own GCP project via Sandbox Cloud Realm (guide)

  2. Pull the branch

  3. Set necessary ENV vars (replace token/bucket/path with your choice):

export GITLAB_DIAGNOSTIC_REPORTS_UPLOADER_SLEEP_S=10
export GITLAB_DIAGNOSTIC_REPORTS_GCS_TOKEN=`gcloud auth print-access-token`
export GITLAB_DIAGNOSTIC_REPORTS_GCS_BUCKET='diag_reports'
export GITLAB_DIAGNOSTIC_REPORTS_ENABLED='true'
export GITLAB_DIAGNOSTIC_REPORTS_PATH='/Users/al/dev/tmp-diag-reports-uploads'
  1. Hack UploadAndCleanupReports#ready_to_upload?: return true to upload any files that you put into the dir

  2. gdk restart to pick up ENV vars

  3. Enable FF via rails c: Feature.enable(:gitlab_diagnostic_reports_uploader)

  4. Put any file into the designated dir

  5. It should be uploaded (check the bucket) and removed from the dir (check the dir)

  6. Check tail -f log/application_json.log for log records on success uploads or any issues

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #372242 (closed)

Merge request reports