Skip to content

Analyze merge_request_diff_files data

Taken from: gitlab-com/database#110 (comment 97527141)

Let's analyze the data in merge_request_diff_files and compile statistics for:

  1. size of records over time (on a daily basis)
  2. size of individual MRs and distribution of this - are there any outstanding MRs we may be able to delete data from?
  3. Number of distinct file data (useful for https://gitlab.com/gitlab-org/gitlab-ce/issues/37632 in context of deduplication).
  4. What else?

I'm thinking we should stand up a database in gstg for this analysis or at least use a production replica for this (maybe one that is not running other transactions).

Edited by Andreas Brandl