Partitioning: web_hook_logs table
This epic tracks the effort of partitioning the `web_hook_logs` table. Related issues: - gitlab-org/gitlab#256088 | initial discussion on the necessity to partition the `web_hook_logs` table. - gitlab-org/gitlab#276021 | prioritization of partitioning the `web_hook_logs` table due to Primary key integer overflow risk and additional discussion on the partitioning effort. We have decided to partition the table for the following reasons: 1. There is an initial consensus in gitlab-org/gitlab#256088 that a time partitioning by month of `web_hook_logs` is desirable and would address the issues we are facing with `PruneWebHookLogsWorker` and would provide a path forward for setting up a retention policy 1. `web_hook_logs` has no foreign keys referencing it, making the partitioning of the table feasible using our current tools (migration helpers) 1. Partitioning `web_hook_logs` by month has the same requirements and logic as the partitioning of `audit_events` which we have [recently successfully completed](https://gitlab.com/groups/gitlab-org/-/epics/3206), which provides us with additional confidence on successfully completing the process before the Primary Key overflows on August at a worst case scenario. 1. It will also address the Primary key integer overflow risk that is discussed in detail in gitlab-org/gitlab#276021. According to our [forecast on 2021-02-22](https://gitlab.com/gitlab-org/gitlab/-/issues/320898#note_514276636) `web_hook_logs` Primary Key value is the fastest growing, expected to overflow between August 2021 (at a worst case scenario using an exponential forecast) and November 2021 (following a polynomial model). The agreed implementation plan follows: 1. Initialize the partitioning of `web_hook_logs` by mid %"13.10" (early March) while also updating the Primary Key of the new partitioned table to `bigint` 2. Finalize the partitioning of `web_hook_logs` by the end of %"13.11" (mid April) or worst case scenario early %"14.0" (late April) 3. Worst case scenario, swap the partitioned table with the old table in early %"14.1" (late May - early June) 4. Worst case scenario, drop the old `web_hook_logs` in early %"14.2" (late June) ### Additional information At 2021-03-01, `web_hook_logs` has \~527M records and is a total size of \~1TB | Table | Rows | Total size | Table size | Index(es) Size | TOAST Size | |-------|------|--------------|------------|----------------|------------| | `web_hook_logs` | \~527M | 1.02 TiB (10.46%) | 713.02 GiB (13.37%) | 42.26 GiB (1.10%) | 279.01 GiB (38.56%) | The earliest record is from July 2020, which means that we expect that 16 partitions will be created initially (default one && 9 partitions with data for July 2020 to March 2021 && 6 future partitions): ```sql gitlabhq_production=> SELECT created_at FROM web_hook_logs ORDER BY id asc LIMIT 1; created_at ---------------------------- 2020-07-26 23:40:36.382274 ``` It has 2 secondary indexes defined and no foreign keys referencing it: ``` gitlabhq_production=> \d web_hook_logs Table "public.web_hook_logs" Column | Type | Collation | Nullable | Default ------------------------+-----------------------------+-----------+----------+------------------------------------------- id | integer | | not null | nextval('web_hook_logs_id_seq'::regclass) web_hook_id | integer | | not null | trigger | character varying | | | url | character varying | | | request_headers | text | | | request_data | text | | | response_headers | text | | | response_body | text | | | response_status | character varying | | | execution_duration | double precision | | | internal_error_message | character varying | | | created_at | timestamp without time zone | | not null | updated_at | timestamp without time zone | | not null | Indexes: "web_hook_logs_pkey" PRIMARY KEY, btree (id) "index_web_hook_logs_on_created_at_and_web_hook_id" btree (created_at, web_hook_id) "index_web_hook_logs_on_web_hook_id" btree (web_hook_id) Foreign-key constraints: "fk_rails_666826e111" FOREIGN KEY (web_hook_id) REFERENCES web_hooks(id) ON DELETE CASCADE ``` The maximum and minimum ids are consistent with the expected \~527M records (difference is 567M): ```sql SELECT max(id) FROM web_hook_logs; max ------------ 1088454310 SELECT min(id) FROM web_hook_logs; min ----------- 521020087 ``` Given that our scheduling migration will schedule one `BackFillPartitionedTable` job per 2 minutes, copying 50K records in sub-batches of 2500, we expect the back-fill migration to take to complete: `527M / 50K = 10.540 jobs, one running each 2 minutes = 351.33 hours = 14.63 days`
epic