Partitioning: web_hook_logs table
This epic tracks the effort of partitioning the `web_hook_logs` table.
Related issues:
- gitlab-org/gitlab#256088 | initial discussion on the necessity to partition the `web_hook_logs` table.
- gitlab-org/gitlab#276021 | prioritization of partitioning the `web_hook_logs` table due to Primary key integer overflow risk and additional discussion on the partitioning effort.
We have decided to partition the table for the following reasons:
1. There is an initial consensus in gitlab-org/gitlab#256088 that a time partitioning by month of `web_hook_logs` is desirable and would address the issues we are facing with `PruneWebHookLogsWorker` and would provide a path forward for setting up a retention policy
1. `web_hook_logs` has no foreign keys referencing it, making the partitioning of the table feasible using our current tools (migration helpers)
1. Partitioning `web_hook_logs` by month has the same requirements and logic as the partitioning of `audit_events` which we have [recently successfully completed](https://gitlab.com/groups/gitlab-org/-/epics/3206), which provides us with additional confidence on successfully completing the process before the Primary Key overflows on August at a worst case scenario.
1. It will also address the Primary key integer overflow risk that is discussed in detail in gitlab-org/gitlab#276021. According to our [forecast on 2021-02-22](https://gitlab.com/gitlab-org/gitlab/-/issues/320898#note_514276636) `web_hook_logs` Primary Key value is the fastest growing, expected to overflow between August 2021 (at a worst case scenario using an exponential forecast) and November 2021 (following a polynomial model).
The agreed implementation plan follows:
1. Initialize the partitioning of `web_hook_logs` by mid %"13.10" (early March) while also updating the Primary Key of the new partitioned table to `bigint`
2. Finalize the partitioning of `web_hook_logs` by the end of %"13.11" (mid April) or worst case scenario early %"14.0" (late April)
3. Worst case scenario, swap the partitioned table with the old table in early %"14.1" (late May - early June)
4. Worst case scenario, drop the old `web_hook_logs` in early %"14.2" (late June)
### Additional information
At 2021-03-01, `web_hook_logs` has \~527M records and is a total size of \~1TB
| Table | Rows | Total size | Table size | Index(es) Size | TOAST Size |
|-------|------|--------------|------------|----------------|------------|
| `web_hook_logs` | \~527M | 1.02 TiB (10.46%) | 713.02 GiB (13.37%) | 42.26 GiB (1.10%) | 279.01 GiB (38.56%) |
The earliest record is from July 2020, which means that we expect that 16 partitions will be created initially (default one && 9 partitions with data for July 2020 to March 2021 && 6 future partitions):
```sql
gitlabhq_production=> SELECT created_at FROM web_hook_logs ORDER BY id asc LIMIT 1;
created_at
----------------------------
2020-07-26 23:40:36.382274
```
It has 2 secondary indexes defined and no foreign keys referencing it:
```
gitlabhq_production=> \d web_hook_logs
Table "public.web_hook_logs"
Column | Type | Collation | Nullable | Default
------------------------+-----------------------------+-----------+----------+-------------------------------------------
id | integer | | not null | nextval('web_hook_logs_id_seq'::regclass)
web_hook_id | integer | | not null |
trigger | character varying | | |
url | character varying | | |
request_headers | text | | |
request_data | text | | |
response_headers | text | | |
response_body | text | | |
response_status | character varying | | |
execution_duration | double precision | | |
internal_error_message | character varying | | |
created_at | timestamp without time zone | | not null |
updated_at | timestamp without time zone | | not null |
Indexes:
"web_hook_logs_pkey" PRIMARY KEY, btree (id)
"index_web_hook_logs_on_created_at_and_web_hook_id" btree (created_at, web_hook_id)
"index_web_hook_logs_on_web_hook_id" btree (web_hook_id)
Foreign-key constraints:
"fk_rails_666826e111" FOREIGN KEY (web_hook_id) REFERENCES web_hooks(id) ON DELETE CASCADE
```
The maximum and minimum ids are consistent with the expected \~527M records (difference is 567M):
```sql
SELECT max(id) FROM web_hook_logs;
max
------------
1088454310
SELECT min(id) FROM web_hook_logs;
min
-----------
521020087
```
Given that our scheduling migration will schedule one `BackFillPartitionedTable` job per 2 minutes, copying 50K records in sub-batches of 2500, we expect the back-fill migration to take to complete:
`527M / 50K = 10.540 jobs, one running each 2 minutes = 351.33 hours = 14.63 days`
epic