Add new table to for unique job names (Part 1)

Actual Piece to get this working will be done in: #446304

Problem

We need to have the ability to filter jobs by name for !106458. We are unable to due this due to the current table size and needing to add a index. This MR needed to be closed !107102 (closed) due to that problem.

Solution

CREATE TABLE p_ci_build_names (
    build_id bigint NOT NULL,
    partition_id bigint NOT NULL,
    project_id integer NOT NULL,
    name text,
    PRIMARY KEY (build_id, partition_id),
    FOREIGN KEY (partition_id, build_id) REFERENCES p_ci_builds (partition_id, id) ON UPDATE CASCADE ON DELETE CASCADE
)
PARTITION BY LIST (partition_id);

CREATE TABLE ci_build_names partition of p_ci_build_names for values in (100, 101);

After a pipeline is created, we schedule a sidekiq task to go through the jobs and update their build_name_id attribute with a record from p_ci_build_names. With this we can extend the index on p_ci_builds with only one bigint column which will increase its size by about 16 bytes per row. The queries could look something like this:

select * from p_ci_builds where build_name_id in (select id from p_ci_build_names where to_tsvector(name) @@ to_tsquery('test') and project_id = <>) and ...

We also need a specialized search index on p_ci_build_names.

I'm running select count(distinct(name)) from ci_builds where project_id = 278964; on dblab to see how many unique names we have in gitlab-org/gitlab.

Results:

gitlabhq_dblab=# select count(*) from ci_builds where project_id = 278964;
   count
-----------
 133611505
(1 row)

Time: 1731848.576 ms (28:51.849)

gitlabhq_dblab=# select count(distinct(name)) from ci_builds where project_id = 278964;
 count
-------
 25391
(1 row)

Time: 109566988.659 ms (1 d 06:26:06.989)

Edited Apr 04, 2024 by Max Fan