Partition the `duo_workflows_checkpoints` table using daily time based partitions

Problem

The checkpoints table is likely to grow a lot over time and the old checkpoints are likely to be less valuable over time. We should look into partitioning this data before the feature goes Beta as making complex schema changes later on is likely to be very time consuming.

Current schema

Introduced in !156872 (merged) the current schema is:

CREATE TABLE duo_workflows_workflows (
    id bigint NOT NULL,
    user_id bigint NOT NULL,
    project_id bigint NOT NULL,
    created_at timestamp with time zone NOT NULL,
    updated_at timestamp with time zone NOT NULL
);

CREATE TABLE duo_workflows_checkpoints (
    id bigint NOT NULL,
    workflow_id bigint NOT NULL,
    project_id bigint NOT NULL,
    created_at timestamp with time zone NOT NULL,
    updated_at timestamp with time zone NOT NULL,
    thread_ts text NOT NULL,
    parent_ts text,
    checkpoint jsonb NOT NULL,
    metadata jsonb NOT NULL,
    CONSTRAINT check_3dcc551d16 CHECK ((char_length(parent_ts) <= 255)),
    CONSTRAINT check_5d3139b983 CHECK ((char_length(thread_ts) <= 255))
);

Considerations

A single workflow might be checkpointed hundreds of times so this table will grow way faster than the workflows table.

We probably don't want old workflows to be deleted over time but old workflows will not need to be restartable. This is very similar to CI Pipelines where after a point it doesn't make sense to be able to restart them. Similarly the checkpoints will contain a lot of metadata about what happened during the workflow. If a workflow does not need to be restarted this can probably be summarised and we wouldn't need to keep all that metadata that is mostly internal state.

Docs

Time decay concepts: https://handbook.gitlab.com/handbook/company/working-groups/database-scalability/time-decay/
Partitioning high level docs: https://handbook.gitlab.com/handbook/engineering/infrastructure/core-platform/data_stores/database/doc/partitioning/
Partitioning developer docs: nghttps://docs.gitlab.com/ee/development/database/partitioning/
Different partitioning strategies already available in GitLab models:

Proposal

Use date range partitioning for checkpoints and drop the checkpoints after 30 days

If we assume that every workflow gets some kind of final summary generated when it is completed then this may be sufficient for historical purposes. At this point the specific checkpoints would no longer be needed. We could also generate this summary before dropping the checkpoints if we wanted. But ideally the summary should be much smaller than all the checkpoints.

There should be no reason to restart an old workflow after 30 days. But we can always tweak the 30 days number if we think we want more time.

We should use daily partitions to keep the partition size small. The docs can be found at https://docs.gitlab.com/development/database/partitioning/date_range/#daily-strategy-daily .

Since we will have APIs that are loading all checkpoints for a given workflow we'll need to make sure these are efficiently searching the partitions. To do this they should include a filter on the partition. Since we know that all checkpoints are created after the workflow we can do something like:

# Give a 5 minute buffer for clock sync issues but otherwise we assume all checkpoints are created after the workflow for efficient retrieval from the partitioned table
checkpoints.where("created_at > ?", workflow.created_at - 5.minutes)

We do not need to worry about backwards compatibility during our alpha release so we can make breaking schema changes. We should probably just use a new feature flag to toggle the use of a new table for this purpose. Once the feature flag is rolled out we can simply drop the old table.

Edited Mar 07, 2025 by Dylan Griffith