Skip to content

Fix duplicated ref tracking

Local Testing procedure for this branch:

checkout main
rm -r db
scripts/init_db_if_missing.sh
# Startup Lorry as normal, mirror the default stuff
scripts/watch.sh
# Shutdown lorry
git checkout ks/fix-excessive-ref-recording
# manually run the migration since
cargo sqlx migrate run --source lorry/migrations -D sqlite://db/lorries.sqlite
# prepare the sqlx queries such that you can compile the new branch
cargo sqlx prepare --workspace
# Finally start the new version and confirm that refs are maintained from
# both the old and new version of Lorry.

Commit:

Fix duplicated ref tracking

Previously Lorry was inserting ref state tracking information per job into
the refs table. This meant that for each repository a row was inserted in
the database per git ref. When cloning thousands of mirrors with sometimes
10s of thousands of refs the size of this table could grow exponentially.

Two columns have variable size in the existing table:

name - The name of the ref itself which in many cases can be quite long.
message - The line output of git push result for the particular ref.

In this change the existing refs table has been emptied and replaced with
two new tables:

ALTER TABLE refs RENAME to refs_old;

CREATE TABLE refs (
    id INTEGER PRIMARY KEY,
    path TEXT NOT NULL,
    name TEXT NOT NULL,
    UNIQUE (path, name)
);

CREATE TABLE ref_state (
    id INTEGER PRIMARY KEY,
    job_id INTEGER NOT NULL,
    ref_id INTEGER NOT NULL,
    status TEXT CHECK (
    status IN (
        'FastForward',
        'ForcedUpdate',
        'Deleted',
        'NewRef',
        'Rejected',
        'NoPush')) NOT NULL,
    successful BOOLEAN NOT NULL,
    FOREIGN KEY (ref_id) REFERENCES refs(id),
    FOREIGN KEY (job_id) REFERENCES jobs(id)
);

With this schema the ref name is only recorded once per mirror and the state
of each ref in the context of a job is recorded now in the ref_state table.

The cardinality of data stored in the refs table has now been significantly
reduced and we only store a few Git specific text values which we model as
an enum in the application. The only variable text that is still stored is
the branch name which is important to continue presenting in the UI.

In large databases this migration may take a considerable amount of time to
apply. Once it has been applied administrators should run a VACUUM; statement
manually against the database to reclaim empty space left from the old schema.

The UI now reports a git porcelain enum status instead of the free-form text value:

image

Fixes #225 (closed)

Edited by Kevin Schoon

Merge request reports

Loading