Improvements to trace archival process and stale data handling in `p_ci_build_trace_metadata`

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Close this issue

Context

We use p_ci_build_trace_metadata to keep track of the archival attempts on the job logs and verify that the signature of the files are correct. But we don't have to keep the data around after the archival is successful.

What we've done so far:

In #500654 (closed), we implemented a process to delete the metadata records of newly archived traces.
In #533933 (closed), we truncated the entire table.

With reference to (2), the fact that we can truncate the whole table without significant repercussion indicates that there is much room for improvement in our archival logic. We can either further minimize the retained data or remove the need to persist data in Postgres all together.

Further context

(Ref: #533933 (comment 2438367011))

It seems that deleting all the existing trace metadata records would effectively just reset the archival attempts to 0 for all failed trace archives... and this doesn't seem so bad. If a job with an unarchived trace still has stale live trace data, we'd just be giving it another 5 attempts.

...

Moreover, I don't see what we actually do with records that have invalid remote_checksum values. It looks like we just log it and leave it for posterity. But I think we still read the invalid archived file anyway?

Possible ideas

(Ref: #533933 (comment 2438852003))

Update ArchiveTraceService so that it ignores stale trace live data thats too old. i.e. There is no point in trying to re-attempt archival on them after X time period.

With above, there would be no need to preserve the trace metadata record after X time. We could also update ArchiveTraceService to dynamically delete trace metadata after X time going forward (after we truncate the table.)

(Ref: #533933 (comment 2438856532))

Just to add, we could do the proposal in #533933 (comment 2438852003) as a first iteration. But this means that all of p_ci_build_trace_metadata is essentially ephemeral. So we could consider replacing the whole thing with Redis counters to track attempts instead of a database table. Based on the backoff logic, it looks like we only try all 5 archival attempts within a 7-day window. And that's easily handled by Redis.

Proposal

Consider the ideas in the previous section as well as other ideas that may ultimately make our trace archival process much more efficient in terms of processing and data storage.

Edited Aug 29, 2025 by 🤖 GitLab Bot 🤖

Improvements to trace archival process and stale data handling in p_ci_build_trace_metadata

Context

Further context

Possible ideas

Proposal

Improvements to trace archival process and stale data handling in `p_ci_build_trace_metadata`