Improvements to trace archival process and stale data handling in p_ci_build_trace_metadata
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
Context
We use p_ci_build_trace_metadata to keep track of the archival attempts on the job logs and verify that the signature of the files are correct. But we don't have to keep the data around after the archival is successful.
What we've done so far:
-
In #500654 (closed), we implemented a process to delete the metadata records of newly archived traces.
-
In #533933 (closed), we truncated the entire table.
With reference to (2), the fact that we can truncate the whole table without significant repercussion indicates that there is much room for improvement in our archival logic. We can either further minimize the retained data or remove the need to persist data in Postgres all together.
Further context
(Ref: #533933 (comment 2438367011))
It seems that deleting all the existing trace metadata records would effectively just reset the archival attempts to 0 for all failed trace archives... and this doesn't seem so bad. If a job with an unarchived trace still has stale live trace data, we'd just be giving it another 5 attempts.
...
Moreover, I don't see what we actually do with records that have invalid
remote_checksumvalues. It looks like we just log it and leave it for posterity. But I think we still read the invalid archived file anyway?
Possible ideas
- (Ref: #533933 (comment 2438852003))
Update
ArchiveTraceServiceso that it ignores stale trace live data thats too old. i.e. There is no point in trying to re-attempt archival on them after X time period.
With above, there would be no need to preserve the trace metadata record after X time. We could also update
ArchiveTraceServiceto dynamically delete trace metadata after X time going forward (after we truncate the table.)
- (Ref: #533933 (comment 2438856532))
Just to add, we could do the proposal in #533933 (comment 2438852003) as a first iteration. But this means that all of
p_ci_build_trace_metadatais essentially ephemeral. So we could consider replacing the whole thing with Redis counters to track attempts instead of a database table. Based on the backoff logic, it looks like we only try all 5 archival attempts within a 7-day window. And that's easily handled by Redis.
Proposal
Consider the ideas in the previous section as well as other ideas that may ultimately make our trace archival process much more efficient in terms of processing and data storage.