Follow-up from "Avoid conflicts between ArchiveTracesCronWorker and ArchiveTraceWorker"

The following discussion from !31376 (merged) should be addressed:

@ayufan started a discussion: (+7 comments)

Yes, checking the .finished_at column only is an expensive operation, as it effectively requires us too loop also through all running builds, which seems inefficient and will not scale when we have 10k-100k running builds at the given time. Actually, it would mean that we reqiure index on id+status. How we could remove this CronWorker? I think that this is our biggest problem to solve. We should not need it in the first place, and instead use a proper queuing mechanism for that Maybe we need some extra table ci_builds_finished that we would use to do final processing of the build, including archiving of traces, and we would remove entries from this table once processed? It seems that adding such table, inserting into it as part of after_transition, and having it run as part of BuildFinishWorker would be way cheaper to do. For example, we are adding additional helper for processing as part of https://gitlab.com/gitlab-org/gitlab-ce/issues/65538#note_204508591, this concept seems quite similar, but for slightly different use-case. We need a way to ensure that all builds were processed at the end via Sidekiq. So, technically, we replicate the queuing mechanism within database :( So, what I'm asking and proposing is: can we somehow avoid running this expensive query and instead have more event-based approach: in form of additional table/column, or just simply rely on Sidekiq. Because this is definitely boring change, but also it does not scale :(
@dosuken123 started a discussion: (+7 comments)

It's a much broader picture than the original intention of this MR, but I'll comment on your suggestion. which seems inefficient and will not scale when we have 10k-100k running builds at the given time. This is definitely the worst case scenario, because most of the cases ArchiveTraceWorker successfully finishes own task (Occasionally, it'd rely on retry of Sidekiq worker). And even, say an outage happened on GCS and it had not responded for 1 hour, ArchiveTracesCronWorker can rescue stale live traces for the number of builds happened in the 1 hour. Having two workers for the same purpose would not be considered as a clean architecture, however, it works as a pragmatic way. You're talking about scalability, but consider that we haven't even removed the feature flag to make the new live trace architecture enabled by default, which means we don't even stand on the stage to talk about future scalability, it is NOT functional today (as causing a small amount of trace loss) and that's the problem we want to fix ASAP. So again, what we need to do here, is fixing the outstanding bug and make the new live trace architecture enabled by default. And next, we keep monitoring the performance metrics and optimize/re-architect if necessary (with whatever alternative way you're suggesting above). Maybe we need some extra table ci_builds_finished that we would use to do final processing of the build, including archiving of traces, and we would remove entries from this table once processed? I'm not really sure how it solves the problem we have today. It'd contribute to decouple some complexity of Ci::Build to the new place, but having a cron worker is still inevitable. What if BuildFinishWorker couldn't finish the whole process and retry is expired? Who's going to complete the cleanup process? How queueing mechanism helps the situation? So, what I'm asking and proposing is: can we somehow avoid running this expensive query and instead have more event-based approach: in form of additional table/column, or just simply rely on Sidekiq. I cannot allocate time for this at least in this MR. I'd rather close this MR and ask Verify group to properly re-architect according to your suggestion, well actually we're still seeking the solution though. One boring solution would be increasing ArchiveTraceWorker's retry to almost indefinite value (e.g. 999999) with some waits (e.g. per 10 minutes). Or, we can rescue an exception and perform_in to be reprocessed until it's done. Either way, it's scary to stuck in eternal loop.