Sidekiq - Prevent potential loss of metrics data during process restarts
Problem Statement
This is a follow up to !18568 (merged), which was more of a hotfix to prevent complete loss of metrics data in prometheus for sidekiq, since due to a race condition in process (re)starts, prometheus database files were recreated before they were cleaned up, removing all metrics data in the process. That hotfix mitigated the worst data loss by simply reinitializing (and thereby recreating) the proper file structure for metrics data again.
However, a problem that remains is that while we have some confidence now that data won't be completely absent, there is still a race condition present where during worker start-up, early workers will start to collect metrics to the prometheus database (the one local to the exporting node that is being scraped for metrics), which will subsequently get wiped out again as more workers start up and go through that shared intialization routine (which performs a full removal of all those files.) This might cause small blips in dashboards where data that had been seen by the scraper disappears again until all workers have fully started up.
We should find a solution where the sidekiq prometheus metrics database files cannot be compromised in the presence of multi-process concurrency, as is the case with sidekiq clusters.
Reach
- Sasha (software engineer)
- Devon (devops engineer)
- Parker (product manager)
because any of these want to see accurate cluster metrics.
Reach: 3 to 6? (not sure how to assess this)
Scale:
10.0 = Impacts the vast majority (~80% or greater) of our users, prospects, or customers. 6.0 = Impacts a large percentage (~50% to ~80%) of the above. 3.0 = Significant reach (~25% to ~50%). 1.5 = Small reach (~5% to ~25%). 0.5 = Minimal reach (Less than ~5%).
Impact
0.25 - minimal, I think, because the time window for this happening is actually fairly small, and shouldn't leave a big dent in numbers (however, it increases with the frequency of worker restarts)
Scale:
3.0 = Massive impact 2.0 = High impact 1.0 = Medium impact 0.5 = Low impact 0.25 = Minimal impact
Confidence
Medium to high confidence, but not sure yet how often / likely this is to happen. We would have to do the research first, but it's certainly possible.
Scale:
100% = High confidence 80% = Medium confidence 50% = Low confidence
Effort
To decide.