Project 'gitlab-com/gl-infra/scalability' was moved to 'gitlab-com/gl-infra/observability/team'. Please update any links and bookmarks that may still have the old path.
There are a couple of things we might do to improve the reliability on that such as adding integration tests for each worker (&96 (comment 229002255)) ensuring that multiple calls (on even half-processed jobs) are still idempotent. Though it's not a catchall solution (given jobs can fail in multiple places, for multiple reasons, leaving unpredictable state behind - it alone can be tricky to ensure with a few tests), but it's a starting point.
i.e. meaning it should be able to run multiple times without side-effects,
I don't know if this applies for all workers: Some jobs are allowed to do their thing multiple times with different side effects every time. (For example, my all-time favourite, RepositoryUpdateMirrorWorker, or the UpdateProjectStatisticsWorker). In other cases, we actually don't want the same side effects twice (for example: NewNoteWorker should not send notifications twice).
I think a good first step might be adding a shared example that injects a sidekiq-server middleware that runs a job twice and ensures that the second run does not fail.
I think ensuring there are no side effects on the second run we can provide some helpers for, but in the end it will need to be tested for each job as the side effects might vary greatly.
@reprazent Let me see if I get your suggestion right:
I think a good first step might be adding a shared example that injects a sidekiq-server middleware that runs a job twice and ensures that the second run does not fail.
Are you suggesting that we could reuse the existing Sidekiq tests, but instead running the worker once (by calling perform), we'd run it 2 times (expecting that the current expectations would pass)? Or are you suggesting that we'd have a separate test in a shared example being added for each worker, that would run it more times and ensure no exception is raised.
The former might be an interesting way of reusing the existing tests. We might also create a helper method that wraps the perform call (instead doing so directly at the XWorker object), no need for middleware magic.
I think a good first step might be adding a shared example that injects a sidekiq-server middleware that runs a job twice and ensures that the second run does not fail.
Good idea! Let's make this take it to ~"workflow::Ready". It's not a great test, but it's something to build on in future issues
Following on from this, we could make this middleware toggle-able with, say, an environment variable. We could then toggle that on for an entire gitlab-qa run and see if anything breaks
Following on from this, we could make this middleware toggle-able with, say, an environment variable. We could then toggle that on for an entire gitlab-qa run and see if anything breaks
@smcgivern Though that would mean this middleware should probably be living in the application code itself and be toggled ON when starting the processes, given https://gitlab.com/gitlab-org/gitlab-qa is sort of a black-box test environment (acting on a omnibus instance)?
I think for this issue, we could simplify things a bit, instead trying to catch all job specs at once, we can:
Explicitly mark all jobs as either not_idempotent or idempotent (no defaults here) - but initially, all but one should be idempotent
If neither of these methods were called, a cop will catch it
The cop should explain that an idempotent job is preferred, and for ensuring this he/she will need to use our new helper methods / shared example (being introduced in the same MR) - writing unit tests (there will be docs for best practice).
(optional/next-iteration?) If the worker is marked as idempotent and it's under test or dev environment (and a certain ENV var was set), we could enforce it to run multiple times
With 4 in place, we could start thinking on running it multiple times in gitlab-qa and even staging (considering it'd be behind a ENV var)
If we do just 1, 2 and 3, we can already use the idempotent tag for the job we plan to try at #42 (closed).
Bad point:
A bit limited approach for ensuring a job is idempotent, but we can iterate on that
Not heavily automated as one could just set a job to idempotent with no tests and that's it (though we trust our maintainers right? )
Good point:
We can start encouraging idempotent workers from the get going setting some basic standards
Though that would mean this middleware should probably be living in the application code itself and be toggled ON when starting the processes, given https://gitlab.com/gitlab-org/gitlab-qa is sort of a black-box test environment (acting on a omnibus instance)?
Yes, exactly! That would also allow us to (say) turn it on in staging if we wanted to, or in development.
I think for this issue, we could simplify things a bit
Agreed, my suggestion doesn't have to be in scope for this issue.
initially, all but one should be idempotent
I think you mean only one should be idempotent? Most of our workers are surely not idempotent right now
Not heavily automated as one could just set a job to idempotent with no tests and that's it (though we trust our maintainers right?
I think this is OK because we can later add more checks!
Though I still feel that failing a test and saying we should mark the worker as not_idempotent (or idempotent) doesn't have the same impact as failing a Cop to encourage considering making a worker idempotent. It seems like a different kind of validation:
spec/workers/every_sidekiq_worker_spec.rb have checks to validate the correctness of the labels and are 100% valid on what they're doing
A cop would encourage making the worker idempotent instead leaving the default, which I can't see happening through a test since we can't really skip its failure, as we're able to do with a Cop
It's not a great test, but it's something to build on in future issues
Yeah, it sounds like a initial step. To tag a job as idempotent though, we might want to have a special ad-hoc test for each worker where we check the final expected application state. Running it multiple times with the same argument should not change its state. Ensuring it's transactional though would be another step ahead
(separate issue / MR) Label a few important workers as idempotent, adding tests ensuring application state after multiple runs for these. Here we can set a few standards of what we expect from a "idempotent" worker in practice.
Label all other workers as not idempotent, possibly adding cops for new workers that don't label the worker as idempotent from the get going, these should be an exception.
Old workers labeled as idempotent: false will need triage in a separate issue, possibly generating issues to handle it separately.