Make the BuildHooksWorker idempotent

Part of gitlab-com/gl-infra/scalability#178

The BuildHooksWorker processed 27k "duplicate" jobs the past 7 days, and spent 3 hours on that. Duplicate jobs are jobs that get scheduled when there is already a job in the queue for the same worker with the same arguments.

If the job was (marked as) idempotent, we would be able to deduplicate those jobs when they get scheduled.

The BuildHooksWorkerruns every time a build is created, started or finished (through the BuildFinishedWorker). The job itself only loads the build, and then executes hooks based on the state the record is currently in. The hooks are scheduled async with the data that was built when the BuildHooksWorker ran. This means that if a job was scheduled when a previously scheduled job hasn't started yet, they would both do the same thing, for example:

  1. Job gets created -> BuildHooksWorker-1 scheduled for this
  2. Job starts -> BuildHooksWorker-2 scheduled for this
  3. BuildHooksWorker-1 runs, it will execute hooks with the data that's currently in the database (build running)
  4. BuildHooksWorker-2 runs, if nothing changed in the database in the meantime, the hooks that execute will look the same as what was run in BuildHooksWorker-2.

Questions for grouppipeline execution in GitLab.com / GitLab Infrastructure Team / scalability, is this correct behaviour, or should this be adjusted, I think we have 2 options:

  1. Should we drop the second hook, and only deliver the latest status when the first job executes?
  2. Should we deliver the 2 different payloads? And if we do, should we guarantee that they are delivered in order.
Edited by Bob Van Landuyt