BulkInsertableTask silently removes duplicates using unique_by attributes

Summary

Executing an insertable task that extends from Gitlab::Ingestion::BulkInsertableTask causes the upsert to output an array smaller than the input if any of the input attributes are considered to be duplicates when considering their unique_by attributes. This behavior is unexpected since the documentation for the base class states the following about the unique_by variable.

    #   `unique_by`: Optional attribute names to set unique constraint which will be used by
    #                PostgreSQL to update records on conflict. The duplicate records will be
    #                ignored by PostgreSQL if this is not provided.

The documentation sets the expectation that the attributes specified here will only be used in the ON CONFLICT clause only. In reality, the class creates a unique list of attributes using the unique_by variable.

Steps to reproduce

  1. Create a new instance of a class that extends from BulkInsertableTask. For instance, the Security::Ingestion::AbstractTask class.
  2. Add attributes to the class so that there are duplicates when considering the unique_by attributes.
  3. Execute the insertion, and confirm that the return data size is less than the attributes size.

See GlobalAdvisoryScanWorker: null value in vulnera... (#432870 - closed) for an example of this.

Example Project

N/A

What is the current bug behavior?

The size of the data returned from the #execute method is less than the attributes size passed in.

What is the expected correct behavior?

The size of the data returned is always the same as the attributes size passed in.

Relevant logs and/or screenshots

N/A

Possible fixes

  1. Remove the call to #unique_attributes.
  2. Use the unique_by attributes to group by the return_data, and update the documentation to reflect how to access the data.
  3. Do not update the class and update only the documentation to explicitly state this side effect.