BulkInsertableTask silently removes duplicates using unique_by attributes
Summary
Executing an insertable task that extends from Gitlab::Ingestion::BulkInsertableTask causes the upsert
to output an array smaller than the input if any of the input attributes are considered to be duplicates
when considering their unique_by attributes. This behavior is unexpected since the documentation for the
base class states the following about the unique_by variable.
# `unique_by`: Optional attribute names to set unique constraint which will be used by
# PostgreSQL to update records on conflict. The duplicate records will be
# ignored by PostgreSQL if this is not provided.
The documentation sets the expectation that the attributes specified here will only be used in the
ON CONFLICT clause only. In reality, the class creates a unique list of attributes using the
unique_by variable.
Steps to reproduce
- Create a new instance of a class that extends from
BulkInsertableTask. For instance, theSecurity::Ingestion::AbstractTaskclass. - Add attributes to the class so that there are duplicates when considering the
unique_byattributes. - Execute the insertion, and confirm that the return data size is less than the
attributessize.
See GlobalAdvisoryScanWorker: null value in vulnera... (#432870 - closed) for an example of this.
Example Project
N/A
What is the current bug behavior?
The size of the data returned from the #execute method is less than the attributes size passed in.
What is the expected correct behavior?
The size of the data returned is always the same as the attributes size passed in.
Relevant logs and/or screenshots
N/A
Possible fixes
- Remove the call to
#unique_attributes. - Use the
unique_byattributes to group by thereturn_data, and update the documentation to reflect how to access the data. - Do not update the class and update only the documentation to explicitly state this side effect.