Avoid storing duplicate licenses for the same spdx_identifier in software licenses table
During investigation of MIT
policy duplication in software_licenses
table while validating license policy test cases, we noticed that there are duplicate entries for the same spdx_identifier
in gitlab.com
production database.
The duplications are not only for MIT
license but exists for other identifiers as well.
[ gprd ] production> mit_licenses = SoftwareLicense.where('name LIKE ?', "%MIT%").where.not(spdx_identifier: nil)
[ gprd ] production> mit_licenses.count
=> 5
# Refer internal note below for the exact records
# https://gitlab.com/gitlab-org/gitlab/-/issues/395776#note_1308935395
[ gprd ] production> SoftwareLicense.where.not(spdx_identifier: nil).distinct.count(:spdx_identifier)
=> 433
[ gprd ] production> SoftwareLicense.where.not(spdx_identifier: nil).count
=> 457
Duplicate license records observed
[ gprd ] production> SoftwareLicense.where.not(spdx_identifier: nil).select(:spdx_identifier).group(:spdx_identifier).having("count(*) > 1").size
{"AGPL-1.0"=>2,
"AGPL-3.0"=>2,
"Apache-2.0"=>2,
"Artistic-2.0"=>2,
"BSD-2-Clause"=>2,
"BSD-3-Clause"=>2,
"BSD-4-Clause"=>2,
"CC0-1.0"=>2,
"CDDL-1.0"=>2,
"CDDL-1.1"=>2,
"CERN-OHL-1.1"=>2,
"EPL-1.0"=>2,
"EPL-2.0"=>2,
"ISC"=>2,
"LGPL-3.0-only"=>2,
"MIT"=>2,
"MS-PL"=>2,
"MS-RL"=>2,
"MirOS"=>2,
"Python-2.0"=>2,
"RSA-MD"=>2,
"Ruby"=>2,
"WTFPL"=>2,
"Zlib"=>2}
Impact
- Looks like we are using
SoftwareLicense.all_license_names
helper method to populate the UI. - This bug results in invalid Spdx names appearing in the scan result policies -> licenses dropdown which leads to invalid approval rule execution as well and results in the below mentioned approval failure case.
- Because of having an invalid license name in the policy configuration (i.e
MIT
instead ofMIT License
), test case number 6 fails. Example MR.
Implementation plan
- Add a column
:visible, :boolean, default: true, null: false
tosoftware_licenses
and display only visible licenses in the UI. - Add a migration to mark all duplicated licenses as
visible: false
- Cleanup license references in
software_license_policies
to link to visible licenses. Mostly like we do in the migration in this MR. - Remove duplicated licenses from the DB.
- Create unique index for
software_licenses
table onspdx_identifier
for not null spdx_identifier.
Edited by Andy Schoenen