Avoid storing duplicate licenses for the same spdx_identifier in software licenses table

During investigation of MIT policy duplication in software_licenses table while validating license policy test cases, we noticed that there are duplicate entries for the same spdx_identifier in gitlab.com production database.

The duplications are not only for MIT license but exists for other identifiers as well.

[ gprd ] production> mit_licenses = SoftwareLicense.where('name LIKE ?', "%MIT%").where.not(spdx_identifier: nil)
[ gprd ] production> mit_licenses.count
=> 5

# Refer internal note below for the exact records
# https://gitlab.com/gitlab-org/gitlab/-/issues/395776#note_1308935395
[ gprd ] production> SoftwareLicense.where.not(spdx_identifier: nil).distinct.count(:spdx_identifier)
=> 433
[ gprd ] production> SoftwareLicense.where.not(spdx_identifier: nil).count
=> 457

Duplicate license records observed

[ gprd ] production> SoftwareLicense.where.not(spdx_identifier: nil).select(:spdx_identifier).group(:spdx_identifier).having("count(*) > 1").size
{"AGPL-1.0"=>2,
 "AGPL-3.0"=>2,
 "Apache-2.0"=>2,
 "Artistic-2.0"=>2,
 "BSD-2-Clause"=>2,
 "BSD-3-Clause"=>2,
 "BSD-4-Clause"=>2,
 "CC0-1.0"=>2,
 "CDDL-1.0"=>2,
 "CDDL-1.1"=>2,
 "CERN-OHL-1.1"=>2,
 "EPL-1.0"=>2,
 "EPL-2.0"=>2,
 "ISC"=>2,
 "LGPL-3.0-only"=>2,
 "MIT"=>2,
 "MS-PL"=>2,
 "MS-RL"=>2,
 "MirOS"=>2,
 "Python-2.0"=>2,
 "RSA-MD"=>2,
 "Ruby"=>2,
 "WTFPL"=>2,
 "Zlib"=>2}

Impact

  • Looks like we are using SoftwareLicense.all_license_names helper method to populate the UI.
  • This bug results in invalid Spdx names appearing in the scan result policies -> licenses dropdown which leads to invalid approval rule execution as well and results in the below mentioned approval failure case.

Screenshot_2023-03-10_at_3.20.00_PM

  • Because of having an invalid license name in the policy configuration (i.e MIT instead of MIT License), test case number 6 fails. Example MR.

Screenshot_2023-04-11_at_5.01.47_PM

Implementation plan

  1. Add a column :visible, :boolean, default: true, null: false to software_licenses and display only visible licenses in the UI.
  2. Add a migration to mark all duplicated licenses as visible: false
  3. Cleanup license references in software_license_policies to link to visible licenses. Mostly like we do in the migration in this MR.
  4. Remove duplicated licenses from the DB.
  5. Create unique index for software_licenses table on spdx_identifier for not null spdx_identifier.
Edited by Andy Schoenen