[x] Use this repo as SSOT to import/update advisories into Gemnasium DB (so it will stay in sync by design after initial import). Moved to https://gitlab.com/gitlab-org/gitlab-ee/issues/11837. We will periodically resync until this is done.
Document public workflow to contribute to our advisories db:
create an issue or an MR to contribute and add an advisory to the DB. Contribution to enriching metadata is open, but merging and publishing to the DB is restricted to maintainers (~Secure team).
Sorry if it wasn't clear, I was actually more asking for general guidelines about how such things get discussed/decided, because this might need a lot of context .
For this very case, a short description could be: we are about to give public access to a database of security advisories that was originally held privately by GitLab. This would take the form of a public project, the repository being the "Database" mentioned above.
The wondering about the License to use is to make sure we're not taking any legal risk, as this is a complex topic.
@kencjohnston what are the expectations about the access and contribution requirements? E.g. AFAIK using the EE License implies that contributors have a valid License too. It could be totally open (MIT?) but at the same time this database will only be useful for GitLab Ultimate features. A fully open access also means potential external usages.
So we simply have to iterate over the entries and create a JSON file for each of them. Using jq's interpolation syntax, the path of the JSON file would be: \(.package.slug)/\(.identifier).json
The problem is that some security advisories have no identifier, so we've got to either fix that or come up with another pattern for the file name. I suggest we create our own identifier to fill the gap, using a pattern that would be similar to the one used by CVE. To illustrate this, GMS-2019-001 would be the identifier of the first advisory of 2019. There would be no need to allocate a Gemnasium-specific identifier for advisories that already have an external identifier, like a CVE id.
@gonzoyumo I'm now looking for a way to fill the gaps and set the identifier where it's missing. That should be as simple as a single SQL query.
@gonzoyumo Are you OK with updating the Gemnasium DB so that all the advisories have a valid identifier? The identifier would be set to GMS-YYYY-XX when missing as explained above.
I would then suggest having a Gemnasium Identifier, which would be always set, for all advisories.
@plafoucriere But then we'd have to update the DB schema to handle multiple identifiers, which is not possible at the moment.
Also, security advisories already have UUIDs in the Gemnasium DB, something like 08ee7d04-c94e-4938-a745-ffdddab7bd3f. So tracking the advisories is not an issue.
We can't reuse DB column package_advisories.uuid to store something like GMS-2019-123 because technically this is a uuid, not a text column. See column definition:
uuid uuid DEFAULT uuid_generate_v4() NOT NULL
Let's be pragmatic and proceed in two steps:
first allocate Gemnasium-specific identifiers where there's no identifier
then update the DB schema to handle multiple identifiers and ensure there's a Gemnasium-specific identifier for every single advisory
To me the latter is out of scope, but worth creating an issue. @plafoucriere Do we agree on that? cc @gonzoyumo
I forgot about the uuid. The only problem with uuid is that they don't generate nice urls. But since the DB will be exposed as yaml, I don't see that as an issue. So, it works for me . Thanks
@fcatteau this totally make sense to me. One small issue I see is how to quickly get the "next" identifier value when adding a new advisory without public common identifier. It might be hard to find what was the last GMS-2019-xxx value used so that we can generate the new one. Though this use case might be very limited and we could have a manual check in DB when encountering this case.
One small issue I see is how to quickly get the "next" identifier value when adding a new advisory without public common identifier.
@gonzoyumo Yes, agreed, and I already had a response prepared for you: we'll see when it happens. Security advisories with no public identifier are uncommon, so we can deal with that later. Ideally we'll update the API endpoint used to create advisory so that it allocates a GMS identifier when needed. Technically we'll have to change the SQL function api.create_advisory.
My opinion is that YAML makes it easier to write and update lists (for fixed versions and URLs) and multiline strings (for description and solution), so I suggest we switch to YAML.
By the way, Gemnasium supports Markdown syntax in the description. I'm not sure YAML plays well with Markdown, so we'd better check that out.
@brytannia I know you've already commented on this but it's not written anywhere. From what I remember you don't really care since the editor you use makes it easy to edit JSON files, right?
hint: the k8s community started with json (while yaml was also supported), and they ended up with yaml-only. Json is not the right format for hand editing ;)
By the way, Gemnasium supports Markdown syntax in the description. I'm not sure YAML plays well with Markdown, so we'd better check that out.
Neat, TIL! Markdown should work fine within YAML just by using a scalar; i.e.
description:|### Stuff## BIGGER STUFF
Between JSON and YAML, I would vote YAML, but if you're asking more generically then I'd try and sneak TOML into the mix since YAML is overly complex and has some crazy stuff
I don't think TOML is bringing a considerable advantage. Multi-lines text must be wrapped with """, so it's actually more verbose.
@plafoucriere To me this is indeed a strong argument in favor of YAML. Also, YAML certainly has a larger ecosystem (e.g. set of tools) than TOML, right?
I'm adding this comment just for consistency. I also vote for YAML. Although I don't have a strong opinion on this topic cause my previous work with advisories was made with the next algorithm: copy template JSON, fill it with data while viewing it IDE with nice syntax highlight. However, YAML looks easier to read out of the box.
Here's the boilerplate we usually have in the CONTRIBUTING.md of projects maintained by GitLab:
Developer Certificate of Origin + License
By contributing to GitLab B.V., You accept and agree to the following terms and conditions for Your present and future Contributions submitted to GitLab B.V. Except for the license granted herein to GitLab B.V. and recipients of software distributed by GitLab B.V., You reserve all right, title, and interest in and to Your Contributions. All Contributions are subject to the following DCO + License terms.
This notice should stay as the first item in the CONTRIBUTING.md file.
I'm not sure it fits well since in this case users are contributing security advisories (data to be added to the Gemnasium DB). There're not contributing code. cc @jhurewitz@kencjohnston@gonzoyumo
@jhurewitz a security advisory is a public announcement of a security issue. We're using these announcements to fill the Gemnasium DB, in order to query them when we do automated security assessments (Dependency Scanning). Ex of an advisory: https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-8331
Now I think we have our answer, Kenny suggested keeping the EE license, so we should be good!
@plafoucriere Is the database exclusively populated by contributors or will we also be populating it? How will we be offering the use of the database to customers? Will there be a charge for access? If we are not charging to use the database it would not make sense to use the commercial EE license?
Also, the EE license is very specific to the GitLab product and licensing model so we may need to revisit it to see if we need to make any modifications to account for this. In addition, the EE license would only cover the end user's use of the database but it would not address the contribution to the database. There would still need to be some sort of terms governing that.
@jhurewitz this Database is already used by our Dependency Scanning feature which is currently part of the Ultimate tier offering.
The content is mainly populated by us from public external sources but we also may enrich it in the filling process.
The primary goal to make this DB publicly accessible is to allow customers to check its content to ensure GitLab Dependency Scanning is effectively checking against publicly known vulnerabilities.
The secondary goal is to open it to external contributions in case we are actually missing some vulnerabilities.
There is currently no explicit goal from a Product perspective to allow any direct usage of this Database for e.g. creating a custom analysis tool, so there is no specific way to charge access to this data and we should be clear about its restricted usage.
If we want to make it publicly available then it would not make sense to have it governed by the EE license, we would want it covered by our open source license along with the CE code. We can certainly add a disclaimer and should do so. As far as contributions, do we want to add any sort of terms with regard to the contributions to protect us with regard to the content they are submitting?
@jhurewitz are we still on track to address this before Contribute or do you think it will require more time? We ultimately need to have it done before May 22nd but with Contribute and related PTOs, I'm worried about extra delays to get a final version.
@gonzoyumo@plafoucriere We should probably open the client we use to publish advisories in Gemnasium, so that we can leverage it in the pipeline of the new gemnasium-db project:
to get a preview of the advisory submitted in a merge request; it's needed to review the affected and fixed versions
to publish the advisory and keep the uuid in the repo; this can be done with a Shell script but it's even better if we've got a client for that
in both case, to better present the errors to users
Sadly there's still no easy way to release a binary, as far as I know. But the usual Docker-based workaround would work.
We need to better define the "internal" contribution process about how to publish advisories to the Gemnasium PostgreSQL DB before knowing how to organize and expose the tools. Addressing that will open a lot of questions and I'm worried about the time left in %11.11.
Though I agree automating that part would be a relief so I suggest attempting to add that in %12.0. I added #11293 (closed) to discuss this.
Though I agree automating that part would be a relief so I suggest attempting to add that in %12.0. I added #11293 (closed) to discuss this.
@gonzoyumo Yes, but I'm afraid this is not practical enough. It's all about velocity and the time it takes to publish new vulnerabilities on a daily basis. But OK, let's say we properly solve this in %12.0. Meanwhile we're likely to create scripts and ad-hoc tools to address velocity.
For instance, the script generates gem/nokogiri/CVE-2019-11068.yml with this content:
identifier:CVE-2019-11068title:Bypass of a protection mechanism in libxsltdescription:libxslt through 1.1.33 allows bypass of a protection mechanism becausecallers of xsltCheckRead and xsltCheckWrite permit access even upon receiving a-1 error code. xsltCheckRead can return -1 for a crafted URL that is not actuallyinvalid and is subsequently loaded. Vendored version of libxslt has been patchedto remediate this vulnerability. Note that this patch is not yet (as of 2019-04-22)in an upstream release of libxslt.date:"2019-04-22"affected_range:""fixed_versions:-1.10.3affected_versions:Prior to 1.10.3solution:Upgrade to latest version if using vendored version of libxslt OR updatethe system library libxslt to a fixed versionurls:-https://github.com/sparklemotion/nokogiri/issues/1892-https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-11068-https://security-tracker.debian.org/tracker/CVE-2019-11068-https://people.canonical.com/~ubuntu-security/cve/CVE-2019-11068uuid:1a2e2e6e-67ba-4142-bfa1-3391f5416e4cpackage_slug:gem/nokogiri
It's still a WIP though because the affected_range is not returned by this specific API endpoint. I'm working on it.
Use this repo as SSOT to import/update advisories into Gemnasium DB (so it will stay in sync by design after initial import).
@gonzoyumo We may not be able to switch to gemnasium-db by May 7th because of the open issues in the advisories project, and also because the existing tools are not YAML compatible. So I'd rather commit to that. cc @brytannia
That being said, the export-db command can be used to regularly resync the repo until we switch. Are you OK with that? If so, do agree on a bi-weekly synchronization?
The deadline to have this ready is May 22nd to me, as the announcement of the 11.11 release. We still have time to finalize this and migrating the tools/issues when this project will be ready.
Thanks for suggesting the workaround, we'll keep that in mind in case we are still unable to rely entirely on this new project. But I really want us to aim at closing advisories project before May 22nd.
Nope, #11293 (closed) is about the (currently private) process of pushing advisories to the PostgreSQL DB.
As you noted in your first answer, the expected missing doc here is to advertise about this database content being publicly accessible in the dependency scanning doc
@NicoleSchwartz@kencjohnston we never talked about "usage ping" for this. I don't see any clear way to do this as it could be as simple as a pageview on the repository content or usage of the search function... or counting contributions (issues/MRs)... I don't see immediate value on this as for other usual feature's usage ping that could justify the effort but feel free to prove me wrong.
@gonzoyumo We didn't port the labels of the advisories project to gemnasium-db.
What about these PackageType::* labels, like PackageType::gem? That would match what we have in the files, except that pypi and packagist are registries, not package types really.
Or that could be PackageRegistry:: instead, in which case the list would include Rubygems, Npm, MavenCentral, Packagist and Pypi instead; gem and maven are package managers, not registries.
Or we could skip that and consider these labels are redundant with the MRs - except that won't work for issues.
I suggest we embrace our mistakes for now, and have PackageType::* scoped labels matching Gemnasium's package type. WDYT?
Could you remind us of the purpose of these labels? Is this purely about metadata for e.g. helping to narrow search results or do we rely on them for workflows?
To choose between using package type vs package registry I'd follow the logic used for the current directory structure. We could always rename later if necessary.
We could also possibly add another label to identify package registry if this brings value (e.g. there is a specific workflow to process items for a particular registry) but not sure it's worth it now vs the overhead of maintaining these labels.
I also agree we should use scoped labels for this.
Could you remind us of the purpose of these labels? Is this purely about metadata for e.g. helping to narrow search results or do we rely on them for workflows?
Metadata mostly, making issues and MRs easier to find and easier to read. Also, I use these labels when batch processing issues related to the same package type; this is for efficiency, to reduce the context switch.
To choose between using package type vs package registry I'd follow the logic used for the current directory structure. We could always rename later if necessary.
I agree. This is ugly but at least this is consistent with the repo and somewhat predictable.
We could also possibly add another label to identify package registry if this brings value (e.g. there is a specific workflow to process items for a particular registry) but not sure it's worth it now vs the overhead of maintaining these labels.
Agreed. I would refrain from creating labels before we clearly know what their purpose is. Maintaining labels has a cost.
I also agree we should use scoped labels for this.
Then let's go for PackageType:xyz where xyz matches the package types used in the repo. Unless you don't like mixing Camel Case with lowercase.