Spike: Investigate options for searching vulnerabilities

Time-box: 4 days

Problem

Today, there is not an efficient way to search for existing, specific vulnerabilities within and across projects. For instance, the recent log4j vulnerability highlighted the difficulty in searching across a GitLab instance for any vulnerabilities that might reference log4j in a Description or other data field and not just contain a specific identifier (CVE) associated with a known log4j vulnerability. This need is coming up more and more often from our users.

Our GraphQL API is capable of supporting this but requires the customer to write a custom script to first pull down vulnerability data and then search externally. The CSV export report could be used—particularly from a top-level group—to see many vulnerability records but report generation can be very slow (10s of minutes) when a large number of records are involved. It is also more cumbersome to tie records in the CSV back to the vulnerability records in GitLab.

Proposal

The purpose of this spike is to investigate various search options with the goal of determining which ones might be feasible. Feasibility considerations include performance/scalability, ongoing maintenance costs, and impacts to customers (e.g. do they have to install an additional component with complex setup/config). We should look for solutions that can be implemented by the groupthreat insights team.

The research must consider types of solutions in the following order:

  1. Available in GitLab today (e.x. Postgres) or as an existing first-class integration (e.x. Elasticsearch).
  2. Available in the near future (i.e. Clickhouse).
  3. Net-new technologies/components we could include with GitLab (Omnibus). These would require approval and integration, so should be considered as a last resort, and are likely to spawn its own spikes.

There may be other potential technologies or solutions that are already a part (or optional part) of GitLab as part of the GitLab Omnibus component list. Depending on the outcome of this investigation, if the best option is to use an external component, we will incorporate this into designs such that the search and search-related features (like advanced filtering) may only appear when the optional component is present (like Global Search does today).

Use-cases

These use cases are in rough priority order, with the first item being of primary importance. If it turns out that a single technology is not suited to all of these use cases, focus should be on selecting one that can enable full text search.

  1. Full text search in selected fields. At minimum, this should include the vulnerability name/title, Description, and any Solution. Ideally, search can also include information in the generic details key/value structure ("Evidence" section on a vulnerability page).
  2. Near real-time updates of counts for the Security dashboard.
  3. Support other charts that depend on aggregated queries. For example, time spent in each state (detected, confirmed, resolved), average time to resolve, count by fields other than severity (currently the only counts in the security dashboard) etc.
  4. Improve performance of the vulnerability report for large projects/groups/namespaces.

Expected Outcomes

  • Identify feasible options for in-product vulnerability search, evaluating each for:
    • Whether it requires customers to install additional (non-Omnibus) components such as elasticsearch
    • Any potential on-going maintenance required of customers
  • Assess each feasible option for:
    • Ability for groupthreat insights to implement with only minimal guidance from other teams
    • Rough level of effort to implement
    • Performance limitations

Investigation Outcome and Recommendations

The summarised outcome of the investigation is that GitLab should investigate the use of Postgres Full Text Search using partitioning and GIN/GIST indexes. This solution has already proven effective as solution for Issue search within GitLab which should make it the simplest possible minimum viable solution for use case 1.

In the event that Postgres Full Text Search fails to scale as far as is needed, ElasticSearch has been identified as being our most applicable solution, already having established precedent within GitLab as the "Advanced Full Text Search" solution for issues, providing boilerplate and expertise that could be leveraged further to fulfil use case 1. at scale.. This may additionally enable it to fulfil use case 4 as it may handle the substantial data quantities more effectively, but further investigation is required here.

Use cases 3 and 4 are metric driven use cases, which ElasticSearch can be an effective solution for as it's industry standard use in log analysis and metric aggregation shows. As a result, ingesting vulnerabilities into it may prove effective in solving all 4 use cases pending further feasibility investigation. This versatility to solve multiple use cases with existing precedent and organisational experience makes it natural point of progression.

Use cases 3 and 4 could possibly be achieved without ElasticSearch by performing and caching statistical aggregation outside of the standard request response cycle (asynchronous jobs) to be consumed on request, and potentially sent out to the user via a WebSocket, but further investigation will be required to asses the complexity and cost of this style of implementation.

An ancillary option of ClickHouse may be further investigated for feasibility to solve use cases 3 and 4 given the existing Working Group within GitLab, however this would serve as an additional duplication of data, especially if ElasticSearch is already utilised for full text search capabilities, so this should considered a backup option.

Refer to this thread for full investigation and findings: #352665 (comment 1243987880)

Further Investigation Required

  1. [SPIKE] Investigate best strategy for Postgres Full Text Search of vulnerability information
  2. [SPIKE] Investigate ingestion of significant vulnerability data into ElasticSearch
  3. [SPIKE] Investigate performance of ElasticSearch for:
    1. Vulnerability filter and query
    2. Vulnerability full text search
    3. Vulnerability Metric aggregation and query
  4. [SPIKE] Investigate feasibility of asynchronous metric aggregation and caching (Sidekiq+Redis/Postgres)
  5. [SPIKE] Investigate reporting of near-real-time metrics to interface via websockets
  6. [SPIKE] Investigate ingestion of significant vulnerability data into ClickHouse
  7. [SPIKE] Investigate performance of ClickHouse for Vulnerability Metric Aggregation and query