[SPIKE] Investigate ingestion of significant vulnerability data into ClickHouse
As a follow up to #352665 (closed), we should to investigate the engineering complexity of ingesting significant quantities of vulnerability data into ClickHouse, parsing it and ensuring that ClickHouse remains functional and performant with our strategy of doing this as a potential alternative solution to ElasticSearch for metric aggregation and near-real-time querying.
Expected Outcomes
-
Determine the engineering complexity involved in vulnerability ingestion into ClickHouse. Some questions to consider: -
Does ClickHouse have an API we want to push vulnerability information into one at a time, or in bulk? -
Do we need to write to a file and ingest that into ClickHouse some different way? -
What is the way GitLab currently ingests issue data into ClickHouse for the Advanced Search capability, and can we mimic it? -
If the data needs to be ingested asynchronously, what kind of delays should we expect?
-
-
Investigate the costs related to our use of ClickHouse. -
What is our expected data domain size in ClickHouse? What kind of cost impact might this have on GitLab.com? (Bearing in mind we do already have an ClickHouse deployment)
-
-
What will the impact be on our self-managed users? -
What kind of resources will be needed to run a ClickHouse instance? -
What sort of administrative complexity would this incur on our self-managed users? -
What complexity would be involved in making this behaviour opt-in?
-