Determine and document plan for unreview open-source components that don't exist in GitLab infrastructure

Determine and document a plan for UnReview open-source components that don't exist in GitLab infrastructure. We will not implement these plans until milestone 2 (past the PoC).

Apache Kafka

Link: https://kafka.apache.org/

Provided by: https://www.confluent.io/

Description: Distributed event streaming platform

Used for: Distributed event streaming and processing

Note: Initially, UnReview used Kafka and Kafka streams/KsqlDb to process and send data from the "extract" stage to the "load" stage. Now, Kafka only sends raw data to Azure Blob storage, while Apache Hive+ADF performs the data preprocessing.

Pros of implementing in the GitLab.com infrastructure:

We have a need for implementing queues/streaming in various parts of the GitLab product (including in a recent discussion with @sgoldstein ), not just potentially for this project.
It is apache 2.0 licensed (so no concerns from legal)

Cons of implementing in the GitLab.com infrastructure:

One more moving part to maintain, secure, and scale
Kafka requires Apache Zookeeper that also has to be maintained

Concerns with implementing in the GitLab self-hosted infrastructure:

Alternative solutions:

Google Pub/Sub (comparison)
Kafka on GCP installed via the marketplace

ADF (Microsoft Azure Data Factory)

Link: https://azure.microsoft.com/en-us/services/data-factory/

Description: Serverless data integration and transformation service

Used to: Orchestrate the preprocessing pipelines, manage the Hive cluster, and move the processed data to MongoDB

Pros of implementing in the GitLab.com infrastructure:

None

Cons of implementing in the GitLab.com infrastructure:

This is an Azure specific service

Concerns with implementing in the GitLab self-hosted infrastructure:

N/A - not possible

Alternative solutions:

Google Dataflow
Meltano?
TBD

Apache Hive

Link: https://hive.apache.org/

Provided by: https://azure.microsoft.com/en-us/services/hdinsight/

Description: SQL compatible data warehouse

Used for: Data preprocessing including building train/test datasets.

Note: The cluster is automatically started by ADF and terminates after the preprocessing scripts are executed.

Pros of implementing in the GitLab.com infrastructure:

It is apache 2.0 licensed (so no concerns from legal)

Cons of implementing in the GitLab.com infrastructure:

One more moving part to maintain, secure, and scale
Depends on Hadoop requiring significant efforts to maintain

Concerns with implementing in the GitLab self-hosted infrastructure:

Do customers have the virtual/physical hardware allocated for our needs?

Alternative solutions:

Postgres - It already exists but is not designed for D/W workloads. Would it be able to handle this minimally well?
Clickhouse? https://clickhouse.tech/

MongoDB

Link: https://www.mongodb.com/

Description: Document based database

Pros of implementing in the GitLab.com infrastructure:

Cons of implementing in the GitLab.com infrastructure:

Licenses would need to be validated https://www.mongodb.com/community/licensing by legal

Concerns with implementing in the GitLab self-hosted infrastructure:

Do customers have the virtual/physical hardware allocated for our needs?

Alternative solutions:

Postgres https://www.postgresql.org/docs/9.5/datatype-json.html - would this meet our needs?
Clickhouse? https://clickhouse.tech/

Neo4J

No longer a dependency.

Edited Jun 16, 2021 by Alexander Chueshev