Determine and document plan for unreview open-source components that don't exist in GitLab infrastructure

Determine and document a plan for UnReview open-source components that don't exist in GitLab infrastructure. We will not implement these plans until milestone 2 (past the PoC).

image

Apache Kafka

Link: https://kafka.apache.org/

Provided by: https://www.confluent.io/

Description: Distributed event streaming platform

Used for: Distributed event streaming and processing

Note: Initially, UnReview used Kafka and Kafka streams/KsqlDb to process and send data from the "extract" stage to the "load" stage. Now, Kafka only sends raw data to Azure Blob storage, while Apache Hive+ADF performs the data preprocessing.

Pros of implementing in the GitLab.com infrastructure:

  • We have a need for implementing queues/streaming in various parts of the GitLab product (including in a recent discussion with @sgoldstein ), not just potentially for this project.
  • It is apache 2.0 licensed (so no concerns from legal)

Cons of implementing in the GitLab.com infrastructure:

  • One more moving part to maintain, secure, and scale
  • Kafka requires Apache Zookeeper that also has to be maintained

Concerns with implementing in the GitLab self-hosted infrastructure:

  • TBD

Alternative solutions:

ADF (Microsoft Azure Data Factory)

Link: https://azure.microsoft.com/en-us/services/data-factory/

Description: Serverless data integration and transformation service

Used to: Orchestrate the preprocessing pipelines, manage the Hive cluster, and move the processed data to MongoDB

Pros of implementing in the GitLab.com infrastructure:

None

Cons of implementing in the GitLab.com infrastructure:

  • This is an Azure specific service

Concerns with implementing in the GitLab self-hosted infrastructure:

N/A - not possible

Alternative solutions:

Apache Hive

Link: https://hive.apache.org/

Provided by: https://azure.microsoft.com/en-us/services/hdinsight/

Description: SQL compatible data warehouse

Used for: Data preprocessing including building train/test datasets.

Note: The cluster is automatically started by ADF and terminates after the preprocessing scripts are executed.

Pros of implementing in the GitLab.com infrastructure:

  • It is apache 2.0 licensed (so no concerns from legal)

Cons of implementing in the GitLab.com infrastructure:

  • One more moving part to maintain, secure, and scale
  • Depends on Hadoop requiring significant efforts to maintain

Concerns with implementing in the GitLab self-hosted infrastructure:

  • Do customers have the virtual/physical hardware allocated for our needs?

Alternative solutions:

  • Postgres - It already exists but is not designed for D/W workloads. Would it be able to handle this minimally well?
  • Clickhouse? https://clickhouse.tech/

MongoDB

Link: https://www.mongodb.com/

Description: Document based database

Pros of implementing in the GitLab.com infrastructure:

Cons of implementing in the GitLab.com infrastructure:

Concerns with implementing in the GitLab self-hosted infrastructure:

  • Do customers have the virtual/physical hardware allocated for our needs?

Alternative solutions:

Neo4J

No longer a dependency.

Edited by Alexander Chueshev