Determine and document plan for unreview open-source components that don't exist in GitLab infrastructure
Determine and document a plan for UnReview open-source components that don't exist in GitLab infrastructure. We will not implement these plans until milestone 2 (past the PoC).
Apache Kafka
Link: https://kafka.apache.org/
Provided by: https://www.confluent.io/
Description: Distributed event streaming platform
Used for: Distributed event streaming and processing
Note: Initially, UnReview used Kafka and Kafka streams/KsqlDb to process and send data from the "extract" stage to the "load" stage. Now, Kafka only sends raw data to Azure Blob storage, while Apache Hive+ADF performs the data preprocessing.
Pros of implementing in the GitLab.com infrastructure:
- We have a need for implementing queues/streaming in various parts of the GitLab product (including in a recent discussion with @sgoldstein ), not just potentially for this project.
- It is apache 2.0 licensed (so no concerns from legal)
Cons of implementing in the GitLab.com infrastructure:
- One more moving part to maintain, secure, and scale
- Kafka requires Apache Zookeeper that also has to be maintained
Concerns with implementing in the GitLab self-hosted infrastructure:
- TBD
Alternative solutions:
- Google Pub/Sub (comparison)
- Kafka on GCP installed via the marketplace
ADF (Microsoft Azure Data Factory)
Link: https://azure.microsoft.com/en-us/services/data-factory/
Description: Serverless data integration and transformation service
Used to: Orchestrate the preprocessing pipelines, manage the Hive cluster, and move the processed data to MongoDB
![](/-/project/278964/uploads/115cde8de37075403990d7621f84c20f/pipeline1.png)
![](/-/project/278964/uploads/d92765d4b0bbb9e0cbf8a1c45b3ade73/pipeline2.png)
Pros of implementing in the GitLab.com infrastructure:
None
Cons of implementing in the GitLab.com infrastructure:
- This is an Azure specific service
Concerns with implementing in the GitLab self-hosted infrastructure:
N/A - not possible
Alternative solutions:
- Google Dataflow
- Meltano?
- TBD
Apache Hive
Link: https://hive.apache.org/
Provided by: https://azure.microsoft.com/en-us/services/hdinsight/
Description: SQL compatible data warehouse
Used for: Data preprocessing including building train/test datasets.
Note: The cluster is automatically started by ADF and terminates after the preprocessing scripts are executed.
Pros of implementing in the GitLab.com infrastructure:
- It is apache 2.0 licensed (so no concerns from legal)
Cons of implementing in the GitLab.com infrastructure:
- One more moving part to maintain, secure, and scale
- Depends on Hadoop requiring significant efforts to maintain
Concerns with implementing in the GitLab self-hosted infrastructure:
- Do customers have the virtual/physical hardware allocated for our needs?
Alternative solutions:
- Postgres - It already exists but is not designed for D/W workloads. Would it be able to handle this minimally well?
- Clickhouse? https://clickhouse.tech/
MongoDB
Link: https://www.mongodb.com/
Description: Document based database
Pros of implementing in the GitLab.com infrastructure:
Cons of implementing in the GitLab.com infrastructure:
- Licenses would need to be validated https://www.mongodb.com/community/licensing by legal
Concerns with implementing in the GitLab self-hosted infrastructure:
- Do customers have the virtual/physical hardware allocated for our needs?
Alternative solutions:
- Postgres https://www.postgresql.org/docs/9.5/datatype-json.html - would this meet our needs?
- Clickhouse? https://clickhouse.tech/
Neo4J
No longer a dependency.