Discuss long term architectural considerations for analytics

We are effectively building an analytics solution on top of GitLab, so there are some long term considerations we need to start discussing if we would like to make this viable:

  • GraphQL (some teams are already moving here)
  • Job scheduling
  • Sidekiq & Kafka: https://blog.appsignal.com/2019/04/23/kafka-sidekiq-ruby.html
  • Druid (https://druid.apache.org)
  • Spark vs Flink
  • Airflow

pshutsin [1:43 PM] Currently all the analytics are bundled up together with gitlab rails app. Analytics share the same database, the same server and the same process as main rails app. Analytics on its nature work with large set of past data, it usually designed within ETL flow with a lot of prepared data for display being stored in database. As our functionality grows we store more and more data in database, calculate a lot of metrics for ALL EE users regardless if they use analytics or not. E.g. https://gitlab.com/gitlab-org/gitlab-ee/issues/12683 issue will add millions of records to database. So it’s very natural for analytics to work as completely separate application with read-only access to replication database and full access to its own separate database. This will allow users to setup analytics only if they want to use it, configure and control servers load, reduce overhead on main app. As a price for it - it’s harder to build an MVP as a separate app and requires more infrastructure setup for those who want to use analytics. So my question is: are there any plans to move out all analytics to a separate app? Do we want to go this way or we want to stick to all-in-one rails app? WDYT?

Examples: https://medium.com/netflix-techblog/keystone-real-time-stream-processing-platform-a3ee651812a

Export metrics that can be used by Grafana: https://play.grafana.org/d/000000056/graphite-templated-nested?orgId=1, https://gitlab.com/gitlab-org/grafana-dashboards/tree/master/dashboards

Edited Dec 10, 2024 by Brandon Labuschagne
Assignee Loading
Time tracking Loading