Product Analytics MVP: Basic database schema (!35168) · Merge requests · GitLab.org / GitLab

Andreas Brandl requested to merge ab/product-analytics into master Jun 23, 2020

What does this MR do?

We want to explore and validate the Product Analytics feature introduced in !27730 (closed). In this MR, we tackle the database-side of it and ship a single table to store product analytics events.

MVP - Experimental

The goal is to ship a real MVP here to validate the feature itself. Therefore, we mark this with a large EXPERIMENTAL sticker. This has the following implications:

We don't invest too much time into proper relational/dimensional modeling at this point in time.
Instead, we ship it as a flat table.
We know the schema shipped here is going to be problematic once we store a significant amount of data in it.
Optionally, we apply hash partitioning to it (a slight piggy-back-sidetrack from groupdatabase, but it'll be useful to include this here - see below)

We are going to be explicit about the experimental state of the feature when we ship it:

It's behind a feature flag
We mark it as "experimental" in documentation
We are explicit about the fact that we will remove all data once feature validation is complete.

The last point (3) means: We are going to drop this table at some point including all data and start over. All data will be lost, including on GitLab.com and self-hosted installations.

Read more about MVP-approach and how we think the architecture is going to evolve (if successful) here:

Partitioning

For Product Analytics, events are typically gathered and analyzed by project. We use project_id to scope analytical queries down. Furthermore, typical queries in this context contain a timerange - for example, to build a graph over 30 days. This is based on collector_tstamp BETWEEN ? AND ? currently.

In order to build graphs, we expect queries similar to this:

SELECT
  xxx,
  COUNT(*)
FROM product_analytics_events_experimental
WHERE project_id = ?
AND   collector_tstamp BETWEEN ? AND ?
GROUP BY xxx

SELECT
  DATE_TRUNC('day', some_timestamp)
  COUNT(*)
FROM product_analytics_events_experimental
WHERE project_id = ?
AND   collector_tstamp BETWEEN ? AND ?
GROUP BY DATE_TRUNC('day', some_timestamp)

So we basically always have a project_id = ? and a collector_tstamp BETWEEN ? AND ? filter.

This lends itself to

Hash partitioning based on project_id
Time-space partitioning based on collector_tstamp
A combination: First level (1), second level (2)

We are currently still working on implementing time-space partitioning, so this is WIP and not yet available. Hash partitioning is only a small change in comparison: !35268 (merged)

From a groupdatabase perspective, it would be interesting to see hash-partitioning applied here - because this would help us learn something in that area as well. It is low-risk since the feature is behind a feature flag and marked experimental anyways. Partitioning can help us to grow the feature usage beyond what would be possible with a non-partitioned table.

Applying only hash-partitioning here is not ideal, particularly since we're only going to enable a few projects to generate data for this. This is known and we don't mind for now.

~~OTOH, this should not hold-up MVP development of Product Analytics - so if that turns out to be the case, we shouldn't do it and simply go with a non-partitioned table for now.~~ We decided to keep it.

Dependencies

There are two dependencies for the hash-partitioning side (currently under review):

Using partitioned table from ActiveRecord

This is pretty straight forward, but we have to tweak the primary key. The formal PK includes the partitioning key, which makes it a composite key. ActiveRecord doesn't support this (yet, and there is a gem for it). We don't need it anyways, because id is a key on its own, too.

class ProductAnalyticsEvent < ApplicationRecord
  self.table_name = 'product_analytics_events_experimental'

  # Ignore that the partition key :project_id is part of the formal primary key
  self.primary_key = :id
end

From the querying side, we can use the model without any restrictions. In order to benefit from the partitioning though, we should always make sure to include a where(project_id: ?) filter.

We will provide tooling to detect queries without partitioning filter later, see #216654.

Todos

Validate using a standard AR model with a composite key is not a problem or introduce composite keys gem @abrandl

Does this MR meet the acceptance criteria?

Conformity

Edited May 31, 2022 by 🤖 GitLab Bot 🤖

Product Analytics MVP: Basic database schema