Product Analytics MVP: Basic database schema
What does this MR do?
We want to explore and validate the Product Analytics feature introduced in !27730 (closed). In this MR, we tackle the database-side of it and ship a single table to store product analytics events.
MVP - Experimental
The goal is to ship a real MVP here to validate the feature itself. Therefore, we mark this with a large EXPERIMENTAL
sticker. This has the following implications:
- We don't invest too much time into proper relational/dimensional modeling at this point in time.
- Instead, we ship it as a flat table.
- We know the schema shipped here is going to be problematic once we store a significant amount of data in it.
- Optionally, we apply hash partitioning to it (a slight piggy-back-sidetrack from groupdatabase, but it'll be useful to include this here - see below)
We are going to be explicit about the experimental state of the feature when we ship it:
- It's behind a feature flag
- We mark it as "experimental" in documentation
- We are explicit about the fact that we will remove all data once feature validation is complete.
The last point (3) means: We are going to drop this table at some point including all data and start over. All data will be lost, including on GitLab.com and self-hosted installations.
Read more about MVP-approach and how we think the architecture is going to evolve (if successful) here:
Partitioning
For Product Analytics, events are typically gathered and analyzed by project. We use project_id
to scope analytical queries down. Furthermore, typical queries in this context contain a timerange - for example, to build a graph over 30 days. This is based on collector_tstamp BETWEEN ? AND ?
currently.
In order to build graphs, we expect queries similar to this:
SELECT
xxx,
COUNT(*)
FROM product_analytics_events_experimental
WHERE project_id = ?
AND collector_tstamp BETWEEN ? AND ?
GROUP BY xxx
or
SELECT
DATE_TRUNC('day', some_timestamp)
COUNT(*)
FROM product_analytics_events_experimental
WHERE project_id = ?
AND collector_tstamp BETWEEN ? AND ?
GROUP BY DATE_TRUNC('day', some_timestamp)
So we basically always have a project_id = ?
and a collector_tstamp BETWEEN ? AND ?
filter.
This lends itself to
- Hash partitioning based on
project_id
- Time-space partitioning based on
collector_tstamp
- A combination: First level (1), second level (2)
We are currently still working on implementing time-space partitioning, so this is WIP and not yet available. Hash partitioning is only a small change in comparison: !35268 (merged)
From a groupdatabase perspective, it would be interesting to see hash-partitioning applied here - because this would help us learn something in that area as well. It is low-risk since the feature is behind a feature flag and marked experimental anyways. Partitioning can help us to grow the feature usage beyond what would be possible with a non-partitioned table.
Applying only hash-partitioning here is not ideal, particularly since we're only going to enable a few projects to generate data for this. This is known and we don't mind for now.
OTOH, this should not hold-up MVP development of Product Analytics - so if that turns out to be the case, we shouldn't do it and simply go with a non-partitioned table for now. We decided to keep it.
Dependencies
There are two dependencies for the hash-partitioning side (currently under review):
Using partitioned table from ActiveRecord
This is pretty straight forward, but we have to tweak the primary key. The formal PK includes the partitioning key, which makes it a composite key. ActiveRecord doesn't support this (yet, and there is a gem for it). We don't need it anyways, because id
is a key on its own, too.
class ProductAnalyticsEvent < ApplicationRecord
self.table_name = 'product_analytics_events_experimental'
# Ignore that the partition key :project_id is part of the formal primary key
self.primary_key = :id
end
From the querying side, we can use the model without any restrictions. In order to benefit from the partitioning though, we should always make sure to include a where(project_id: ?)
filter.
We will provide tooling to detect queries without partitioning filter later, see #216654.
Todos
-
Validate using a standard AR model with a composite key is not a problem or introduce composite keys gem @abrandl