Cube does not return zero values for a given date range
Summary
Datasets do not return zero values for dates that have no events data.
Details
See this staging project. Note that we're querying the last 30 days but data is only returned for the two days that actually have data:
In Clickhouse:
SELECT
DATE(collector_tstamp) AS Date,
count(*) AS events
FROM snowplow_events
GROUP BY Date
┌───────Date─┬─events─┐
│ 2023-06-02 │ 32 │
│ 2023-06-05 │ 6 │
└────────────┴────────┘
2 rows in set. Elapsed: 0.009 sec.
The response from Cube (trimmed for brevity):
...
"data": [
{
"SnowplowTrackedEvents.derivedTstamp.day": "2023-06-02T00:00:00.000",
"SnowplowTrackedEvents.derivedTstamp": "2023-06-02T00:00:00.000",
"SnowplowTrackedEvents.pageViewsCount": "8"
},
{
"SnowplowTrackedEvents.derivedTstamp.day": "2023-06-05T00:00:00.000",
"SnowplowTrackedEvents.derivedTstamp": "2023-06-05T00:00:00.000",
"SnowplowTrackedEvents.pageViewsCount": "2"
}
],
...
Proposal
Return all days being queried but report zero values when there is no event data.
Technical Implementation
Update Snowplow schemas to return all dates in the time range, but COALESCE values such that 0 is returned if the sum of all events for a given day in the requested date range is zero.
In other words, return 0 when there's no events received that day.
Implementation research findings
I started at the closest point to the date by looking at whether Clickhouse could handle this for us. Which it could do indeed! Clickhouse has a WITH_FILL modifier for GROUP/ORDER_BY. Unfortunately because we query Cube and not Clickhouse directly, we aren't able to set this property.
I then looked to see if Cube could automatically solve this and found that Cube has a fillMissingDates option. Unfortunately it looks like it has some limitations which make it not viable for us right now:
- You can only have a single
timeDimension(which is a problem for certain types of multi-axis charts). - It will override the ordering.
- It fills data between two dates with data, it doesn't seem to fill the prior or post data for a range where the data starts halfway through the range.
Both of the above solutions are also limited by the fact that we allow users to define their own visualization queries. This means that we don't control how a query is written and therefore can't make assumptions on the intent of the query by appending additional commands to them.
Therefore, that really leaves two possible options:
- Fill in the missing dates via the Cube proxy on the backend
- Fill in the missing dates via the Cube data source on the frontend
Filling in through the frontend would solve the issue for the UI. But it would mean that there is a disparity between the data being shown in the UI and the data someone might get from querying our Cube proxy via the GraphQL API. This would also have a disparity when we start allowing the export of data too.
Therefore, I suggest we fill in the missing dates on the backend to keep that as the single source of truth and keep the API and UI in sync in terms of the data it receives.
Implementation plan
backend -
- Add a transformer to https://gitlab.com/gitlab-org/gitlab/-/blob/master/ee/app/services/product_analytics/cube_data_query_service.rb#L43 which takes the JSON response body; along with the queries date range, and adds
0to dates which haven't been returned for that response, so that the data returned fills the entire date range provided. - Add specs to validate the transformer works.