Raising query limits applied by the product analytics schemas
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
Problem to solve
Once we integrate the WITH FILL modifier with our Cube implementation we will increase the number of rows returned whenever data is requested. This is because WITH FILL will fill the entirety of the date range provided with 0 values in the event that we don't have data for particular date.
Our current query limits are applied by the defined visualizations and are all currently set to a limit of 100. This limit is arbitrary and was the limit set up when developing the product analytics PoC rather than anything set by performance calculations.
The issue is that the limit is very low, especially if a user opts to receive data for an entire year (returning 365 rows). This could increase as we integrate smaller granularities like seconds or hours.
Proposal
I propose that we remove the limit from our built-in visualization schema definitions entirely. Cube.js has a default limit of 10,000 rows; which is documented for the export process. Cube.js has a maximum limit of 50,000 rows and so does Clickhouse by default.
In its place, I suggest the following process:
- If a user sets a
limitin their custom visualization, always respect it and us that for the query. - If the query doesn't define a
limitand the query defines a date range to search:- Take the date range and calculate the number of days in the range.
- Apply this
limitto the query before sending the request to Cube. - This can be applied through the Cube proxy and will be applied to the export functionality since it uses the same endpoint.
- If the query doesn't define a
limitand the query does not define a date range:- Use the default
10,000limitand return the full range. - Export users can use the
offsetproperty to paginate through their data every10,000rows.
- Use the default
In addition, we should add the same limit/offset text from the exports documentation and add it the wider product analytics documentation.
As a follow-up, we could also add an alert to the dashboard letting users know that the visualizations they are seeing have hit the limit, and they should reduce the date range.
Open-questions/concerns
- What performance impact does this have on different visualizations?
- Query times
-
10,000rows is very minor for Clickhouse, might need to test by setting a granularity tosecondsover a long date range. - Our least performant queries are already “Sessions over time” and “Sessions per browser”. These will probably need rewriting/pre-aggregation anyway.
-
- Rendering times
-
10,000rows needs to be looked at for rendering. E-charts will reduce the number of points on a chart to make sure the points fit and group points to help visually.
-
- Query times