Add WITH FILL ORDER BY modifier to Cube Clickhouse driver
Problem to solve
In Cube does not return zero values for a given da... (#415687 - closed) • Robert Hunt • 16.3 we added a transformer to our Cube query service to fill data for dates in the provided date range that don't contain any data. This means that when a user selects a specific date range, the visualization will show that full range on its axis.
The problem is that it gets particularly difficult to support all the possible Cube time dimension granulites. We would ideally be doing this as close to the data as possible (i.e. in Clickhouse) as it would be far more performant.
Proposed solution
The Clickhouse driver for Cube is no longer actively worked on by the Cube core team but is instead supported by community contributions. We should look at adding the WITH FILL
modifier to the driver either in https://github.com/cube-js/cube/blob/master/packages/cubejs-schema-compiler/src/adapter/ClickHouseQuery.js or https://github.com/cube-js/cube/blob/master/packages/cubejs-clickhouse-driver/src/ClickHouseDriver.ts. This could be toggled by the fillMissingDates
pivot config option as that is essentially what we are doing but to fill the entire range not just the bits between two points. This may alternatively need a new option to toggle this on and off.
Making this change should mean that we can remove the transformer once we update Cube.
Alternate approaches
In the event that the PR doesn't get merged by Cube, we have a few potential alternatives:
transformer implementation
Keep the existingWeight: 1 – Super easy right now, longer term it's going to be a problem.
This approach is the most straightforward in the short-term. We have a working implementation that hasn't fallen down yet, even with GitLab-levels of data. We can continue to push the core Cube.js team to work with us on finding a solution in the meantime.
The downsides here are more mid-to-long term:
- The transformer will get increasingly more complicated as we begin to support additional time dimensions (minutes, hours, years etc…).
- The transformer ignores any query limits sent and appends the missing data to cover the entire date range. The query limit set may be there for a specific reason we shouldn't/can't ignore.
- For smaller time dimensions, the amount of data the transformer has to fill will cause API timeouts eventually. So we'll need to put hard-limits into the API/UI to account for these like we have done in the past with Audit Events and other endpoints.
Fork Cube.js
Weight: 8 – Setting up deployment processes for new repo, plus minor work to update retrieval of the compiled image to our own. The bigger cost is longer term.
By forking Cube.js (their library is open-source, whilst their cloud offering is not), we can apply fixes and improvements without needing to wait a long time for traction on their side. We can upstream these fixes to their core and continue to work with them if we think it will be beneficial to the wider community. We also get full control of what does or doesn't go into our fork.
However, this would mean continued maintenance, support, and bug fixing for our fork, which could add a considerable burden to our small-ish team. It may also end up being tricky to reconcile between Cube.js's repo and our own fork as time goes on. Plus, although the repo licences-Apache 2 and MIT-allow us to fork and do what we want, we will need to maintain a list of what changes we have made to comply with Apache 2.
Find an alternative provider
Weight: 13 – Possible, but a lot of work, with plenty of room for scope creep.
Obviously the more drastic option since we have put in so much work already to integrate and work with Cube.js. But in the event that we're finding Cube.js difficult to contribute to and not meeting our needs any more, we could begin the process of finding an alternative solution.
This would be a big endeavour to migrate the devkit, analytics stack, API's, codebase, schemas, and existing visualization definitions (custom and inbuilt). And there are no guarantees that we'll find another query interface that meets all our latest needs anyway.
Create our own
Weight: 34 – Completely undefinable right now and would need considerable upfront work to figure out what a custom-made solution might look like.
The most drastic option so far. Super unlikely that we'd ever do this, but it would mean we could create a tailored API specific for our needs, use cases, and infrastructure requirements. Of course, the biggest downside is the time required to build and maintain our own solution.
Implementation plan
frontend - 3
GitHub PR -- Update the Cube Clickhouse driver to support the Clickhouse
WITH FILL
modifier for time related dimensions and filters. - Add tests to validate the
WITH FILL
modifier is applied.
documentation - 1
Add Cube.js setup guide- Add a guide to explain how to set up Cube.js locally and integrate it with the DevKit.
infrastructure - 1
DevKit/Analytics stack -- Update the Cube container to use the new release with the
WITH FILL
modifier
backend - 1
Remove transformer MR -- Remove transfomer + specs created in Cube does not return zero values for a given da... (!125136 - merged)
- Update documentation to reflect the improved functionality