The goal of this issue is to track the usage of ClickHouse throughout product development teams, and to identify possible points of collaboration or knowledge sharing. We are kicking off a ClickHouse working group and understanding these use cases better is important to achieving the working group's goals.
Please copy the template below for your respective group if you have an interest or active effort around the usage of ClickHouse:
Your group
Why are you interested in using ClickHouse?
...
How are you currently using it? In what features?
...
Are you currently facing any challenges or have any questions around using ClickHouse?
...
Do you have any relevant epic/issue links to share or additional comments?
Observability features have big data and insert heavy requirements which are not a good fit for Postgres or Redis. ClickHouse was selected by groupobservability as a good fit to meet these features requirements. ClickHouse is an open-source column-oriented database management system. It is attractive for these use cases because it can efficiently filter, aggregate, and sum across large numbers of rows.
How are you currently using it? In what features?
We have deployed a production ClickHouse Database to GKE and manage the Infrastructure ourselves. We do not currently support features with a self-hosted GitLab instance only .COM.
Our Integrated Error Tracking feature stores all errors in a ClickHouse database. This feature is in Open Beta in production.
ClickHouse will be the core data storage solution for our new Metrics feature. Once the architectural blueprint is approved, we will begin working on this feature in the subsequent milestones.
Are you currently facing any challenges or have any questions around using ClickHouse?
For our use cases we need to leverage the S3 backed Disk for high scalability via the S3 Mergetree Table Engine. Long term we have identified this as a must. Storage in S3 is currently only available on AWS. Though it was scheduled to be available within GCP several months ago by ClickHouse Inc, the timelines have now changed significantly which requires us to migrate to AWS.
ClickHouse requires a large amount of memory to run. The recommendation is a minimum of 16-32Gb for a small development instance and 64Gb-128Gb for a production instance. This leads to interesting questions about our goal for a self-hosted Observability solution.
ClickHouse requires a highly efficient, scalable and robust ingestion process, to handle high volumes of inserts, as a result we are proposing to refactor our ingestion process which may add complexity to the system.
Do you have any relevant epic/issue links to share or additional comments?
We plan to build features that revolve around analysing historical test data. For example, we want to measure how unit tests perform over a period of time to identify flaky tests. We expect heavy writes as each round of CI pipeline would insert new data points.
Since we don't have an alternative time series DB, I'd like to know if Clickhouse would be a more suitable data store over Postgres. If so, how could we leverage Clickhouse to analyse historical test data.
How are you currently using it? In what features?
Not currently using it.
Are you currently facing any challenges or have any questions around using ClickHouse?
Overview of what ClickHouse is suitable for would be great. Also, guidelines around how to model data in a Clickhouse database.
Do you have any relevant epic/issue links to share or additional comments?
We are currently doing calculations for analytics and reporting purposes (e.g. how many merge requests were created in the last 28 days) on the production database of an instance, or alternatively tracking counters in Redis (see ServicePing documentation)
Both solutions are born out of necessity, since we currently do not have a better DBMS better suited for analytics workloads available in self-managed instances. We are running into problems with scalability (calculations on PostgreSql) and the type of data we can store (Redis).
Having a DBMS that can efficiently do calculations across large numbers of rows available in all Gitlab instances would allow us to rethink the ServicePing architecture.
How are you currently using it? In what features?
Not currently using it, and we can only consider it, if it's available in self-managed instances.
Are you currently facing any challenges or have any questions around using ClickHouse?
How likely can we get ClickHouse as a required part of a self-managed Gitlab instance?
We need a scalable, open-source, and packageable DBMS for collecting and querying analytical data with regards to product analytics, experimentation, A/B testing, and other use cases.
How are you currently using it? In what features?
We're currently bundling it as part of our Analytics Stack, a deployable cluster for self-managed and SaaS to facilitate the usage of Product Analytics features.
Are you currently facing any challenges or have any questions around using Clickhouse?
Currently we're looking at how to manage it from a operator and proxy standpoint. We're aware of Altinity's operator, as well as one that Observability has built, but we need to investigate how that would work inside our Stack cluster.
We also need to implement Chproxy as there are additional use cases we need to meet beyond the general interactions of collecting and querying data within our Stack services.
Compliance audit events currently has the biggest database table, and is running into query optimisation issues due to the size. We are investigating using ClickHouse to improve the speed of query, but also look into providing better reporting with aggregations & filtering of results. ClickHouse would also offer us longer term scalability.
How are you currently using it? In what features?
We are not currently using ClickHouse, but are investigating how we can fit it into our stack and solve the problem above.
Are you currently facing any challenges or have any questions around using ClickHouse?
Not yet, but Im sure there will be 😉
Do you have any relevant epic/issue links to share or additional comments?
@nrosandich are the optimisation issues strictly related to SaaS or are you also experiencing those on Self-Managed instances? I was wondering how similar your use-case is to ours (~"group::product intelligence") since we're both essentially collecting events from Gitlab usage.
@bastirehm the issues we are seeing are on SaaS, but I don't think it is exclusive and we would be looking at rolling this out for self managed as well.
We're starting to see some scalability challenges around measuring storage usage on the GitLab.com Container Registry due to the huge number of image layers (5+ million) on the largest namespaces. We're currently using the PostgreSQL backend for these queries, but the ability to aggregate millions of rows is not one of its main strengths.
Having a subset of the registry data replicated to ClickHouse and using that for usage queries was recently suggested as a better alternative (&9415 (comment 1198168887)).
How are you currently using it? In what features?
Not currently using it.
Are you currently facing any challenges or have any questions around using ClickHouse?
Not yet, but we have an issue to investigate the feasibility of using ClickHouse for the registry usage data/queries in the short/mid-term (container-registry#844).
Do you have any relevant epic/issue links to share or additional comments?
Issue where ClickHouse was suggested as an alternative for the registry usage queries: &9415 (comment 1198168887)
We're facing challenges running analytical queries against the PostgreSQL database. Some of our features are already timing out due to the large data volume.
Replicating part of the core data from the PG database into ClickHouse would help us improving the performance of our features and would enable us build more advanced analytical features.
How are you currently using it? In what features?
Not currently using it. We have several ideas and created PoC-s about the possible CH DB schemas and integrations. Mostly tested locally.
we're interested in CH purely from GitLab Product support standpoint for the moment. We'd like to be engaged in process early on to anticipate upcoming changes etc.
How are you currently using it? In what features?
none
Are you currently facing any challenges or have any questions around using ClickHouse?
Main challenge is understanding how will it impact general GitLab setup and which components need to be altered to accommodate it's needs.
Do you have any relevant epic/issue links to share or additional comments?
Unfortunately I'm not sure what the appropriate label for Infrastructure would be here.
Why are you interested in using ClickHouse?
We ourselves are not interested, and we only have it deployed because it's a required component of Sentry, which the dev and support teams use. The current Sentry instance for GitLab internal use https://sentry.gitlab.net/gitlab/ is very outdated, basically unmanageable and struggling with the load, which is why we're moving it to Kubernetes.
How are you currently using it? In what features?
As part of Sentry, which has an event searching component called Snuba that uses Clickhouse as a backing datastore. Transactions and errors (Sentry terminology) are stored and queried by Snuba.
Are you currently facing any challenges or have any questions around using ClickHouse?
We use a Helm chart to install this Clickhouse Kubernetes operator on our GKE cluster in the ops environment. The process to install it quite convoluted as you can see from this.
I know your team has forked the very same operator here but it appears to be heavily customised for Opstrace and therefore unsuitable for Sentry's needs. We have some of our own customisations in this chart, but otherwise the configuration of the cluster itself for Sentry's use is here.
Do you have any relevant epic/issue links to share or additional comments?
Our interest in Clickhouse is limited to the setup/deployment perspective, if a new Clickhouse database component will need to be setup and deployed to support gitlab.com.
How are you currently using it? In what features?
Not using it.
Are you currently facing any challenges or have any questions around using ClickHouse?
No
Do you have any relevant epic/issue links to share or additional comments?
We're facing challenges running queries against the PostgreSQL database. Some of our features are already timing out due to the large data volume. Particularly I am thinking about the events table that is used in the activity and calendar actions for user.
Replicating part of the core data from the PG database into ClickHouse would help us improving the performance of our features.
How are you currently using it? In what features?
n/a
Are you currently facing any challenges or have any questions around using ClickHouse?
We are not working actively on it yet.
Do you have any relevant epic/issue links to share or additional comments?
Packagecloud (the software that powers packages.gitlab.com) stores a row in MySQL for every package that gets downloaded. It contains the user's IP address, user agent and the package ID. The table has approximately ~260 million rows.
We are not doing anything with this data. Ideally this data would live in ClickHouse and not MySQL as it's a massive table that just keeps on growing forever (nothing in Packagecloud keeps it under control).
With the data in ClickHouse, we could create dashboards that show, for example, the increase in package downloads over time, or how many packages are downloaded per day/month/year, etc.
How are you currently using it? In what features?
We're not using it for anything related to Packagecloud.
Are you currently facing any challenges or have any questions around using ClickHouse?
Does it make sense to store this data in ClickHouse?
Since Packagecloud is a commercial 3rd party package, any thoughts around how we would regularly migrate rows from the aforementioned MySQL table to ClickHouse? Maybe using something like this?
Do you have any relevant epic/issue links to share or additional comments?
In case you don't want the CH DB to access other services, you cold do periodical inserts: For some GitLab features we implemented a boring solution where we periodically check for new records in PG, export a batch to CSV (1-2mb files) and then send the CSV to CH. There is some persistence logic involved to keep track of the progress so we won't re-insert the same data.
This could be as simple as a cron job. Note: querying CH is generally fast but the schema should be defined in a way that supports the queries. Without that, we might face slow queries and large memory usage (query fails).
Good to know that CH can read from MySQL databases, however I think the boring solution might be best for two main reasons:
Our Cloud SQL MySQL DB is not public facing (unsure if we can access it over PSC somehow ... I think they support it when they're the producer and we're the consumers -- not the inverse)
The table in question needs to be denormalized as it has a column for package_id that we'd need to join with the packages table.
So we could run a cronjob in k8s that queries MySQL, denormalizes the data, exports as CSV, sends it to CH and deletes the rows from MySQL. Do you have a link to that script you mentioned @ahegyi that might help here?