Proposal for Error Tracking system inside GitLab
We have in the past built a Sentry integration, assigned directly to the project. Because of the Sentry change in the license, we can't use that anymore with the to be released 10 version.
As part of the new 10 version, they switched from PostgreSQL to Clickhouse (columnar based databased, developed by Yandex). ClickHouse still uses SQL as the query language, but have much better performance for their type of data. This is also part of what is allowing them to build their analytics features, simplifying their queries and allowing them to scale.
Building an error tracking solution has a few challenges:
- You need to build libraries/integrations for a variety of languages/frameworks you want to support
- You need to handle varying/burst traffic on your ingestion endpoint (with a very tight/controlled API rate limiting)
- You need to have good data store/data management policies (have data retention policies, etc)
We can get concern 1 for free, by using any of the existing opensource integrations. Can be sentry own libraries or airbreak.
Concern 2 requires the ingestion endpoint to be backed by a distributed queue system, like Kafka (what Sentry is using) or better: RabbitMQ, NATS (so we don't have to maintain a Java ecosystem).
Part of the challenges of implement 2 is related to concern 3, the data store/data management choice. Luckily we can simply re-use that from sentry own implementation. So no need to figure out schema decisions or how to store this type of data. They are doing that for a long time and we can simply learn from their experience.
Why we should do that instead of forking Sentry?
Sentry is a big codebase, but most of it is reimplementing things we already have inside GitLab, like authentication, use /roles/permission system, syncing and reading repository data etc.
Sentry is also a python codebase, which we don't use inside the GitLab project (there is only in Meltano).
How feasible is that (How a POC would look like)?
Considering we can reuse existing libraries, and existing decisions they already made, we could reimplement the ingestion with Go + Nats and insert data into their Clickhouse schema.
With that in place, we can reuse our existing frontend for errors that we have today for the Sentry integration, and instead of consuming it from their API, we go direct to the ClickHouse database.
This will give us an MVP for errors, and the product can move the vision around error tracking from that point.
This will allow us to move one step closer to our main goal of an integrated solution for the entire DevOps ecosystem, and this is a good excuse to do so.
I also suggest we make the whole basic functionality Open Source and build EE only features on top of that.
Other comments/considerations:
Sentry is missing a big opportunity to follow a model like ours, to be open-core and have basic/useful functionality be free while charging for more advanced ones (focuses on big corps). That would have allowed them to use GitLab as an entry point for their paid tier, instead of forcing us to fork, in order to provide that functionality to our customers.
I'm making this confidential, because I think we can be more transparent on the discussion, without causing another unintentional PR incident.