Store GitLab's big data in Elasticsearch or PostgreSQL with extensions?

Update: It seems we'll do do ES

  1. Only viable open source solution for logs now on the market.
  2. Already using it for advanced code search.
  3. Also using for storing traces/Jaeger https://gitlab.com/gitlab-org/gitlab-ee/issues/5694

And then for Prometheus we'll use Thanos.


Where to store GitLab's big data? We can use Elasticsearch or separate PostgreSQL databases that we scale with extensions.

Types of big data in GitLab:

  1. Vulnerability information: PostgreSQL
  2. Log: PostgreSQL with Timescale extension https://github.com/timescale/timescaledb
  3. Tracing: PostgreSQL with Timescale extension https://github.com/timescale/timescaledb
  4. APM: Elasticsearch => PostgreSQL with Timescale extension https://github.com/timescale/timescaledb
  5. Metrics: Prometheus => PostgreSQL with Prometheus extension https://github.com/timescale/pg_prometheus
  6. Code search: distributed git / ES => PostgreSQL with Citus extension https://www.citusdata.com/blog/2016/04/28/scalable-product-search/

Advantages of Elasticsearch:

  1. Scales better
  2. Best tool for the job
  3. Popular for that use case
  4. Supported (for example Jaeger)

Advantages of PostgreSQL:

  1. We already ship it in Omnibus
  2. Getting more popular for big data
  3. We can standardize on it
  4. Easy to integrate with Rails
  5. Flexible queries
  6. No JVM

I know from other startups that they spend a lot of time developing a seamlessly scaling system for their product category. GitLab is integrating so many product categories that this isn't an option for us. We tried CephFS to scale blocks storage but settled on sharding per project. Even Google has a hard time replacing Borgmon (inspiration for Prometheus) with a seamlessly scaling system. We should embrace the sharding per project that we started. Auto DevOps should deploy a separate PostgreSQL database for logs, tracing, APM, metrics (imported from Prometheus for longer term storage). Currently Auto DevOps already deploys Prometheus per project, so this seems in line with that.

Code search is a hard one. We keep having problems with Elasticsearch. And ES requires you to store everything twice. We have a boring solution for searching in a group by doing a git grep across projects. Right now in GitLab or GitHub I never use the ability to search in all projects https://github.com/search?q=module+Project&type=Code&utf8=%E2%9C%93 but I see the use case when you host lots of open source. Since I'm undecided we should probably keep what we have now.

/cc @dzaporozhets @markpundsack

Edited Sep 06, 2018 by Sid Sijbrandij
Assignee Loading
Time tracking Loading