[Discovery] Use InfluxDB for some Analytics features?

Problem to solve

For data-intensive Analytics features like Productivity Analytics and Code Hotspots, Postgres is proving to be a limitation both for reading and writing. InfluxDB has been proposed as an alternative data store. Does InfluxDB suit our needs for these features?

Related issues

Discovery: Use Elasticsearch for some Analytics features?.

Some advantages of InfluxDB

Designed specifically for time series data, which is mainly what these data-intensive Analytics features consume.
In use since 2017 for storing Metrics data. (See the installation instructions.)
Customers can use Grafana to build custom dashboards on the data that is collected.

Some drawbacks of InfluxDB

Not all self-managed instances use InfluxDB. Those instance would not have access to any Analytics features that rely on it.
"SQL JOINs aren’t available for InfluxDB measurements" (source)
"InfluxDB is not a full CRUD database but more like a CR-ud, prioritizing the performance of creating and reading data over update and destroy, and preventing some update and destroy behaviors to make create and read more performant: To update a point, insert one with the same measurement, tag set, and timestamp. ... You can’t update or rename tags yet ... You can’t delete tags by tag key (as opposed to value)" (source)

In-house expertise

It looks like Pawel Chojnacki pawel@chojnacki.ws did most of the implementation in 2017, for example see b668aaf4, but he is no longer with the company. @djensen is currently requesting more information from the Monitor team and will update this section.

Relevant database fields

"The following measurements are currently stored in InfluxDB:

PROCESS_file_descriptors
PROCESS_gc_statistics
PROCESS_memory_usage
PROCESS_method_calls
PROCESS_object_counts
PROCESS_transactions
PROCESS_views
events

Here, PROCESS is replaced with either rails or sidekiq depending on the process type. ... [event] is used to store generic events such as the number of Git pushes, Emails sent, etc. Each point in this measurement has a single value field called count. The value of this field is simply set to 1." (source) (An "event" is a time series data point that occurs on an irregular period, and for Analytics we may be more interested in "metrics" which are regular data.)

Relevant files

lib/gitlab/metrics/influx_db.rb shows how InfluxDB::Client can be used
lib/gitlab/metrics/metric.rb has a to_hash method with an important comment about concurrency
lib/gitlab/metrics/samplers/influx_sampler.rb is a "Class that sends certain metrics to InfluxDB at a specific interval"

What does success look like, and how can we measure that?

Success is an InfluxDB proof of concept that confirms:

Extremely fast create (for backfilling data on the scale of gitlab.com)
Extremely fast querying (for quick page load times)
A manageable increase in complexity
Analytics data conforms to its time series or event structure requirements

This proof of concept should emulate Production volume for a typical day. Probably the most straightforward thing is to generate Code Hotspots data. Specifically, calculating and storing the total number of times each file in each Project was touched by a commit in that day.

Edited Oct 31, 2019 by Dan Jensen