[Discovery] Use InfluxDB for some Analytics features?
Problem to solve
For data-intensive Analytics features like Productivity Analytics and Code Hotspots, Postgres is proving to be a limitation both for reading and writing. InfluxDB has been proposed as an alternative data store. Does InfluxDB suit our needs for these features?
Related issues
Some advantages of InfluxDB
- Designed specifically for time series data, which is mainly what these data-intensive Analytics features consume.
- In use since 2017 for storing Metrics data. (See the installation instructions.)
- Customers can use Grafana to build custom dashboards on the data that is collected.
Some drawbacks of InfluxDB
- Not all self-managed instances use InfluxDB. Those instance would not have access to any Analytics features that rely on it.
- "SQL JOINs aren’t available for InfluxDB measurements" (source)
- "InfluxDB is not a full CRUD database but more like a CR-ud, prioritizing the performance of creating and reading data over update and destroy, and preventing some update and destroy behaviors to make create and read more performant: To update a point, insert one with the same measurement, tag set, and timestamp. ... You can’t update or rename tags yet ... You can’t delete tags by tag key (as opposed to value)" (source)
In-house expertise
It looks like Pawel Chojnacki pawel@chojnacki.ws did most of the implementation in 2017, for example see b668aaf4, but he is no longer with the company. @djensen is currently requesting more information from the Monitor team and will update this section.
Relevant database fields
"The following measurements are currently stored in InfluxDB:
- PROCESS_file_descriptors
- PROCESS_gc_statistics
- PROCESS_memory_usage
- PROCESS_method_calls
- PROCESS_object_counts
- PROCESS_transactions
- PROCESS_views
- events
Here, PROCESS is replaced with either rails or sidekiq depending on the process type. ... [event] is used to store generic events such as the number of Git pushes, Emails sent, etc. Each point in this measurement has a single value field called count. The value of this field is simply set to 1." (source) (An "event" is a time series data point that occurs on an irregular period, and for Analytics we may be more interested in "metrics" which are regular data.)
Relevant files
- lib/gitlab/metrics/influx_db.rb shows how
InfluxDB::Clientcan be used - lib/gitlab/metrics/metric.rb has a
to_hashmethod with an important comment about concurrency - lib/gitlab/metrics/samplers/influx_sampler.rb is a "Class that sends certain metrics to InfluxDB at a specific interval"
What does success look like, and how can we measure that?
Success is an InfluxDB proof of concept that confirms:
-
Extremely fast create (for backfilling data on the scale of gitlab.com) -
Extremely fast querying (for quick page load times) -
A manageable increase in complexity -
Analytics data conforms to its time series or event structure requirements
This proof of concept should emulate Production volume for a typical day. Probably the most straightforward thing is to generate Code Hotspots data. Specifically, calculating and storing the total number of times each file in each Project was touched by a commit in that day.