Skip to content

Tag job-count and lease-count metrics with platform requirements

Adam Coldrick requested to merge sotk/metrics/platform-property-tags into master

Description

This MR updates the Scheduler metrics gathering code to gather the count metrics grouped by both stage/state and unique platform requirement sets. This allows us to tag these metrics with information about platform properties, allowing more detailed introspection of the state of the job queue.

To make the gathered platform information human-readable, the platform requirements are now stored in the database as a JSON string, rather than the SHA1 hash of that string. For Jobs with large platform requirement dictionaries this will increase the database size, but with Job cleanup implemented now this shouldn't be a big issue.

Changes proposed in this merge request:

  • Stop hashing the platform requirements when storing them in the database (using the ordered JSON string instead)
  • Gather Job/Lease count metrics per-platform
  • Tag buildgrid.job-count and buildgrid.lease-count metrics with platform requirements

Validation

Run a BuildGrid with monitoring enabled and a tag-format specified, then send some jobs with platform requirements and view the resulting metrics.

For example, apply this patch to get the example Scheduler dashboard displaying tagged metric names in the "Incomplete Jobs" panel.

diff --git a/buildgrid/settings.py b/buildgrid/settings.py
index 1d339d78..095b3d09 100644
--- a/buildgrid/settings.py
+++ b/buildgrid/settings.py
@@ -132,7 +132,7 @@ MIN_TIME_BETWEEN_SQL_POOL_DISPOSE_MINUTES = 15
 COOLDOWN_TIME_AFTER_POOL_DISPOSE_SECONDS = 75
 
 # SQL Scheduler
-SQL_SCHEDULER_METRICS_PUBLISH_INTERVAL_SECONDS = 300
+SQL_SCHEDULER_METRICS_PUBLISH_INTERVAL_SECONDS = 5  # 300
 
 # Number of times to retry creation of a Pika publisher
 RABBITMQ_PUBLISHER_CREATION_RETRIES = 5
diff --git a/data/config/grafana/dashboards/buildgrid/scheduler.json b/data/config/grafana/dashboards/buildgrid/scheduler.json
index 1026a116..52ea76cb 100644
--- a/data/config/grafana/dashboards/buildgrid/scheduler.json
+++ b/data/config/grafana/dashboards/buildgrid/scheduler.json
@@ -60,7 +60,8 @@
       "targets": [
         {
           "refId": "A",
-          "target": "aliasByMetric(exclude(stats.gauges.buildgrid.instance.job-count.*, 'COMPLETED'))"
+          "target": "seriesByTag(\"name=~stats.gauges.buildgrid.instance.job-count.*\", \"operation-stage!=COMPLETED\")",
+          "textEditor": true
         }
       ],
       "thresholds": [],
@@ -147,7 +148,8 @@
       "targets": [
         {
           "refId": "A",
-          "target": "exclude(aliasByMetric(stats.gauges.buildgrid.instance.operation-count.*), 'COMPLETED')"
+          "target": "seriesByTag(\"name=~stats.gauges.buildgrid.instance.operation-count.*\", \"operation-stage!=COMPLETED\") | aliasByTags(\"operation-stage\")",
+          "textEditor": true
         }
       ],
       "thresholds": [],
@@ -418,7 +420,8 @@
       "targets": [
         {
           "refId": "A",
-          "target": "stats.gauges.buildgrid.instance.job-count.COMPLETED"
+          "target": "sumSeries(seriesByTag(\"name=stats.gauges.buildgrid.instance.job-count.COMPLETED\"))",
+          "textEditor": true
         }
       ],
       "thresholds": "",
diff --git a/data/config/monitoring-controller.yml b/data/config/monitoring-controller.yml
index adabec2e..6c8142b1 100644
--- a/data/config/monitoring-controller.yml
+++ b/data/config/monitoring-controller.yml
@@ -20,6 +20,7 @@ monitoring:
   endpoint-location: statsd:8125
   serialization-format: statsd
   metric-prefix: buildgrid
+  tag-format: graphite
 
 instances:
   - name: ''

Run the monitoring example BuildGrid, and send some jobs. The Grafana interface for this example is available at http://localhost:3000/.

docker-compose -f docker-compose.monitoring.yml up --build --detach
tox -e venv -- bgd execute --remote-cas http://localhost:50052 command ./buildgrid echo "example job"
tox -e venv -- bgd execute --remote-cas http://localhost:50052 command -p OSFamily linux ./buildgrid ls
tox -e venv -- bgd execute --remote-cas http://localhost:50052 command -p ISA x64 -p ISA x32 -p OSFamily linux ./buildgrid echo "hello world"

Merge request reports