Tag job-count and lease-count metrics with platform requirements
Description
This MR updates the Scheduler metrics gathering code to gather the count metrics grouped by both stage/state and unique platform requirement sets. This allows us to tag these metrics with information about platform properties, allowing more detailed introspection of the state of the job queue.
To make the gathered platform information human-readable, the platform requirements are now stored in the database as a JSON string, rather than the SHA1 hash of that string. For Jobs with large platform requirement dictionaries this will increase the database size, but with Job cleanup implemented now this shouldn't be a big issue.
Changes proposed in this merge request:
- Stop hashing the platform requirements when storing them in the database (using the ordered JSON string instead)
- Gather Job/Lease count metrics per-platform
- Tag
buildgrid.job-count
andbuildgrid.lease-count
metrics with platform requirements
Validation
Run a BuildGrid with monitoring enabled and a tag-format
specified, then send some jobs with platform requirements and view the resulting metrics.
For example, apply this patch to get the example Scheduler
dashboard displaying tagged metric names in the "Incomplete Jobs" panel.
diff --git a/buildgrid/settings.py b/buildgrid/settings.py
index 1d339d78..095b3d09 100644
--- a/buildgrid/settings.py
+++ b/buildgrid/settings.py
@@ -132,7 +132,7 @@ MIN_TIME_BETWEEN_SQL_POOL_DISPOSE_MINUTES = 15
COOLDOWN_TIME_AFTER_POOL_DISPOSE_SECONDS = 75
# SQL Scheduler
-SQL_SCHEDULER_METRICS_PUBLISH_INTERVAL_SECONDS = 300
+SQL_SCHEDULER_METRICS_PUBLISH_INTERVAL_SECONDS = 5 # 300
# Number of times to retry creation of a Pika publisher
RABBITMQ_PUBLISHER_CREATION_RETRIES = 5
diff --git a/data/config/grafana/dashboards/buildgrid/scheduler.json b/data/config/grafana/dashboards/buildgrid/scheduler.json
index 1026a116..52ea76cb 100644
--- a/data/config/grafana/dashboards/buildgrid/scheduler.json
+++ b/data/config/grafana/dashboards/buildgrid/scheduler.json
@@ -60,7 +60,8 @@
"targets": [
{
"refId": "A",
- "target": "aliasByMetric(exclude(stats.gauges.buildgrid.instance.job-count.*, 'COMPLETED'))"
+ "target": "seriesByTag(\"name=~stats.gauges.buildgrid.instance.job-count.*\", \"operation-stage!=COMPLETED\")",
+ "textEditor": true
}
],
"thresholds": [],
@@ -147,7 +148,8 @@
"targets": [
{
"refId": "A",
- "target": "exclude(aliasByMetric(stats.gauges.buildgrid.instance.operation-count.*), 'COMPLETED')"
+ "target": "seriesByTag(\"name=~stats.gauges.buildgrid.instance.operation-count.*\", \"operation-stage!=COMPLETED\") | aliasByTags(\"operation-stage\")",
+ "textEditor": true
}
],
"thresholds": [],
@@ -418,7 +420,8 @@
"targets": [
{
"refId": "A",
- "target": "stats.gauges.buildgrid.instance.job-count.COMPLETED"
+ "target": "sumSeries(seriesByTag(\"name=stats.gauges.buildgrid.instance.job-count.COMPLETED\"))",
+ "textEditor": true
}
],
"thresholds": "",
diff --git a/data/config/monitoring-controller.yml b/data/config/monitoring-controller.yml
index adabec2e..6c8142b1 100644
--- a/data/config/monitoring-controller.yml
+++ b/data/config/monitoring-controller.yml
@@ -20,6 +20,7 @@ monitoring:
endpoint-location: statsd:8125
serialization-format: statsd
metric-prefix: buildgrid
+ tag-format: graphite
instances:
- name: ''
Run the monitoring example BuildGrid, and send some jobs. The Grafana interface for this example is available at http://localhost:3000/.
docker-compose -f docker-compose.monitoring.yml up --build --detach
tox -e venv -- bgd execute --remote-cas http://localhost:50052 command ./buildgrid echo "example job"
tox -e venv -- bgd execute --remote-cas http://localhost:50052 command -p OSFamily linux ./buildgrid ls
tox -e venv -- bgd execute --remote-cas http://localhost:50052 command -p ISA x64 -p ISA x32 -p OSFamily linux ./buildgrid echo "hello world"