Add Kibana fields for Postgres autovacuum auto-analyze log messages

Summary

Observability improvement:

When autovacuum runs an automatic ANALYZE on a large table, this can take many minutes to complete, during which other VACUUM tasks cannot clean up recently dead tuples.

Currently when ingesting postgres CSV logs into Elasticsearch, we parse the fields from autovacuum's messages related to VACUUM but not for ANALYZE. We can use the same parsing approach. There are fewer fields available, but at least having the elapsed time would have been helpful for finding which auto-analyze task was the relevant one.

Related Incident(s)

Originating issue(s): production#7297

Specifically see this comment: production#7297 (comment 999968340)

Desired Outcome/Acceptance Criteria

This is an example of the raw log event in the postgres CSV file:

2022-06-21 17:15:25.061 GMT,,,79623,,62b1f3df.13707,1,,2022-06-21 16:37:51 GMT,31/525464278,531974591,LOG,00000,"automatic analyze of table ""gitlabhq_production.public.issues"" system usage: CPU: user: 862.79 s, system: 129.90 s, elapsed: 2253.18 s",,,,,,,,,""

Currently fluentd applies the generic parsing of postgres CSV, so the message field is:

automatic analyze of table "gitlabhq_production.public.issues" system usage: CPU: user: 862.79 s, system: 129.90 s, elapsed: 2253.18 s

From that message, we want to extract additional fields, with consistent field names and types as we use for parsing the corresponding vacuum log events.

For reference when parsing, here is the Postgres code that emits that log message. From this we can see there are no optional substrings that our regexp would need to accommodate, so parsing this log event should be straight-forward.

            ereport(LOG,
                    (errmsg("automatic analyze of table \"%s.%s.%s\" system usage: %s",
                            get_database_name(MyDatabaseId),
                            get_namespace_name(RelationGetNamespace(onerel)),
                            RelationGetRelationName(onerel),
                            pg_rusage_show(&ru0))));

To add these fields, we need to extend the fluentd (td-agent) log parsing with an additional transformation filter for ingesting postgres CSV log entries. That configuration is managed here: https://gitlab.com/gitlab-cookbooks/gitlab_fluentd/-/blob/master/templates/default/postgres.conf.erb

As an initial implementation sketch, I think something like this would probably work. It preserves the same prefix for the new field names as we use for autovacuum's vacuum tasks. This is desirable because the same workers do both tasks, and we would typically want to see both activities (particularly since they can have interaction effects).

$ git diff -U0
diff --git a/templates/default/postgres.conf.erb b/templates/default/postgres.conf.erb
index d1cf5f0..0cd1e6b 100644
--- a/templates/default/postgres.conf.erb
+++ b/templates/default/postgres.conf.erb
@@ -87,0 +88,19 @@
+
+# Attempt to extract auto-analyze statistics from the CSV message field
+<filter postgres.postgres_csv>
+  @type parser
+  key_name message
+  inject_key_prefix auto_vacuum_      # Enriched fields will use this prefix
+  remove_key_name_field false         # Keep original message
+  reserve_data true                   # Keep original fields
+  reserve_time true                   # Keep original time
+  emit_invalid_record_to_error false  # Ignore fields that don't match the parser regexp
+
+  <parse>
+    @type regexp
+    types cpu_user_s:float,cpu_system_s:float,elapsed_s:float
+
+    # parse auto-analyze log lines
+    expression /^automatic (?<analyze>analyze) of table "(?<database>\w+).(?<schema>\w+).(?<tablename>\w+)":\s*system usage: CPU: user: (?<cpu_user_s>[\d\.]+) s, system: (?<cpu_system_s>[\d\.]+) s, elapsed: (?<elapsed_s>[\d\.]+) s/
+  </parse>
+</filter>

Associated Services

This change affects the logging service (specifically the fluentd agent): ServiceLogging in Production Engineering

The logs being consumed are Postgres logs, but nothing about the ServicePostgres in Production Engineering itself is affected by this change.

Corrective Action Issue Checklist

Link the incident(s) this corrective action arose out of
Give context for what problem this corrective action is trying to prevent from re-occurring
Assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4')
Assign a priority (this will default to 'priority::4')

Edited Jun 22, 2022 by Matt Smiley