Enable structured logging for taskscaler and fleeting
Overview
In production, we see the following errors when trying to ship gitlab-runner
logs. This means we do not ship these logs to ElasticSearch and lose observability.
steve@runners-manager-shared-gitlab-org-blue-5.c.gitlab-ci-155816.internal:~$ tail -f /var/log/td-agent/td-agent.log
2024-04-17 08:49:47 +0000 [warn]: #0 dump an error event: error_class=Fluent::Plugin::Parser::ParserError error="pattern not matched with data '2024-04-17T08:49:46.861Z [INFO] required scaling change: required=-209 capacity-info=\"{\\\"InstanceCount\\\":475,\\\"MaxInstanceCount\\\":900,\\\"Acquired\\\":261,\\\"UnavailableCapacity\\\":0,\\\"Pending\\\":0,\\\"Reserved\\\":0,\\\"IdleCount\\\":5,\\\"ScaleFactor\\\":0.9,\\\"ScaleFactorLimit\\\":50,\\\"CapacityPerInstance\\\":1}\"'" location=nil tag="runner" time=2024-04-17 08:49:46.861817000 +0000 record={"transport"=>"stdout", "stream_id"=>"71155123694a4a43a7f7475b9eba1e3d", "priority"=>"6", "syslog_facility"=>"3", "syslog_identifier"=>"gitlab-runner", "pid"=>"649402", "uid"=>"0", "gid"=>"0", "comm"=>"gitlab-runner", "exe"=>"/usr/bin/gitlab-runner", "cmdline"=>"/usr/bin/gitlab-runner run --working-directory /home/gitlab-runner --config /etc/gitlab-runner/config.toml --service gitlab-runner --user gitlab-runner", "cap_effective"=>"1ffffffffff", "selinux_context"=>"unconfined\n", "systemd_cgroup"=>"/system.slice/gitlab-runner.service", "systemd_unit"=>"gitlab-runner.service", "systemd_slice"=>"system.slice", "systemd_invocation_id"=>"41957e76a65940b99f73ecb372c3a00a", "boot_id"=>"xxx", "machine_id"=>"xxx", "hostname"=>"runners-manager-shared-gitlab-org-blue-5", "message"=>"2024-04-17T08:49:46.861Z [INFO] required scaling change: required=-209 capacity-info=\"{\\\"InstanceCount\\\":475,\\\"MaxInstanceCount\\\":900,\\\"Acquired\\\":261,\\\"UnavailableCapacity\\\":0,\\\"Pending\\\":0,\\\"Reserved\\\":0,\\\"IdleCount\\\":5,\\\"ScaleFactor\\\":0.9,\\\"ScaleFactorLimit\\\":50,\\\"CapacityPerInstance\\\":1}\""}
2024-04-17 08:49:56.375676645 +0000 fluent.warn: {"error":"#<Fluent::Plugin::Parser::ParserError: pattern not matched with data '2024-04-17T08:49:56.151Z [INFO] decreasing instances: amount=1 group=gce/gitlab-r-saas-l-m-amd64-org-5/us-east1-c/blue-default'>","location":null,"tag":"fluent.warn","time":1713343796,"record":{"transport":"stdout","stream_id":"71155123694a4a43a7f7475b9eba1e3d","priority":"6","syslog_facility":"3","syslog_identifier":"gitlab-runner","pid":"649402","uid":"0","gid":"0","comm":"gitlab-runner","exe":"/usr/bin/gitlab-runner","cmdline":"/usr/bin/gitlab-runner run --working-directory /home/gitlab-runner --config /etc/gitlab-runner/config.toml --service gitlab-runner --user gitlab-runner","cap_effective":"1ffffffffff","selinux_context":"unconfined\n","systemd_cgroup":"/system.slice/gitlab-runner.service","systemd_unit":"gitlab-runner.service","systemd_slice":"system.slice","systemd_invocation_id":"xx","boot_id":"xxxx","machine_id":"81de70f2ce23e896c06688ac48aa2561","hostname":"runners-manager-shared-gitlab-org-blue-5","message":"2024-04-17T08:49:56.151Z [INFO] decreasing instances: amount=1 group=gce/gitlab-r-saas-l-m-amd64-org-5/us-east1-c/blue-default"},"message":"dump an error event: error_class=Fluent::Plugin::Parser::ParserError error=\"pattern not matched with data '2024-04-17T08:49:56.151Z [INFO] decreasing instances: amount=1 group=gce/gitlab-r-saas-l-m-amd64-org-5/us-east1-c/blue-default'\" location=nil tag=\"runner\" time=2024-04-17 08:49:56.152049000 +0000 record={\"transport\"=>\"stdout\", \"stream_id\"=>\"xxx\", \"priority\"=>\"6\", \"syslog_facility\"=>\"3\", \"syslog_identifier\"=>\"gitlab-runner\", \"pid\"=>\"649402\", \"uid\"=>\"0\", \"gid\"=>\"0\", \"comm\"=>\"gitlab-runner\", \"exe\"=>\"/usr/bin/gitlab-runner\", \"cmdline\"=>\"/usr/bin/gitlab-runner run --working-directory /home/gitlab-runner --config /etc/gitlab-runner/config.toml --service gitlab-runner --user gitlab-runner\", \"cap_effective\"=>\"1ffffffffff\", \"selinux_context\"=>\"unconfined\\n\", \"systemd_cgroup\"=>\"/system.slice/gitlab-runner.service\", \"systemd_unit\"=>\"gitlab-runner.service\", \"systemd_slice\"=>\"system.slice\", \"systemd_invocation_id\"=>\"41957e76a65940b99f73ecb372c3a00a\", \"boot_id\"=>\"xxx\", \"machine_id\"=>\"81de70f2ce23e896c06688ac48aa2561\", \"hostname\"=>\"runners-manager-shared-gitlab-org-blue-5\", \"message\"=>\"2024-04-17T08:49:56.151Z [INFO] decreasing instances: amount=1 group=gce/gitlab-r-saas-l-m-amd64-org-5/us-east1-c/blue-default\"}","environment":"ci-prd","hostname":"runners-manager-shared-gitlab-org-blue-5","fqdn":"runners-manager-shared-gitlab-org-blue-5.c.gitlab-ci-155816.internal","stage":"main","shard":"shared-gitlab-org","tier":"runners","type":"ci-runners"}
This is because the logs are not JSON structured logging:
Apr 17 07:16:43 runners-manager-shared-gitlab-org-blue-5 gitlab-runner[649402]: {"job":6645098281,"level":"debug","msg":"Starting container 62cfc26849587777d1535da604d7e73a43b8c72146bc4439d349f93bf4889e01 ...","project":15642544,"runner":"ynF9_MVqW","time":"2024-04-17T07:16:43Z"}
Apr 17 07:16:43 runners-manager-shared-gitlab-org-blue-5 gitlab-runner[649402]: 2024-04-17T07:16:43.251Z [INFO] required scaling change: required=0 capacity-info="{\"InstanceCount\":444,\"MaxInstanceCount\":900,\"Acquired\":439,\"UnavailableCapacity\":0,\"Pending\":0,\"Reserved\":3,\"IdleCount\":5,\"ScaleFactor\":> # <--- Not structured logged.
Apr 17 07:16:43 runners-manager-shared-gitlab-org-blue-5 gitlab-runner[649402]: {"code":202,"job":6645098448,"job-log":"0-2244","job-status":"running","level":"info","msg":"Appending trace to coordinator...ok","runner":"ynF9_MVqW","sent-log":"0-2243","status":"202 Accepted","time":"2024-04-17T07:16:43Z","update-int>
These logs seem to be coming from taskscaler
and fleeting
.
Action Item
Logs from taskscaler
and fleeting
should follow the same format specified by gitlab-runner
, so that in production we always have JSON structured logs.