Make structured "job finished" log line with failure_reason and exit_code
What does this MR do?
Changes the job finished message to be unified, and includes metadata that is currently not being logged.
Before:
Job succeeded duration_s=35.836495667 gitlab_user_id=1 job=526 namespace_id=0 organization_id=0 project=19 project_full_path=root/gdk-ci-test root_namespace_id=0 runner=Wg8IWvTxZ runner_name=GDK local runner
ERROR: Job failed (system failure): prepare environment: setting up build pod: provided host alias mysql_1 for (...) duration_s=1.337469125 gitlab_user_id=1 job=529 namespace_id=1 organization_id=1 project=20 project_full_path=root/tm-services-compatibility-between-docker-and-k8s root_namespace_id=1 runner=Wg8IWvTxZ runner_name=GDK local runner
After:
Job succeeded duration_s=36.388285542 gitlab_user_id=1 job=539 job-status=success namespace_id=1 organization_id=1 project=19 project_full_path=root/gdk-ci-test root_namespace_id=1 runner=Wg8IWvTxZ
WARNING: Job failed: command terminated with exit code 42 duration_s=9.154000584 error=command terminated with exit code 42 exit_code=42 failure_reason= gitlab_user_id=1 job=553 job-status=failed namespace_id=1 organization_id=1 project=19 project_full_path=root/gdk-ci-test root_namespace_id=1 runner=Wg8IWvTxZ
Why was this MR needed?
This allows us to more easily see job success vs failure rates. It also tells us more about why a job failed.
This is particularly useful during incidents, where we may want to see if there is an increase in system failures. Currently that requires fuzzy matching on the msg field which is not very user friendly.
The end goal of this is for job finished to canonically represent all job completions, include all relevant dimensions, allowing us to easily assess the overall health of the system and dig into systemic failures.
EDIT: We keep the old messages for BC, but we can use job-status: [success, failed] as a filter.
What's the best way to test this MR?
I tested it locally.