Include more CI Job details in Slack notification when pipeline fails
Problem to solve
Gitlab has the ability to send a message to Slack when a pipeline fails, which is incredibly useful. However, it's difficult to determine which step/task of a pipeline failed without accessing the pipeline within Gitlab. The notification includes the pipeline, branch, and originator. But, it lacks the context of why the pipeline failed.
Including more information (such as job name, stage name, job log, etc.) in the notification would assist in diagnosing the problem without requiring a context switch to Gitlab.
Intended users
- Developers
- Build Engineers
Further details
The primary use case would be Diagnosis of a failed build when notified via Slack.
The goal is to be able to reduce the friction required needed to diagnose a failed build. This is greatly improved by eliminating the need to log into / access Gitlab to asses the failure. Additionally, there's potential to be able to resolve the failed build without needing to log into GitLab as well.
Proposal
Add the following to the Slack webhook payload
- Job # (with a link to the job)
- Job Name
- Stage name
- Job log
An improved Slack webhook payload including improved pipeline/job details should be a relatively minor change, but greatly improve the usability of the failed pipeline slack notification.
The log can be included as an attachment to the webook payload similar to how newly opened issues are handled.
Here's an example message:
<project>: Job #28145 (<job name>) of Stage <stage name> of pipeline #7076 of branch <branch name> by <developer name> (<developer user name>) failed in 01:29
[contents of job log
added as an
attachment]
Current Journey
- A developer pushes a commit which results in a failed pipeline
- Gitlab sends a notification to Slack with the pipeline info
<project>: Pipeline #7076 of branch <branch name> by <developer name> (<developer user name>) failed in 01:29
- A build engineer receives the notification and wants to determine where/why the build failed so they're required to access Gitlab to diagnose the fault
- Variation 1: User is not connected to their VPN and must log in before they can access their local Gitlab
- Variation 2: User is not logged into Gitlab and must log in before they can access the pipeline details (further compounded by the usage of a password manager and multi-factor auth)
- The build engineer clicks/accesses the failed job in the Gitlab UI
- The build engineer analyzes the log to determine the failure
- The build engineer notifies the appropriate party of the failure so that it can be resolved
- Variation 1: The problem wasn't a result of an failure introduced by the initial commit (such a network hiccup on deploy) and the pipeline is able be to rerun without further resolution
Improved User Journey
- A developer pushes a commit which results in a failed pipeline
- Gitlab sends a notification to Slack with the pipeline, including more robust information (Job info, job log, stage)
<project>: Job #28145 (<job name>) of Stage <stage name> of pipeline #7076 of branch <branch name> by <developer name> (<developer user name>) failed in 01:29 [contents of job log added as an attachment]
- An engineer receives the notification in Slack and analyzes the attached log to determine the failure
- The engineer notifies the appropriate party of the failure so that it can be resolved
- Variation 1: The problem wasn't a result of an failure introduced by the initial commit (such a network hiccup on deploy) and is able to provide a slash command or API request to rerun the job
In this particular user story, the issue was able to be resolved without needing to access Gitlab directly
Permissions and Security
The current permissions scheme should be able to be maintained. However, since the job might contain sensitive data, there might be concerns regarding adding it to an outgoing webhook. Adding an option to enable log inclusion with a default value of off
could alleviate any potential leakage of sensitive information. Though, this would require additional development work as it would impact UI/API
Documentation
If only the webhook payload is modified, there should be no documentation changes
If the option to include build log in the outgoing payload is modified, the Slack integration docs will need to be updated to include an updated screen shot and any API/UI changes will need to be documented.
Testing
A solution change only the payload content would utilize existing services/capability. Therefore, the only tests that should be impacted are those validating the contents of the outgoing failed pipeline webhook.
If this feature is decided to be configurable to include the log, changes to the UI and API would needed to be tested.
What does success look like, and how can we measure that?
Success is being able to more quickly/easily diagnose the reason a build failed, especially if it doesn't require accessing GitLab. This can be measured in the amount of time / number of steps required to diagnose the failed build.