Pipeline Efficiency: Provide AI assisted help to fix pipelines fast
Release notes
Problem to solve
I've learned to maintain my own CI/CD pipelines, writing code, deploying the app, and throughout the process, installing dependencies needed inside CI/CD jobs.
Sometimes, there is a failure within the job logs. I need to read and parse the error log, often a stacktrace from a script or command that does not run properly. Or it is a single logged error line that requires research and analysis.
In a recent example, the deployment of my opsindev.news newsletter to GitLab Pages failed, because of a regression in the MkDocs Material Theme being installed with pip. https://gitlab.com/dnsmichi/opsindev.news/-/jobs/3544293987
The Python module is installed from a private repository, using an insiders subscription that enables social plugins and GDPR cookie banners. https://gitlab.com/dnsmichi/opsindev.news/-/blob/main/.gitlab-ci.yml
Click to expand
# base installs the requirements
.mkdocs-base:
script:
- pip install -r requirements.txt
# build generates static site, throws errors with wrong config, etc.
.mkdocs-build:
script:
- mkdocs build --site-dir public
# insiders uses the MkDdocs Material insiders version (sponsoring by dnsmichi, token in CI/CD variables)
.mkdocs-insiders:
script:
- !reference [.mkdocs-base, script]
- |
if [ -z "${GITHUB_TOKEN_MKDOCS}" ]; then
echo "Using MkDocs Material default"
else
echo "Using MkDocs Material insiders"
pip install git+https://${GITHUB_TOKEN_MKDOCS}@github.com/squidfunk/mkdocs-material-insiders.git
fi
- !reference [.mkdocs-build, script]
# lint and generate site on main branch
pages:
extends: .mkdocs-insiders
stage: deploy
only:
- main
artifacts:
paths:
- public
The procedure to fix the pipeline is
- Copy the error message into a web search,
AttributeError: 'SocialPlugin' object has no attribute '_image_promises'
- Additionally, open the project's bug tracker and search for
SocialPlugin
- Found the problem in https://github.com/squidfunk/mkdocs-material/issues/4819
- Maintainer commented with a fix, linking to a commit/PR, labeled
fix available
. - Solution is to wait for a new nightly build of the MkDocs material pypi package and later re-run the CI/CD deployment. This needs an issue to track its due date, and increase visibility (if there were more team members).
To analyse the problem, I need to leave the GitLab UI and job log and open several new tabs. The strategy on searching for an error in CI/CD can be overwhelming - from long experience, and as a developer, it is easier to identify the words to copy into a web search, a beginner may not know what a stacktrace is.
Suppose the maintainer would not have pushed a fix already. In that case, I'd need to create an upstream issue myself, and also an issue/incident in GitLab that the CI/CD pipelines are failing at the moment - and the fix is pending, waiting for upstream to release a fixed version, and tracking the issue URL in the incident.
Another use case are errors on www-gitlab-com with failed blog posts deployments (YAML frontmatter errors) or other linting errors that may require team members to ask for help.
Intended users
Different personas, from experienced CI/CD pipeline builders to users who run into blog deployment troubles.
- Delaney (Development Team Lead)
- Sasha (Software Developer)
- Priyanka (Platform Engineer)
- Sidney (Systems Administrator)
- Rachel (Release Manager)
- Alex (Security Operations Engineer)
- Simone (Software Engineer in Test)
- Allison (Application Ops)
- Ingrid (Infrastructure Operator)
- Dana (Data Analyst)
- Eddie (Content Editor)
User experience goal
The user should be able to fix a failing CI/CD pipeline and job fast. Adding documentation on how to fix the problem should be easy from the UI, with additional AI help that knows how to troubleshoot more efficient or already provide the possible solution.
Proposal
This is a proposal with multiple ideas:
- Being able to comment on a job log line (similar to code reviews), and document how to fix the problem.
- The error may come again, which can help create a knowledge base if this becomes a well established team workflow.
- Maybe this comment can be copied/persisted somewhere else (a documentation issue), because job logs can get huge/slow/deleted
- Or create an alert/incident from the job UI, so that the analysis and learning process is transparent for everyone.
- GitLab CI/CD learned from past pipeline runs that these errors frequently come again, and offers a help blurb in the job log UI on how to fix the problem.
- Thinking of MLOps/AIOps here, similar to the detection of flaky tests, and more Observability data available.
- For example, if the error is related to network timeouts, or generic HTTP errors, additional metrics from CI/CD environments could help increase pipeline efficiency.
- If a code stacktrace is rendered, the AI can detect the language, used package, bug tracker, and offer immediate research help (and maybe a code change to fix). The first iteration can be better rendering of the error message, i.e. marking it bold.
Further details
"Fix failed pipelines fast" is a common requirement for efficient pipelines. Avoiding errors, and detecting them early enough, with fast help to fix them, can help developers, DevOps engineers, Ops, etc. to become more efficient.
In the future world, AI may help solve these problems. This requires more pipeline insights (OpenTelemetry data, and additional instrumentation of job script steps), and training history of the user's projects to learn specific problem types.
- CI/CD Observability: Tracing with OpenTelemetry (#338943)
- Implement tracing for `CreatePipelineService` (#373143)
- GitLab pipeline instrumentation (gitlab-org/opstrace&41)
Providing the option to document, learn and "auto fix" pipelines also enables beginners and first time contributors, who are not familiar with the code base, nor the architecture and its dependencies that may cause errors in complex CI/CD pipelines.
Last but not least, this could also become a way to better troubleshoot CI/CD pipelines, and create a collection of known CI/CD problems (needs research on data processing, user agreements, etc.). Other possibilities are better documentation for pipeline efficiency, and learning complex pipelines.
- Add more config, resource, infra efficiency tip... (#367062)
- Examples of complex pipelines (verify-stage#360)
Permissions and Security
The ability to comment on job logs, or get AI tips should be limited to logged in users, with project access.
Documentation
- How to create an issue/incident from commenting on a job log (depends how this is implemented)
- How to navigate AI assisted tips in CI/CD - use case examples.
Availability & Testing
- Evaluate whether commenting on job logs, and persisting the data "somewhere" affects performance. Linking issues with job log lines can affect the frontend rendering for example. (context: I remember that we added a regex to render URLs in job logs that caused troubles).
- Define how much AI assisted help is possible, and clarify it in the documentation.
Available Tier
- Free: Ability to comment on job logs and create issues/alerts.
- Ultimate: Advanced AI assisted tips for more efficient pipelines, "auto fix", etc.
- Should also be available to contributors that fork an Ultimate project, and create an upstream MR that runs the pipelines.
Feature Usage Metrics
Failed jobs and pipelines, and actions taken after viewing the job log. Is a new issue/alert created? Was the AI assist help blurb used to create a follow-up MR to fix the pipeline?
What does success look like, and how can we measure that?
Month-over-month reduction of failing pipelines, and patterns of failure being reduced.
Dogfooding on https://gitlab.com/gitlab-com/www-gitlab-com with the handbook, website and blog deployments provides a great measurement for failing pipelines. Common failures are linter errors (handbook links, codeowners, redirects, blog front matter yaml). Examples in these issues:
- Marketing handbook: Restructuring and corporate... (gitlab-com/www-gitlab-com#13991 - closed)
- Evaluate (continuous) URL linting and external ... (gitlab-com/www-gitlab-com#13980 - closed)
- CI linting: Detect wrong relative URLs with `si... (gitlab-com/www-gitlab-com#10326)
https://gitlab.com/gitlab-org/gitlab provides complex pipelines and potential failure reduction too.
Another measurement could be:
- First time contributors that fork the project and create a MR - number of CI/CD pipeline failures, and how they are reduced with additional help. This could also help increase the number of contributors, when waiting for a pipeline, and fixing things is easier and faster.
What is the type of buyer?
Is this a cross-stage feature?
Yes. Needs discussion on the feature scope and iterations. There might be overlaps with AI Assist, Observability and Pipeline Efficiency initiatives already planned. I think that the verify stage in the ops section can be the DRI, cc @jreporter
- grouppipeline execution - sectionops @jheimbuck_gl
- grouprunner - sectionops @DarrenEastman
- ~"group::pipeline insights" - sectionops @jocelynjane
- groupobservability - sectionops @kbychu
- groupmlops - sectiondata-science @tmccaslin
Quality engineering and contributor success teams may also be interested. cc @meks @nick_vh
What is the competitive advantage or differentiation for this feature?
Pipeline Efficiency on one platform: Actions on jobs logs (document, alert) and AI assisted tips to fix pipelines fast.
Pipeline Efficiency is a complex topic, with different possibilities to analyse, research, learn and solve. After giving talks around this topic in 2020 and 2021, I recognized a growing interest in 2022 at KubeCon EU and NA in lightning talks at the GitLab booth. I will continue researching and creating proposals and content in FY24 (2023).
Links / references
- CI/CD Observability: Tracing with OpenTelemetry (#338943)
- Implement tracing for `CreatePipelineService` (#373143)
- GitLab pipeline instrumentation (gitlab-org/opstrace&41)
- Marketing handbook: Restructuring and corporate... (gitlab-com/www-gitlab-com#13991 - closed)
- Evaluate (continuous) URL linting and external ... (gitlab-com/www-gitlab-com#13980 - closed)
- CI linting: Detect wrong relative URLs with `si... (gitlab-com/www-gitlab-com#10326)
- KubeCon NA Lightning Talk "Efficient DevSecOps Pipelines in a Cloud Native World" at GitLab booth by @dnsmichi: https://docs.google.com/presentation/d/1k4PWKJR9O1jEGxKblSQtjDsloQ95uvu6Ty9Pjpmin7E/edit
This page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.