When an environment cannot be created it should fail the whole pipeline creation and surface the errors to the user. Right now they see failed jobs with no information about what went wrong (eg. gitlab-ce#43196 )
Proposal
Force the failure of the pipeline job that creates the environment if the environment name is invalid, and display an error in the job log page:
This job could not be executed because it would create an environment with an invalid name. See documentation
For example one way to create a job with an invalid environment is to create a name that is longer than 255 characters like this build. Ideally this should not even run the build (or create a pipeline) it should just show an errors as though they had an invalid .gitlab-ci.yml otherwise this can cause confusing downstream errors as you can see in the build output the CI_ENVIRONMENT_SLUG is not assigned a value and it should be.
Should this be a pipeline failure or job failure? I think this failure happens when creating the pipeline so we may want stop the whole process, but I'm not totally sure.
This item has been automatically determined to have 4 or fewer upvotes and is, for now, being marked as awaiting further demand. If you feel this is incorrect, please comment on the issue and I'll be happy to take a look.
However, the default behavior should abort pipeline creation if environment will not be created properly. Today, some user is still using theri broken gitlab-ci.yml without knowing it. This should be notified somehow.
Currently, an invalid environment URL (e.g. hp://hogehoge) is silently ignored even if user has specified in .gitlab.ci-.yml. Probably, we should introduce environments.error_messages or such column to indicate that some of the environment functionalities failed and users should adjust their script.
We are basically seeing more than 2000 errors like this a day, and these errors are hidden from a user, as far as I know. This is one of the most active exceptions in Sentry.
@grzesiek Yes, this is one of the long-standing bug that should be resolved. I'm often asked by users that why environment is not created in a particular case.
Thanks @grzesiek and @shinya.maeda for the feedback. Once we have some clarity on what our capacity is following the team realignment, I'll work to get this scheduled.
Right now they see failed jobs with no information about what went wrong
@kbychu@shinya.maeda Yep, this is definitely worth looking into sooner rather than later.
The new environment page already stops the user from creating this env with a long name. Is .gitlab-ci.yml the only other interface where the environment name can be set?
@shinya.maeda@kbychu I think easiest / most future-proof way to ensure users know about the error is to notify them via email when it happens, this way we won't need to cover every single method to create environments.
Essentially, generate an email notification for (every?) sentry error explaining the failure, name of failed environment and which configuration generated it.
The downside to this approach is that we wouldn't be validating the environment name on creation, but I think that's an optimization we can do later i.e.: displaying a warning when user tries to save the .gitlab-ci.yml with an invalid environment name.
@dfosco I am not convinced of the utility of email on every error.
Two completely different reasons why.
A smaller iteration to me is we can and should learn over time what are the error conditions for creating an environment. We can solve that via linting, via pipeline editor informing the user something is invalid, via surfacing the error in the pipeline logs or catching the error condition so the pipeline is not successful.
Emails are often ignored. While it may be better than silent errors that are not noticeable to users altogether, I am not sure sending users an email is that much better.
When an environment creation is failed during a pipeline creation, the pipeline creation also fails and shows the message as "Failed to create an environment by a validation failure". Example of pipeline creation failure. Here is the list of failure reasons.
When an environment update is failed after a deployment job succeeded, the deploy job is marked as failed and shows a message "Failed to update an environment by a validation failure". Example of job execution failure. Here is the list of failure reasons.
A smaller iteration to me is we can and should learn over time what are the error conditions for creating an environment [...]
Emails are often ignored. While it may be better than silent errors that are not noticeable to users altogether, I am not sure sending users an email is that much better.
@shinya.maeda Thanks for the explanation, that makes sense. My only suggestion here would be to point the reason for the validation failure, if possible, otherwise just link to the docs with all the failure reasons.
And involve @axil to make sure the copy in the log response is user friendly
@fabiopitino I have a question on the Ci::PipelineMessage usage.
As this issue describes, currently we're silently ignoring environment creation when the a parameter given for the environment creation via .gitlab-ci.yml is invalid. This should result in a pipeline creation failure, however, as you can see in https://sentry.gitlab.net/gitlab/gitlabcom/issues/1375317/?query=EnvironmentCreationFailure, there are many invalid cases today, so if we suddenly started failing the pipeline creation, it could lead to a production incident.
I wonder if we can start surfacing the environment creation error as a warning message of a pipeline. Something like "Your environment was not correctly created because of an invalid parameter". In this case, users would be aware of that something went wrong on the pipeline while the other jobs are correctly working. Does it make sense?
Also, where would the warning/error message be visible on UI?
@shinya.maeda I think this is a good usecase for warning messages. Currently we display warnings in:
CI Lint output
Redirect of Run pipeline button
Pipeline Editor
We decided to remove warnings from all pipeline pages because users triggering pipelines that weren't the author of the CI configs were getting annoyed by some recurring warnings. Instead for now the targeted persona is the author of changes to CI configs.
This config works when the CI config author tested on a default branch (e.g. environment name will be review/main. This is a default behavior on CI Linter), however, it doesn't work if a developer created a merge request with a branch named test-branch-! (i.e. ! is the character allowed in Git, but NOT in Environment model). We'd like to show the message to a particular developer in this case for letting them taking an action, so I was wondering if we can start displaying in this type of errors in the pipeline details page e.g. https://gitlab.com/gitlab-org/gitlab/-/pipelines/335194588.
However, it sounds like we recently removed the messages from the pipeline details page. This makes me under the impression that the pipeline warning feature is limited to CI Linter usage in a static sense. If so, maybe we should introduce job-level message.
I think that if we want people to actually notice this message without making it a hard failure, we should surface it as a flash to anyone with commit rights on the project page itself.
This config works when the CI config author tested on a default branch (e.g. environment name will be review/main. This is a default behavior on CI Linter), however, it doesn't work if a developer created a merge request with a branch named test-branch-! (i.e. ! is the character allowed in Git, but NOT in Environment model). We'd like to show the message to a particular developer in this case for letting them taking an action, so I was wondering if we can start displaying in this type of errors in the pipeline details page e.g. https://gitlab.com/gitlab-org/gitlab/-/pipelines/335194588.
Since the problem can occur when a pipeline runs, looping in @v_mishra for a Pipeline Execution opinion.
When a problem can be identified before the pipeline run, we should surface it in the CI Lint page and the Pipeline Editor Lint tab and the pipeline status indicator in the Edit tab.
For the problem when an invalid definition is created dynamically, job page could be a good place to surface the warning since it's a page you go to to investigate a specific job.
Agree with @nadia_sotnikova on this: since the problem can be of a dynamic nature, warning users during pipeline editing wouldn't catch all instances *
Since warnings were removed from the pipeline details page, does it make sense to add this warning message as a one-off alert above the pipeline graph?
* With that said, it would also be nice to alert users before the error happens, in case they write an environment name like test-environment-! on their YAML file.
I believe that the question is, can we identify the error beforehand?, if yes then it makes sense to add it to the linter, if its something dynamic we need to think about the right location to surface it,
Maybe in the logs or in the pipeline or page page (where we display the errors with labels)
Only afterward we can think about a form of a warning in the linter, we had a similar discussion over a job that needs a job that was not created
@dfosco@dhershkovitch Agreed that if we can identify the error beforehand, we need to show it in the Linter in all of its locations (the CI Lint page and the pipeline editor + status validation area).
For the running pipeline, the goal should be to show a notification that is helpful and actionable in the right moment.
Is the problem always caused by a specific problematic job? Can we identify it?
From our research, we see that users don't sit and watch their running pipeline in the pipeline detail page. Most developers use the mini pipeline graph in the MR widget or the pipelines page to jump directly to the logs of the job that interests them, usually when a job has failed.
If a specific job is to blame, I'd show the warning in the job logs page of the job that caused the problem, which reduces the feedback loop for getting the alert, troubleshooting the logs, etc.
@nadia_sotnikova If a specific job is to blame, I'd show the warning in the job logs page of the job that caused the problem, which reduces the feedback loop for getting the alert, troubleshooting the logs, etc.
@shinya.maeda Could we force the job that creates the environment to pass with a warning (in case the environment creation fails) and add the warning message on the job log, as Nadia proposes?
@dfosco OK, so it seems we're almost aligned with the job warning proposal. Would you mind updating the UX proposal in the issue description? Probably, we'd need a mockup for frontend work. CCing @kbychu for PM check. I'll follow-up technical proposal later, probably we need a new database table for storing the message.
@shinya.maeda Done! Let me know if that works. I'm not sure if we'll force an error or just a warning, let me know and I can adjust the design & copy accordingly.
@shinya.maeda it looks like if you set callout_message on the job, the frontend is already set up to display that. It doesn't look like it is set up to display links though
We may be able to add something to job.deployment_status as well? This would take a little bit of frontend, but either work for me
@dfosco Thank for creating the mock! We'd need a few adjustments:
We'd want to differentiate the job failure message (the existing feature) and the job warning message (a new feature), instead of mixing them in the same section on the UI. For example, a job could get multiple messages across errors and warnings, meaning the message rows could also be multiple. Maybe, we'd want to create a new yellow section under the failure/red message section.
We'd want to support message collapse UI in case of the multiple warnings. It could be next iteration.
We do NOT fail pipeline jobs by environment creation/update error as this is quite impactful on production. We just show a warning message in the job page. Thus the job status can be .
Here is the example cases:
Job Status
Warning message
Failure message
Passed
"The system failed to create an environment for this job because environment name can contain only letters, digits, '-', '_', '/', '$', '{', '}', '.', and spaces, but it cannot start or end with '/'. See documentation."
Passed
"The system failed to update an environment by this job because environment external url is blocked: Only allowed schemes are http, https. See documentation."
Failed
"The system failed to create an environment for this job because . See documentation."
"The script exceeded the maximum execution time set for the job"
Failed
"The deployment job is older than the previously succeeded deployment job, and therefore cannot be run"
The warning message template would be:
"The system failed to create an environment for this job because #{validation_error_details}. See documentation."
"The system failed to update an environment for this job because #{validation_error_details}. See documentation."
We do NOT fail pipeline jobs by environment creation/update error as this is quite impactful on production. We just show a warning message in the job page. Thus the job status can be .
Sorry to jump back into the conversation so late, but if the job shows up as passed, chances are the engineer won't check its logs page, which means they won't see the warning. @dfosco
@nadia_sotnikova@afontaine Thanks for pointing that out, Nadia! Yes, forcing the job into warning status seems like a good idea -- users might still ignore it, but if they realize there's something wrong, that's how they'll find the information they need to fix it.
@dfosco In that case, we should differentiate it from the existing "successful with warning" status (a job script failed but allow_failure is true), as these meanings are quite different, which causes another confusion.
Based on your previous comment, do you think the passed with warnings status is not ideal for this scenario?
@dfosco Yes, we're mixing different states in one status. This looks a definitely bad design to me. And, again, the job status can be failed instead of passed. Actually, even a hard failure idea is better than the proposal.
I am not sure I follow all the variation, so I created the following table.
Job Status
Warning?
allow_failure
pipeline page
Job Log Page
Passed
no
false
Everything green
No warning or error
Passed
yes
false
something new
warning message
Passed
yes
true
something new
warning message
Failed
no
false
failed
error message
Failed
no
true
passed with error?
error message
Failed
yes
false
failed
warning and error message
Failed
yes
true
Passed with error?
warning and error message
Does this capture all the variations @shinya.maeda@dfosco@nadia_sotnikova ? I think we need to create a new case for when something passes but has a warning. And we might want to update passed with warnings to passed with errors because we are now distinguishing errors and warnings. Warnings are not a hard failure.
I am going to move this to workflowdesign and probably move this out of 14.2 until we're clear what we should from a ux perspective. Is that ok with you Daniel?
@kbychu Thanks for the table, it's very useful! Indeed, it seems like this is a case for passed with errors, but I'm not sure if that might confuse our users. Curious what @nadia_sotnikova thinks of this!
Also, forgot to update here in the thread, but @shinya.maeda and I will have a sync tomorrow to discuss this issue and find a way forward. Will add you and Nadia as optional in the invite, but will record anyway