Invalid environment external_url disturbs the entire environment update process
Release notes
Problem to solve
Invalid environment configurations can interrupt the environment update process after a deployment succeeds. For example, you're seeing a successful deployment in the following screenshot, where actually it failed to update the environment status.
https://gitlab.com/shinya.maeda/pipeline-playground/-/jobs/2364758866
The deployment job should update the environment URL to www.google.com
, but since it's malformed URL, the system can't update it.
Customer impact
This silent failure often gives support engineers and customers having hard time to debug the problems.
We've found the problem: the environment tier and the deployment-related merge request metrics are not going to be updated due to the missing
http
orhttps
URL prefix under theenvironment.url
YAML key. The validation on theEnvironment
record will silently fail here: https://gitlab.com/gitlab-org/gitlab/-/blob/master/app/services/deployments/update_environment_service.rb#L34
Errors on SaaS
You can see how often this environment update failure happens on SaaS:
- https://sentry.gitlab.net/gitlab/gitlabcom/issues/2678585/?query=EnvironmentUpdateFailure
- https://log.gprd.gitlab.net/goto/f08bab70-c220-11ec-b73f-692cc1ae8214
At the moment, roughly 4,000 of deployments encounter the failure every day. There is no feedback feature to let users/customers be aware of this problem.
Here is the frequency per error message: https://log.gprd.gitlab.net/goto/b4f34b10-c223-11ec-afaf-2bca15dfbf33
We can see that 100% of error messages are related to the environment.url
keyword.
Intended users
Metrics
Tracking the absolute number of failed pipelines due to invalid environments.
User experience goal
The user should be able to see why their environment failed to update
Proposal: Soft validation on External URL
- Since
external_url
is used for that users accessing to the website, and it's not used for internal server request, we can persist an URL withoutAddressableUrlValidator
. - Since we expose the
external_url
as a button in some pages (environment page, MR page, etc), we sanitize the URL not to include javascript code.
Previous proposal (turned down due to drawbacks)
Proposal: Validate Environment URL at pipeline creation
When a new pipeline is created, we additionally validate the Environment URL on each job. We expand the Environment URL (e.g. url: appname-$CI_COMMIT_REF_SLUG
) based on the CI/CD variables. If it's invalid, we mark the job as failed
status similar to what we did in the previous issue.
A few notes:
- Since the job is failed, users almost 100% notice that something went wrong.
- Easy to implement. The weight would be 1-2.
- This is a breaking change that could disturb user's CI/CD workflow. (This might not even a con at %15.0, because we allow breaking change at major update.) We should communicate with affected customers in advance to mitigate the impact.
- There is an edge case that users can set dynamic environment URLs after a job finishes. We can't detect this error by this approach.
UI/UX
Surface the following error message to the pipeline job page if an environment update is failed:
SSOT UI text*
This job could not be executed because it would update the environment with an invalid URL. Learn More.
Documentation link: https://docs.gitlab.com/ee/ci/yaml/index.html#environmenturl
Similarly, it might be the case the pipeline fails because both the name and URL are invalid. In that case, the message can refer both, and the documentation link take the user to the parent section:
SSOT UI text*
This job could not be executed because it would update the environment with an invalid URL and name. Learn more.
Documentation link: https://docs.gitlab.com/ee/ci/yaml/index.html#environment
Further details
Permissions and Security
Documentation
No expected documentation change, other than pointing to https://docs.gitlab.com/ee/ci/yaml/index.html#environmenturl
Availability & Testing
Available Tier
What does success look like, and how can we measure that?
Customers are able to fix this problem after their first encounter with the error message, and the absolute number of failed pipelines due to invalid environments drops.
What is the type of buyer?
- Casey - the Release and Change Management Director
- Dakota - the Application Development Director
- Kennedy - the Infrastructure Engineering Director
Is this a cross-stage feature?
Links / references
- This is a follow-up of #21182 (closed)