Define undocumented behavior for `=~` operator in CI/CD variable expressions for cases where RHS variable is an invalid regex
Summary
I couldn't find any reliable documentation on how the regex matching operator =~
in CI/CD variable expressions is supposed to behave for variables that contain invalid regexes.
Of course I got the actual behavior from an experiment and from reading the source code which evaluates this operator. Although the results surprised me, the current behavior could be useful. But it is unclear if we can rely on this. If the surprising behavior is considered a bug and is "fixed" later, then our CI will completely change its behavior, because we built it on that bug.
Steps to reproduce
Create a .gitlab-ci.yml
with content
match:
image: alpine
parallel:
matrix:
- LEFT: [1234, 23, /23/]
RIGHT: [1234, 23, /23/]
rules:
- if: $LEFT =~ $RIGHT
variables:
RESULT: 0
- variables:
RESULT: 1
script:
- exit "$RESULT"
Execute the pipeline. This will result in 3×3 = 9 jobs, having the following results:
(tested in self-hosted 15.11-ee, 16.0.8-ee, and Gitlab.com 16.5.0-pre 43a5e402)
=~ | 1234| 23 | /23/
-----+-----+-----+------
1234 | ✔ | × | ✔
23 | ✔ | ✔ | ✔
/23/ | × | × | ✔
How to read this? Example: The ×
in the first table body row indicates that Gitlab evaluated $LEFT =~ $RIGHT
as false for LEFT
=1234
and RIGHT
=23
.
1234
=~
23
is false". Here the 1234
und 23
are the literal content of some variable and are not hard-coded into the expression itself!
What is the current surprising (not necessarily bug!) behavior?
I was surprised that 23
=~
1234
is true.
The intended and documented use-case is string
=~
/regex/
. But if we leave out the /
s, Gitlab seems to interpret =~
as .isSubstringOf(
.
What is the expected intuitive behavior?
I'd have assumed that string
=~
invalidRegex
would results in a yaml syntax error when starting the pipeline, since running "string" =~ "anotherString"
in ruby gives type mismatch: String given (TypeError)
.
Alternatively, I would have understood if =~
was being overloaded as .contains(
for right-hand-sides of type string.
I know... Substring search and regex matching are not really comparable. In case of a regex match, we cannot say that the regex is longer or shorter than the matched string – both is possible. Nevertheless, I cannot shake the feeling that Gitlab somehow reverses the direction of the =~
operator if its right-hand side is an invalid regex.
By default, partial matches are accepted. E.g. prefixmiddlepostfix
=~
/middle/
is true.
Looking at this behavior, I would have expected, that prefixmiddlepostfix
=~
middle
(without /
s!) was true too (since there is no syntax error), and middle
=~
prefixmiddlepostfix
was false. But it is the other way around.
Current implementation
From the parser for CI/CD expressions, I can see that parsing errors are actively ignored, and invalid regex are stored as plain strings internally.
After parsing, the CI operator =~
is evaluated as right
.scan(
left
)
(yes, that's the actual order).
Note that right
.scan(
might be subject to polymorphism. I did not check this, because I didn't have the time to setup a complete ruby debugging environment just for this. But my impression is, that ...
- If
right
is a valid regex, the parser stores it asLexeme::Pattern
, which results inLexeme.Base.scan(
being used, effectively calling"leftHandSideString".scan(/rightHandSideRegex/)
. - But if
right
is an invalid regex, and the parser stores it as a plain string, then we call"rightHandSideString".scan("leftHandSideString")
directly, because theLexeme.Base.scan(
(which normally reverses the order back to normal) isn't even called this time.
If that is the case, then I would say the current behavior is indeed a bug and not a feature. But it is really hard to tell, because neither the documentation nor the code give any indications on what is expected here.
Possible fix 1 -- Syntax Error
Forbid invalid regexes and stop the entire pipeline (not just a single job) with a syntax error.
The usual YAML syntax error would work. But in case the variable was supplied from the outside, for starting the pipeline, an extra error pointing out that the error is not (entirely) in the .gitlab-ci.yml
file, but in the value of a variable, would be better.
Possible fix 2 -- It's not a bug, it's a feature!
Document an expected behavior for invalid regex, e.g. as a feature "Overload =~
operator for right-hand-side of type string".
Personally, I'd prefer if we could change the order for that overloaded operator (i.e. implement it as left.contains(right)
instead of left.isSubstringOf(right)
) but I understand that the current order makes more sense if you want to be backward compatible for an undocumented bug feature.
Having this additional operator would be much appreciated (no matter in which direction it is implemented).
For now, it allows users to work around #209904 (closed). But even if that issue is fixed, it would still be useful when you want to do plain substring matches, so that special symbols like .*[(
in such variables don't have to be escaped.
Possible fix 3 (=1+2) -- Have both options
Add a third variable type "regex" (in addition to regular variables and file variables).If the variable is of the new type "regex", fix 1 applies. If the variable is of the default type, the overloaded behavior from fix 2 applies -- always (!), even if the variable is the string (not regex!) /23/
.