After spending hours trying to understand what was going wrong, I came to the conclusion I found a specific edge case where GitLab does not throw any error messages and simply gets bugged.
When these conditions are met:
There is a needs relations between two jobs
The parent job has a matrix
The total number of characters in the matrix variables values is greater than 114
...no pipelines are created, without any notifications of what could have gone wrong.
Steps to reproduce
Create a project with this minimal .gitlab-ci.yml:
I don't have any fixes to this issue. The best workaround I found is to remove the needs instruction while waiting for a fix.
Surface the error when no pipeline is created when matrix reaches limit
Match job name chars limit
Note
We are a 150 seats GitLab Premium customer. Support on this issue would be much appreciated
Proposal
Iteration
Description
Limitations
Step 1: Surface the error
Ensure that an error message is surfaced when character limits are exceeded.
Does not solve the problem to allow for long matrix section. Users still need to use identified workaround of variable indirection.
Step 2: Make the character limits match
Does not solve the problem to allow for long matrix section. Users still need to use identified workaround of variable indirection.
Step 3: Update UI to show variables
Add UI elements to expose the variables used, rather than relying on the name. This may require updating the API to contain the information if not already available
Does not solve the problem to allow for long matrix section. Users still need to use identified workaround of variable indirection.
Step 4: Allow Custom Matrix Names
Exact details to be determined. Allow the user to assign names to the various configurations through the config. This allows for names that are meaningful. When names aren't provided default to names by variable - same behaviour as today.
Requires the UI update from step 3 to maintain parity with existing experience of viewing variables, but does fix the issue of being limited in matrix variables.
Thanks @splattael for the ping here. I agree that with having insight into notifications to create that visibility, it will in turn provides a better way to see when these scenarios occur.
While I agree pipeline error notifications could be a nice to have feature, I would rather not wait for #36806 to be resolved before treating this ticket.
This ticket is about a bug that is affecting us, premium customers, in our work. I am pretty sure this bug, which looks like a buffer overflow, can be resolved by looking at GitLab logs. There is not need for a visual notifications of the bug here.
Should I reach the premium support helpdesk to escalate the issue ?
thanks for the ping @vdsbenoit. Very odd that this particular set of circumstances (variable length + needs) creates the bug!!
A first fix for this might be throwing an error in the pipeline editor when the total variable length exceeds 114 but I don't think that's the outcome you want.
Heads up for @marknuzzo - we might investigate if there's a hard limit we should document here. In the meantime we can document the limit when mixing usage of parallel:matrix and needs:here maybe? cc @marcel.amirault
Let me emphasis that the 114 char limit is not per variable but for all the variables of a matrix item together. The issue becomes very cumbersome when we have multiple variables, with few characters each, but with a sum that exceeds 114. For instance:
@vdsbenoit - thanks for clarifying, i saw the same thing.
@marknuzzo that works for me, can you create the investigation issue? Note this behavior is only showing up when the combined length of matrix variables + needs are both present
HI @jheimbuck_gl - For now, I marked this issue as workflowblocked only to enforce that #369894 (closed) research should take place first to help inform of direction here. Please let me know if I interpreted that flow correctly.
Huh, this is very strange. I'd like to understand this a bit more to know how to write up the docs for it. Let me ping @mbobin, who I think knows this code well.
@marcel.amirault it's related to how we compute the names for matrix jobs and needs limits. The name for Parent job job becomes Parent job: [fooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo] and because it is a need for Child job, the need name is expanded to use the variable, like needs: ["Parent job: [foooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo]"], but we do have a name length validation for needs, which is 128 characters:
[14] pry(#<Gitlab::Ci::Pipeline::Chain::Populate>)> pipeline.stages.last.statuses.last.errors.messages=> {:"needs.name"=>["is too long (maximum is 128 characters)"]}
But it doesn't bubble up as pipeline error:
And it's interesting that it doesn't match the 255 chars limit on job names:
gitlabhq_dblab=# \d ci_builds Table "public.ci_builds" Column | Type | Collation | Nullable | Default-------------------------+-----------------------------+-----------+----------+--------------------------------------- name | character varying(255) | | |
Was it a bad idea to use the variable values in the job's name?
@mbobin @lauraX Sorry, I'm not following how this limitation works... I tried taking the details in the reply above and documenting it, but each time I wrote it out it was clear I have no idea what I'm talking about.
If you had to add it to the docs () how would you explain it?
Hi @marcel.amirault - I think I would say something along the lines of:
Aggregated matrix variables values cannot be greater than 114. - Do you think we should mention why this is the case?
@marknuzzo I think we could potentially unblock this issue by breaking up a solution in three issues:
Make the error message actually say what the error is
Figure out why the job name doesn't match the chars limit and fix that (perhaps the validation too)
As a longer term solution - use something else as a job name that is not the variable value <- if we do this, we might not need to do any of the above, but everything depends on timing
So we can either truncate this job name OR use something else as a job name ...? Truncating seems like the best solution.
Truncating is not a good option because we can end up with jobs that have the same name in case of multiple variables. We could use the x/x format from the parallel: int use case, but in this case the user will have no idea what variables the job is using without doing some echo statements in the script. Should we add a description field to jobs to hold the information about parallelization?
Figure out why the job name doesn't match the chars limit and fix that (perhaps the validation too)
Truncating is not a good option because we can end up with jobs that have the same name in case of multiple vari
ah, yes, this makes sense.
Should we add a description field to jobs to hold the information about parallelization?
We could do this, and additionally fix the limit to make them match. Maybe this will be enough for now.
@marknuzzo - I think this is ready to be worked on, with the proposal being to fix the limits to make them match. Marius suggested also adding a description field to jobs, which can be done either as part of this issue or another one.
@marknuzzo - I think this is ready to be worked on, with the proposal being to fix the limits to make them match. Marius suggested also adding a description field to jobs, which can be done either as part of this issue or another one.
I think we can leave the description addition up to whoever picks up this issue - it can be done in the same MR or as a follow-up, depending on how the MR goes.
Thanks @lauraX - @dhershkovitch - with this weighted now and being ready, I think we need to compare this issue against our next prioritization typebugboard to determine the appropriate timing in an upcoming iteration/milestone.
/cc @treagitlab for awareness due to the typebug discussion and priority.
This grouppipeline execution bug has at most 25% of the SLO duration remaining and is ~"approaching-SLO" breach. Please consider taking action before this becomes a ~"missed-SLO" in 14 days (2022-09-03).
Hi @jheimbuck_gl@richard.chong@carolinesimpson - though this will ~"missed-SLO", looking at the next prioritization typebugboard, it would seem that the best timing for this based on other prioritized bugs right now is %15.6 at the earliest. WIP limits may change but just wanted to start a conversation here to start thinking about where it can best be slotted in. WDYT?
@jheimbuck_gl may we have an explanation why this missed SLO issue suddenly went to the backlog ? To my experience GitLab backlog == won't do...
GitLab developers identified the bug in GitLab and multiple Premium customers mentioned they are impacted by this issue. This bug prevents us to use the parallel:matrix feature and drove us crazy because no errors are displayed in the GitLab GUI. It will likely impact more people because the threshold is quite low.
It's is a little frustrating to see how such an issue is handled while I spent hours to prepare & write the report.
@vdsbenoit sure thing. First off thanks for the writeup having those really make these easier to replicate and fix.
The team is currently focused on a number of issues that are high severity and priority (you can see the bugs scheduled in 15.6 here) for the next couple milestones. I update what the team is planning in our letter from the editor and direction pages if you'd like to take a look.
Instead of pushing issues out from milestone to milestone as we plan we are reviewing what's already workflowready for development and pulling in issues for the capacity we have within a milestone. I'll be reviewing that list (this issue included) against other issues in the upcoming milestones.
Just chiming in here that we're hitting this bug too, and it's pretty annoying. It really limits how effective the matrix keyword is, because you can't really have any meaningful matrix setup without hitting this variable limit, and if you have any long value you're in for a bad time.
It'd be far better to even suffer some sort of information-loss in the job name (I.e., truncate the characters in the name to the 255 max limit) than it is to just have the job die because of the naming convention. It's also not great from a UX perspective, because it manifests as just a generic pipeline error, so our engineers will often reach out to our platform team unsure of how to actually resolve it, and the platform team has to then research what changes were made recently to help the team understand.
Hey @jheimbuck_gl - do you know if there is a clear direction of how to resolve this issue? It's obviously something we'd love to see fixed and I'm willing to look into contributing a MR to fix it since it keeps getting pushed out ; There doesn't seem to be a consensus to how it would be resolved though.
I suspect we have a couple requirements here:
Limiting the matrix keywords to < 128 characters is not a great solution, as it's:
not intuitive
severely limits the use of parallel and matrix together
Increasing the database limit is not likely something we want to do. 255 is a really large limit for a manually created job, but really easy to hit with generated jobs, and increasing it to > 500 would not really fix the issue, as much as it would hide it.
maybe this is something that would be entertained since the initial issue that talks about the name deprecation in 15.0 doesn't note the why behind it, other than standardization. It still doesn't feel wise to me from a performance/cost perspective though.
We need to avoid duplicate names existing within the same pipeline.
To that end, I'd suggest maybe adding some logic to the matrix naming strategy (which I think(?) is here) where it performs the following actions:
If the current length of "#{job_name}: [#{vars}]" is < 128 characters, then use the value as it currently exists
If the current length of "#{job_name}: [#{vars}]" is > 128 characters, then run a md5 hash of #{vars}, and use that in place of the vars.
While md5 has some risk of collision, it's fast, and the risk of a collision in this space should be very minimal, and results in a nice small value (32 char hash + 4 for the ": []" characters = 36 characters) value. That leaves 92 characters for the manually entered job, even with an unlimited number of variables within the job. Yes, there is some loss of information about what job is running which variables, but this seems better than the whole pipeline dying. This could also be mitigated in the short term by users printing out variable values in the before_script if they need to.
@PatrickRice it looks like @lauraX and @mbobin had a discussion about how to fix this and applied a weight for what it would entail but the proposal did not get updated with that proposal.
Would one of you please update the proposal to what you settled on for a fix please so if a community contributor picks it up the implementation steps are clear? Thanks!
@jheimbuck_gl - I saw that discussion, but it seems like the only proposal is to make the lengths match (change the needs validation to 255). While that does make sense, it doesn't really fix the issue that the pipeline fails without an error (I assume that'd be a separate issue) and that the variable job naming causes parallel:matrix jobs to easily exceed that limit (which is what I assume is being fixed here) . It feels like there is another fix needed here, too, otherwise the parallel:matrix approach is severely limited. For example, we use parallel:matrix jobs for copying docker images between ECR repositories, with two variables: SOURCE_IMAGE and DESTINATION_IMAGE, and one ECR URL is 48 characters long (not counting the image name), which means our job name is a minimum of ~200 characters with just two variables plus the image name. Does it fit? yeah, but only with 2 variables. Adding a third or fourth basically instantly breaks our pipeline. Setting the limit to 255 is basically doing what I mentioned in the "increasing limit" bullet - it hides the issues, but it doesn't really fix it.
Maybe I'm reading something wrong and @lauraX or @mbobin could help me out!
I have updated the proposal above with some slight fixes that Marius and I discussed, which can probably be done in separate MR.
@PatrickRice You are correct about several different issues needing to be fixed :)
The proposal that Marius and I discussed was to fix the current situation of silent erroring, and then make the character limits match. This would probably be a good idea since this is a bug anyway. This is what I updated the proposal with.
This does not fix the broader issue that the variable job naming causes parallel:matrix to easily exceed the limit, though.
I wonder if bumping up the limit to 512 would be reasonable? I don't really have data to corroborate that this would indeed fix the problem, but I assume less users would be affected with a higher limit. This would still have to be validated though, since I'm just brainstorming here.
We could also do naming the same way we do for parallel: int, and add a description field so that in the job show page the user would know the information about the paralellization.
It seems like we need a more friendly way to name the jobs that doesn't rely on the variable content, but that the user is still able to easily determine which variables values were used in a particular job run. Is that correct? It seems to me we could just use a randomly generated string if we had a way to tie it back to the variables and display them on the UI.
Perhaps @v_mishra might have some ideas since it crosses the line into UX/design?
I wonder if bumping up the limit to 512 would be reasonable? I don't really have data to corroborate that this would indeed fix the problem, but I assume less users would be affected with a higher limit. This would still have to be validated though, since I'm just brainstorming here.
I agree that this likely helps, but in the above example it allows maybe another 3-4 long variables? It also causes a weird developer experience where developers may want to name their variables non-descriptive names in order to reduce the number of characters in the jobs . I like your thought around the description field, and I wondered if a good boring solution would be to have a JOB_NAME special variable that overrode what showed up in the brackets. The only issue I've thought about with that is it really only works when you have a 1xN matrix (this works with parallel:int because it's always 1xN). For example, in the following setup, how would we know where to put the JOB_NAME (or description keyword):
It seems like we need a more friendly way to name the jobs that doesn't rely on the variable content, but that the user is still able to easily determine which variables values were used in a particular job run.
I agree completely . The md5 string I proposed is essentially that, it's a pseudo-random string from a user perspective when the name is too long, and I figured there would need to be some iteration on how to make the presentation back to the users be friendly.
@lauraX Can you have a look at the Proposal I added for steps to address this based on this thread and ensure that I have captured it all correctly. I'm a little unsure if we can actually proceed with the truncation because of technical requirements to have unique names, or if it is just an inconvenience to the user to have duplicate names. Perhaps jumping straight to a unique random name without the truncation step is the way to go? WDYT?
@PatrickRice I thought as much for upping the limit
a good boring solution would be to have a JOB_NAME special variable that overrode what showed up in the brackets.
This is a VERY interesting solution, I wonder if we could work with this. Although it's a more complex solution, it feels like the best long term thing to do since it allows infinite variables and even more flexibility on the job name that is meaningful to the user. @mbobin - what do you think of this?
@carolinesimpson we do not have a technical restriction on unique names. I'm not sure about using a unique random name, as it would be almost the same as truncation but without any information WDYT about Patrick's suggestion above?
@carolinesimpson we can't truncate the names because we need unique job names in a pipeline(i.e. for needs).
Perhaps jumping straight to a unique random name without the truncation step is the way to go?
This could be an option, but as a user I would not know what I'd be looking at on the pipeline page and even in the job page. This would also make the debug harder when the YAML is invalid(i.e. the missing needs/ dependencies jobs)
a good boring solution would be to have a JOB_NAME special variable that overrode what showed up in the brackets.
maybe with a different name, but I think it will still need to be based on something meaningful.
I think the steps needed here are:
surface the error from Ci::BuildNeed
make the limits match
explore different solutions to bypass the limit. I think this will require UX.
I wonder if bumping up the limit to 512 would be reasonable
That's hard to do because the database type for this column on .com is varchar(255) and we'd need to convert it to text to bump the limit which will require careful planning(and around 5 releases to execute). And there are also a few indexes on this column which will have to be recreated. And having longer job names will definitely increase the database size.
For example, we use parallel:matrix jobs for copying docker images between ECR repositories, with two variables: SOURCE_IMAGE and DESTINATION_IMAGE, and one ECR URL is 48 characters long
@PatrickRice as a workaround until we figure this out, introducing some indirection might work:
maybe with a different name, but I think it will still need to be based on something meaningful.
Could you describe what you mean when you say it would need to be based on something meaningful? My thought was that we'd let the devs name the individual jobs explicitly via a keyword/variable use, thus letting them assign their own meaning. If they want to use "A", at least that has meaning to them . I.e., in the example above, it would result in an execution that looked and was named like this:
graph LR pre("pre-job") pre --> A("my_job [1x3]") pre --> B("my_job [1x4]") pre --> C("my_job [2x3]") pre --> D("my_job [2x3]")
Honestly, while I like this because it allows devs to assign their own meaning, I'm still not sure how it works well with a non-1xN matrix, and that's what causes me pause with it right now.
This could be an option, but as a user I would not know what I'd be looking at on the pipeline page and even in the job page.
This is why I'm a bit of a fan of the md5 approach; It's not truly random. It doesn't have as great of a UX as the current solution because you can't see just with a hover what variables are in play, but importantly as long as the variable names/values are the same the md5 sum would be the same every time and the pipelines would work for effectively infinite variables. Applying this only when the job length is > 255 (after the needs fix, 128 right now) would mean that the UX would continue to work as-is for most users of GitLab, and only when using large jobs would this solution apply. If a given job failed, it would be very simple to add a echo ${variable} setup to the before_script like you noted in your workaround to determine exactly which combo was causing issues and again - the job name would be the same with every invocation.
as a workaround until we figure this out, introducing some indirection might work:
We actually do something slightly different than this as a workaround, where we store the variables within a dotenv file in source in a distinct folder per matrix job, and retrieve them from there based on names in a before_script to minimize the length of the variables. It's a pretty bad dev experience, and tends to break the pipeline when people try to add new variables or matrix permutations; that happening again is what caused me to come back here and offer to help solve the issue because I don't like losing velocity to broken pipelines and confusion. Your approach is more DRY than ours so I like it more, but if I'm going to refactor I'd rather go to a more permanent solution .
@v_mishra I haven't reviewed the proposed solutions, do they all require UX changes? I think once we settle on one path forward we can move this to your epic.
cc @carolinesimpson for awareness I changed this to workflowplanning breakdown since there are multiple proposals and this could need design. If we have a preferred path forward can we say as much in the issue description and then work with Veethika on design?
@jheimbuck_gl It's more that there are multiple steps to get to the full solution, rather than multiple proposals.
The displaying of the variables in a matrix pipeline run when we stop having them in the name is the part that will require some UI/UX work.
The final step of having the user assign custom names to the matrix jobs could also use some UX support,although a developer could potentially propose an updated yml structure. Ideally I'd like some input on the UX of the structure though to make sure it makes sense.
thanks for clarifying @carolinesimpson, can you or an engineer update the proposal to reflect that those are steps and not possible solutions and add any specific questions for @v_mishra to answer before we move forward.
Sure @jheimbuck_gl I have attempted to make it more clear.
@lauraX or @mbobin Would you mind filling in a little more detail into the proposal (particularly any details about the character limits matching) to make it more obvious for a developer picking this up what that refers to exactly. Thanks!
We've encountered this issue too, both in a FOSS namespace, and in our Saas Ultimate namespace.
Adding FOO: bar in this matrix cause the error (can be reproduced easily using CI Lint by checking "validate").
We've limited the number of variables as a workaround but it will be blocking soon.
We're a Premium [SAAS & Self-Hosted] Customer, and we've ran into this issue too, I've reported it to support, who linked this ticket as I didn't not find this one myself. We have spent a significant amount of time trying to diagnose this issue. Seeing better error messages and/or any of the proposed / suggested fixes outlined above when the name length exceeds the limit would be awesome!
Look forward to seeing the solution on this one - Apologies if this is not the right area to add this comment
Just want to note that we have also come across this when we want to use different docker images for each matrix element and specify the images via their hashes (which are typically quite long), e.g.
Is the error message surfaced in the pipeline yet? IMO that's the minimum critical/immediate fix if this won't be closed soon. (see "Step 1: Surface the error" above).
Why? Twice our small team (within a larger org that I assume has also faced this) has lost more than a full day while trying to isolate this (despite making an effort to understand, document, and share knowledge on the problem when it was first encountered over a year ago). The stated "255 character variable limit" is misleading when debugging, the surprising effect of "needs" makes it difficult to a/b test when attempting to root-cause, and the lack of any failure-cause output is obviously problematic.
I do appreciate that the issue is linked within the parallel:matrix documentation now.
Mark Nuzzochanged title from No pipelines created when matrix var is longer than 114 chars to Backend: No pipelines created when matrix var is longer than 114 chars
changed title from No pipelines created when matrix var is longer than 114 chars to Backend: No pipelines created when matrix var is longer than 114 chars
This issue is very critic for us (premium customer)
I don't understand why we cannot set a custom job name who should be uniq (could be verify with an error message)
Custom job name could used some matrix variables
This is very simple and pragmatic
@mgibsongl The concern in this issue was addressed. A validation error appears when the parallel job name is too long. The next step in #420669 (closed) will allow the needed parallel job name to be longer. The ask to set a custom name for a parallel job should be proposed as a separate feature request issue instead of re-opening this one.
I like the solution for the array with the JOB_NAME example.
Here is a use case to put this issue in perspective.
We are running GitLab Ultimate and implementing the Renovate-Bot in our GitLab instance.
BackGround:
Renovate is an alternative to GitHub dependabot
Our Gitlab holds over 2000+ repositories.
We have a repository that runs the Renovate pipeline.
Due to the number of repositories, we have to split the Renovate job; otherwise, if it fails on a single repository, all other following repositories do not get their dependency PRs created.
Issues:
By default, if you take all discovered repositories by Renovate and give them to a child pipeline, you run into the first problem. This is because there is a maximum of only 200 jobs allowed to be created. By default, we could only have 200 repositories enabled by Renovate.
The solution was to split all the discovered repositories into groups of 5, and then give them to the parallel matrix. So, as a result, we now have a maximum of 1000 repositories enabled with Renovate because each Renovate child repository processes 5 repositories.
Next problem (unsolved, matrix job name !!!)
We need to be able to use distributed cache. However, the parallel matrix does not work well with a good cache key that can be reused between pipelines. We hope to use ${CI_PROJECT_NAME}-${CI_JOB_NAME}. The problem here is that the name of the job is not only long but also contains spaces due to the arguments given to renovate to process only 5 repositories.
What is furthermore essential is that these jobs are generated and hard coded into the .gitlab-ci.yml.
In my opinion, a JOB_NAME variable would solve many of our problems. We can then have the cache key: 'renovate-${JOB_NAME}`, which means the cache between renovate jobs can be successfully loaded from the distributed cache. Yes, I know these names and jobs can shift around when a new repository is created and then added to Renovate; however, that is an acceptable tradeoff.
Hi @gjrtimmer, thank you for sharing your context and thoughts here! We currently have an epic to address the concerns around matrix job names here: &11791. It looks like the issue most related to your concern would be #285853.