Error messages do not contain helpful statements about what or where the error was in the gitlab-ci definition that was causing the job to fail
Since I upgraded our hosted Gitlab EE and Gitlab-runner to 10.8,I get job failures with the message There has been a missing dependency failure, check the job log for more information.
I can't find any additional information or guidance to understand what the problem is.
Intended users
Further details
Solution
As MVC solution, we are improving the callout error message here to show the list of depended jobs that caused the missing dependencies error. The format of the message would be This job depends on other jobs with expired/erased artifacts: job_name_1, job_name_2.
Remaining problems
TBD
Potential follow ups
TRUE error trapping and diagnostics.
Look into providing a configuration option to not make previous jobs from previous stages an automatic dependency. See #6144 (comment 195987881)
Permissions and Security
Documentation
Testing
What does success look like, and how can we measure that?
I am getting this error on a job that doesn't require a dependency, so I don't have a dependencies: key. How does GitLab determine that it thinks dependencies are required?
I had the same problem - no defined dependencies or artifacts, yet some jobs within a stage were failing with There has been a missing dependency failure, check the job log for more information..
In this instance, there are actually build-3 to 10, and I've pressed the manual button for all of them at the same time, and two of them failed with There has been a missing dependency failure, check the job log for more information.
I'm now seeing There has been a missing dependency failure on subsequent jobs after adding artifacts:reports:junit to my main test job. The JUnit report files appear to be deleted too quickly.
Edit: Solved in this case by adding artifacts:expire_in
Also experienced this issue - digging into it a bit artifacts were being created by first phase but expired almost immediately and therefore not available for subsequent phase(s). As @kevinbosman states, a workaround is to manually set the artifacts:expire_in value for your artifacts. This fixed the issue for us.
Not sure why the artifacts were expiring so fast...my understanding is artifacts are retained 30 days by default: so either there's a bug (note: we're using somewhat older version 10.8. so this may be fixed now) or we've misconfigured something in our setup.
Also, I've noticed that even though a job doesn't have a dependency defined it still downloads artifacts produced several stages before. Consider the following pipeline:
build -> dockerize -> deploy
A "build" job produces an artifact and a "dockerize" job depends on it.
A "deploy" job doesn't have any dependency defined, but a)it fails when an artifact from stage 1 is expired; b)it downloads artifact and that takes some additional time.
Adding dependencies: [] fixes that weird behavior.
Also, I've noticed that even though a job doesn't have a dependency defined it still downloads artifacts produced several stages before. Consider the following pipeline:
build -> dockerize -> deploy
A "build" job produces an artifact and a "dockerize" job depends on it.
A "deploy" job doesn't have any dependency defined, but a)it fails when an artifact from stage 1 is expired; b)it downloads artifact and that takes some additional time.
Actually, what I described previously is mentioned in docs:
Note that artifacts from all previous stages are passed by default.
So, it seems to be expected behavior, a little bit odd though.
Yeah, that's very strange behaviour. I have to put an empty dependencies on subsequent jobs, that's just dumb. I have a 15m expiry of artifacts because those artifacts get staged into a repository, and don't need to be in gitlab for large periods of time.
I would like to work on fixing this issue, would like to make dependencies: [] as the default value let me know if this solution looks good ?
Have commented on the similar issue in gitlab-ce, please have a look
https://gitlab.com/gitlab-org/gitlab-ce/issues/52100#note_135203485
I'm running into this issue even though I have dependencies: [] in my job; if I let artifacts from previous stages expire, I get the "There has been a missing dependency failure" error when I try to run the manual job that has dependencies: []. I would expect that since it explicitly has no dependencies, I would never see this error message.
This is in GitLab Enterprise Edition 11.3.5-ee 7b10203c.
yeah, this thing where subsequent jobs need 'dependencies: []' should be considered a bug. I didn't declare jobs in my later stages as being dependent on the artifacts that expired, so why should it need them?
I just ran into the issue that @mearns and @trenton.d.adams experienced today. I had a job that specified dependencies and THOSE dependencies had artifacts in them that had expired by the time the job ran. It errored with "There has been a dependency failure" and nothing else. No information about what dependency in which job. Proper behavior would be to inform you where the dependency that is unmet originated from. @jlenny can we get this pulled into a sprint or looked at?
Please do not just make the message more clear... The default behavior here is not sane.
If gitlab by default considers all preceding jobs in a pipeline to be dependencies of a given job, jobs at the end of a pipeline could be downloading gigabytes worth of artifacts from previous jobs in the pipeline that they don't actually need.
What artifacts a job depends on should be defined explicitly, not assumed to be "everything". The default should be []
@jlenny there seems to be a couple of different problems being raised here. Just want to clarify what is our initial target for %12.4? Is it the improved error message? If so, can I get a suggestion on what the error message/s should be?
@iamricecake improving the error message is the minimum we should do here, but we should try to make the behavior sane. People with less complex pipelines rely on artifacts rolling forward, so I don't think we can/should change that. But we can consider a creative way to address @rolanp's comment above that the behavior is not sane for some cases. @ayufan may have interesting thoughts on how to make this more clear.
I beg to differ, greatly! Point to the vague documentation isn’t what I want - I want TRUE error trapping and diagnostics. Tell me in the error message what stage of the pipeline caused the error and what the configuration point was that was the source. I want to know “this job failed because it depends upon ‘section_foo’ of the pipeline which has expired artifacts” and then you can place a link to the documentation.
Go with the callout_failure_message and modify it a bit to include which stages/job caused the error. But as the screenshot there shows, I don't know if it's possible to format the message to even at least include a line-break and link. Will need some help from FE.
Let the FE side decide how to display the error, the backend will just return the missing_dependencies in the API response. It will be an array of builds that caused an invalid dependency error. It will have details about the artifacts, whether expired or erased. The FE can decide how they want to display the info. Will need more FE work than the first option. I can't tell how much work this would require on the FE side.
Regarding @rolanp's feedback, I don't have a concrete idea yet. Maybe provide something like artifacts:dependents_only? Then succeeding stages won't download the artifact if it's not explicitly defined in their dependencies. This way, you won't have to define dependencies: [] in each independent stage. But this seems to be a good candidate for a separate issue/iteration?
@jlenny lemme know what you think. Also, @ayufan would be great to hear your thoughts too.
@iamricecake maybe @jhampton has some idea for the FE or can find someone who can help.
I do think @rolanp is requesting something different here. This issue is about an unclear error message. That's not to say it's not worth looking into but it's certainly a breaking change. I wouldn't block fixing the error message on something that might have to wait for 13.0 or later.
Yes, my request should probably be a separate issue.
I know some users probably find it convenient to not have to explicitly define the dependencies for a given job. However, once your pipeline gets larger than a handful of jobs, it would be preferable to not have to add dependencies: [] to every job that doesn't have any dependencies. This avoids each job having download a large volume of stuff it doesn't need (several gigabytes in the case of some of our pipelines). I would also say the default is non-obvious to the user, so there are probably folks who aren't even aware that their pipelines are wasting time and space on downloading potentially useless artifacts.
@iamricecake, I chatted with @shampton, and he can advise from a frontend perspective. Also pinging @dimitrieh for any UX insight around message text, etc. I am also happy to help out. Thanks!
Go with the callout_failure_message and modify it a bit to include which stages/job caused the error. But as the screenshot there shows, I don't know if it's possible to format the message to even at least include a line-break and link. Will need some help from FE.
The problem with the message format is that HTML collapses white space (including new lines). In order for option 1 to work, we'll have to modify the frontend to take HTML formatted messages and then have the backend add HTML code so that it can create a link.
The backend would look something like:
defmissing_dependencies_messagejob_failure_link_start="<a href=\"https://docs.gitlab.com/ce/ci/yaml/README.html#when-a-dependent-job-will-fail\">"failure_message=_("There were missing dependencies from the following stage(s): %{invalid_stages}")%{invalid_stages: invalid_stages}help_message=_("Please refer to %{job_failure_link_start}https://docs.gitlab.com/ce/ci/yaml/README.html#when-a-dependent-job-will-fail%{job_failure_link_end}")%{job_failure_link_start: job_failure_link_start,job_failure_link_end: job_failure_link_end}[failure_message.html_safe,help_message.html_safe].join("<br />")end
Then on the frontend we would just need to edit the callout.vue file to use the v-html attribute, which shouldn't break any existing messages that are just plain strings.
The result would look like:
Let the FE side decide how to display the error, the backend will just return the missing_dependencies in the API response. It will be an array of builds that caused an invalid dependency error. It will have details about the artifacts, whether expired or erased. The FE can decide how they want to display the info. Will need more FE work than the first option. I can't tell how much work this would require on the FE side.
We could also go with option 2, it would just be a little bit more work on both sides. I'm open to either.
@dimitrieh could you take a look at this (including my screenshot) and see if this message is acceptable or if we should format it in some other way?
@shampton your solution looks good, but I want to be sure we are considering the following:
If I am correct, the error message provided in the description is one of I believe 5-6 standard job error messages like runner error/script error/etc. These messages are shown in multiple places like the call-out above the job log or within tooltips within the mini pipeline graph etc.
This solution seems to extend upon those error messages by providing additional information to fix these errors. This seems like a good idea, though I want to be sure it doesn't affect those other places in the unconsidered ways or makes things inconsistent in terms of information provisioning.
Additionally, the call-out should not become a big stretchable informational widget but should stay concise and refer to another place with more information if need be (it seems this is the case though in the screenshot provided ). See for more info: https://design.gitlab.com/components/alert/
With this in mind can you let me know:
Does this affect any other error states/types? If so, how? If not, why not?
Does this error information show up in any other place? If so, how? If not, why not?
What are the extremes in terms of information that will be added?
to be clear, when you have a missing dependency failure there is no error
message beyond the vague and unhelpful "there are unmet dependencies",
which was the root of my frustration and my desire to see better error
trapping and reporting. It is unfair and not accurate to say "these errors
are displayed other places" when in fact that is not the heart of the issue.
@northrup I understand your point. My main take is not to block this, just to see if we are not causing unintended consequences. Let's do a check and afterward, I feel we should be good to go.
@dimitrieh great point about this message potentially showing up in other places. I've done some exploring and I haven't seen the message show up in other places other than the job's page itself where it shows the alert as shown in the screenshot. @shampton we're you able to find something that I might have missed?
@dimitrieh also some clarification about the messaging: in that spike, I used the word stage but I'm not sure if that's entirely correct. Should we use stage(s) or should it be job(s)? Because technically we are showing a list of jobs there.
@dimitrieh@iamricecake sorry for the delayed response. I haven't found any other places where this message shows up either. I think this specific message is only showed in the callout above the job log. We should be good to move forward with option 1.
@iamricecake if you want to do the backend for this, I can probably just add the small frontend change to your merge request if that's alright with you.
also some clarification about the messaging: in that spike, I used the word stage but I'm not sure if that's entirely correct. Should we use stage(s) or should it be job(s)? Because technically we are showing a list of jobs there.
@iamricecake This is indeed the last part we need to get in the clear. This message is shown above the job log on an individual job detail page.
It reports as you state on multiple jobs. Is that correct in this context? Generally, error messages shown above the job log have details about that job only. Is it needed to state a list of jobs or can we pinpoint exactly what dependencies are missing inside of this job for example and perhaps point to specific places inside of the job log (taking that we do have that capability already)?
In terms of terminology, I would certainly not interchangably use stage and job as they mean different things. Job seems to be the right word indeed.
after we got this down, let's update the description with our final solution
Generally, error messages shown above the job log have details about that job only.
So for example, job_3 has dependencies: [job_1, job_2], and artifacts from these 2 dependencies have expired. The callout message would state There were missing dependencies from the following job(s): job_2, job_1. I think the error is still about the current failed job which is job_3, implying that its dependencies are missing?
or can we pinpoint exactly what dependencies are missing inside of this job for example and perhaps point to specific places inside of the job log
When artifacts of a certain job expire or are erased, they are done on all of them at the same time. So when a job dependency is marked as invalid, it only means either all artifacts have expired or all of them have been erased:
# app/models/ci/build.rb def valid_dependency? return false if artifacts_expired? return false if erased? true end
So I'm thinking seeing the callout error message would be good enough to lead the developer to look at their .gitlab-ci.yml for the listed jobs and see related artifacts. Given the message also includes a link to the documentation.
@iamricecake I might have been confused before. This fully makes sense to me and indeed the call out is the correct way to go. Though I think we can do a small copy change. Let me know if the following makes sense:
There were missing dependencies from the following job(s) on which the current job depends: job_2, job_1
@dimitrieh I'm just not sure about on which the current job depends part. This is because even without dependencies: [] defined, as long as the artifacts from the previous jobs expire or are erased, this error will come up. So not sure if it's completely accurate to say the current job depends on those previous jobs if dependencies weren't defined.
... and yet that is how the code does evaluations, even if you haven’t explicitly declared it. I am more than moderately frustrated with the “good enough” approach logic that leads to “they should be able to guess what esoteric config line generates this vague message from our slightly improved vague error”. The whole point of this was that error messages did not contain helpful statements about what or where the error was in the gitlab-ci definition that was causing the job to fail, to my mind you haven’t addressed that at all and have only marginally improved this issue.
I get that it’s a complex request, and I get that you need to make incremental progress with MVC, but this isn’t addressing the issue at heart it’s just dancing around it.
Thanks for the suggestions @dimitrieh@eread. What do you think about this detailed message:
Lastly, this hints that regenerating the artifacts would fix the problem.
Hm, I'm not sure this would be the best solution. Did you mean to auto-regenerate the artifacts when we get the expired error? If yes, this may seem like a band-aid fix that may not actually address the real problem. Because the error could have been caused by a few things:
Misconfiguration from a user that may have caused artifacts to expire unexpectedly
Confusion on how the feature works and not expecting all artifacts from previous stages are dependencies by default
By the way, the linked documentation doesn't mention anything about default dependencies. Should we update it to provide more hints to the user?
I like the word "unavailable", though the word "missing" kinda implies "I expect it to be there but I can't find it anymore", I am okay either way
Your suggested solution could be possible. Though I feel it is a complex feature to implement due to having to cover edge cases like re-running a depended job that also depends on another job but at that point the artifacts have also expired, might end up re-running everything, end up with infinite loops, so you also have to track number of retries and what not. But I am worried it may not actually address the cause of why they end up with expired dependencies in the first place, for example #6144 (comment 195987881). 'Cause if we can actually solve the actual problem, if there's any, then we won't have to further increase build time by retrying jobs.
Thanks for the link suggestion, I'll use that instead.
What do you think about the extra info about listing the artifacts? Do you find it useful or should we just go back to the shorter version which is just listing job names?
What do you think about the extra info about listing the artifacts? Do you find it useful or should we just go back to the shorter version which is just listing job names?
Could you see a situation where the error banner becomes way too long? What if there are 1000 artifact files, for example?
It would seem safer to just go with job names, notwithstanding more information is better in an error.
Could you see a situation where the error banner becomes way too long? What if there are 1000 artifact files, for example?
Yeah, that's potentially a problem. Though I was thinking of how probable that may happen, like hundreds of artifacts causing an expiration error at the same time. So I was comparing the cost of that disadvantage against the potential benefit of helping the user spot their problem faster upon seeing the list of artifacts in the error message.
But I'm also fine to just list the job names if there's not much benefit I'll go ahead and update the message on this MR.
@northrup I reached out to you on slack for a potential zoom call. My hopes is that a more sync conversation can increase my understanding about the pain points and how this can be thought out across iterations.
@jlenny I have verified this on a self-managed installation of the latest nightly build. The error message won't show up on GitLab.com though, it might be because the flag is switched off. I will confirm this once I get access to chatops.
Given this issue has potential follow-ups in its description, do we close this one now?
I have confirmed that ci_disable_validates_dependencies is enabled on Gitlab.com thus why the callout message won't show up. Do we intend to keep it this way?
if the potential follow-ups in #6144 (closed) are still needed, they should get new issues opened and this one can be closed
@dimitrieh should we open issues for the potential follow-ups?
@iamricecake@dimitrieh what did you end up deciding to do with this one? With %12.4 having now shipped, we should make a decision and close or move this.
I say close this one as this issue's main focus is improving the error message. We can just open separate issues for the potential follow-ups. WDYT @dimitrieh?
Hi, @northrup. From what I understand, this issue is meant to improve the error message which is what we did on !18219 (merged), which also made it to 12.4.
Look into providing a configuration option to not make previous jobs from previous stages an automatic dependency. See #6144 (closed) (comment 195987881)
So I was suggesting to close this issue given the error message has been improved. And open issues for each of these follow-up items because they don't directly relate to the callout error message.
@northrupAs we didn't get to our meeting anymore. Would you mind taking the first stab at creating the follow-up issue(s) as @iamricecake mentioned? It will renew our focus and intent!
You seem to have the clearest idea as to what still needs to be done. If need be we can then convert that issue into an epic and schedule out individual issues.
We have our meeting in a bit, let's move on from there.
@jlenny @iamricecake Just had my meeting with @northrup and got to the following conclusions:
Original problem:
Jog log doesn't exist when the job artifacts the current job depends on do not exist. This is because the current job doesn't even begin. The current error state gives very little help towards finding out why this is the case and how to resolve this.
Current situation:
!18219 (diffs) goes a long way towards fixing/removing pain points of experiencing this error and not being able to find out how to fix this. This was also why it was labeled ~P2. This was merged.
There is nowhere that it is described that the missing dependency is the result of a missing artifact of a previous job run (probably due to experimentation).
@jlenny my recommendation here would be to close this issue and consider the follow-ups separately.
Follow up:
In order to prevent this error state/scenario from even happening in the first place, it would be wishful to be able to have more granularity for making jobs dependent on previous jobs in order to support DAG's.
Currently, you can make a job dependent on a previous job's status and availability of the artifact. It is however currently not possible to make the job depends on the previous job regardless if the artifact is available yes or no.
Usecase
This is for example required when transferring a CI configuration from a competitor product (e.a. Jenkins) towards GitLab-ci. There is a need in that case for more granularity (If this, then that logic) so flexible dependency statements make it possible to make things work with native GitLab primitives.
The current workaround consists of making the expiration time longer than necessary.
Make it possible to have artifact expiration time be related to pipeline duration (clock begins ticking when pipeline finishes).
Usecase
Artifact expiration is currently set from artifact creation time, however, there are other jobs within the same pipeline or even downstream pipelines that depend on those. More granularity towards setting expiration might help here to remove unnecessary failing due to artifacts being removed to fast.
The current workaround consists of making the expiration time longer than necessary.
Thanks for this analysis @dimitrieh, this is wonderful. @iamricecake I agree with following the recommendation here and am closing this issue. Please let me know if you disagree.