Respondents primarily access job logs for debugging failed jobs, with 83% of respondents indicating this reason. 54% of respondents require access to job logs for up to 30 days.
The most common challenge faced in managing artifacts is storage. Storage is affected by the number of artifacts generated from pipeline runs, not the individual artifact file sizes.
Users expressed a need for improved job log interfaces, including timestamps and auto-collapsing sections for better usability.
Suggestions
A mechanism to bulk delete job logs after a designated retention period would significantly free up storage space. Here are some ideas:
For job logs generate in the future - Automated process
Provide an option for users to opt into automated removal, running cleanup at specified intervals.
Add job logs to the existing expire_in keyword to set a designated retention period.
For job logs generated in the past - Bulk deletion and archive
Provide a "delete all" button for job logs from a specific day forward.
Enable bulk deletion through the API.
Develop a better search and multi-select mechanism in the UI for easier management of job logs.
Implement auto-collapsing for script sections in the job log to help users navigate more easily when there is a lot of output.
Detailed Findings and Respondents Background
Primary Workflows for Job Logs:
The primary reason for accessing job logs is to debug when jobs fail (10 responses).
Other reasons include seeing test results, checking build speeds, ensuring reproducibility, auditing specific tests, and assessing project health (1 response each).
Duration of Job Log Access:
54% of respondents (7 people) need access to job logs for 30 days.
Other durations include (1 response each):
Custom retention policy
One week
Indefinite for tagged releases, 30 days for per-commit builds
3 months
6 months
1 respondent was unsure about the required access duration.
Challenges in Managing Artifact Storage:
Storage is affected by many teams running numerous pipelines.
Challenges are more related to Docker images rather than artifacts/logs.
Issues with cleanup policies from runners and specific retention policies were highlighted.
Additional Needs and Improvements:
Respondents look forward to having timestamps on job logs as a default feature.
The feature flag FF_SCRIPT_SECTIONS has significantly improved the user experience and is suggested to become default.
Respondent suggests an interface with auto-collapsing sections (e.g. script sections from the bottom) for better usability.
Compliance Requirements:
54% (7 respondents) work under internal requirements.
38% (5 respondents) need to comply with GDPR.
15% (2 respondents each) are either not working with any of the listed requirements, unsure, or complying with PCI-DSS and FDA standards.
Background:
This research is part of https://gitlab.com/gitlab-org/ux-research/-/issues/2971+, conducted in May 2024. It targeted Dev team leads, platform engineers and software developers from a mix of SMB and enterprise-size customers. The goal was to determine specific use cases for restricting download access to artifacts and to understand how job logs are used and how long they need to be retained.
The survey ran for a month, yielding 12 valid responses. Of these, 53% were from technology companies involved in creating pipelines, writing code, and running tests. Additionally, 54% of respondents use Self-Managed GitLab, while 23% use GitLab.com.
Jobs logs do not currently have a retention policy (no expiration) and are difficult to manage
As part of &8715 we need to add functionality to enable customers visibility and user-friendly options to view and delete these job artifacts and traces. We currently have one API allowing for deletion: https://docs.gitlab.com/ee/api/jobs.html#erase-a-job.
There are multiple potential solutions. This issue addresses the first item only.
Introduce a retention policy which expires job logs after a certain period (i.e. 7 days) for auto deletion.
@jocelynjane I agree that we should introduce a retention policy and I think that's preferred over API or UI. This makes it easy for users to manage their job logs without having to work through API/UI to remove them.
Having said that, we need to be cognizant of the impact of adding a new retention policy on existing job logs. Some users may have taken for granted that job logs will remain forever and be caught by surprise when a retention and removal policy is introduced. So we might need to introduce ample time from announcing that we will be introducing a retention policy and before the expiration begins, so that users who need any information from job logs can do what is necessary to retrieve what they need.
There may also be other reasons why we have always kept job logs around without deleting them. I don't know what they are off the top of my head, because it has been that way since I joined the team. We need to validate if these reasons still apply. @fabiopitino do you know what other reasons could there be? If there is no strong reason to keep them forever, I think it's fair to start clearing them.
I found a doc that states it is safe to remove but will result in empty UI. It makes sense to me, so I'm not sure why we don't erase them automatically after a set retention period.
There are existing ways to remove job logs, but they are not ideal for regular maintenance of projects, especially on GitLab.com:
It only gave users 3 days to adapt, which seems to be very very short. Blog post released on June 18, 2020, changes were to take effect on June 22, 2020. We should also consider other medium of communication besides blog post. I'm not sure how many users actually read blog posts religiously every month.
This default job expiration applies to all artifacts EXCEPT job logs. Also note that while the change took place 3 days after we set expiration dates to 1 year after the creation date. We didn't start deleting artifacts immediately.
This default job expiration applies to all artifacts EXCEPT job logs. Also note that while the change took place 3 days after we set expiration dates to 1 year after the creation date. We didn't start deleting artifacts immediately.
Do you have plan to do so? this will help and we will not need to cleanup ourself via the api
Introducing an expiration policy for job traces is a good idea, but I think 7 days is not a reasonable default value.
Most other CI services retain job logs for 90/180/360 days (e.g. GitHub Actions = 90 days). Anything you choose should be on the same scale, because users typically read documentation only when something unexpected happens. And "something unexpected" in this case would be an irrecoverable data loss.
i would agree a default expiration policy would be ok. And i agree with @sio that 7 days is way to short. I guess best is a default with a option to change it either for a project or namespace (i would prefer the latter).
But dont you think its kinda strange to introduce a default expiration policy for job traces when there is no such thing for job artifacts (on gitlab.com)?
Would a non-customizable 90 day retention policy for job traces be enough?
IMHO this should be configurable at the group level on GitLab.com SaaS, similar request for expire_in for job artifacts in #370552
There are various reasons why users may want to keep job logs for a longer period of time. I'll try to impersonate a few ideas:
Compliance and reporting in my company requires the logs to be kept and archived.
90 days as default does not work with keeping logs for over a year to being able to debug/analyse older builds
Keeping logs forever does not hurt because they are only a few kilobytes of data
I'd suggest adding a config option that allows users to opt-out.
Is there a significant need for a separate project/namespace setting for job trace retention length?
Owners/maintainers of a namespace will see the log storage warning - my thinking is that they will want to take action. As a target persona, a namespace setting should be sufficient for housekeeping, and preventing the storage to grow over time.
Would it also be enough to just follow whatever expiration the job artifacts have?
That's 7 days on GitLab.com SaaS according to @joshlambert 's comment in #370552 (comment 1101372068) which can be too low, if you are coming back to investigate a job log. Potentially it makes sense to not follow 7 days, because the amount of data generated in artifacts is estimated higher, than a job log.
Priority: (Customer did not get back to me on this)
Why interested: We must retain all build logs for 6 months to meet fedramp high certification standards
Problem they are trying to solve: Allow our company to take on contracts that require fedramp high
Current solution for this problem: No current solution, keeping logs forever is probably ok though auto removal after 6 months would be ideal for us.
(I have asked the customer to elaborate on how having the auto removal after 6 months helps with their workflows, but they did not respond back to me.)
Impact to the customer of not having this: As mentioned above, there is the possibility that this (logs being able to be deleted before 6 months) may not allow us to meet new compliance rules which either means losing contracts or moving off of gitlab ci
Thanks @kenneth - this is helpful information to have. We need more of this type feedback to understand what an automated policy might look like in terms of expiry duration.
Is there any way that job log retention behavior can be changed so that it relies on the same artifacts:expire_in setting as artifacts do, or is this inadvisable?
I ask because I have a GitLab Premiumcustomer report of a repository using over 100GB of storage, almost all of which is artifacts. While researching, I discovered on the Artifacts page many "artifacts" that are older then their artifacts:expire_in setting (1 day). When I expand the artifacts in the list, they are all job logs.
As in the reports above, the customer is concerned about the upcoming namespace storage limitations, and is also confused as to why the artifacts appear to remain beyond expiration.
Please let me know if I can provide any more information. Thank you!
@mgibsongl we currently do not have a way for customers to expire job logs as part of the artifacts:expire_in setting. Job logs are not considered the same type as our other build artifacts. I am curious if the customer would be willing to answer a few questions on how they use job logs and what their housekeeping needs are around these specific artifacts. My calendar is open to chat or I can send over questions for async discussion.
Another reason build artifacts might not be expiring would be tied to know bug we have. Some artifacts when generated with "keep latest artifacts" are not being properly unlocked. If you're seeing any of those items still lingering past their expiration period, it is likely due to this bug.
Thanks Joceyln. The user seems to be unconcerned in general with job logs specifically, rather was surprised by the following:
The jobs the logs belonged to were deleted.
The job logs showed as artifacts.
The artifacts (which were actually job logs) weren't being removed as they expected based on the artifacts:expire_in setting.
Based on the conversation I had with the customer, those are my takeaways. It's my understanding that they do not require nor desire that job logs for deleted jobs remain. I hope that helps clarify it a bit more.
We've been using https://gitlab.com/gitlab-de/use-cases/gitlab-api/gitlab-storage-analyzer (and community scripts) to purge artifacts archive from a large project on gitlab.com that has an intensive pipelines usage.
Yet, the storage for trace logs and metadata remains high considering the volume of jobs and we are struggling to reduce the storage further more.
Is there currently an alternative available to purge them (instead of manually deleting them through the UI which is impossible to consider) ?
@jocelynjane - can you comment on when this is being addressed? Having a mechanism to address job logs is one of the key requirements to enforce storage and is noted in: #375296
@joshlambert I think we had a miscommunication on this one. This was not part of the work I had planned on, as 1) we still don't have data to tell us what the policy should be and 2) we provided the UI to assist in the this process.
If we need to do something separate, @shampton and team will need to deep dive a little more into the details and provide an estimate. I don't see this being trivial.
@alberts-gitlab Can you take a look at this sometime in the next week to give a surface level proposal and weight for this? Please timebox an hour for your investigation.
To verify, are the job.log files counted against total consumed storage for SaaS customers? If so, even though these are relatively small files, they do add up, especially for projects with large amounts of CI/CD activity. Not being able to easily clean these up, especially through the artifacts API, seems like a storage penalty (if if counts towards limits). There is the option to perform a "select all" on the Builds->Artifacts page which is nice, thank you for this, but unfortunately the 20 items per-page limit continues to be a bottleneck for projects with hundreds of pages of these files (or more).
If I am incorrect, and the API should work for the job.log files, please let me know.
Create a universal policy for job logs (i.e. all job logs will expire in 6 months as a first iteration; a custom setting would be a follow on based off demand) OR
Apply the same expire_in settings to job logs.
The second option would be more ideal versus just expiring all job logs at a specified time (and creating a separate setting). Is it possible to get a rough idea of what the implementation would take? Thanks!
@jocelynjane I think if we are able to add expire_in to job logs specifically that would work. The user may want to have the job logs take longer to expire than the other artifacts, so as long as we can count for that use case we should be okay.
I agree that a separate expiration setting for logs might be needed, maybe something like job_name:expire_logs_in. Then default to 6 months, and enforce the max to 6 months?
In terms of complexity, this also depends on what we want to do with existing logs that have no expiration set. We all migrate them to have an expiration of 6 months from now?
We all migrate them to have an expiration of 6 months from now?
@iamricecake@jocelynjane Since users have no way of setting the expiration of past logs to be longer than six months (e.g. "never"), I think we should default the existing logs to expire_logs_in: never. Users can go back and delete past logs if they want to clean up, but I don't think we should force their logs to expire if they don't want to.
@shampton Maybe an alternate idea to "force" default cleanup: If newly created pipelines after a specified date (e.g. 16.5 release date) would default to 6 months, and old pipelines keep the never behavior, no data will be harmed. Users can override the settings from the UI or config respectively, and this value overrides any other defaults.
@shampton@jocelynjane do we actually want to support never expiring job logs? I thought this is what the retention policy is trying to address. IMO, if we will support non-expiring job logs, we will have to implement account-type limits for this, like maybe only supporting this for Ultimate, to prevent abuse.
@iamricecake you are correct - we do not want to support never expiring job logs. I think our customers need to potentially set up their own expiration time period. We are implementing this issue because we need to have automated housekeeping.
@shampton I think we want to go and help remove the old job logs. We can have a "from this day forward" policy based on our release of this feature and provide a 6 month (? - or less) timeline where we let customers know either they need to take an action to save the older logs or we will do a batch deletion.
@jocelynjane - we may need to confirm if there are auditing and compliance requirements related to job logs that may require extensive retention of logs
From a customer perspective: I'd like to keep job logs for release/tags for ever as this seems valuable from a compliance point of view. Probably the same for all default branch jobs
FYI, part of the new storage management automation docs also includes programmatic snippets for deleting job logs with the GitLab CLI, and python-gitlab API library scripts to delete job logs. A complete example matching age or size is also available.
@pmurray7 would you customer be interested in chatting with me for 15-30 minutes on their artifact usage and how they perform artifact housecleaning for research purposes? I'd like to understand what types of solution would work best for our customers as we have a number of options (e.g. separate job log settings v using expire_in).
We are a GitLab Premium customer with 28 seats, and have a good number of projects with automated pipelines.
I can't recall us ever needing to refer to a job log older than a week. Having he ability to set an automatic job log expiry tailored to our expectations, just like other artefacts, would be perfect.
Just to give a better idea of the current end user experience from our perspective.
We have one project with 6.4GiB of job logs, probably going 2+ years at this point. As the UI allows me to select and delete only 20 items at a time, you can imagine how long it might take for thousands of logs. Needing to fallback on using the API directly is not a solution, just reinforces the usability problem of this feature.
The end result is that we simply "let it be". Even a slight improvement to pagination in that UI would help (items per page, quicker page browsing).
Always interesting to see how diverse the requirements are.
In our case (Premium with currently 23 seats), we sometimes need to refer to job logs that are more than 1 year old. In our old engineering system, going back as far as 3 years happened frequently, and I don't see any reason why it would be different with GitLab now.
The scenario is that we're running a lot of end-to-end tests, and these are somewhat fragile by their very nature as we have a lot of dependencies on third-party services. We want to be resilient, but we do not have the bandwidth to investigate every single broken test run. So when something breaks we mostly just re-run the test job so we are unblocked and can move on. For those errors that occur frequently, we fix them asap. For those we have already seen but happen rarely, we grep through the e2e job logs to get an idea how often and when it already happened, then tackle them step by step.
In the old system, we accumulated 173 GiB over 7 years, which compresses down to 70.8 GiB on disk. Not an issue at all, because storage is cheap
So if or when a retention policy is introduced, I think it makes sense to default to a relatively low expiration duration like 1 week or 1 month. However, it would be great if we could set an infinite or a pretty high duration for those other cases where we actually want to keep the logs around.
Why interested: Wanting to cut down on storage for older, obsolete pipelines.
Current solution for this problem: Manual solution to delete pipelines via API. Artifact retention can take care of this partly, but job logs still require manual intervention.
@manuelgrabowski@ovider can either of you please provide additional information on what the customer need is here, specifically - are they looking for job logs to have the same expire_in as their other job artifacts? Do job logs need a different expiration policy (e.g. perhaps most of the artifacts can be deleted after 3 months, but job logs need to be retained for 6 months), or do they need something entirely different? Thanks!
Hi @jocelynjane, in this case the customer was looking for a way to delete entire pipelines after a given time, to reduce storage usage from artifacts and job logs. Now that you mention it, automatically delete old pipelines (#338480 - closed) might have been a better fit for their ask. While deleting pipelines, and thus both artifacts and job logs without having to consider expire_in beforehand, is possible manually and scriptable via the API, they were looking to avoid having to implement something via the API.
@manuelgrabowski I am behind developing a solution for job logs independent of deleting the pipeline. If there is an interim solution here through this issue, I'm happy to discuss.
Self-hosted instance admin here, on GitLab-CE. And +1.
We're currently onboarding Renovate to get dependency updates automated (rather than doing it once/twice/whenever per year and basically missing important security updates just because nobody's looking/paying attention) and set up a dedicated project/pipeline as they suggested for Renovate Runner. Since we run this once every hour (with the default limit of 2 MR max per run) we've accumulated over a thousand jobs in barely a month; each retaining its run log (which grows longer the more repositories we add to the cycle.)
Those old logs don't serve any purpose once the next one (or maybe the next 3-5) have come around, so the best case for those would be to prune them after X runs (only keeping the last X runs/pipelines/jobs alive by default) or expiring them after X days so they're automatically deleted.
Currently, thats the only pipelines we create through GitLab Runner, but we plan on adding more for various other tasks; and they have similar deminishing usefulness after a certain point (mostly: when the associated branch was deleted/the associated MR was merged; but also once X newer pipelines came around they've basically become obsolete.)
With that, we also get a certain kind of pipelines where we want/need specific control; particularly the option to never reclaim it: Releases (as well as Patches/Hotfixes etc.= where we want to keep the logs for audit/compliance purposes.
Fortunately, this isn't a problem yet for us in regards to size, resource usage etc.; but we can see this get out of hand and require further customization on our end to keep things managable (as many other comments already mentioned; by using the API/glab/etc. to delete them regularly.)
And while this is certainly doable, the amount of different people posting their own variants of such a script; as well as capabilities of similar other products (such as AzDO) makes me wonder how this isn't a thing in GitLab already.
To sum it up, I see the following things one might wanna do:
Decide whether Pipelines/Jobs and their associated Logs should expire or not (for auditing/compliance reasons, some projects might not want to expire them ever.)
Decide when to expire them (based on time/age, number of more recent ones for the same ref, or tie-in to a ref/branch/MR.)
Define this on instance level (globally, as a default; might as well be gitlab.rb) then override on Project (or maybe Group) level, down to Branch level (which likely goes into a .gitlab-ci.yml directive.)
Override persistence on the Pipeline/Job itself and mark them as persistent (for one-offs, or when branch-rules/etc. aren't feasible to catch them all.)
Note that most if not all of those are inspired by Azure DevOps Services, which we currently use for most of our CI needs (mainly because we had it set up before we switched to GitLab; and many of the pipelines we have over there do not really carry over to GitLab Runner yet.)
Overall, it's about storage quotas.
A job without artifacts might still make sense, but a job without a log?
We are in a similar situation (self-hosted instance) and focus solely on retention policies for whole pipelines (via API).
With such a policy, a separate policy for job logs would certainly be less important for other customers.
Since I stumbled into the quota section while looking for something else, thats what our Renovate runner repository looks like at the moment. It does nothing else than host the config file for Renovate and the pipeline YAML file.
4 KiB repository vs. 150 MiB job artifacts (which is just the job itself plus its logs) seems excessive, and anything older than...lets say, a week (honestly, anything older than a day) has no real value for us, since the only reason we'd look in there is to investigate issues while running the pipelines themselves (which are bound to fail them, so we will look as soon as practical.)
And that already includes the following:
default:artifacts:expire_in:3 days
...so theres not a whole lot else that I could put into the file that would help. I don't even think this does anything, since there are no actual artifact files, just the console log (stdout plus stderr if any.)
Just to follow up to my own screenshot, I've since run the script that loads a months worth of builds thru the rails console and then erases the build. The repository is now down to low single-digit megabytes (because I didn't delete all the logs yet,) but this is painful to do on a regular basis. Something that I can schedule (perhaps thru a CI script) would be largely preferable to this in the interim term (to echo Fabio's comment further down; bulk delete/expire first, retention strategy long term.)
We cannot remove Artifacts from S3 without causing geo replication issues without this feature. S3 costs will continue to increase.
Problem they are trying to solve:
Removing objects from S3 storage causes Geo replication to fail as there are still existing artifact references for artifacts which are removed from S3
@pmurray7 is the customer willing to chat about more specifics on their artifact management processes, and how they use job logs to help with the design here? We have a number of potential options for implementation and are looking for customer feedback. Thanks!
Implementing a mechanism to delete job logs after a designated retention period would significantly free up storage space for the customer needs. Specifically in regards to budget for storage costs.
@jocelynjane - can you share the latest status for this issue regarding prio? If you need any more context from the customer, please let us know.
@manuel.kraft we have a number of potential solutions, but we don't have enough data yet to select and implement one yet. We have also been focusing on resolving bigger Build Artifact issues with improper unlocking of "keep latest". I would like to understand 1) how the customer is using job logs and 2) how does the customer currently manage artifact storage management in general (e.g. what is their current artifacts policy)?
@bonnie-tsang - let's prioritize this with your other Build Artifacts validation work. This will make a big impact for our users (as it relates to storage cost).
Customer would like to configure the time on how long the job logs should be kept. It should be done similar to other retention configurations for storing GitLab data.
The current retention policy for other data is normally set to 30 days.
Job logs are consuming a lot of storage and customer wants to reduce storage costs.
@jocelynjane - Can you give an estimate for this work? As you mentioned also, it is related to costs, which is a prio 1 topic nowadays.
Thanks for the follow up @manuel.kraft! I understand this is a big pain point for our customers, and important as we look to enforce storage limits.
@bonnie-tsang is looking at the data we have collected to determine our next steps in the design process (solution validation) in %17.0. Once the best fit solution is identified, we can assess the effort for implementation. We will do what we can to get this into our plans!
thx @jocelynjane - I think from an UX perspective, we can see how other parts of GitLab implemented such policies for the corresponding data / storage, and try to make it consistent with those, or if its a re-usable UI component, it will make it even easier. Of course, from a backend perspective, it may be totally different, but i guess the groundwork is available from the other parts in our product.
@manuel.kraft we have definitely considered various options here (whether it is a separate policy, or we add job logs to the existing expire_in and what the UI may look like) - it ultimately depends on the best fit here as we want to be careful of introducing too many settings/complexities for artifact management. the implementation is reusable!
@manuel.kraft@ddornseiff can you find out if this really comes in the near future or we need to built another workaround to deal with the deletion of job logs...
Hey, we are premium subscribers and would like to see this since you will be imposing quotas soon and some of our larger projects have gigabytes of job logs.
We are also finding this to be a problem. We are approach storage limits, and after clearing all old artifacts have GBs of job logs left. It would be good to have a simple way to manage this.
A lot of manual work was required for this customer to be able to understand why artifact cleanup efforts were not impacting the overall storage. By investigating objects via the rails console we were able to understand that a lot of the artifacts were in fact large job log traces that have to be erased individually via the API.
Since traces do impact storage consumption, we should definitely have a place where a user is able to monitor large traces and manage their space.
I'm asking because I've been working on this and other job log related issues as part of Category:Build Artifacts and I'm happy to reassign any and all of those issues
Why interested: We identified that we have significant storage usage in some projects which ends out to be just logs. Our database has almost 13M artifact entries which are entries about logs and not "real" artifacts.
Problem they are trying to solve: High storage usage. This impacts Geo replication of artifacts as well.
Current solution for this problem: None, as the documented API and instance-wide workarounds erase the entire job, including job artifacts. Customer noted that "Artifacts retention and logs retention should be separate. I can think of cases where artifacts are needed just for few days and logs for few months, and vice versa."
Impact to the customer of not having this: Artifact storage costs will continue to increase, Geo replication of artifacts continues to be impacted.
@smathur we're current gathering input on this subject to build a retention policy/plan. Would the customer be interested in having a 15 min chat to talk about how they use their job logs and the workflow for storage management to help shape our solution? Thanks!
@jocelynjane I invited you to the call with the customer on Monday at 7 am PT. He is in Greece so I have to accommodate the earlier time. Let me know if you prefer a different day.
@jocelynjane I just had a call with a customer that is also experiencing issues with legacy job logs taking up a lot of space and being difficult to remove. Are you still looking for customer feedback?
@bonnie-tsang to help the transition of this issue, can you please summarize the findings from the interviews and survey for handoff? grouppipeline execution will handle solution validation if required and implementation.
If there is a clear recommendation from the findings, please note that as well. Thanks!
Usually goes back at most a few days, maybe a week or two. (link)
2 - Identify patterns of failure history
There are scenarios when we attempt to revisit much earlier jobs while identifying patterns of failure history.
Identifying patterns of job timeouts in the last month. This would need the job log to exist for at least a month for the best investigation outcomes. (link)
Situations where users needed to refer to logs from a year ago for debugging and audit purposes. If a retention policy is implemented, we would need to retain the deployment and release logs for at least one year. (link)
3 - Performance monitoring
Don't need storing these logs for a long time. (link)
Bonnie Tsangchanged the descriptionCompare with previous version
We received 12 valid responses. Most respondents were from technology companies involved in creating pipelines, writing code, and running tests, primarily using Self-Managed GitLab.
Please see the results summary and suggestions in the issue description️. Detailed findings and the background of the research can also be found in the details block
Manually cleaning via ruby, which is slow, needs anyone doing it to study documentation, and prone to human error. However, I cannot imagine I would clear job logs and artifacts instance-wide, or on a per-project basis on more than 2 projects.
@irisb the work in %17.2 is for UX research, not the implementation (per the workflow label). This category has now transitioned to grouppipeline execution and @rutshah is the right PM to discuss timelines.
Do you think it makes sense to combine this issue with some of the CI data retention discussions/issues to date?
@cheryl.li While this may fall under the data retention strategy, it may take a while before we get to implement it as retention strategy across all CI data (e.g. dependency on partitioning, etc.).
I believe that a bulk deletion (actually, expiration) of artifacts can get us a long way and users could schedule that periodically as they wish. This may be a much smaller work than the retention strategy which requires a lot more considerations.
trace deletion could be more complex because we have to also erase possible trace chunks too.
Once we can bulk expire and delete trace artifacts (job logs) we have the underlying support for automatically expire artifacts.
Separate issue: we can introduce an instance setting to set job log default expiration date. We can have strategy to introduce this on Gitlab.com and also on applying it retroactively (e.g. 1 year in the future).
Retention policy (this issue) to align job logs to the pipeline archival.
Why interested: We have projects storing upwards of 320GB/1,000,000 job logs (s3 objects store) -- This affects backup and recovery timing and retention costs. Probably bloats the DB too.
Problem I am trying to solve: Facilitate customer led (project maintainer) cleanse / or automated expiry.
Current solution for this problem:
Through rails console (admin only), remove data month by month (this takes hours for heavy projects, smaller projects can run increased time filters or remove all together)
# 1. Set user to appear as the author of deletion admin_user = User.find_by(username: '<userid>')
# 4. Remove each of the build logs (This takes time) builds.each_batch do|batch| batch.each do|build| print "Ci::Build ID #{build.id}... "ifbuild.erasable? Ci::BuildEraseService.new(build, admin_user).execute puts "Erased" else puts "Skipped (Nothing to erase or not erasable)" end end end
# 5. Repeat 3-4with newtime range until removed.
The problem with this approach is that it took roughly 48 hours to complete (~1hr to retrieve a months worth of builds, deletion of those builds at 5 requests per second).
Increasing the search range (through project.builds) beyond a month for projects of this size takes hours/(days?) and increased risk of failure.
I can see there's an experimental GraphQL bulkDestroyJobArtifacts mutation used by the project artifacts screen -- This is one interface available to customers but requiring examples to programmatically walk through job logs and remove them
Impact to the customer of not having this: Potential storage quota issues, increased duration towards backup and recovery
We had another customer report some difficulties with managing artifacts for old projects. There is still a lot of confusion. For example, we calculate Repository Size & LFS but a job log is considered an artifact and isn't counted towards repository limits. If you have tens of thousands or hundreds of thousands of jobs, it can be difficult to tell what type of artifacts are using up space on first glance. You could have a successful expire_in configured and still have no idea why artifacts are still high when it can be due to the sheer number of jobs. Having a retention policy would make this easier and not require scripting in which costs customers time.
In the case of this internal issue we suggested the use of the support glab tool:
## Requires installation of glab and jq# Set project ID variableGL_PROJECT_ID=12345678# Grab all job IDs from the project and put them into a list. # It uses the paginate flag to ensure we continue to get all results # and uses JQ to only grab jobs older than the 1st of this yearglab api --paginate--method GET projects/$GL_PROJECT_ID/jobs \ | jq --compact-output'.[] | select(.created_at < "2024-01-01") | .id'\>${GL_PROJECT_ID}_jobs_list.txt# Take input from the list and call the erase method on each of them# This erases all job logs older than the set date and then displays the name of the job# You can add a sleep if necessary due to rate limitswhile IFS=read-r job_id;do glab api --method POST "projects/$GL_PROJECT_ID/jobs/$job_id/erase"\ | jq --compact-output'.name'done < ${GL_PROJECT_ID}_jobs_list.txt
Why interested: They have a repository that regularly runs a lot of tests and writes logs. In the case of green tests, the logs do not need to be kept; in the case of failed tests, the developers want to look at the log and do their debugging. There is currently a case where a repo - although only 4 months old - already has 10 GB of logs.
Impact to the customer of not having this: Potential storage issues.
@rutshah is there already a timeline for this feature?
hello @cheryl.li - Discussed this with another large enterprise here in DACH (2000 users, Premium, Self Managed) again this week, and they are also heavily demanding the implementation of this functionality to get control over the massive amount of storage required for job logs.
Can you already provide a potential timeline for this? thank you!
Thanks for the ping @manuel.kraft! This is on our radar, but given capacity constraints on the team and current assignments, we're likely not going to review this til FY25-Q4 at earliest. I believe there are plans to review our data retention policies as a whole across the organization that @mjwood will be driving.
With that said, I wonder if there are quick wins we can be building for our customers, e.g. if they want to remove select types of job log data themselves, and need not wait for a global retention policy to be in place.
thank you @cheryl.li for the quick update / context and your offer to potentially provide an interim solution which may be used by customers to remove select types of job logs themselves.
i have already reached out to the customer's team this morning to get their feedback on this approach / proposal. waiting on reply currently.
@alex-dess - in case this gets traction, please coordinate with the client team while i am off next days.
hello @cheryl.li - talked with the customer team today and they would be happy to get help from your team on an interim solution. can you share how to proceed from here and what they should test?
A valid workaround might be a schedule pipeline that deletes old job logs. The gitlab-storage-analyzer tool, mentioned above by @dnsmichi, could potentially be wrapped in such a pipeline. IMHO it would only require a Gitlab CI template/components to be used.