We might be able to remove ci_builds.trace completely after we store this data somewhere in a safe place, if it is not stored in object storage already.
I did check on GitLab.com how much data is stored in ci_builds.trace column. This was used when GitLab CI was still separate product. We then moved to storing traces on disk. According to my queries, it is:
@ayufan and @grzesiek I'd like to pick this up. And while I'm at it, move the traces that depend on project.ci_id to the new location too.
My proposal for doing so will require quite some time during migrations, but eventually we'd end up with 1 way we store traces. Lets start by listing the ways we store the traces now:
the old, old way: in ci_builds.trace
The old way: on disk, depending on projects.ci_id
The current way, File.join(Settings.gitlab_ci.builds_path, job.created_at.utc.strftime("%Y_%m"), job.project_id.to_s)
To start in gitlab-ce%"10.0" we'd introduce the new format. Gitlab::Ci::Trace is already setup to allow adding new directories, I'll just add one, and make sure project removal works as expected. Also in this release we'd move ci_builds.trace to the new location.
%10.1 will include a migration to steal the background migration that migrates the ci_builds.trace, and drop this column. References to Ci::Build#old_trace can be removed than too. Furthermore, we'll include the background migration to move the traces depending on ci_id to the current location. This is the last location in in the code base that depends on that.
%10.2 will steal the background migrations again, after which we drop the projects.ci_id column (if possible). Than we'd move to the last move action, in %10.1 we can see the performance if the smaller move operation, and adjust to that in this release.
One note, I was thinking on introducing ad hoc moving too, so when we load a trace being on an old location, do a move operation, after which we serve it. This might be a good idea in terms of spreading the load and making the migrations better to handle, but now we're increasing request timings for it and are bound to operations taking a few seconds tops. Plus we might run into unicorn being killed. Probably not worth it.
Edit: we should move artifacts to the same directory when using local storage, so we can remove the directory whole when erasing a job.
Can you explain why we can't ship two migrations in 10.1 (the one that migrates ci_id and the one that migrates trace from the database) and steal both in the 10.2?
With the new format I mean File.join(Settings.gitlab_ci.builds_path, job.project_id.to_s, job.created_at.utc.strftime("%Y_%m")). So the project id first, allowing batch deletion. Later we can add artifacts in the same directory.
That is an option too, I'm just a bit afraid for creating a lot of IOPS on that NFS mount. I read the code on these types of migrations and it seems that I could schedule it in a way that we take enough time not to intervene with each other, but we can not guarantee this for our customers. They might update from gitlab-ce%"10.0" to %10.3 which will steal these migrations right away. But if you and @ayufan think this is ok, it would like to include it in one release too.
Once we do it, migrate all traces (including the oldest one) to behave as artifacts, one extra attached file with content.
This would unify how we handle trace, which is another artifact. This would make trace to be stored on Object Storage. This would make trace to count against disk space. This would make explicit for which, and how we store each trace. Since files need to be deleted separately we could have a routine that would walk this table and delete them one-by-one, thus solve the above problem too.
We have to do it as we have to fix the old ci_id problem of traces, migrating them to be stored as artifacts make it explicit and possible.
We need to write a migration to convert them to artifacts.
I wonder if it would be better to do that in a production change, with a simple Rails script, because migrations are supposed to be isolated - in this case we might be unable to isolate object storage communication sufficiently, and it seems that self-managed instances might not need it. Is that a valid assumption?
I also think a production change might be better because otherwise we would also need a partial index on ci_builds to efficiently identify the ones without trace.
Maybe we should start by creating a thin clone and extract the ids that need to be migrated. If we're here we could also check the projects that they belong to and their data because there could be something wrong with them since they failed the previous migration.
@morefice how can we quantify the impact of removing this column on the database? 40 rows with data probably do not store a lot, what is the total storage size consumed by this column?
Also a question: the issue says that we used to have 87636 builds like this, and only 40 remains as for now. Why is that? Did we do something to remove them / migrate them already?
If you do not feel the purpose of this issue matches one of the types, you may apply the typeignore label to exclude it from type tracking metrics and future prompts.