The current backup process is painfully long thanks to the single-threaded gzip, especially when upgrading. I would like to bring gitlab-ce#20773 back into life because:
More and more cores are becoming the standard.
Parallel compression programs (like xz and pigz) allow users to specify thread number.
For on-premise instances, operators may schedule backup at low traffic times (e.g. midnight), so there's no worry about eating up all system resources.
Proposal
Have a configuration option to let users choose their preferred compression program (and arguments), defaulting to gzip.
@markglenfletcher
We could probably reuse a large part of my old MR for this,
but before investing any time in it, we'd need to know if gitlab is willing to merge something like that now.
Assuming not much has changed in that area since a year ago, the only thing you'd need to change in my mr is this method:
# if available use pigz (http://zlib.net/pigz/) as the compression command instead of gzip# pigz is "A parallel implementation of gzip for modern multi-processor, multi-core machines"# which makes backup creation and restoration on these machines a lot fasterdefself.compression_commandifself.which('pigz')'pigz'else'gzip'endend
Instead of checking if pigz is available and using it automatically I'd probably use 2 ENV variables.
Something like: GL_ENABLE_PIGZ_BACKUPS GL_SET_PIGZ_PROCESSES
We are interested in this, I see it wasn't implemented because of CPU considerations, but I imagine this no longer applies, specially if gitlab is running in kubernetes?
This still applies to me. I'm not running Kubernetes (and nor are a lot of people) — I'm running GitLab Omnibus in Docker.
My GitLab backups start at 19:30, and the backup that started last night is still hogging my entire server at 10:15 with a CPU-chewing gzip process. This is not a large GitLab instance — there are around 10 smallish projects on it, yet the backups take over FOURTEEN HOURS because gzip is used!
Frankly, I'd much rather disable backup compression entirely and let something more efficient handle it later — but support for pigz would be an acceptable fix, providing pigz is included by default as part of the Omnibus Docker image.
EDIT: While GitLab is not compressing the main backup, it is producing the following files:
# ls -1 /backups/1656576028_2022_06_30_15.1.0_gitlab_backup.tar1656825594_2022_07_03_15.1.1_gitlab_backup.tarartifacts.tar.gzbuilds.tar.gzdb/gitlab-secrets.jsongitlab.rblfs.tar.gzpages.tar.gzregistry.tar.gzrepositories/terraform_state.tar.gzuploads.tar.gz
Most of those .tar.gz files are fairly small, but registry.tar.gz is 280Gb (clearly, I should prune this! But still, it shouldn't be slowing the entire server by trying to compress this…)
It's a real pain. My backup takes ~12 hours and almost all this time a CPU-chewing gzip process is running on 1 core of 8. Could you please let us either skip compression or choose an archiver and its options or just use pigz with -p threads.