Restore fails due to snapshot directories and leaves instance unusable
Summary
In testing backup/restore we run into an issue where the restore fails due to the presence of .snapshot files (in both the backup and the directory its restoring to)
We use Netapp over trident for the storage provisioning and the snapshots are created as part of the Netapp solution. As a client application gitlab would not be able to modify them due to permissions etc.
The worst part is of course that the restore fails part way through thus is unusable (Web GUI works etc but cannot commit or make any changes to config etc). Fortunately this is a test instance and this is why we test before release! :)
The part where restore fails is:
...
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
GRANT
[DONE]
2019-03-24 12:31:13 +0000 -- done
WARNING: This version of GitLab depends on gitlab-shell 8.7.1, but you're running Unknown. Please update gitlab-shell.
2019-03-24 12:31:28 +0000 -- Restoring repositories ...
rake aborted!
GRPC::Internal: 13:move dir: rename /home/git/repositories/.snapshot /home/git/repositories/+gitaly/tmp/default-repositories.old.1553430688.087008660/.snapshot: file exists
/srv/gitlab/vendor/bundle/ruby/2.5.0/gems/grpc-1.15.0-x86_64-linux/src/ruby/lib/grpc/generic/active_call.rb:31:in `check_status'
/srv/gitlab/vendor/bundle/ruby/2.5.0/gems/grpc-1.15.0-x86_64-linux/src/ruby/lib/grpc/generic/active_call.rb:181:in `attach_status_results_and_complete_call'
/srv/gitlab/vendor/bundle/ruby/2.5.0/gems/grpc-1.15.0-x86_64-linux/src/ruby/lib/grpc/generic/active_call.rb:377:in `request_response'
/srv/gitlab/vendor/bundle/ruby/2.5.0/gems/grpc-1.15.0-x86_64-linux/src/ruby/lib/grpc/generic/client_stub.rb:178:in `block in request_response'
/srv/gitlab/vendor/bundle/ruby/2.5.0/gems/grpc-1.15.0-x86_64-linux/src/ruby/lib/grpc/generic/interceptors.rb:170:in `intercept!'
/srv/gitlab/vendor/bundle/ruby/2.5.0/gems/grpc-1.15.0-x86_64-linux/src/ruby/lib/grpc/generic/client_stub.rb:177:in `request_response'
/srv/gitlab/vendor/bundle/ruby/2.5.0/gems/grpc-1.15.0-x86_64-linux/src/ruby/lib/grpc/generic/service.rb:170:in `block (3 levels) in rpc_stub_class'
/srv/gitlab/lib/gitlab/gitaly_client.rb:166:in `call'
/srv/gitlab/lib/gitlab/gitaly_client/storage_service.rb:21:in `delete_all_repositories'
/srv/gitlab/lib/backup/repository.rb:46:in `block in prepare_directories'
/srv/gitlab/lib/backup/repository.rb:45:in `each'
/srv/gitlab/lib/backup/repository.rb:45:in `prepare_directories'
/srv/gitlab/lib/backup/repository.rb:78:in `restore'
/srv/gitlab/lib/tasks/gitlab/backup.rake:87:in `block (4 levels) in <top (required)>'
/srv/gitlab/vendor/bundle/ruby/2.5.0/gems/rake-12.3.2/exe/rake:27:in `<top (required)>'
/srv/gitlab/bin/bundle:3:in `load'
/srv/gitlab/bin/bundle:3:in `<main>'
Tasks: TOP => gitlab:backup:repo:restore
(See full trace by running task with --trace)
command terminated with exit code 1
Steps to reproduce
Difficult for you guys to fully implement Netapp and other such for Dynamic storage provisioning... but I guess it could be simulated by add a few root owned .snapshot files to the repository filesystems.
But anyway once done ...
- Take backup using
kubectl exec -it <task-runner-pod-name> -- backup-utility --skip registry
- Attempt restore using
kubectl exec <task-runner-pod-name> -it -- backup-utility --restore -t <backup_timestamp>
Configuration used
global:
# Add the <REDACTED> CA
certificates:
customCAs:
- secret: <REDACTED>
# Community Edition
edition: ce
time_zone: UTC
# Set the wilcard domain/LB IP and override some sub-domains (will be used in HELM's ingress creation)
hosts:
https: true
externalIP: 10.232.11.176
domain: <REDACTED>.com
gitlab:
name: git.<REDACTED>.com
https: true
ingress:
enabled: true
tls:
enabled: true
secretName: <REDACTED>-san-bundle
configureCertmanager: false
class: ''
# Disable minio - we will use GCS Storage
minio:
enabled: false
# Email Config
smtp:
enabled: true
address: smtp-gw.<REDACTED>.com
port: '25'
authentication: ''
email:
from: '<REDACTED>'
display_name: '<REDACTED>'
reply_to: 'noreply@<REDACTED>.com'
# General app settings
appConfig:
enableUsagePing: false
enableImpersonation: false
defaultCanCreateGroup: true
usernameChangingEnabled: false
defaultProjectsFeatures:
issues: true
mergeRequests: true
wiki: true
snippets: true
builds: true
# LDAP Configuration
ldap:
servers:
main:
label: 'LDAP'
host: '<REDACTED>.com'
port: 636
uid: '<REDACTED>'
encryption: 'simple_tls'
ssl_version: 'TLSv1_2'
ca_file: '/etc/ssl/certs/ca-cert-<REDACTED>-ca-bundle-prod.pem'
verify_certificates: false
active_directory: false
allow_username_or_email_login: false
block_auto_created_users: false
base: 'o=<REDACTED>.com'
user_filter: ''
attributes:
username: '<REDACTED>uid'
email: 'mail'
name: 'cn'
# Configure the storage heavy things to used GCS buckets instead
lfs:
bucket: <REDACTED>-gitlab-lfs
connection:
secret: gitlab-rails-storage
key: connection
artifacts:
bucket: <REDACTED>-gitlab-artifacts
connection:
secret: gitlab-rails-storage
key: connection
uploads:
bucket: <REDACTED>-gitlab-uploads
connection:
secret: gitlab-rails-storage
key: connection
packages:
bucket: <REDACTED>-gitlab-packages
connection:
secret: gitlab-rails-storage
key: connection
backups:
bucket: <REDACTED>-gitlab-backup
tmpBucket: <REDACTED>-gitlab-tmpbackup
# Don't install nginx ingress - we already have one
nginx-ingress:
enabled: false
# Don't install cert-manager - we already have one
certmanager:
install: false
# Don't install runner as part of chart - will be deployed in namespaces that require it
gitlab-runner:
install: false
# Don't install prometheus - we already have one
prometheus:
install: false
# Don't install registry - we already have one
registry:
enabled: false
# Extra required config for backups to use GCS bucket
gitlab:
task-runner:
backups:
objectStorage:
config:
secret: gitlab-backup-conf
key: config
# Disable registry isnt global - thus we must switch it off in unicorn / sidekiq too
unicorn:
registry:
enabled: false
sidekiq:
registry:
enabled: false
Current behavior
Restoration fails and leaves instance unusable
Expected behavior
Restoration should succeed - and if it fails leave it in a usable state. Perhaps its good to add a dry-run or simple pre-tests to ensure it "will" succeed.
Versions
- Chart: 1.7.0
- Platform:
- Self-hosted: IBM Cloud Private
- Kubernetes: (
kubectl version
)- Client: 1.11.1
- Server: 1.12.4
- Helm: (
helm version
)- Client: 2.9.1
- Server: Not sure
Relevant logs
See Summary