Restore fails due to snapshot directories and leaves instance unusable

Summary

In testing backup/restore we run into an issue where the restore fails due to the presence of .snapshot files (in both the backup and the directory its restoring to)

We use Netapp over trident for the storage provisioning and the snapshots are created as part of the Netapp solution. As a client application gitlab would not be able to modify them due to permissions etc.

The worst part is of course that the restore fails part way through thus is unusable (Web GUI works etc but cannot commit or make any changes to config etc). Fortunately this is a test instance and this is why we test before release! :)

The part where restore fails is:

...
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
GRANT
[DONE]
2019-03-24 12:31:13 +0000 -- done
WARNING: This version of GitLab depends on gitlab-shell 8.7.1, but you're running Unknown. Please update gitlab-shell.
2019-03-24 12:31:28 +0000 -- Restoring repositories ...
rake aborted!
GRPC::Internal: 13:move dir: rename /home/git/repositories/.snapshot /home/git/repositories/+gitaly/tmp/default-repositories.old.1553430688.087008660/.snapshot: file exists
/srv/gitlab/vendor/bundle/ruby/2.5.0/gems/grpc-1.15.0-x86_64-linux/src/ruby/lib/grpc/generic/active_call.rb:31:in `check_status'
/srv/gitlab/vendor/bundle/ruby/2.5.0/gems/grpc-1.15.0-x86_64-linux/src/ruby/lib/grpc/generic/active_call.rb:181:in `attach_status_results_and_complete_call'
/srv/gitlab/vendor/bundle/ruby/2.5.0/gems/grpc-1.15.0-x86_64-linux/src/ruby/lib/grpc/generic/active_call.rb:377:in `request_response'
/srv/gitlab/vendor/bundle/ruby/2.5.0/gems/grpc-1.15.0-x86_64-linux/src/ruby/lib/grpc/generic/client_stub.rb:178:in `block in request_response'
/srv/gitlab/vendor/bundle/ruby/2.5.0/gems/grpc-1.15.0-x86_64-linux/src/ruby/lib/grpc/generic/interceptors.rb:170:in `intercept!'
/srv/gitlab/vendor/bundle/ruby/2.5.0/gems/grpc-1.15.0-x86_64-linux/src/ruby/lib/grpc/generic/client_stub.rb:177:in `request_response'
/srv/gitlab/vendor/bundle/ruby/2.5.0/gems/grpc-1.15.0-x86_64-linux/src/ruby/lib/grpc/generic/service.rb:170:in `block (3 levels) in rpc_stub_class'
/srv/gitlab/lib/gitlab/gitaly_client.rb:166:in `call'
/srv/gitlab/lib/gitlab/gitaly_client/storage_service.rb:21:in `delete_all_repositories'
/srv/gitlab/lib/backup/repository.rb:46:in `block in prepare_directories'
/srv/gitlab/lib/backup/repository.rb:45:in `each'
/srv/gitlab/lib/backup/repository.rb:45:in `prepare_directories'
/srv/gitlab/lib/backup/repository.rb:78:in `restore'
/srv/gitlab/lib/tasks/gitlab/backup.rake:87:in `block (4 levels) in <top (required)>'
/srv/gitlab/vendor/bundle/ruby/2.5.0/gems/rake-12.3.2/exe/rake:27:in `<top (required)>'
/srv/gitlab/bin/bundle:3:in `load'
/srv/gitlab/bin/bundle:3:in `<main>'
Tasks: TOP => gitlab:backup:repo:restore
(See full trace by running task with --trace)
command terminated with exit code 1

Steps to reproduce

Difficult for you guys to fully implement Netapp and other such for Dynamic storage provisioning... but I guess it could be simulated by add a few root owned .snapshot files to the repository filesystems.

But anyway once done ...

Take backup using kubectl exec -it <task-runner-pod-name> -- backup-utility --skip registry
Attempt restore using kubectl exec <task-runner-pod-name> -it -- backup-utility --restore -t <backup_timestamp>

Configuration used

global:
  # Add the <REDACTED> CA
  certificates:
    customCAs:
      - secret: <REDACTED>
  # Community Edition
  edition: ce
  time_zone: UTC
  # Set the wilcard domain/LB IP and override some sub-domains (will be used in HELM's ingress creation)
  hosts:
    https: true
    externalIP: 10.232.11.176
    domain: <REDACTED>.com
    gitlab:
      name: git.<REDACTED>.com
      https: true
  ingress:
    enabled: true
    tls:
      enabled: true
      secretName: <REDACTED>-san-bundle
    configureCertmanager: false
    class: ''
  # Disable minio - we will use GCS Storage
  minio:
    enabled: false
  # Email Config
  smtp:
    enabled: true
    address: smtp-gw.<REDACTED>.com
    port: '25'
    authentication: ''
  email:
    from: '<REDACTED>'
    display_name: '<REDACTED>'
    reply_to: 'noreply@<REDACTED>.com'
  # General app settings
  appConfig:
    enableUsagePing: false
    enableImpersonation: false
    defaultCanCreateGroup: true
    usernameChangingEnabled: false
    defaultProjectsFeatures:
      issues: true
      mergeRequests: true
      wiki: true
      snippets: true
      builds: true
    # LDAP Configuration
    ldap:
      servers:
        main:
          label: 'LDAP'
          host: '<REDACTED>.com'
          port: 636
          uid: '<REDACTED>'
          encryption: 'simple_tls'
          ssl_version: 'TLSv1_2'
          ca_file: '/etc/ssl/certs/ca-cert-<REDACTED>-ca-bundle-prod.pem'
          verify_certificates: false
          active_directory: false
          allow_username_or_email_login: false
          block_auto_created_users: false
          base: 'o=<REDACTED>.com'
          user_filter: ''
          attributes:
            username: '<REDACTED>uid'
            email: 'mail'
            name: 'cn'
    # Configure the storage heavy things to used GCS buckets instead
    lfs:
      bucket: <REDACTED>-gitlab-lfs
      connection:
        secret: gitlab-rails-storage
        key: connection
    artifacts:
      bucket: <REDACTED>-gitlab-artifacts
      connection:
        secret: gitlab-rails-storage
        key: connection
    uploads:
      bucket: <REDACTED>-gitlab-uploads
      connection:
        secret: gitlab-rails-storage
        key: connection
    packages:
      bucket: <REDACTED>-gitlab-packages
      connection:
        secret: gitlab-rails-storage
        key: connection
    backups:
      bucket: <REDACTED>-gitlab-backup
      tmpBucket: <REDACTED>-gitlab-tmpbackup
# Don't install nginx ingress - we already have one
nginx-ingress:
  enabled: false
# Don't install cert-manager - we already have one
certmanager:
  install: false
# Don't install runner as part of chart - will be deployed in namespaces that require it
gitlab-runner:
  install: false
# Don't install prometheus - we already have one
prometheus:
  install: false
# Don't install registry - we already have one
registry:
  enabled: false
# Extra required config for backups to use GCS bucket
gitlab:
  task-runner:
    backups:
      objectStorage:
        config:
          secret: gitlab-backup-conf
          key: config
  # Disable registry isnt global - thus we must switch it off in unicorn / sidekiq too
  unicorn:
    registry:
      enabled: false
  sidekiq:
    registry:
      enabled: false

Current behavior

Restoration fails and leaves instance unusable

Expected behavior

Restoration should succeed - and if it fails leave it in a usable state. Perhaps its good to add a dry-run or simple pre-tests to ensure it "will" succeed.

Versions

Chart: 1.7.0
Platform:
- Self-hosted: IBM Cloud Private
Kubernetes: (kubectl version)
- Client: 1.11.1
- Server: 1.12.4
Helm: (helm version)
- Client: 2.9.1
- Server: Not sure

Relevant logs

See Summary

Edited Mar 24, 2019 by Sean Erswell-Liljefelt