Offline transfer object storage POC (!186032) · Merge requests · GitLab.org / GitLab

What does this MR do and why?

This merge request demonstrates how object storage credentials can be used to build a pre-signed URL to fetch relation files from an external object storage provider. Using pre-signed URLs have a few advantages:

Fetching compressed relations using pre-signed URLs fits easily within the current direct transfer process because it treats the object storage provider as an external server, similar to how source instances are handled in direct transfer.
Using pre-signed URLs allows us to run validations on the URL and treat the external object storage location as an untrusted source.
Generating a URL allows us to make requests to object storage using our own adapters rather than rely on Fog making requests to object storage, but we can still use Fog to allow customers to use their preferred provider.

However, it's unclear what the implications of this approach may be with regards to efficiency, security, impact on the offline transfer feature in general, etc. This MR is to solicit feedback on how we can fetch relations from object storage and get a clearer idea of how this can be technically implemented.

Object storage flow

A user provides credentials to access their object storage provider where relation files can be stored. Those credentials are passed to Fog::Storage to generate pre-signed URLs to GET or PUT relation files. Offline migrations then use those URLs to interact with an object storage location instead of another GitLab instance that may not be accessible by the source or destination.

Object storage will need to be accessible to the destination GitLab instance within the organization's network policies. The object storage policies must also be configured to allow pre-signed URLs with the given credentials.

sequenceDiagram
    participant PIPE as Pipeline
    participant DS as OfflineFileDownloadService
    participant CLIENT as BulkImports::Clients::ObjectStorage
    participant OS as Object Storage (Fog compatible, S3 shown in example)
    
    PIPE->>DS: new(configuration, file_key, filename, tmpir).execute
    Note over DS: As part of the extract phase, use NdjsonExtractor uses OfflineFileDownloadService<br/>if performing an offline migration.
    
    DS->>CLIENT: new(bucket, credentials)
    Note over CLIENT: Creates client with<br/>provider-specific settings
    
    DS->>CLIENT: stream(file_key)
    CLIENT->>CLIENT: resource_url(file_key)
    Note over CLIENT: Generates presigned URL to download file at file_key
    
    CLIENT->>OS: GET request with presigned URL
    OS-->>CLIENT: Stream file content
    CLIENT-->>DS: Yield file content stream
    Note over DS: Writes file contents to filename in tmpdir
    DS-->>PIPE: Return filename
    
    Note over PIPE: Decompress downloaded file and continue ETL as normal

Duo was used to help generate this diagram

What's not in this MR

Support for reading from files on disk - I'd really appreciate others' input on this
An offline export tool or backend changes in GitLab to export relations directly to object storage
A full import structure to read all relations from object storage
DRY code. BulkImports::OfflineFileDownloadService and BulkImports::Clients::ObjectStorage share a lot of code with their online counterparts, BulkImports::FileDownloadService and BulkImports::Clients::Http respectively. Much of this code could be extracted to shared modules, but for the sake of time and clarity on this MR, I haven't done that yet.

References

How to set up and validate locally

This is a bit tedious to test locally because there's no script or method to upload exported relations from the source to object storage yet. However, it can be done with API requests and in the console. How to export and import one group and all its subgroups and projects with its tree relation exports imported from object storage:

Configure an object storage provider locally. Configuring GDK to use MinIO is simplest. If using MinIO configured by GDK, you will need to manually set the region in MinIO to gdk. Requests to pre-signed URLs when the region doesn't match will be refused.
Ensure a bucket exists to upload relation export files to. I recommend creating a new bucket (e.g. import-objects), but it can be a bucket already used by GitLab.
Pick a group to export. I used Gitlab Org from GDK's seed in this example.
On the source instance, export relations for all portables in the group (the top-level group, it's subgroups, and all of is descendant projects) using POST /groups/:id/export_relations and POST /projects/:id/export_relations endpoints. The source and destination instances can be the same GDK instance.
Allow a minute or two for the relations to export. Use GET projects/:id/export_relations/status or GET projects/:id/export_relations/status to check the status, or query for each portable's bulk_import_exports in the console to check their statuses.
Once the exports have finished, upload the file relations to object storage using the Rails console to show how BulkImports::Clients::ObjectStorage#stream_upload would work.

bucket = 'import-objects' # whatever bucket you want to upload to
credentials = {
  provider: 'AWS',
  aws_access_key_id: 'minio', # default access key for gdk
  aws_secret_access_key: 'gdk-minio', # default secret key for gdk
  region: 'gdk',
  endpoint: 'http://127.0.0.1:9000', # default endpoint for gdk MinIO configuration
  path_style: true # this must be true for MinIO
}
object_storage_client = BulkImports::Clients::ObjectStorage.new(bucket, credentials)
export_key = "export_#{DateTime.current}"

group_portable_ids = # Array group ids that were exported using export_relations
project_portable_ids = # Array group ids that were exported using export_relations
portable_ids_by_type = { group: group_portable_ids, project: project_portable_ids }

portable_ids_by_type.each do |portable_type, portable_ids|
  portables = portable_type == :group ? Group.where(id: portable_ids) : Project.where(id: portable_ids)

  portables.each do |portable|
    portable.bulk_import_exports.each do |export|
      # Skip non-tree relations since that hasn't been handled on import
      next if BulkImports::FileTransfer.config_for(portable).file_relation?(export.relation)

      file_key_base = "#{export_key}/#{portable_type}_#{portable.id}/#{export.relation}"

      if export.batched?
        export.batches.each do |batch|
          file_key = "#{file_key_base}/batch_#{batch.batch_number}.ndjson.gz"
          object_storage_client.upload_stream(batch.upload.export_file, file_key)
        end
      else
        file_key = "#{file_key_base}.ndjson.gz"
        object_storage_client.upload_stream(export.upload.export_file, file_key)
      end
    end
  end
end

(Optional) On the source destination, update some records that have already been exported and uploaded to object storage. These updates won't appear on the destination and adds a little extra proof that the relation was taken from object storage.
Begin a direct transfer using POST /bulk_imports/offline. Right now, this runs a partially offline migration by fetching tree relations (piplines that use NdjsonExtractor) from the object storage credentials provided. For now, an offline_entities_mapping hash needs to be passed via the API. In the full implementation, this will be its own file on object storage that gets read and stored to BulkImport::Configuration.

curl --request POST --header "PRIVATE-TOKEN: <your private token>" --header "Content-Type: application/json" \
--url "http://gdk.test:3000/api/v4/bulk_imports/offline" \
--data '{
  "s3_configuration": {
    "aws_access_key_id": "minio",
    "aws_secret_access_key": "gdk-minio",
    "region": "gdk",
    "endpoint": "http://127.0.0.1:9000",
    "path_style": true
  },
  "configuration": {
    "bucket": "import-objects",
    "export_prefix": "<export prefix you used above>",
    "offline_entities_mapping": {
      "source/full/path": "<portable_type>_<portable_id>",
      "gitlab-org": "group_24",
      "gitlab-org/gitlab-test": "project_2",
      "gitlab-org/gitlab-shell": "project_3",
      "gitlab-org/org-gitlab-subgroup": "group_102"
    },
    "url": "http://gdk.test:3000",
    "access_token":"<your private token>"
  },
  "entities": [
    {
      "source_full_path": "gitlab-org",
      "source_type": "group_entity",
      "destination_slug": "gitlab-org-semi-offline-transfer",
      "destination_namespace": "<namespace of your choice>"
    }
  ]
}'

MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Edited Mar 27, 2025 by Sam Word

Offline transfer object storage POC