Offline transfer object storage POC
What does this MR do and why?
This merge request demonstrates how object storage credentials can be used to build a pre-signed URL to fetch relation files from an external object storage provider. Using pre-signed URLs have a few advantages:
- Fetching compressed relations using pre-signed URLs fits easily within the current direct transfer process because it treats the object storage provider as an external server, similar to how source instances are handled in direct transfer.
- Using pre-signed URLs allows us to run validations on the URL and treat the external object storage location as an untrusted source.
- Generating a URL allows us to make requests to object storage using our own adapters rather than rely on Fog making requests to object storage, but we can still use Fog to allow customers to use their preferred provider.
However, it's unclear what the implications of this approach may be with regards to efficiency, security, impact on the offline transfer feature in general, etc. This MR is to solicit feedback on how we can fetch relations from object storage and get a clearer idea of how this can be technically implemented.
Object storage flow
A user provides credentials to access their object storage provider where relation files can be stored. Those credentials are passed to Fog::Storage to generate pre-signed URLs to GET or PUT relation files. Offline migrations then use those URLs to interact with an object storage location instead of another GitLab instance that may not be accessible by the source or destination.
Object storage will need to be accessible to the destination GitLab instance within the organization's network policies. The object storage policies must also be configured to allow pre-signed URLs with the given credentials.
sequenceDiagram
participant PIPE as Pipeline
participant DS as OfflineFileDownloadService
participant CLIENT as BulkImports::Clients::ObjectStorage
participant OS as Object Storage (Fog compatible, S3 shown in example)
PIPE->>DS: new(configuration, file_key, filename, tmpir).execute
Note over DS: As part of the extract phase, use NdjsonExtractor uses OfflineFileDownloadService<br/>if performing an offline migration.
DS->>CLIENT: new(bucket, credentials)
Note over CLIENT: Creates client with<br/>provider-specific settings
DS->>CLIENT: stream(file_key)
CLIENT->>CLIENT: resource_url(file_key)
Note over CLIENT: Generates presigned URL to download file at file_key
CLIENT->>OS: GET request with presigned URL
OS-->>CLIENT: Stream file content
CLIENT-->>DS: Yield file content stream
Note over DS: Writes file contents to filename in tmpdir
DS-->>PIPE: Return filename
Note over PIPE: Decompress downloaded file and continue ETL as normal
Duo was used to help generate this diagram
What's not in this MR
- Support for reading from files on disk - I'd really appreciate others' input on this
- An offline export tool or backend changes in GitLab to export relations directly to object storage
- A full import structure to read all relations from object storage
- DRY code.
BulkImports::OfflineFileDownloadServiceandBulkImports::Clients::ObjectStorageshare a lot of code with their online counterparts,BulkImports::FileDownloadServiceandBulkImports::Clients::Httprespectively. Much of this code could be extracted to shared modules, but for the sake of time and clarity on this MR, I haven't done that yet.
References
- Offline transfer architecture design document (WIP)
- Support semi-automated migrations between offli... (&8985)
- Offline migrations: investigate object storage ... (#525564 - closed)
- Fog documentation
- GitLab object storage current documentation
- Configure MinIO object storage for GDK
How to set up and validate locally
This is a bit tedious to test locally because there's no script or method to upload exported relations from the source to object storage yet. However, it can be done with API requests and in the console. How to export and import one group and all its subgroups and projects with its tree relation exports imported from object storage:
- Configure an object storage provider locally. Configuring GDK to use MinIO is simplest. If using MinIO configured by GDK, you will need to manually set the region in MinIO to
gdk. Requests to pre-signed URLs when the region doesn't match will be refused. - Ensure a bucket exists to upload relation export files to. I recommend creating a new bucket (e.g.
import-objects), but it can be a bucket already used by GitLab. - Pick a group to export. I used
Gitlab Orgfrom GDK's seed in this example. - On the source instance, export relations for all portables in the group (the top-level group, it's subgroups, and all of is descendant projects) using
POST /groups/:id/export_relationsandPOST /projects/:id/export_relationsendpoints. The source and destination instances can be the same GDK instance. - Allow a minute or two for the relations to export. Use
GET projects/:id/export_relations/statusorGET projects/:id/export_relations/statusto check the status, or query for each portable'sbulk_import_exportsin the console to check their statuses. - Once the exports have finished, upload the file relations to object storage using the Rails console to show how
BulkImports::Clients::ObjectStorage#stream_uploadwould work.
bucket = 'import-objects' # whatever bucket you want to upload to
credentials = {
provider: 'AWS',
aws_access_key_id: 'minio', # default access key for gdk
aws_secret_access_key: 'gdk-minio', # default secret key for gdk
region: 'gdk',
endpoint: 'http://127.0.0.1:9000', # default endpoint for gdk MinIO configuration
path_style: true # this must be true for MinIO
}
object_storage_client = BulkImports::Clients::ObjectStorage.new(bucket, credentials)
export_key = "export_#{DateTime.current}"
group_portable_ids = # Array group ids that were exported using export_relations
project_portable_ids = # Array group ids that were exported using export_relations
portable_ids_by_type = { group: group_portable_ids, project: project_portable_ids }
portable_ids_by_type.each do |portable_type, portable_ids|
portables = portable_type == :group ? Group.where(id: portable_ids) : Project.where(id: portable_ids)
portables.each do |portable|
portable.bulk_import_exports.each do |export|
# Skip non-tree relations since that hasn't been handled on import
next if BulkImports::FileTransfer.config_for(portable).file_relation?(export.relation)
file_key_base = "#{export_key}/#{portable_type}_#{portable.id}/#{export.relation}"
if export.batched?
export.batches.each do |batch|
file_key = "#{file_key_base}/batch_#{batch.batch_number}.ndjson.gz"
object_storage_client.upload_stream(batch.upload.export_file, file_key)
end
else
file_key = "#{file_key_base}.ndjson.gz"
object_storage_client.upload_stream(export.upload.export_file, file_key)
end
end
end
end
- (Optional) On the source destination, update some records that have already been exported and uploaded to object storage. These updates won't appear on the destination and adds a little extra proof that the relation was taken from object storage.
- Begin a direct transfer using
POST /bulk_imports/offline. Right now, this runs a partially offline migration by fetching tree relations (piplines that useNdjsonExtractor) from the object storage credentials provided. For now, anoffline_entities_mappinghash needs to be passed via the API. In the full implementation, this will be its own file on object storage that gets read and stored toBulkImport::Configuration.
curl --request POST --header "PRIVATE-TOKEN: <your private token>" --header "Content-Type: application/json" \
--url "http://gdk.test:3000/api/v4/bulk_imports/offline" \
--data '{
"s3_configuration": {
"aws_access_key_id": "minio",
"aws_secret_access_key": "gdk-minio",
"region": "gdk",
"endpoint": "http://127.0.0.1:9000",
"path_style": true
},
"configuration": {
"bucket": "import-objects",
"export_prefix": "<export prefix you used above>",
"offline_entities_mapping": {
"source/full/path": "<portable_type>_<portable_id>",
"gitlab-org": "group_24",
"gitlab-org/gitlab-test": "project_2",
"gitlab-org/gitlab-shell": "project_3",
"gitlab-org/org-gitlab-subgroup": "group_102"
},
"url": "http://gdk.test:3000",
"access_token":"<your private token>"
},
"entities": [
{
"source_full_path": "gitlab-org",
"source_type": "group_entity",
"destination_slug": "gitlab-org-semi-offline-transfer",
"destination_namespace": "<namespace of your choice>"
}
]
}'
MR acceptance checklist
Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.