Importing packages from Artifactory to GitLab. We often hear from customers that they would like an easy migration path and Artifactory comes up more often than Sonatype.
Outcome expected
The outcome of this investigation could be empowering professional services with a methodology or plan to help customers migrate. Then, once they've done that for 5+ customers, we can start to think about integrating it into the product.
Tasks to Evaluate
Determine feasibility of the feature
Create issue for implementation or update existing implementation issue description with implementation proposal
Set weight on implementation issue
If weight is greater than 5, break issue into smaller issues
Add task
Add task
Risks and Implementation Considerations
To import a registry, we would optimally leverage APIs available in the other registry systems:
Find the 3rd party registry on a list of registries we support import from.
Once selected, they would provide credentials for logging into the private registry.
Once logged in there may be a variety of options:
Import all packages to project
Select packages to import to project
Note: It might be best to wait until we have virtual repositories for this type of import feature because importing to a virtual repository may be what users want rather than being forced to import to specific projects.
Once selected, the import would run and the user would begin to see packages populated
It would be nice if there was some sort of progress indicator
From the tech perspective:
We need to authenticate via API somehow
We need an API endpoint to fetch a list of packages available for import
We need to take a list of those packages, fetch each, and for each package map a set of attributes to the corresponding tables in GitLab, creating each record as the package is stored.
We need to optionally kick off background jobs to fill in any metadata.
Other ideas
We could set an option up where once connected to the registry, GitLab could "sync" with the registry, essentially creating a mirror, by a user setting a time period (sync daily/weekly/monthly).
Throwing a wild idea: could we use a package registry proxy "cache" to handle this import?
Let's say that a user has this configuration: user -> Artifactory, we update it to: user -> GitLab -> Artifactory.
GitLab acts as a proxy with some caching but on top of that we could have "rules" on what cache entries should be moved to a specific project. For example, we could have:
Maven.Package.A -> goes to Project AMaven.Package.B -> goes to Project B
As users pulls packages from the GitLab package registry, Project A and Project B are "filled" or imported.
Downside: this would only import packages that are pulled through GitLab.
Upside: the import is distributed among all users using the GitLab proxy.
Can you read through and contribute to the conversation?
Can you add what your thoughts are and how many packages you have?
Will this work for you?
List specifically what you're looking for GitLab's help on.
Chef - You had mentioned you had Jenkins jobs and chef run-books that will need to be updated. Regarding Updating the run-books, what will this entail?
We would like to add this to GL14.0 If we have these questions answered, @trizzi , @sabrams and his team can join a call soon and we can brainstorm on how we can help you resolve this.
A large Enterprise customer is interested in this feature
Why interested - Security is the primary concern. The customer has upwards of 20 tools to integrate into the pipeline in order to be compliant. Some of the tools in GitLab are not sufficient. So the challenge is really about figuring out how to integrate the tools into the pipeline in order to remain compliant.
@katiemacoy@michelletorres I wanted to tag you on this issue as it's something that has been coming up recently in Sales and early customer conversations. People want to move to GitLab but there is no easy way to migrate off of Artifactory. This issue is for the technical investigation, but once we know what's possible it would be great to have a front end for this as well.
If we can have a really easy migration path, I think this could significantly increase our user base. We could event start with just one format, Maven or npm.
We don't have the capacity to work on this right now, but I'm going to ambitiously schedule this for 15.3.
@trizzi thanks for the tag. Do you think this needs problem validation? I can totally see the usefulness of this feature but do we need to understand more of the details - what do they need to import, who's doing it, etc?
@katiemacoy I think that we should conduct a technical investigation and understand what's possible, then do some user research with some more grounded questions. I'm highly confident this is a problem, but once we know what's feasible we can do some user research
@trizzi I agree, this seems like a problem; any problem validation would be to discover the details of the problem. An approach could be to do problem validation independent of technical feasibility. We can first understand what users want and then figure out if it's possible to reach that vision. We can use solution validation to validate a more grounded/ technically feasible approach.
I love ambitious goals If in 15.3 we do the proper investigation to understand the feasibility and implementation plan, then we can increase customers by the end of the year!
That makes sense to me @10io. I'll update the issue description accordingly. My preference is to start with Artifactory, since it comes up much more often during customer calls.
@trizzi@sabrams Is it known which permission would the import user need? I'm assuming Developer as from that role one one can publish packages, but would like to confirm.
@pprokic this is a detail we still need to work out. There will be a user that will be configuring the connection to Artifactory, that user may be a higher role than Developer, so then it becomes a question of is that person/role responsible for deciding which packages to import, or can other users import once a connection is established. We may not understand which makes the most sense until we start working through the investigation.
@katiemacoy may also have some thoughts on the required roles.
+1 to what Steve says above we will need to do some UX research to help determine where in GitLab this functionality will live and what user role can action it
I should add that a successful investigation may result in empowering GitLab professional services with a plan to help customers through the migration. Then after several iterations and lessons learned, we can move the feature into the application. That's how it's been done with other GitLab importer tools.
@trizzi Is it known whether the investigation is estimated for 15.2 or the feature delivery?
Also, after reading #298726 (comment 878845639) is it going to be a single package type import feature for MVC? Will it definitely be from Artifactory?
I've weighted as a 2 since this is purely an investigation it should be timeboxed. The outcome should be a set of issues within an epic and an implementation plan listing out any remaining unknowns. If the investigation is going to take longer, we should take a closer look at why.
@trizzi@michelletorres I think this will slip from %15.2. We are in the last few days of %15.2 development and while I hope to pick it up still, I don't think there will be enough time to complete a full investigation and implementation plan.
@trizzi - there are a few customers asking for this that GitLab Professional Services are engaged with. I agree with the a short term solution of providing the PS team with scripts to facilitate the data migration. Based on the container registry API and package API, I'm not sure we will be able to push these binaries to gitlab, even if we are able to GET them with the Artifactory API. I think the POST/PUT API endpoints to enable pushing of these binaries to GitLab are highest priority to facilitate the transition of a customer from Artifactory to GitLab
@bryan-may just wanted to let you know that Tim is on parental leave for the next 2 months, but we have this investigation happening in this milestone. Hopefully by the end of the milestone we'll have a clearer picture of what's possible.
This could build the base of a "general" import feature where we could import packages from elsewhere than Artifactory.
It would be powered mainly by background jobs.
The amount of work needed is quite chunky (several milestones needed) but an iterative approach is possible.
Context / Scope
Package imports can unlock several features but for this analysis, we're going to build scope specifically for the first iteration (MVC) that will provide the base that we can build upon.
The main goal is to provide a way to import or copy packages from Artifactory to the GitLab Package Registry. To provide a workable scope, here are the requirements we will use:
The import is user initiated. Its execution will be done once (one shot copy).
(Consider all packages existing in Artifactory.)
Only import packages which formats are supported by GitLab.
Support any Artifactory setup: self hosted or saas (jfrog.io).
Use the provided Artifactory API and/or Graphql to read information. Never write any information.
Main idea
The main idea is the following:
Users give the base url of the Artifactory instance and an admin scoped token.
A background job generates the import plan details. Basically, the list of packages to copy and put the plan on ready to execute state.
Users review the plan and confirm its execution.
A set of background jobs process (eg. copy) each package to the GitLab Package Repository.
At the end of importing a package, an optional post import step can occur which will be handled by, again, background workers.
As we can see, the whole process is mainly handled by background jobs. Why? The reasoning is the following:
I think that we need an approach that will split the work in small executable steps. It would be madness to run the import in one single giant process.
By having multiple small executable steps, we can better handle the scalability aspect of this feature. For example, we don't want a very large import plan to block or clog the backend. Instead, work steps are available somewhere and background jobs will work off them at their speed which will be adjustable at the instance level.
By splitting in steps, we increase the reliability of the import: a single step failure will not cancel all the subsequent steps.
By splitting in steps, we can present them to the user and we can even think about the possibility of retrying failed steps.
By splitting in steps and using background jobs, we can easily have and control a parallel execution.
Step 1: Create the import plan
Done by: user.
This step is to collect user inputs about the import. We could see this as configuring the import.
The 2 crucial information we need here are:
The base url of the Artifactory setup. It has to be a valid url.
Upon save, we can validate those two information by using this endpoint to verify that those are working values.
Once persisted, the import plan is available for pick up by background workers for its generation.
We probably need some constraints here to make sure that we don't have 2 identical import plans (eg. targeting the same base url).
Step 2: Generate the plan steps
Done by: background job.
This step is basically creating all the steps necessary for the import plan. Each step will be an import step. Upon execution, a package (name+version+type) will be imported (eg. all the (recent) files for that package name+version).
An import step is basically a "package coordinate":
a package name.
a package version.
a package type.
Note that an import step doesn't go to the package file level (we don't consider a filename). For the first iteration, we made the decision that executing an import step will take care of listing the files and importing them. If that goes out of hand, we could further split the import step execution in as many file imports as necessary. This should be reasonable enough for most package formats except Generic where basically we can have an arbitrary amount of files for a single package.
To generate those, we will need to use the GraphQL API. That API allows us to query the packages directly no matter in which (Artifactory) repository the package exists.
Here is an example of such query:
GraphQL query
query { packages(filter: {name: "*"}) { edges { node { name packageType versions { name } } } }}
What is interesting here is that:
The API is properly paginated. We can thus walk through pages.
Technical detail: working with pages allow us to more easily use batched inserts which is basically inserting database data in bulk in a single query (instead of n).
The endpoint provides a filter input that can be used to scope the query to specific package types.
Generating import steps could take a while as we could have thousands versions to generate steps for. As described above, this will be handled by a background job where we can better handle this scale. Having said that, we still need to consider this step has a heavy one. This is why this step will be executed uniquely for a given Artifactory base url (eg. we will not generate multiple plans in parallel for the same base url).
Step 3: Execute the plan steps
Done by: background jobs.
Executing import steps
This is the smallest operation of the import feature: given a package name, version, type and import configuration (Artifactory url and credentials), the execution here will get all the related package files and import them in the GitLab package registry.
Looking at Artifactory APIs, the most straightforward way to do this operation is:
Get the list of files of a given package, version and type.
Query the storage api to get the download url for each file.
Download each file and import it in the GitLab Package Registry.
For (1.), we can re-use the GraphQL API using filters to scope down results to a single package.
As we can see, the API is properly returning an array of files as some package formats (such as Maven) can need multiple assets (.jar file and .pom file). As a bonus, we get also metadata about the file such as fingerprints and its size. Those information can be imported too. Lastly, this is a paginated result. We could handle an arbitrarily large number of files.
For (2.), we need to get the actual url that will allow us to download the file. Unfortunately, such url is not returned by the GraphQL API endpoint. As such, we will need to use the Rest API for storage. More precisely, we will use the quick search to scope the query to the wanted package file.
As we can see, we can have multiple locations because we could have duplicated files across Artifactory repositories. From our testing, the most recent files go first. In this regard, we will only consider the first url.
For (3.), things are quite straightforward: given the url, import the given file to the given package (creating it if necessary).
As pointed above, here we could hit issues regarding the amount a files processed. This is even more true for the Generic Registry where a single package can have an arbitrary amount of files.
Because the plan is split into steps, each step is independent from each other. This means that a step can fail and this failure will not impact the outcome of the other steps.
In case of execution failure
I think it would be nice to have in the first iteration, a way for users to "retry" a specific import step. An import step is a collection of network interactions with other services. A network failure could easily make the whole step execution fail.
We should even have a similar approach to what we do with pipelines with partial jobs executed. Users can:
either retry all failed jobs at once (one single Retry button) or
retry jobs individually.
Post import steps
Import steps will take care of importing files that are part of the package but depending on the format, we will miss additional auxiliary files. One straighforward example is the maven-metadata.xml file for maven packages. That file is simply an xml that given a package name will list all the available versions. In general, those auxiliary files are "metadata" files that describe the state of the Package registry. On top of that, the GitLab Package Registry can perform package extractions to read additional metadata of the package. It is the case for NuGet for example.
As such after executing the import step, we will need to proceed with a post import step.
Not all package formats need this step at the time of this writing but I want to point out that currently some metadata endpoints are generating their response on the fly and we're thinking in moving this generation to a static file. Formats moving for this metadata generation will need a post import step.
Here is the list of package formats that will need a post import step at the time of this writing. I took the liberty to also add the nature of the post import step.
Package format
Post import step needed?
What to do?
Composer
No
Conan
No
Debian
Yes
(re)generate the distribution file.
Go Proxy
No
Helm
Yes
generate GitLab package metadata.
Maven
Yes
(re)generate GitLab package metadata and (re)generate maven-metadata.xml file(s).
NPM
No
NuGet
Yes
(re)generate GitLab package metadata.
PyPI
No
Rubygems
Yes
(re)generate GitLab package metadata and (re)generate Gemspec.
Generic
No
These post import steps could take time and be complex. As such, it should be handled by a subsequent job enqueued after the execution of an import step. Obviously, a post import step execution job will be enqueued only if an import step execution is successful.
Technical considerations
Below is an early look at technical details that we will need to complete this feature.
all names are not final and could change during the actual implementation.
Database
At the database level, we can re-use the packages_packages table. Why? If we look above, we said that the import step is basically a package coordinate. Such coordinate modeling already exists today: the Packages::Package model.
We can re-use that model for import steps. Better said, they will be "packages to be imported". We even have a status column that we can upgrade with statuses for the import: waiting_import, importing, waiting_post_import, post_importing and import_failed.
Now that we know how to save steps of the import plan, we need a place for the import plan itself. For this, we will need a new table (packages_import_plans) with:
project_id.
required.
source_type. (perhaps not needed for the first iteration as it would be always set to :artifactory)
required.
based on an Enum.
source_credential_token. This will hold the admin scoped token. This attribut MUST be encrypted.
required.
base_url.
required.
validated to be an URL.
state. Helps following along what is happening in the plan
additional columns. We might need additional columns to log what happened such as failure_reason, generation_time, execution_time.
This is great but how do we link packages_import_plans and packages_packages. Having a foreign key on packages_packages would be a not so great idea as most of the rows there are real, valid, pullable packages that have nothing to do with the import feature.
No, it's much better to revert things and don't touch the structure of packages_packages. For this, a tiny table would do the job (packages_import_plans_packages):
plan_id, the id of the import plan object.
package_id, the id of the package object.
Models
If we want to go all , we could have a state machine for the plan states.
At the model (or service) level, we will need a validation that the <base_url> can be successfully contacted with the <source_credential_token>. Ideally, a plan with non working values should be created.
Background jobs
This is the very core of the feature, so we will need to have things ready for the scalability.
The Plan Builder
Role: Take a plan object and build all the package rows (steps) linked to that plan.
Type: Standard job.
How is it enqueued: When a new plan is saved or by a user action (eg, a button).
Steps:
Put the plan in generating.
Take the plan <base_url> + <source_credential> and contact the GraphQL API to get all packages that need to be imported.
Walk through the page of results and for each package, find or create a package object with the proper name, version and type. That package must be created in status waiting_import.
Create the link between the package and the plan.
Redo (2.) - (3.) for each page.
Put the plan in waiting_execution_confirmation.
It's important that this job shields itself from duplicated executions. Two jobs on the same <base_url> should not run concurrently.
Important note for (2.): find or create. The GitLab Package Registry has hard constraints around Package objects uniqueness. As such, a given project can't have the same package name, version, type multiple times.
For this step, the worker will either find an existing package or create a new one. By re-using an existing package, it will "append" the new package files to the existing ones.
This is actually an existing feature of the GitLab Package Registry: allowing duplicates. When pulling files that are duplicated, the GitLab Package Registry will always return the most recent one.
Another side effect of this is that if an existing package gets "overwritten" by a package imported from Artifactory, that existing package will go through the importing status while we copy files from Artifactory. This status update will make the package temporarily unpullable from the GitLab Package Registry.
Given that this job is INSERT intensive, jump on the opportunity to bulk INSERT (eg. inserting many rows in a single SQL statement).
The Package Importer
Role: Pick any waiting_import package (+ plan in executing) and import it.
Type: Limited capacity (will re-enqueue itself while there are waiting_import package rows).
How is it enqueued: Cron job or by itself.
Steps:
Pick the next waiting_import package (ordered by created_at) and lock it for importing (eg. udpate the status to importing).
Build the list of linked filenames (+ metadata such as fingerprints) using the GraphQL API.
For each filename, query the quick search Rest API endpoint to get the storage URL. Consider the first URL as it is the more recent one.
Download the file itself to a temporary location.
Create the related Packages::PackageFile object and save it. This will trigger an upload to ObjectStorage.
Redo (3.) - (5.) for each filename.
If post import is necessary:
Prepare it. Depending on the format, this could mean creating additional auxiliary objects.
Enqueue the related job.
The limited capacity type worker fits the bill really well here. The last thing we want is a giant import plan (with a large set of packages) eating up all the background resources and clogging the queues.
The post import jobs
Role: Finish preparing an existing package in the GitLab Package Registry.
Type: standard job.
How is it enqueued: by the Package Importer.
The beauty here is that those jobs already exists for most of package formats that need this step. That's because it's a step done after a regular user uploads happens.
The only job that might be needed here is the one for Maven packages for the maven-metadata.xml files. Note that the related service already exist and it's already supporting: both xml (for standard version and snapshots) and maven plugins.
Application Settings
We will need a new application setting:
packages_import_package_importer_capacity - defines how many "Package Importer" jobs can run in parallel.
API changes
We will need API endpoints (preferably GraphQL) for :
Create a new plan.
Destroy a plan.
Listing plans given a project.
Plan details with all the related steps (well actually packages ).
For each package format that needs a post import step:
MR
weight
Post import step for package format X
2
At the current ~"group::package" capacity at the time of this writing, the implementation duration estimate would fall around 3-4 milestones.
Implementation roadmap
Looking at the MR plan, there is quite a bit of unknowns around the post import steps for package formats that need those. Also, we could have surprises from formats that we currently think that they will not need a post import step.
In addition, we could have a few surprises with some package formats. For example, in the GitLab Package Regsitry, Composer packages are the git archive of the target tag. Does Artifactory store composer packages in a similar fashion? Can we re-used the stored file as is?
I think that tackling the implementation on a "let's support all package formats in a single iteration" basis is risky. After careful thinking here, I think it's better to expand the MVC scope a bit with plan filters (see below) so that users can specify package types on the plan. This allows us to tackle the import feature format by format. As we finish the support for a given format, we can make it available for filtering in the import plan creation.
Following the "going slowly, one package format at a time", I would advise starting with NPM. The goal here is to build all the building blocks of the import feature with a really simply package format (with no post import step). Once NPM is supported, we can continue with Maven which will expand the existing code with post import steps. The next formats could be selected by popularity or something else.
Given the amount of work, the risk of having efforts spanning over a single milestone is high. As such, the use feature flag is strongly advised here. As seen in past experiences, this will also helps in parallelizing implementation work.
Possible future iterations
Add filters to the plan so that users can scope the plan generation to:
A package type.
A package name.
A package version.
Here is a glimpse of all filters that are available in Artifactory APIs.
Dry run the Plan Generation to have a preview if which packages would be imported.
Upgrade the import feature to mirror mode.
In mirror mode, we compare the source Package Registry and the target GitLab Package Registry. We modify the GitLab Package Registry so that its state is strictly equivalent to the source Registry.
With the mirror mode comes the question of cadence. At which interval do you check the source Registry? Daily? Weekly?
Support different kinds of source Package Registry.
What about supporting a GitLab instance as a source? This way, we can "move" packages from one GitLab Package Registry to Another.
Let users decide what happens on conflict (eg. the package already exist on GitLab):
Either not allow this and stop the import for that package or
allow package files to be appended to the existing package.
What If? (Alternative solutions)
These are more food for thoughts. Perhaps something to consider as long term evolutions.
A plan with steps that are executed in parallel by background jobs is really similar to another popular construct we have in GitLab: a CI pipeline.
What if an import is actually just a pipeline importing packages from a Package Registry?
Questions
I think my concerns are mainly the scalability of the system. Importing ten packages is not the same load as importing millions of packages. Here are my specific concerns:
Could "The Plan Builder" job have troubles because it tries to go through a very large set of packages?
I don't have an answer here. Perhaps we could set up a limit so that we keep things under control.
Could "The Package Importer" job have troubles because the target package has way too many files in the source registry?
Again, using a limit could help.
Could we flood Sidekiq with post import jobs (metadata extraction or others)?
This already exists.
I'm not super happy with the logic I found to go from a package name, version and type to a list of urls pointing at package files. It's really sad that the GraphQL API doesn't return download urls. The problem I see with the Quick search and Storage API is that (Artifactory) repositories can get in the way. From my testing, I could upload a NPM package foobar 1.3.6 in two different repositories. When looking for the download urls, the Quick search API returned two because I had the exact same package in multiple repositories.
I'm not sure that there is something we can do here.
What is the max page size of the GraphQL API? (eg. how many packages are returned in a single page?)
Revisions
r1 - initial version.
r2 - updates on the overall flow and plan's states based on feedback.
@sabrams Can you have a look at the investigation above and voice your opinion?
@rchanila Can you do the same? Please note that I didn't go far on the frontend side as I think the UX still need some refinements and thoughts. Still, knowing how the backend will organize the code/data, it gives hints on how to organize the frontend.
@katiemacoy / @michelletorres Tim being on parental leave. Can you have have a look at the analysis. In particular, the Implementation roadmap where I suggested expanding a bit the MVC scope to allow us iterating on each package format support.
@katiemacoy I have some concerns around the scalability of the solution. As such, I'd like to request a scalability review of the analysis. Before doing that, I wanted to know if we have data on the scale we will need to work with? Do you have some data providing answers to:
How many packages need to be imported from Artifactory?
What are the main formats of users that need the Artifactory import?
If the Generic registry is used, it would be nice to have a sense of what is the max count of (uniquely named) files under a single package? (this might be hard to get).
Fantastic writeup as always @10io! I'm glad to see there is a fairly straight forward path in terms of the Artifactory APIs: authenticate -> fetch the list of packages -> fetch each package. It's a bit disappointing we have to mix GraphQL and REST calls to make it happen, but as you said, the background jobs will be the most interesting part.
I think the implementation plan looks great and don't have any major concerns. I jotted down a few notes and questions while reading (in no particular order). Many of these may fall into the future iterations category, so they aren't necessarily relevant or in need of a response right now, but just points to discuss or consider future issues for:
I quickly browsed and saw we do not have a GraphQL-specific client in the rails repo, but I don't see it being a problem to use HTTParty or Faraday. There is a graphql-client gem, but I don't think that is necessary.
What happens on a failed or partial import? Do we have the ability to retry a given plan and have it ignore the packages that successfully imported the first time?
Along with the idea of a failed or partial import, what do we do about packages that are created in the awaiting import state that never get imported? Maybe these get picked up by the cleanup policies, or maybe there's a separate cleanup job for this?
On the find or create note you added. Should we add a setting on the plan so users can decide what to do on conflicts (allow or deny)? Either way, we need to be very clear that users may overwrite existing packages.
From the frontend perspective, I think it would be great to have a UI that lists the packages to be imported so there's some sort of visual validation to the user before they click import. Depending on the effort of the rest of the frontend work, it seems to me this could be worked on in parallel while the backend development is happening for the background jobs after we set up the plan table and GraphQL endpoint. I think it would be a huge UX gain to have something where a user enters their Artifactory config into the form, clicks save, it shows the plan page, which has a list of packages from their Artifactory instance, and then there is a button to "Start Import".
Should we give users the ability to stop imports? I could imagine a case where a user with a large number of packages starts an import that may take a long time and then wants to cancel or stop it.
I like the idea of starting with a single format and then adding them one at a time. I had a few thoughts on this:
How difficult would it be to update the import_plan model to include a format or list of formats? Then a project could have many import_plans, but users could import one format at a time, or target specific formats. I could also see a case where users don't want to import all packages to a single project, but would be interested in importing each format to a different project.
Would we consider having an option to import any unsupported formats into the Generic registry?
packages_import_plans and some of the underlying functionality seems very similar to what we will need when we add remote repositories for the dependency proxy/virtual repository features. Are there any considerations we should have now so we can keep this model easily extendable when we look at those future projects?
One other thing I thought of since the self-managed container registry migration is on my mind: what do you think about creating a CI script/template option for package types that don't need post-import processing? I would think this would be relatively fast to create and could help unblock users that have smaller Artifactory repositories, or users that are willing to dedicate some time working through it on a semi-manual basis.
How do we decide which project/group the package from needs to be imported to?
Who is allowed access to view/initiate the import process?
From the frontend perspective, I think it would be great to have a UI that lists the packages to be imported so there's some sort of visual validation to the user before they click import. Depending on the effort of the rest of the frontend work, it seems to me this could be worked on in parallel while the backend development is happening for the background jobs after we set up the plan table and GraphQL endpoint. I think it would be a huge UX gain to have something where a user enters their Artifactory config into the form, clicks save, it shows the plan page, which has a list of packages from their Artifactory instance, and then there is a button to "Start Import".
Agree @sabrams This is similar to what I thought on the frontend, keen to hear Katie's opinion.
It could look similar to UX flow of import projects. I can also see us having a similar section in documentation in the future with ability to import from different registries.
@10io this write-up is so helpful. I have gone from having no idea how this feature would work to being able to picture it.
Answering your questions
In particular, the Implementation roadmap where I suggested expanding a bit the MVC scope to allow us iterating on each package format support.
This is tricky. Customers are wanting to completely move off of Artifactory and they usually have multiple formats. It means the MVC you're suggesting would not solve their problem. However, I do understand that trying to tackle multiple formats at once is too large of scope. If we move forward with tackling one format at a time, we need to accept that there will probably be low uptake of this feature initially. Tackling the formats in terms of most popular -> least popular makes sense.
How many packages need to be imported from Artifactory? What are the main formats of users that need the Artifactory import?
I don't have this info but I can start reaching out to customers to ask.
If the Generic registry is used, it would be nice to have a sense of what is the max count of (uniquely named) files under a single package? (this might be hard to get).
When you say the Generic registry, do you mean migrating a generic package from Artifactory to GitLab?
Other thoughts
From the frontend perspective, I think it would be great to have a UI that lists the packages to be imported so there's some sort of visual validation to the user before they click import.
Agree.
It could look similar to UX flow of import projects.
Agree.
How do we decide which project/group the package from needs to be imported to?
I think the user would decide this and initiate the import from inside the project
Who is allowed access to view/initiate the import process?
Good question, we need to investigate this in user research
Next steps from UX
Tim wanted to wait on doing an UX research on this until we figured out if it's technically feasible. It looks like it is technically feasible. So we'll need to do some research cc @enf FYI - there isn't an UX research issue for this but I can create one. We'll also need to figure out the priority and plan it in.
Quick brain dump of I'd like to learn from research:
Validate target audience for this feature (I'm assuming enterprise customers mostly)
Understand who would be doing the migration
What are their expectations of migration? (time it takes, workflow they need to follow, etc.)
Understand how they'd like to migrate (all packages to one project? to many projects? only import some packages?)
Which formats do they need to migrate?
How many packages?
What impact does it have on them if we can't migrate all of their formats?
tl;dr - Based on earlier research, we identified supports that we'd like to have in place (e.g., a cross-walk of key terms in Artifactory and similar terms in GitLab) that we'd want to have in place to support the transition. And, we want to have the ENG/Design resources to draft those materials before working on a study to see how we could improve them.
I think the expeditious way forward would be to find a few key customers (who have a large amount of packages to migrate) and hold their hands through the process to model out what support needs to look like.
I think one such customer tagged Tim a few days ago to check in, in his related issue.
We can pick up with the research discussion in #1663 or a new issue if you think we need a new one. Go team.
I think the expeditious way forward would be to find a few key customers (who have a large amount of packages to migrate) and hold their hands through the process to model out what support needs to look like.
I have an 8,000-seat Ultimate customer that is urgently looking for this capability and would be happy to assist I imagine.
@jfeeney + @kbockrath While we put a plan in place (no promises right now on what that will look like and/or timeline ), it'd help us greatly if you could populate this issue with some notes on what your customers are struggling with and/or what has been helpful for them as they think about the Migration: #2053
Yeah, it seems that they have different products trying to work together and this is reflected on the APIs types mix. From what I saw:
The Rest API is really there for the raw access, like accessing storage to get data. That's what we need for getting the package files. The problem we have there is that there is no concept of packages at all.
The GraphQL API is where the packages approach lives. You can reason on your Artifactory in packages terms. They are the first citizen, repositories come after. This is super useful to build the "state" of the Package Registry, eg. the list of packages that are in the Package Registry. Downside: the GraphQL API doesn't allow you to navigate below at the files level and ultimately build the URL that is needed to download the file you want.
We could work in Rest API only but I'm a bit concerned that this would lead to too many assumptions on our part which results in less reliable code (imagine that we "compute" the packages available out of the storage API by analysing URLs). The day that this Rest API evolves, there are higher chances that our implementation breaks.
Out of all not so great approaches, mixing both access (Rest and GraphQL) seemed the best because we simply use what is available and we lower the amount of assumption we need to do on our side.
I quickly browsed and saw we do not have a GraphQL-specific client in the rails repo, but I don't see it being a problem to use HTTParty or Faraday. There is a graphql-client gem, but I don't think that is necessary.
Yes, I saw the graphql-client gem which if I'm not wrong comes from the same members that did the graphql gem. At this point, I don't know which path is best. I think we will have a better view once we need to implement this. Either way, we will need two clients: one for Rest, one for GraphQL. Do we pack everything together? Do we use a gem for the GraphQL part? I don't know that (yet).
What happens on a failed or partial import? Do we have the ability to retry a given plan and have it ignore the packages that successfully imported the first time?
I think we should have a similar approach that we have in pipelines. What happens when a pipelines have a partial execution? Users can retry all failed jobs in a single click (retry button) or they can retry them individually. We should have the same here.
Added a section for executions failure
Along with the idea of a failed or partial import, what do we do about packages that are created in the awaiting import state that never get imported? Maybe these get picked up by the cleanup policies, or maybe there's a separate cleanup job for this?
I think after some time, those should be be considered as "timed out" and should be cleaned up automatically.
This opens the question: what do we do with plans that are ready to be executed (all steps generated) but never got executed. Should we time out them too?
On the find or create note you added. Should we add a setting on the plan so users can decide what to do on conflicts (allow or deny)? Either way, we need to be very clear that users may overwrite existing packages.
That's a good idea
I think I would rather have it as follow up. The amount of work is already quite large. Note that we could do follow ups in parallel of supporting new package types.
Follow up added
From the frontend perspective, I think it would be great to have a UI that lists the packages to be imported so there's some sort of visual validation to the user before they click import. Depending on the effort of the rest of the frontend work, it seems to me this could be worked on in parallel while the backend development is happening for the background jobs after we set up the plan table and GraphQL endpoint. I think it would be a huge UX gain to have something where a user enters their Artifactory config into the form, clicks save, it shows the plan page, which has a list of packages from their Artifactory instance, and then there is a button to "Start Import".
Yes, I think we will need that. However note that it can't be so quick as: click save to create a new plan and the users are taken to the plan with all generated steps.
I think we need to be prepared to the fact that generating an import plan can take a non trivial amount of time and background jobs will need that time to generate all steps.
Having said that, we can have a new state, ready_to_import to the plan. Added to the analysis.
Should we give users the ability to stop imports? I could imagine a case where a user with a large number of packages starts an import that may take a long time and then wants to cancel or stop it.
Yes, we can. It shouldn't be too hard to have that.
stopped_import state added.
How difficult would it be to update the import_plan model to include a format or list of formats? Then a project could have many import_plans, but users could import one format at a time, or target specific formats. I could also see a case where users don't want to import all packages to a single project, but would be interested in importing each format to a different project.
Perhaps I explained it badly but this was the actual idea. The package type "filter" or "condition" is part of the plan so that the plan builder job knows which packages should be considered when contacting Artifactory.
I know that users want to have a plan that imports all supported types and that's it but I think we should really work on iterations here and the package format provides a nice way to do so. For example: we release the NPM format to Artifactory imports. Users can already use that. Then, we release the Maven support, users that already used the NPM imports can use that too.
At some point, we will achieve the ultimate goal: users will be able to select all supported package types and watch the plan's execution.
Would we consider having an option to import any unsupported formats into the Generic registry?
I'm not sure about this one. For example, let's say that we import Swift packages in the Generic registry. What users do then? They can't use the Swith package manager to pull them. If they can't pull these, why have them in the Generic registry in the first place?
My point is that, yes, that would work technically but then users will need to do an extra effort (like having custom scripts) to pull those from GitLab. Is that worth it?
packages_import_plans and some of the underlying functionality seems very similar to what we will need when we add remote repositories for the dependency proxy/virtual repository features. Are there any considerations we should have now so we can keep this model easily extendable when we look at those future projects?
I recall seeing the dependency proxy as a way to import packages. The dependency proxy for packages could proxy the artifactory repositories. Then, instead of cache packages as "plain blobs", we could cache them in the corresponding Package Registry of the project.
I'm not sure I see similarities with the virtual registries. Virtual registries is more about having a set of Package Registries and the backend need to walk them to find the requested package. If at some point, we have caching, then we are in the same situation as above: the cache itself could be the Packages Registry.
How do we decide which project/group the package from needs to be imported to?
Import plans are tied to a project id = the import feature will work at a project level in the first iterations. This makes sense are all packages in the GitLab Package Registry are hosted on a project and never on a group. We do have group endpoints (and even instance) but that's for pulling a package and those endpoints actually aggregate all packages of all included projects.
Who is allowed access to view/initiate the import process?
I'm not sure for now. I wanted to say, the same users that have the create_package permission but the minimum access level there is developer. That's a bit too low to my taste, we don't want dozens of developers creating many import plans. Perhaps, we should start with at least maintainer.
It could look similar to UX flow of import projects. I can also see us having a similar section in documentation in the future with ability to import from different registries.
Yes, I think we have a mix of UIs here. I think we can re-use approaches done in "import projects" and "pipeline details" pages.
Again, the frontend side in the analysis is not pushed very far. To me, we still have UX discussions to have before having more details.
This is tricky. Customers are wanting to completely move off of Artifactory and they usually have multiple formats. It means the MVC you're suggesting would not solve their problem. However, I do understand that trying to tackle multiple formats at once is too large of scope. If we move forward with tackling one format at a time, we need to accept that there will probably be low uptake of this feature initially. Tackling the formats in terms of most popular -> least popular makes sense.
I understand that but the "per package format" approach nicely fits the MVC definition where we will not wait to have all package formats supported to deliver something.
I don't have this info but I can start reaching out to customers to ask.
I think this part is the biggest question around the analysis. We need to have a sense of scale at which this needs to work. Obviously importing 100s packages is not the same as importing 1 000 000s of packages. This will determine how much on the scalability side we need to work.
I'd like to have a scalability review on the above analysis but before asking for that, we need to some numbers.
When you say the Generic registry, do you mean migrating a generic package from Artifactory to GitLab?
Yes, that's correct.
Thanks again everyone for the great feedback here!
Feel free to have a quick re read and ping me if necessary.
Import plans are tied to a project id = the import feature will work at a project level in the first iterations. This makes sense are all packages in the GitLab Package Registry are hosted on a project and never on a group. We do have group endpoints (and even instance) but that's for pulling a package and those endpoints actually aggregate all packages of all included projects.
That makes sense. Ah yes, I got confused with container registry where images can be pushed at group level. UX question @katiemacoy If someone who has no projects on GitLab wants to move from Artifactory, how would that experience looks like. Would we create an empty project & import all packages there? Would be happy to help explore those nuances
Perhaps, we should start with at least maintainer.
Ah yes, I got confused with container registry where images can be pushed at group level.
Slightly correction here: container registry images are also pushed at the project level.
For pulling, we have it at the project level too (obviously). We could say that we have it at the group level with the dependency proxy although the scenario there is different.
The UI is similar to packages. At the group level, it aggregates all the images from all included projects.
If someone who has no projects on GitLab wants to move from Artifactory, how would that experience looks like.
Good question! The project is a requirement of the import. Even the UI itself should be available only within a project.
This could be an opportunity to introduce the "store all packages in one project" flow.
If someone who has no projects on GitLab wants to move from Artifactory, how would that experience looks like. Would we create an empty project & import all packages there?
Good call out, we'd need to account for this.
I think before we get into those nuances it would be good to talk to some customers and understand what their expectations are and after that we can start to think about flows
Thanks for answering and adding some details @10io!
Would we consider havin> g an option to import any unsupported formats into the Generic registry?
I'm not sure about this one. For example, let's say that we import Swift packages in the Generic registry. What users do then? They can't use the Swith package manager to pull them. If they can't pull these, why have them in the Generic registry in the first place?
My point is that, yes, that would work technically but then users will need to do an extra effort (like having custom scripts) to pull those from GitLab. Is that worth it?
To the question "Is that worth it?" we should defer to @katiemacoy and @trizzi to get more customer voice on this, but my take is users want to get off Artifactory and stop paying for it. To do that, they need to get all of their packages off the system. Some might be willing to create their own custom solution for the formats we don't currently support, but some might just be happy with the idea of "take all my stuff and move it over to GitLab so I can close my Artifactory account".
@sabrams yup that's my understanding, this feature is useful when they can get all of their packages off of Artifactory. We definitely need to have some conversations with customers (https://gitlab.com/gitlab-org/ux-research/-/issues/1663) but as a first step perhaps we could survey a larger group customers to learn:
how many packages they are wanting to migrate?
what formats are they trying to migrate?
That information can help inform the technical path.
The qualitative research can fill in the gaps in terms of users expectations, what they need in place to make the migration worth it, terminology, how much effort they are willing to put in, etc. etc.
One word on the status of this analysis: before moving forward with this, we really need a groupscalability review on this. For that review, we will need the level of the scale at which package imports will need to work. @katiemacoy / @enf are working on that part.
@michelletorres Not sure, how you want to handle this? To me, ideally, the analysis need the groupscalability. Until we have that, the analysis is not really done and we can't move forward with creating an epic and/or issues. Do we keep this issue open until it's done? Do we do it differently?
We can promote this to an epic, create a new issue for scalability, include in there the research Katie and Erika are doing and when we complete all that, we can create the issues inside the epic.
I'm still a bit bothered by the scalability review being in its own issue. The scalability review should be kept close to the investigation. The scalability issue can reveal some downsides in the approach and this will make use revise the approach. So potentially, we can update the investigation in this issue which doesn't make a lot of sense if this issue is closed
The thing is that this is a Deliverable for 15.4, but, to be honest, I don't want to worry about skewing the metrics with this issue, and I prefer we do the right thing.
I took the liberty to move this issue there and create one for the scalability review.
@enf / @katiemacoy I took the liberty of moving your UX survey issue there too. Feel free to move things around and add anything related to the "Import packages from Artifactory" subject.