Support for mirroring of upstream sources
We have outlined an implementation plan in #328 (closed) for allowing BuildStream to download sources from multiple mirrors.
While interoperability with existing mirroring solutions is important, we have an opportunity to provide a seamless turn key solution for mirroring in general, this is discussed in this email
Using a separate mirroring solution presents the following problems:
- One requires explicit and verbose configuration in
project.conf
- Every time that you want to start mirroring something new, an external moving part must be configured separately from the BuildStream project data
- Every time that you want to stop mirroring new versions of repositories, when for instance those repositories are no longer in use by more recent versions of your BuildStream project, you again need to configure your mirroring solution separately
By implementing a bst mirror
command, with some corresponding Source
object additional support, we are able to eliminate the above hassles by:
- Inferring the location of sources inside a given mirror, and treating a mirror as a single URL, eliminating overhead of explicit configuration in
project.conf
for each source alias - Making the mirroring solution project driven, such that:
- We start mirroring upstream source repositories as soon as the requirement for such repositories appear in a project being mirrored
- We stop mirroring new versions of those upstream source repositories as soon as they are no longer required by the project being mirrored
- We dont ever delete previously mirrored sources, meaning that you should be able to build your BuildStream project for every ref of every source ever seen in your project's history - so we don't make any sacrifices of repeatability here.
Some details and specifications follow
New Source API
The Source
object gains a new Source.mirror()
API, which raises ImplError
by default.
In contrast with:
-
Source.fetch()
: Which is only guaranteed to get the desired ref (I.e., a shallow clone is allowed) -
Source.track()
: Which is only guaranteed to lookup the latest ref for a given symbolic track parameter (I.e., it need not even ever clone a repository at all)
The Source.mirror()
API must instead fetch the latest of everything in the given upstream repository.
The Source
object in this instance will use the same Source.get_mirror_directory()
to store the result, however there are some additional constraints, listed here.
Must retain original upstream format
For existing git
and bzr
sources, this should not be problematic, as the repository currently downloaded in fetch
or track
retains the original upstream format.
However, for tar
, zip
and deb
"downloadable file" sources, they currently use a scheme where the downloaded tarball is renamed after it's sha256sum.
Either:
- This has to change for
_downloadablefilesource.py
such that the filename is retained, and a separate${filename}.sha256sum
file be created beside it, such that it continues to always work in this fashion - This can be implemented differently specifically for the
Source.mirror()
API
The former is preferable, even if it might temporarily annoy some people by eating up some disk space.
Must support incremental namespacing
Normally, a Source does something like the following to decide the directory where it should store a payload:
base = self.get_mirror_directory()
subdir = utils.url_directory_name(upstream_url)
directory = os.path.join(base, subdir)
Instead, we need a numerical counter after base
and subdir
, details of why this is, comes in the next subsection of this document...
Support for incompatible changes in upstreams
Upstreams can introduce incompatible changes, which we need to handle such that a given ref
can always be obtained in permanence. Incompatible changes can occur when:
- CVS surgery has been performed on the upstream
- A git history rewrite has occurred
- A tarball is overwritten with a new tarball, without adding any post release suffix to the tarball name (thus the tarball remains accessible with the same name, but now produces a new sha256sum)
When mirroring, we do mirroring in a loop; and if such an incompatible change is detected upstream, we import new data only into a compatible mirror, using the numeric namespace explained above, and creating an entirely new numerically namespaced subdir in the case that none of the existing mirrors are compatible with the upstream one.
For this, we probably want to add a Source
level public API for iteration over these subdirectories and for creation of a new one.
Source.mirror()
Internally calling This should be done with an internal private Source._mirror()
wrapper which emits a warning in the case that the given source type does not (yet) support the Source.mirror()
method.
New MirrorQueue
Similar to the TrackQueue
and FetchQueue
, this is a simple component to drive Source._mirror()
New loading technique
For the sake of running bst mirror
, it is more convenient to have a loading technique which loads every element found in the project directory, instead of following the specified targets.
This might prove to be more tricky, with project options in play, so let's call this optional and "nice to have"
bst mirror
command
New Ideally does not have a TARGETS
parameter and just loads everything, but plausibly needs to have a TARGETS
parameter.
This just loads the pipeline which in turn runs the new MirrorQueue
Simplified configuration and client side additions
The client side story for downloading from multiple mirrors as described in #328 (closed), needs some extensions:
- A "mirror" can now be defined as only a base URL with a
mirror-name
- These can be mixed with other "mirror" definitions that are not
bst mirror
driven - For a "mirror" which is configured for a
bst driven
mirror, we resolveSource.translate_url()
differently, under the assumption that the payload will reside at the configured mirror url with a well known subdirectory (as we would have constructed it locally).
In addition to the simplified configuration, some blacklisting can be done on a per source alias basis. This allows an organization which hosts their own git repositories to exclude those repositories from the mirroring process, as it may be a popular choice to "Only mirror the third party sources which you do not already host yourself"
Iterating over "alias mappings"
As discussed in #328 (closed), there may be multiple alias mappings. When configuring for interoperability, these must all be listed explicitly; but when we expect a bst mirror
driven mirror, these are traversed dynamically and in order of an incremental numeric namespace subdirectory.
This way we just try every possible repo for a given source at a given mirror, and stop iteration when one of the URLs are unreachable (subdirectory does not exist on the server).
Documentation and setup for hosting a mirror
Hosting a mirror mostly consists of setting up a server to:
- Periodically run a task
- The task fetches the latest commits in the history of the BuildStream projects which it is configured to mirror
- The task proceeds to run
bst mirror
on the projects (or projects and target elements) which it is configured to mirror
Further, the mirror directory must be configured as the source cache in the user configuration used to launch bst mirror
, so that the task running bst mirror
is also allowed to write to the location where things will be hosted.
Finally, it is up to the project administrators to setup the host such that it is in fact able to host these payloads in the required formats, and over the given URI schemes that are used in the project.conf source aliases (this just makes the mirror accessible to build machines and users/developers).
This is to say:
- You need to serve
http(s)://
if you want to be mirroring tarballs, or ostree repositories - You probably want to serve
git://
in order to host git repositories