Skip to content

Enable indexing from S3 metadata bucket

Mark Woodhall requested to merge feature/index-from-s3 into develop

This MR makes it possible to index metadata from a metadata bucket stored in S3.

With this MR the REST API can be run in a profile that will cause it to ingest metadata from S3. An example of this can be seen below.

AWS_PROFILE=crossref-staging AWS_REGION=eu-west-1 METADATA_BUCKET=crossref-metadata-bucket-temp lein run :nrepl :api :s3-ingest

When run with s3-ingest the rest API will start a task that will page through all data in METADATA_BUCKET and index the data in Elasticsearch.

It is possible to index a subset of data in METADATA_BUCKET by using METADATA_DOI, here you can specify a prefix, for example:

AWS_PROFILE=crossref-staging AWS_REGION=eu-west-1 METADATA_BUCKET=crossref-metadata-bucket-temp METADATA_DOI=10.1145/253228.253255 lein run :nrepl :api :s3-ingest

If you do not wish to rely on S3 then you can use a local directory, like so:

METADATA_BUCKET=/location/to/local/metadata METADATA_LOCAL_STORAGE=1 lein run :nrepl :api :s3-ingest

Note METADATA_BUCKET must have been built using Metadata Bucket Builder or another tool that conforms to the spec.

I’ve not added much in the way of automated tests as part of this PR because most of the functionality is about querying S3 and I don’t want to rely on that for testing, using a mock felt futile, since it meant all that was really tested was indexing and we already have many tests in that area.

I have updated the knowledge base to reflect the above, see here

I have done some benchmarking and the implementation included in this MR takes ~20 minutes to ingest 20,000 metadata items from S3.

Edited by Mark Woodhall

Merge request reports