Enable indexing from S3 metadata bucket
This MR makes it possible to index metadata from a metadata bucket stored in S3.
With this MR the REST API can be run in a profile that will cause it to ingest metadata from S3. An example of this can be seen below.
AWS_PROFILE=crossref-staging AWS_REGION=eu-west-1 METADATA_BUCKET=crossref-metadata-bucket-temp lein run :nrepl :api :s3-ingest
When run with s3-ingest
the rest API will start a task that will page through all data in METADATA_BUCKET
and index the data in Elasticsearch.
It is possible to index a subset of data in METADATA_BUCKET
by using METADATA_DOI
, here you can specify a prefix, for example:
AWS_PROFILE=crossref-staging AWS_REGION=eu-west-1 METADATA_BUCKET=crossref-metadata-bucket-temp METADATA_DOI=10.1145/253228.253255 lein run :nrepl :api :s3-ingest
If you do not wish to rely on S3 then you can use a local directory, like so:
METADATA_BUCKET=/location/to/local/metadata METADATA_LOCAL_STORAGE=1 lein run :nrepl :api :s3-ingest
Note METADATA_BUCKET
must have been built using Metadata Bucket Builder or another tool that conforms to the spec.
I’ve not added much in the way of automated tests as part of this PR because most of the functionality is about querying S3 and I don’t want to rely on that for testing, using a mock felt futile, since it meant all that was really tested was indexing and we already have many tests in that area.
I have updated the knowledge base to reflect the above, see here
I have done some benchmarking and the implementation included in this MR takes ~20 minutes to ingest 20,000 metadata items from S3.