Move to ElasticSearch and drop Solr/MongoDB
WIP PR
Purpose
This pull request migrates away from Solr and MongoDB to ElasticSearch.
Highlights
-
Solr and MongoDB have been removed in favour of ElasticSearch for all data storage. ElasticSearch indexes exist for all of the core data types:
(def index-settings {"work" {:number_of_shards 1 :number_of_replicas 3} "member" {:number_of_shards 1 :number_of_replicas 3} "funder" {:number_of_shards 1 :number_of_replicas 3} "subject" {:number_of_shards 1 :number_of_replicas 3} "coverage" {:number_of_shards 1 :number_of_replicas 3} "journal" {:number_of_shards 1 :number_of_replicas 3}})
-
The configuration for
docker-compose
has been adjusted to start ElasticSearch, all references to Solr and MongoDB have been removed -
A new "corpus test" has been created see
cayenne.corpus-test
. This test can work against a corpus of varying size and proves that citation matching is working within a known threshold. An almost identical version of this test is included in a PR to Solr version so a direct comparison can be made. I've attached a scoring comparison of citation matching below. A more complete comparison can be found here
Original DOI | Matched DOI | Elastic | Solr |
---|---|---|---|
10.1002/erv.2485 | 10.1002/erv.2485 | 100.273125 | 90.04283 |
10.1002/jnr.23820 | 10.1002/jnr.23820 | 81.95269 | 61.502754 |
10.1002/jnr.23992 | 10.1002/jnr.23992 | 104.20477 | 88.88707 |
10.1002/nur.21773 | 10.1002/nur.21773 | 95.3643 | 85.251755 |
10.1007/s00125-016-4154-6 | 10.1007/s00125-016-4154-6 | 92.42219 | 83.01552 |
10.1007/s00213-016-4480-x | 10.1007/s00213-016-4480-x | 93.45497 | 76.72064 |
10.1007/s10964-016-0591-2 | 10.1007/s10964-016-0591-2 | 92.13747 | 82.10659 |
10.1007/s11302-016-9551-2 | 10.1007/s11302-016-9551-2 | 96.21273 | 75.00612 |
10.1007/s11682-016-9638-y | 10.1007/s11682-016-9638-y | 100.7581 | 86.417206 |
10.1007/s13318-016-0388-4 | 10.1007/s13318-016-0388-4 | 120.53948 | 105.28446 |
10.1016/j.alcohol.2016.08.008 | 10.1016/j.alcohol.2016.08.008 | 90.90022 | 78.548706 |
10.1016/j.bbi.2016.10.007 | 10.1016/j.bbi.2016.10.007 | 91.129654 | 84.56702 |
10.1016/j.bbr.2016.10.035 | 10.1016/j.bbr.2016.10.035 | 101.12494 | 90.45204 |
10.1016/j.biopsycho.2016.12.010 | 10.1016/j.biopsycho.2016.12.010 | 88.34703 | 75.23803 |
10.1016/j.bmc.2016.10.035 | 10.1016/j.bmc.2016.10.035 | 110.07812 | 94.04622 |
10.1016/j.explore.2016.10.009 | 10.1016/j.explore.2016.10.009 | 85.96247 | 69.95195 |
10.1016/j.infbeh.2016.09.006 | 10.1016/j.infbeh.2016.09.006 | 100.5378 | 86.74484 |
10.1016/j.jad.2016.10.035 | 10.1016/j.jad.2016.10.035 | 61.279423 | 53.282127 |
10.1016/j.jad.2016.11.036 | 10.1016/j.jad.2016.11.036 | 90.15741 | 81.84434 |
10.1016/j.jad.2016.11.046 | 10.1016/j.jad.2016.11.046 | 123.41971 | 103.81766 |
10.1016/j.joms.2016.10.033 | 10.1016/j.joms.2016.10.033 | 85.54165 | 76.20327 |
10.1016/j.neubiorev.2016.12.003 | 10.1016/j.neubiorev.2016.12.003 | 75.98824 | 72.51518 |
10.1016/j.neubiorev.2016.12.006 | 10.1016/j.neubiorev.2016.12.006 | 117.57448 | 91.75167 |
10.1016/j.neubiorev.2016.12.013 | 10.1016/j.neubiorev.2016.12.013 | 97.39974 | 87.65282 |
10.1016/j.neulet.2016.11.064 | 10.1016/j.neulet.2016.11.064 | 108.19604 | 93.43173 |
10.1016/j.neuro.2016.11.006 | 10.1016/j.neuro.2016.11.006 | 97.35805 | 82.715225 |
10.1016/j.neurobiolaging.2016.11.014 | 10.1016/j.neurobiolaging.2016.11.014 | 100.97907 | 89.42957 |
10.1016/j.neuroimage.2016.12.046 | 10.1016/j.neuroimage.2016.12.046 | 85.69545 | 74.715836 |
10.1016/j.neuron.2016.09.039 | 10.1016/j.neuron.2016.09.039 | 67.21429 | 61.09139 |
10.1016/j.nicl.2016.11.014 | 10.1016/j.nicl.2016.11.014 | 86.846924 | 76.55675 |
10.1016/j.nlm.2016.10.006 | 10.1016/j.nlm.2016.10.006 | 106.93617 | 95.132774 |
10.1016/j.nlm.2016.11.008 | 10.1016/j.nlm.2016.11.008 | 65.49764 | 58.382977 |
10.1016/j.peptides.2016.11.001 | 10.1016/j.peptides.2016.11.001 | 97.45492 | 82.56204 |
10.1016/j.physbeh.2016.10.010 | 10.1016/j.physbeh.2016.10.010 | 87.95236 | 70.85148 |
10.1016/j.physbeh.2016.11.030 | 10.1016/j.physbeh.2016.11.030 | 99.73418 | 85.511086 |
10.1016/j.physbeh.2016.12.004 | 10.1016/j.physbeh.2016.12.004 | 116.00987 | 93.99558 |
- There has been a lot of "code clean up", most of this was done in the early phase of the elastic branch, a few things worth mentioning that have been removed:
- OAI harvester
- Datomic-backed graph API
- HTML landing page interrogation
- Datacite XML parser
- DOI metadata quality checker
- Web of Knowledge parser
- Resolution URL checker
- Citation analysis
- DOAJ code
- Old patent deposit code (now handled by event data)
- Deposits API
- /licenses route (in favour of license facet)
- Old code for citation checking
-
Index settings are configured to closely match Solr, particularly the number of shards used by the work index matches with the Solr production deployment
(def index-settings {"work" {:number_of_shards 1 :number_of_replicas 3} ...})
There is scope to change this in the future but it is worth keeping in mind that scoring is shard local, so the number of shards directly impacts scoring, in theory this should even out over a large enough corpus
Index Structures
Much of the underlying structure for index was already in place in the elastic branch, I have only made changes to this structure where it fixed an issue.
- Change
year
to be non numeric here. The reasons for this are explained in the commit message. - I also ported mappings required for new master features. e.g. peer reviews, isbn types
Concerns
The changes in this PR are somewhat wider ranging than just swapping in ElasticSearch, as the highlights above show there has been a general clean up and removal of "old code". A large portion of functionality is proven by the passing of existing high level automated tests, however, there may be untested areas which will require testing after deployment.