Skip to content

Move to ElasticSearch and drop Solr/MongoDB

Mark Woodhall requested to merge elastic into develop

WIP PR

Purpose

This pull request migrates away from Solr and MongoDB to ElasticSearch.

Highlights

  1. Solr and MongoDB have been removed in favour of ElasticSearch for all data storage. ElasticSearch indexes exist for all of the core data types:

    (def index-settings
      {"work"     {:number_of_shards 1  :number_of_replicas 3}
       "member"   {:number_of_shards 1  :number_of_replicas 3}
       "funder"   {:number_of_shards 1  :number_of_replicas 3}
       "subject"  {:number_of_shards 1  :number_of_replicas 3}
       "coverage" {:number_of_shards 1  :number_of_replicas 3}
       "journal"  {:number_of_shards 1  :number_of_replicas 3}})
  2. The configuration for docker-compose has been adjusted to start ElasticSearch, all references to Solr and MongoDB have been removed

  3. A new "corpus test" has been created see cayenne.corpus-test. This test can work against a corpus of varying size and proves that citation matching is working within a known threshold. An almost identical version of this test is included in a PR to Solr version so a direct comparison can be made. I've attached a scoring comparison of citation matching below. A more complete comparison can be found here

Original DOI Matched DOI Elastic Solr
10.1002/erv.2485 10.1002/erv.2485 100.273125 90.04283
10.1002/jnr.23820 10.1002/jnr.23820 81.95269 61.502754
10.1002/jnr.23992 10.1002/jnr.23992 104.20477 88.88707
10.1002/nur.21773 10.1002/nur.21773 95.3643 85.251755
10.1007/s00125-016-4154-6 10.1007/s00125-016-4154-6 92.42219 83.01552
10.1007/s00213-016-4480-x 10.1007/s00213-016-4480-x 93.45497 76.72064
10.1007/s10964-016-0591-2 10.1007/s10964-016-0591-2 92.13747 82.10659
10.1007/s11302-016-9551-2 10.1007/s11302-016-9551-2 96.21273 75.00612
10.1007/s11682-016-9638-y 10.1007/s11682-016-9638-y 100.7581 86.417206
10.1007/s13318-016-0388-4 10.1007/s13318-016-0388-4 120.53948 105.28446
10.1016/j.alcohol.2016.08.008 10.1016/j.alcohol.2016.08.008 90.90022 78.548706
10.1016/j.bbi.2016.10.007 10.1016/j.bbi.2016.10.007 91.129654 84.56702
10.1016/j.bbr.2016.10.035 10.1016/j.bbr.2016.10.035 101.12494 90.45204
10.1016/j.biopsycho.2016.12.010 10.1016/j.biopsycho.2016.12.010 88.34703 75.23803
10.1016/j.bmc.2016.10.035 10.1016/j.bmc.2016.10.035 110.07812 94.04622
10.1016/j.explore.2016.10.009 10.1016/j.explore.2016.10.009 85.96247 69.95195
10.1016/j.infbeh.2016.09.006 10.1016/j.infbeh.2016.09.006 100.5378 86.74484
10.1016/j.jad.2016.10.035 10.1016/j.jad.2016.10.035 61.279423 53.282127
10.1016/j.jad.2016.11.036 10.1016/j.jad.2016.11.036 90.15741 81.84434
10.1016/j.jad.2016.11.046 10.1016/j.jad.2016.11.046 123.41971 103.81766
10.1016/j.joms.2016.10.033 10.1016/j.joms.2016.10.033 85.54165 76.20327
10.1016/j.neubiorev.2016.12.003 10.1016/j.neubiorev.2016.12.003 75.98824 72.51518
10.1016/j.neubiorev.2016.12.006 10.1016/j.neubiorev.2016.12.006 117.57448 91.75167
10.1016/j.neubiorev.2016.12.013 10.1016/j.neubiorev.2016.12.013 97.39974 87.65282
10.1016/j.neulet.2016.11.064 10.1016/j.neulet.2016.11.064 108.19604 93.43173
10.1016/j.neuro.2016.11.006 10.1016/j.neuro.2016.11.006 97.35805 82.715225
10.1016/j.neurobiolaging.2016.11.014 10.1016/j.neurobiolaging.2016.11.014 100.97907 89.42957
10.1016/j.neuroimage.2016.12.046 10.1016/j.neuroimage.2016.12.046 85.69545 74.715836
10.1016/j.neuron.2016.09.039 10.1016/j.neuron.2016.09.039 67.21429 61.09139
10.1016/j.nicl.2016.11.014 10.1016/j.nicl.2016.11.014 86.846924 76.55675
10.1016/j.nlm.2016.10.006 10.1016/j.nlm.2016.10.006 106.93617 95.132774
10.1016/j.nlm.2016.11.008 10.1016/j.nlm.2016.11.008 65.49764 58.382977
10.1016/j.peptides.2016.11.001 10.1016/j.peptides.2016.11.001 97.45492 82.56204
10.1016/j.physbeh.2016.10.010 10.1016/j.physbeh.2016.10.010 87.95236 70.85148
10.1016/j.physbeh.2016.11.030 10.1016/j.physbeh.2016.11.030 99.73418 85.511086
10.1016/j.physbeh.2016.12.004 10.1016/j.physbeh.2016.12.004 116.00987 93.99558
  1. There has been a lot of "code clean up", most of this was done in the early phase of the elastic branch, a few things worth mentioning that have been removed:
  • OAI harvester
  • Datomic-backed graph API
  • HTML landing page interrogation
  • Datacite XML parser
  • DOI metadata quality checker
  • Web of Knowledge parser
  • Resolution URL checker
  • Citation analysis
  • DOAJ code
  • Old patent deposit code (now handled by event data)
  • Deposits API
  • /licenses route (in favour of license facet)
  • Old code for citation checking
  1. Index settings are configured to closely match Solr, particularly the number of shards used by the work index matches with the Solr production deployment

    (def index-settings
      {"work"     {:number_of_shards 1  :number_of_replicas 3}
       ...})

    There is scope to change this in the future but it is worth keeping in mind that scoring is shard local, so the number of shards directly impacts scoring, in theory this should even out over a large enough corpus

Index Structures

Much of the underlying structure for index was already in place in the elastic branch, I have only made changes to this structure where it fixed an issue.

  1. Change year to be non numeric here. The reasons for this are explained in the commit message.
  2. I also ported mappings required for new master features. e.g. peer reviews, isbn types

Concerns

The changes in this PR are somewhat wider ranging than just swapping in ElasticSearch, as the highlights above show there has been a general clean up and removal of "old code". A large portion of functionality is proven by the passing of existing high level automated tests, however, there may be untested areas which will require testing after deployment.

Merge request reports