Refactor REST API sample feature
Background
The current sampling approach in the REST API involves assigning a random score to each work (DOI record) by Elasticsearch (ES) and then sorting the works by the random score to choose the top N of them. This is repeated for every sampling request.
This approach is highly inefficient, with sampling requests taking up to 8-11 seconds in production and 15-16 seconds in staging.
Dominika has proposed replacing the current approach with a different one, based on sampling citation-ids. Citation-ids are natural numbers assigned to DOIs (all those in /works) in CS. In the proposed approach, first we sample citation-ids using a random number generator, and then we retrieve works from ES using sampled citation-ids.
Proposed approach to handle a request /works?sample=N&... is the following:
- Get the upper bound of the citation-id range (from cache or ES).
- Sample max(5, 2*N) numbers from the range 1 - upper bound (twice as many as the user asked for, but no fewer than 5).
- Retrieve from ES all works matching sampled citation-ids and passing any user-defined filters or queries (this can be done in a single request).
- If the retrieved set contains fewer than N works, repeat steps 2 and 3 two more times, and add the results to the set.
- If the work set retrieved so far:
- Contains fewer than N works, revert to the original sampling using ES random scoring.
- Contains more than N works, sample N and return.
- Contains exactly N works, return.
More details here: https://docs.google.com/document/d/1bwmW4chAneRcVRWiiNYzLM-DY_GrPmzTRvcIcd8Q0ks/edit#
How urgent
High priority given risks to REST API stability.
Definition of ready
-
Product owner: @ppolischuk1 -
Tech lead: @dtkaczyk -
Service:: label applied -
Definition of done updated -
Acceptance testing plan: -
Weight applied
Definition of done
-
Unit tests identified, implemented, and passing -
Code reviewed -
Available for acceptance testing via a staging URL, or otherwise -
Consider any impacts to current or future architecture/infrastructure, and update specifications and documentation as needed -
Knowledge base reviewed and updated -
Public documentation reviewed and updated -
Acceptance criteria met -
proposed sampling approach implemented -
measure and report the new request times for sampling requests after the change, in staging and production
-
-
Acceptance testing passed -
Deployed to production