npm feeder job exceeds job timeout

Problem

The npm feeder is currently hitting the job timeout of 3 hours. As a result, we're currently unable to update our license database. This is happening because each time the job is executed it starts from the beginning and fetches all the required splits. Contrast this with the go feeder that uses a cursor persisted in cloud storage in order to carry on working from where it left off.

Failing Job

run feeder

Proposed Solutions

Persist the last round's splits

Implementation Plan 1

Save the splits from the latest round to the cursor
Instead of starting from "scratch", start from the last saved split. their offsets need to be fetched again, as those might have moved
Sub-split as needed
Persist the results of the new sub-splits

Please note: we worked through this solution but ran into an issue with rate-limiting. An extenstion to the solution was needed.

Implementation Plan 2

Iteration 1

Create a new compute instance in ext-license-db-dev-d6ba6f35 in the us-east1 region
Create an administrator account and set a password
Setup couchdb
Create a new database called license-db-npm-mirror
Start the replication process using https://replicate.npmjs.com/registry as the source and license-db-npm-mirror as the target. Select continuous as the type of replication
Monitor progress and ensure it completes
Create a disk image backup
~~Set custom hostname for the instance to avoid relying on an ephemeral ip address~~
Perform some manual QA to validate that removing rate-limiting moves us closer to our goal

Iteration 2

Submit Use errgroup to re-fetch NPM split offsets for review
Make npm registry url configurable in feeder and add support for basic authentication
Make npm registry url configurable in interfacer and add support for basic authentication
Release feeder and interfacer

Iteration 3

Review infrastructure changes
Align with existing approaches to security
~~Add terraform code to provision couchdb and associated infrastructure~~

Edited May 22, 2023 by Philip Cunningham

Assignee Loading

Time tracking Loading