Page Indexing using scraping

We have a new way of indexing the municipality pages instead of using the browser registration. The new scrapy component starts a web crawler which indexes an entire website and is able to send out events for all pages or just all modified pages (delta) since the last indexing.

Result should make the registrations through the current API (page-metadata) and display the result on the settings page in the CMS:

It looks like the page-metadata currently supports:

listing: slug:gemeente_slug/page-metadata/
creation: slug:gemeente_slug/page-metadata/create/
read: slug:gemeente_slug/page-metadata/int:page_metadata_pk/
update: slug:gemeente_slug/page-metadata/int:page_metadata_pk/update
delete: slug:gemeente_slug/page-metadata/int:page_metadata_pk/delete

however this might not be reflected in the API which possibly only supports create as defined in antwoorden_cms/utters/api/urls.py

Since our starting position is an index of pages based on widget registry the following behavior should be realized:

After a full index (mode=full) the following should take place:

all pages that are currently registered but were not found in the index should be removed, except those for which specific settings are defined / not equal to the default settings for that organisation. These last should be marked as 'deleted' so that we can still display them (replacing the icon in front of the pagetitle by a clickable exclamation mark and not showing the edit button, only the deletion button)
all pages that were found and already exist should be updated to update the pagetitle only (no changes on the settings)
all pages that were found and did not already exist should be added as new

With a partial index (mode=delta) the following should take place:

any page that is deleted which has default settings is deleted. If the page has specific settings it should be marked as 'deleted'
any page that is added is added as new page
any page that is updated should get an update on the page title.

Additionally a cron job should be created that schedules a task executed by the CMS that calls the scrapy component for all organizations and all domains with the following parameters in a HTTP Form POST message (content-type application/x-www-form-urlencoded):

project=default
spider=domain
domain=[domain e.g. www.meierijstad.nl]
organisation=[organisation slug as specified in CMS e.g. meierijstad]
environment=[test | production]

Remark on the last one, we need to add an extra field on the domains used for an organisation/municipality, so that we can distinguish production website domains from non-production website domains.

Edited Mar 21, 2024 by Harvey van der Meer