Worker Design & Scaling

Workers listen to a single Cloud PubSub topic waiting for a message. Once a message is received it is dispatched (routed) to a handler. Handlers perform specific work items such as ingesting the vuln-list data source, or reviewing new CVEs and determining if we have a mapping.

The pubsub topic represents a work queue to be processed. Messages are acked by the handler if the message can be/has been processed.

One worker instance is designated as the scheduler, in addition to being a worker. See the Scheduler topic for more information.

For the initial implementation, Pull Subscriptions are recommended based on simplicity to implement/deploy. In the longer term, moving to Push Subscriptions would be recommended. Push subscriptions are more complicated, requiring accepting HTTPS connections, JWT token authentication, and valid TLS certificates.

Example high-level program flow:

Initialize and register all handlers
If designated scheduler, start
While True
- Poll PubSub
- If Message
  - Route to handler
  - Handler processes and acks message

Message routing

Messages are routed to registered handlers through a to field in the message. All messages include an identifier used to link traces and log messages for observability purposes.

Example PubSub Message:

{
  "v": 1,
  "to": "REGISTERED_HANDLER_NAME",
  "id": "EVENT_ID",
  "data": ...
}

v -- Message version major number. Increments on breaking changes.
to -- What handler the message should be routed too
id -- An ID (uuid) used to link traces and logs between callers
data -- Data/message for the handler. Format defined by handlers.

Scaling & High Availability (HA)

Scaling

Because all workers are able to handle all messages, scaling to support faster processing of existing data sources or to support additional load if new data sources are added, is done by adding additional worker instances.

High Availability (HA)

The limiting factor in HA for this system is the single designated scheduler instance. If the designated scheduler instance goes down, no scheduled ingestions will occur. Moving to a leader election algorithm would allow recovery and multi-region fail over support.

Once leader elections are implemented, a multi-region deployment is possible.

Handler State

State for handlers/components of the workers is stored in the license db in a state table. The table stores JSON documents by key name.

Edited Mar 20, 2023 by Michael Eddington