Discuss vision-lidar cluster fusion in the object tracker

Description

The purpose of this issue is to discuss the rationale behind having both image-track association and image-clusters association in the tracking architecture. Here is the current tracking architecture diagram for reference,

Current diagram as image

Rationale: (Will be updated after discussions in the comments)

This section explains why we have two distinct blocks (ie., Image detection to track association and Cluster <--> Image association) instead of just one block (Cluster <--> Image association)

Assumption:
Vision detections, in general, come in at a lower frequency than the lidar clusters

Lidar clusters are usually published at 10Hz. Vision detections from surround cameras are typically published at a lower rate.

Reason:
Based on the assumption, we will not have lidar clusters and vision detections looking at the same exact objects at the same exact time.
Tracks have state information that make it easier to predict the position of the tracks at any give timestamp. This makes it easier and more accurate to associate tracks with vision detections.
But not all tracks can be associated with vision detections. This is especially true for new objects that just arrived in the field of view.
This means that we cannot ignore the unassociated vision detections. The best way to use them would be to associate them with Lidar clusters.
The lidar clusters will not be from the exact timestamp as the vision detections but we can increase the tolerances for the association. This enables us to create accurate and stable tracks for much longer range than if we just use lidar clusters.

Definition of Done

Have agreement on the architecture and move the rationale to the design doc

Edited Aug 04, 2021 by Gowtham Ranganathan