Specifying a new matching algorithm (mask) should not require a new class or redefinition of a method
As it is right now I have to redefine either a class or a method if I want to change the join algorithm (or the thresholds associated with it). This is clunky. Consider the following:
class NameZipJoin(Join):
"""
Join class that matches on entity name and zip code
"""
def _validate_sides(self):
_validate_zip(self.join_sides)
def join(self, **kwargs):
match_df = self.get_matches_df(**kwargs)
left_side_zip = f"{self.join_sides[0].zip_field}_{self.join_sides[0].source}"
right_side_zip = f"{self.join_sides[1].zip_field}_{self.join_sides[1].source}"
mask = (match_df["similarity"] >= 0.75) & (
match_df[left_side_zip] == match_df[right_side_zip]
)
return match_df[mask]
A better method would be to have an injectable matching alg, perhaps with a default:
class NameZipJoin(Join):
...
def join(self, mask=None, **kwargs):
...