MDM Interview Challenge
The snippet can be accessed without any authentication.
Authored by
Daniel West

Data Engineering Challenge: Golden Customer Record Builder


Scenario

You are working on a data platform that collects customer information from various departments. Each group maintains its own version of the customer records, leading to duplicates and inconsistencies. Your task is to create a single golden customer record for each unique customer.

Input Files

Download the files below and assume they are ingested from different teams:
- customers_a.csv
- customers_b.csv

Your Task

- Load both CSVs using Python (Pandas or PySpark).
-
Standardize schemas:
-
customers_a.csv
hasfirst_name
andlast_name
-
customers_b.csv
has afull_name
field — split it intofirst_name
andlast_name
-
-
Identify matching customers using the
email
field
(assume email is a unique identifier) -
Merge customer records into a single "golden record", keeping:
- The most recent value for each field based on
updated_at
- Any non-conflicting fields (e.g., include phone number if available)
- The most recent value for each field based on
-
Output the final golden record table as either:
- A Pandas DataFrame, or
- A new CSV file

Bonus (Optional)

- Add a confidence score per record based on how many sources contributed
- Discuss how you would productionize this using:
- Orchestration (e.g., Airflow, Dagster)
- Data modeling (e.g., dbt)
- Scalable execution (e.g., Spark)
- Handle fuzzy matches for names or emails (e.g., small typos)

Skills Tested

- Data wrangling and transformation
- Schema alignment and normalization
- Deduplication and conflict resolution
- Master data management (MDM) thinking
customers_a.csv 151 B
customers_b.csv 163 B
Please register or sign in to comment