Skip to content
Snippets Groups Projects

MDM Interview Challenge

  • Clone with SSH
  • Clone with HTTPS
  • Embed
  • Share
    The snippet can be accessed without any authentication.
    Authored by Daniel West

    :test_tube: Data Engineering Challenge: Golden Customer Record Builder

    :blue_book: Scenario

    You are working on a data platform that collects customer information from various departments. Each group maintains its own version of the customer records, leading to duplicates and inconsistencies. Your task is to create a single golden customer record for each unique customer.


    :open_file_folder: Input Files

    Download the files below and assume they are ingested from different teams:

    • customers_a.csv
    • customers_b.csv

    :white_check_mark: Your Task

    1. Load both CSVs using Python (Pandas or PySpark).
    2. Standardize schemas:
      • customers_a.csv has first_name and last_name
      • customers_b.csv has a full_name field — split it into first_name and last_name
    3. Identify matching customers using the email field
      (assume email is a unique identifier)
    4. Merge customer records into a single "golden record", keeping:
      • The most recent value for each field based on updated_at
      • Any non-conflicting fields (e.g., include phone number if available)
    5. Output the final golden record table as either:
      • A Pandas DataFrame, or
      • A new CSV file

    :fire: Bonus (Optional)

    • Add a confidence score per record based on how many sources contributed
    • Discuss how you would productionize this using:
      • Orchestration (e.g., Airflow, Dagster)
      • Data modeling (e.g., dbt)
      • Scalable execution (e.g., Spark)
    • Handle fuzzy matches for names or emails (e.g., small typos)

    :brain: Skills Tested

    • Data wrangling and transformation
    • Schema alignment and normalization
    • Deduplication and conflict resolution
    • Master data management (MDM) thinking
    Edited
    customers_a.csv 151 B
    customers_b.csv 163 B
    0% Loading or .
    You are about to add 0 people to the discussion. Proceed with caution.
    Please register or to comment