Platform: Add support for exporting Snowplow data to S3

This issue tracks building support for exporting data ingested within Data Insights Platform to GCS. Having data landed in GCS further enables it being available on downstream consumers such as Snowflake.

DIP_S3ExporterHighlighted

Some of the key deliverables necessary with this implementation are:

  • Ability to partition data per Snowplow-formatted app_id and tenant ID. It can be either of below format.

    graph TD
        GCS["GCS Data Lake"]
        
        GCS --> app1["app_id=1/"]
        GCS --> appN["app_id=N/"]
        
        app1 --> app1ten1["tenant_id=1/"]
        app1 --> app1tenN["tenant_id=N/"]
        
        appN --> appNten1["tenant_id=1/"]
        appN --> appNtenN["tenant_id=N/"]
        
        app1ten1 --> app1ten1data["Partitoned YYYY/MM/DD/HH data_YYYY_MM_DD_HH_MM_SEC.parquet"]
        app1tenN --> app1tenNdata["Partitoned YYYY/MM/DD/HH data_YYYY_MM_DD_HH_MM_SEC.parquet"]
    
        appNten1 --> appNten1data["Partitoned YYYY/MM/DD/HH data_YYYY_MM_DD_HH_MM_SEC.parquet"]
        appNtenN --> appNtenNdata["Partitoned YYYY/MM/DD/HH data_YYYY_MM_DD_HH_MM_SEC.parquet"]
        
    
        classDef root fill:#f96,stroke:#333,stroke-width:2px
        classDef level1 fill:#9af,stroke:#333,stroke-width:1px
        classDef level2 fill:#9bf,stroke:#333,stroke-width:1px
        classDef data fill:#bfb,stroke:#333,stroke-width:1px
        
        class S3 root
        class app1,app2,app3,appN level1
        class app1ten1,app1ten2,app1ten3,app1tenN,app2ten1,app2ten2,app2tenN,app3ten1,app3ten2,app3tenN,appNten1,appNten2,appNtenN level2
        class app1ten1data,app1tenNdata,appNten1data,appNtenNdata data

    Or it can be other way around i.e. tenant_ID on top and app_id as sub folder like below.

    graph TD
        GCS["GCS Data Lake"]
        
        GCS --> app1["tenant_id=1/"]
        GCS --> appN["tenant_id=N/"]
        
        app1 --> app1ten1["app_id=1/"]
        app1 --> app1tenN["app_id=N/"]
        
        appN --> appNten1["app_id=1/"]
        appN --> appNtenN["app_id=N/"]
        
        app1ten1 --> app1ten1data["Partitoned YYYY/MM/DD/HH data_YYYY_MM_DD_HH_MM_SEC.parquet"]
        app1tenN --> app1tenNdata["Partitoned YYYY/MM/DD/HH data_YYYY_MM_DD_HH_MM_SEC.parquet"]
    
        appNten1 --> appNten1data["Partitoned YYYY/MM/DD/HH data_YYYY_MM_DD_HH_MM_SEC.parquet"]
        appNtenN --> appNtenNdata["Partitoned YYYY/MM/DD/HH data_YYYY_MM_DD_HH_MM_SEC.parquet"]
        
    
        classDef root fill:#f96,stroke:#333,stroke-width:2px
        classDef level1 fill:#9af,stroke:#333,stroke-width:1px
        classDef level2 fill:#9bf,stroke:#333,stroke-width:1px
        classDef data fill:#bfb,stroke:#333,stroke-width:1px
        
        class S3 root
        class app1,app2,app3,appN level1
        class app1ten1,app1ten2,app1ten3,app1tenN,app2ten1,app2ten2,app2tenN,app3ten1,app3ten2,app3tenN,appNten1,appNten2,appNtenN level2
        class app1ten1data,app1tenNdata,appNten1data,appNtenNdata data

In option 2 we have data separated per tenant which helps to maintain the data isolation requirement for the each customer. This has benefits in terms of data cleansing also a path for data sharing in future.

  • We should be able to write data to different bucket based on type of data we are receiving i.e able to write to different bucket for OTEL events, Snowplow, Cloud events etc. It should be easily configurable type of incoming data is going to which Object storage.
  • Feature to configure the data retention policy to archival of data and seamless retrieval of data source for inactive tenants.
  • If a new tenant is added the bucket creation or path creation all handles it self in the workflow.
  • Mechanism to monitor the fair usage quotas per tenant in terms of storage.
  • Have feature to support tenant-specific encryption keys.
  • Route data to specific regions based on tenant residency requirements
Edited by ved prakash