Platform: Add support for exporting Snowplow data to S3
This issue tracks building support for exporting data ingested within Data Insights Platform to GCS. Having data landed in GCS further enables it being available on downstream consumers such as Snowflake.
Some of the key deliverables necessary with this implementation are:
-
Ability to partition data per Snowplow-formatted
app_idand tenant ID. It can be either of below format.graph TD GCS["GCS Data Lake"] GCS --> app1["app_id=1/"] GCS --> appN["app_id=N/"] app1 --> app1ten1["tenant_id=1/"] app1 --> app1tenN["tenant_id=N/"] appN --> appNten1["tenant_id=1/"] appN --> appNtenN["tenant_id=N/"] app1ten1 --> app1ten1data["Partitoned YYYY/MM/DD/HH data_YYYY_MM_DD_HH_MM_SEC.parquet"] app1tenN --> app1tenNdata["Partitoned YYYY/MM/DD/HH data_YYYY_MM_DD_HH_MM_SEC.parquet"] appNten1 --> appNten1data["Partitoned YYYY/MM/DD/HH data_YYYY_MM_DD_HH_MM_SEC.parquet"] appNtenN --> appNtenNdata["Partitoned YYYY/MM/DD/HH data_YYYY_MM_DD_HH_MM_SEC.parquet"] classDef root fill:#f96,stroke:#333,stroke-width:2px classDef level1 fill:#9af,stroke:#333,stroke-width:1px classDef level2 fill:#9bf,stroke:#333,stroke-width:1px classDef data fill:#bfb,stroke:#333,stroke-width:1px class S3 root class app1,app2,app3,appN level1 class app1ten1,app1ten2,app1ten3,app1tenN,app2ten1,app2ten2,app2tenN,app3ten1,app3ten2,app3tenN,appNten1,appNten2,appNtenN level2 class app1ten1data,app1tenNdata,appNten1data,appNtenNdata dataOr it can be other way around i.e. tenant_ID on top and app_id as sub folder like below.
graph TD GCS["GCS Data Lake"] GCS --> app1["tenant_id=1/"] GCS --> appN["tenant_id=N/"] app1 --> app1ten1["app_id=1/"] app1 --> app1tenN["app_id=N/"] appN --> appNten1["app_id=1/"] appN --> appNtenN["app_id=N/"] app1ten1 --> app1ten1data["Partitoned YYYY/MM/DD/HH data_YYYY_MM_DD_HH_MM_SEC.parquet"] app1tenN --> app1tenNdata["Partitoned YYYY/MM/DD/HH data_YYYY_MM_DD_HH_MM_SEC.parquet"] appNten1 --> appNten1data["Partitoned YYYY/MM/DD/HH data_YYYY_MM_DD_HH_MM_SEC.parquet"] appNtenN --> appNtenNdata["Partitoned YYYY/MM/DD/HH data_YYYY_MM_DD_HH_MM_SEC.parquet"] classDef root fill:#f96,stroke:#333,stroke-width:2px classDef level1 fill:#9af,stroke:#333,stroke-width:1px classDef level2 fill:#9bf,stroke:#333,stroke-width:1px classDef data fill:#bfb,stroke:#333,stroke-width:1px class S3 root class app1,app2,app3,appN level1 class app1ten1,app1ten2,app1ten3,app1tenN,app2ten1,app2ten2,app2tenN,app3ten1,app3ten2,app3tenN,appNten1,appNten2,appNtenN level2 class app1ten1data,app1tenNdata,appNten1data,appNtenNdata data
In option 2 we have data separated per tenant which helps to maintain the data isolation requirement for the each customer. This has benefits in terms of data cleansing also a path for data sharing in future.
- We should be able to write data to different bucket based on type of data we are receiving i.e able to write to different bucket for OTEL events, Snowplow, Cloud events etc. It should be easily configurable type of incoming data is going to which Object storage.
- Feature to configure the data retention policy to archival of data and seamless retrieval of data source for inactive tenants.
- If a new tenant is added the bucket creation or path creation all handles it self in the workflow.
- Mechanism to monitor the fair usage quotas per tenant in terms of storage.
- Have feature to support tenant-specific encryption keys.
- Route data to specific regions based on tenant residency requirements
Edited by ved prakash
