Generic File Tap/Target
A generic file based tap seems appropriate, going to call this tap-file
for now to give a name to it, pardon the words I chose here I'd be happy to update the Ontology used :D
Today there's a bunch of ideasthat overlap with the core idea of File. I haven't seen a "connector" tool handle this very well before this is a common issue (Fivetran, workato, etc etc).
Where we are today:
- https://hub.meltano.com/taps/s3-csv
- https://hub.meltano.com/taps/csv
- https://hub.meltano.com/taps/gmail-csv (questionable if this should be on the list)
- https://hub.meltano.com/taps/parquet
- https://hub.meltano.com/taps/sftp
- https://hub.meltano.com/targets/jsonl
- https://hub.meltano.com/targets/csv
- https://hub.meltano.com/targets/azureblobstorage
To make this generic, we'd need some standard features.
File Tap/Target needed Features
- Transport Layer (I think the right term, maybe something else) - How do we get the file, FTP, SFTP, S3, POST(curl), etc etc etc
- File Types - CSV, Parquet, JSONL, etc etc
- Compression - gzip, zip, 7zip, tar, gzip , etc
- Encoding's should be handled in some way that makes sense
Other Ideas that could be guiding principles for the design of this
- Identity Functionality should work. ie
meltano elt tap-file target-file
should output the same file with no changes (cksumfilein
== chksumfileout
) -
https://github.com/edgarrmondragon/tap-dbf/pull/1/files from @edgarrmondragon uses pyfilesystem2 which seems like a nice abstraction layer for all of this.
- URI's for everything
uri: datalake://{username}:{password}@{storage_name}?tenant_id={tenant_id}
- URI's for everything
- Example configuration with sensible defaults
- Default to pulling all files from the base directory
Questions
- Does a single
tap-file
make the most sense? - Does a library with helpful functionality make sense instead, and just have everyone implement
This comes up in different ways, and I'd like to collect ideas in one place (I keep posting the same information and maybe it's just a bad idea so let's have one place to hash out the idea :D )
Edited by Derek Visch