Resolve "Add CLI option like `--test` to test whether config is valid and a connection can be made"
Closes #14 (closed)
This connection test feature (--test
in the CLI) is implemented by adding a MAX_RECORDS_LIMIT
variable within the base Stream class - then implemented a new Tap.run_connection_test()
method which is identical to Tap.sync_all()
except that it injects MAX_RECORDS_LIMIT=0
into each stream before running sync()
.
Notes:
- Each stream is initialized as usual and a "real" data sync is started.
- The
MAX_RECORDS_LIMIT
is applied per stream or (if applicable) per stream partition. - Because of the nature of this "LIMIT", we always drop at least one record in this process. Meaning, a record is queued up which is not ultimately emitted. This is intentional. For security and other reasons, we want to ensure this is treated as a hard hard limit and will not be exceeded. We also do not want to short-circuit the process prior to a row being received, since that could miss a large class of potential errors.
-
SCHEMA
messages will still be emitted toSTDOUT
, per usual, for each stream. -
STATE
messages may or may not be emitted, depending upon implementation. -
CATALOG
can be read or it can be discovered in combination with--test
, as per usual.- I considered deliberately skipping discovery and other catalog processing steps, but some stream types rely on this process in order to identify which endpoints to hit or which tables to query from.
- Without supporting discover and/or input catalogs, certain types of taps would simply have an empty catalog with no streams and therefor no way to assert connectivity.
- The notice
Stream prematurely aborted due to the stream's max record limit (0) being reached.
will be emitted only if one or more records would have been emitted. If zero records are found, the notice will not print and the process will run identically with any other zero-record stream sync.
Other applications of MAX_RECORDS_LIMIT
(to be addressed in follow-on dev stories):
- For testing use cases:
- A "dry run" might run 100 or 1000 rows per stream just to ensure all streams are working correctly.
- For first run use cases:
- Similarly to the above, a developer might want to execute 1,000 or 100,000 rows per stream just to ensure all streams are working correctly and landing in their target database. Rather than wait several hours for all tables to sync, they can review the output on the same business day and then schedule the remaining records to be synced after hours.
- For scenarios where very long contiguous runtimes are not desirable:
- A very high limit can be set (say, 100 million).
- After this limit is reached, the tap will send its
INFO
messageStream prematurely aborted due to the stream's max record limit (100000000) being reached.
and then will shut down. - Because state messages are always emitted if >0 records have been read, the process is resumable directly from that point, assuming the stream is incremental or log_based.
Note to developers:
- Because this feature is implemented entirely with private base class methods, there are no breaking changes and we expect most tap developers should be able to immediately take advantage of the new feature simply by updating to the latest version of the SDK.
Edited by AJ Steers