Exploration into broken pipe errors and single point of failure for ELT

This is for next week's milestone, and will be exploration only (not implementatin of a fix).

Task List

@iroussos fill out this issue with more detail
@iroussos update the documentation with instructions for users on how to add credentials
@iroussos come up with a 1-week scope improvement proposal to do some of the hypothesized highest ROI things
@dmor have a go/no-go conversation before any implementation takes place (and figure out which milestone)

Problem description

At the core of Meltano ELT is the Extract > Load pipeline. Meltano uses Singer.io Taps as Extractors and Singer.io Targets as Loaders.

Both Taps and Targets are loaded as plugins and are treated as black boxes from Meltano: We expect that all Taps and Targets follow the Singer.io Specification.

When meltano elt runs, the tap is connected to the target through a pipe and data are sent through that pipe.

We try to handle simple errors, like, for example, a tap not supporting the discovery command, but not all issues can be gracefully handled by Meltano.

The most important issue we are facing is a Tap or a Target failing while the Extract > Load pipeline runs: If an error occurs that the Tap or Target can't handle properly and exits, the pipe that connects the Tap with the Target can brake and closes.

The process that failed writes an error to the output, but the other process may take some time before realizing that the pipe broke, continue its execution and then fail with a cryptic Broken pipe error as the last error provided to the user.

We have identified two major use cases where the above process causes confusion to the user:

The Target fails and exits the first time it tries to load data.

The proper error is written to the output log (e.g. wrong password or not adequate user role to access a DB), but the Tap keeps on running uninterrupted.

The Tap keeps on extracting data from the API and writing successful log messages. At some point down the road, it tries to send the extracted data to the Target, but the pipe has closed due to the Target failing and exiting, so it fails while trying to flush the data to its output (the pipe in this case). Hence the Broken pipe error in this case.

As an example, assume a user extracting from the Gitlab API and Loading the extracted data to her local Postgres. For whatever reason, the wrong Postgres credentials have been setup in the environment. It could be a wrong password, wrong database name or even a wrong port for the local Postgres:
- The Gitlab extractor (tap-gitlab) runs for a while before sending any data to the loader, as it tries to group relevant data together before sending them back.
- When the first bunch of data from Gitlab are ready, they are sent to target-postgres, which fails, writes an error to the log, exits and breaks the pipe.
- tap-gitlab continues its execution happy as a sun flower in the summer as it does not know that the pipe broke. It keeps on trying to fetch more data, while producing loads of log messages.
- At some point, tap-gitlab has enough data to send back, tries to write to its output and fails with a critical error.
The worst part is that depending on the extractor (how much time it takes to send data to the pipe and how many log entries it generates in between) and the cmd buffer of the user running the meltano elt, the original error may not be even there when the final Broken pipe error is produced.

Examples of this happening inside our team and causing confusion: #502 (closed), #341 (closed), #450 (closed), etc
One of the threads of a multi-threaded Tap fails.

Using multiple threads to extract in parallel from multiple end points of an API is required for APIs with lots of data (like Salesforce, NetSuite or Zuora). Not all Taps support multi-threaded extraction, but once that support it, can cause this really tricky error:
- One of the threads fails. Either cause there is a bug for that specific Endpoint, an unsupported Entity or whatever.
- The moment the thread fails and exits, the pipe (that all threads share in order to send data to the Loader) breaks.
- If there is no proper thread management in the Tap, the rest of the threads keep on living their lives, extract data and produce log entries.
- As one by one the other threads finish with their execution and try to send data to the pipe, they break with a nice critical Broken pipe error.
- That's the previous issue multiplied by however many threads may be running (8+) or even more if the thread execution is scheduled beforehand and all threads are spawn to just fail afterwards.
The result is an amazing log with critical after critical error of failed jobs with jobs extracting in the same time and logging successful API extraction messages. It may require going back 100s or 1000s of log entries to find the original error that caused the process to fail.

That's the case with some of the most wonderful errors we had with tap-salesforce.

Problem Severity

This is an issue we are facing more and more as new people try to use Meltano. At the moment we have weekly reports generated from inside the team, so I guess that this is going to be multiplied as more people start to use Meltano.

From my personal (@iroussos) experience, most of the cases are due to wrong Target credentials or a wrong Target setup:

A user makes a mistake while setting up a Target. Either because how to setup the proper Postgres credentials is not explained adequately in our documentation or because there is confusion between using the docker image vs using a local Postgres or just because a Target is not set correctly (in the case of extracting to CSV).
The Target fails after a while (during the initial schema generation phase), but the tap keeps on generating log entries and fails as described in the previous section.

Either way, those errors will make users think that it's Meltano's fault, while most of the time it's just missing credentials or a DB not properly setup. In contrast, the proper user experience would be to notify the user that the credentials are wrong as soon as possible.

On an unrelated note, this whole situation is made worse by the vast number of log messages meltano elt generates. We should maybe raise the log level to error or something similar.

Options to solve this problem

If we want to completely solve this issue, we should find a way to stop execution when an error arises from a Tap or a Target and present a meaningful error to users. We already wrap the piped execution, so maybe we could handle those errors gracefully.

@mbergeron has implemented this very delicate and tricky part of Meltano, so he should be able to add more in depth comments on how difficult such an effort would be.

A second option would be to focus on the root cause of most problems: Wrong Tap/Target setup and credentials.

That would require us to try and test whether the Tap and the Target work properly without running the full Extract-Load process. The tricky part here is how to do that while still treating taps and targets as black boxes and following the Singer.io Specification:

There is no test configuration command at the moment for Taps and Targets. If we had such a command available, we could start by checking that the Tap can connect to the API and that the Target can connect to its Database and write data.
For taps supporting discoverability, generating the catalog could be a proper test, but not all APIs require proper credentials to be set in order to get the catalog. And not all taps support the discover option.

Other than that, I can not see any other way to test that the tap has been setup properly. I have seen Taps that fail the moment you start them and others that may take a while.
Regarding the Target credentials, which is our major pain point, we could maybe try to write a very simple test table schema to test the Target setup.

As an example, we could require that in the Target Schema a meltano_log table will be created with all the logs.

Just by creating that table when the execution begins, makes sure that we can properly connect to the Target Database. And then we can start the proper tap | target execution/

And we can use that table in order to write the execution log messages, which is also something that may be required in a future release.

Technical feasibility and scope of solutions

To Be Discussed

Edited Oct 22, 2019 by Yannis Roussos

Admin message