Error for designs with no direct connection between consequent join tables
What is the current bug behavior?
If a design is provided with the join tables not directly connected to each other, an error is raised while the join graph is computed from the design definition.
Assume the following design:
{
version = 1
name = gitlab
connection = postgres_db
label = Gitlab
designs {
gitlab_issues_per_label {
label = Gitlab Issues per Label
from = gitlab_issues_per_label
description = Gitlab issues per Label
joins {
gitlab_projects {
label = Projects
sql_on = "gitlab_issues_per_label.project_id = gitlab_projects.project_id"
relationship = many_to_one
}
gitlab_project_milestones {
label = Milestones
sql_on = "gitlab_issues_per_label.milestone_id = gitlab_project_milestones.milestone_id"
relationship = many_to_one
}
}
}
}
}
The following error is raised by Meltano UI:
File ".../unweighted.py", line 294, in _bidirectional_pred_succ
raise nx.NetworkXNoPath("No path between %s and %s." % (source, target))
networkx.exception.NetworkXNoPath: No path between gitlab_projects and gitlab_project_milestones.
In this design, gitlab_projects
and gitlab_project_milestones
are joined with the base table gitlab_issues_per_label
and not between the two of them.
This is not only a valid schema definition for analytical purposes, but a prominent one, as most analytical schemas follow a star or a snowflake schema with multiple dimensional tables (that store the dimensions or categories to group by) joined to a base denormalized table (that stores the values to be aggregated).
Please note that there is no error during compilation of the .m5o
files, but at run time, when a user:
- Selects the design with this issue (
Gitlab > gitlab_issues_per_label
in this case) - Selects an aggregate from the base table and at least one categorical column from each table
What is the expected correct behavior?
Designs with join tables not directly connected to each other should be available.
Extra care should be taken care for snowflake schemas, i.e. schemas where a dimensional table is split in more than one tables in order to properly define hierarchies. Like, for example, when defining date or location hierarchies.
We should not assume that tables will be joined to the base table or the previous one. Tables should be able to join to any other table previously defined. As long as it is joined to a table that comes before it, the design is a valid one.
This is a valid schema that we should be able to define in a design and use:
Steps to reproduce
You can use the topic and models defined in the model-gitlab project to test and reproduce this issue. The gitlab_issues_per_label
design I used as an example is part of the topic defined for the gitlab model.
Checkout the branch I work on model-gitlab!1 (merged) as this depends on fixing this issue.
If you want to check it with real data, then create a new project and add the gitlab token in our Vault to extract data from the Gitlab API and check the transformed results in the analytics
schema:
meltano init tap-gitlab-project --no_usage_stats
cd tap-gitlab-project
# update .env
source .env
meltano elt tap-gitlab target-postgres --transform run
# Copy the models from the model-gitlab project to the /model/ directory of your project
meltano ui
.env required:
export FLASK_ENV=development
export SQLITE_DATABASE=meltano
export PG_DATABASE=warehouse
export PG_PASSWORD=
export PG_USERNAME=
export PG_ADDRESS=localhost
export PG_PORT=5432
export PG_SCHEMA='tap_gitlab'
export GITLAB_API_TOKEN='OUR GITLAB TOKEN'
export GITLAB_API_GROUPS='meltano'
export GITLAB_API_PROJECTS=''
export GITLAB_API_START_DATE='2018-01-01T00:00:00Z'
Relevant logs and/or screenshots
Full log:
[ Thread-60][INFO][2019-03-26 16:09:28,676] 127.0.0.1 - - [26/Mar/2019 16:09:28] "POST /api/v1/sql/get/gitlab/gitlab_issues_per_label HTTP/1.1" 500 -
Traceback (most recent call last):
... ...
File "/home/iroussos/work/meltano/src/meltano/api/controllers/sql.py", line 125, in get_sql
File "/home/iroussos/work/meltano/src/meltano/core/sql/sql_utils.py", line 108, in get_sql
hda_helper = HyperDimensionalAggregatesHelper(design, incoming_json, schema)
File "/home/iroussos/work/meltano/src/meltano/core/sql/hyper_dimensional_aggregates_helper.py", line 16, in __init__
definition=incoming_json, design_helper=design_helper, schema=schema
File "/home/iroussos/work/meltano/src/meltano/core/sql/base.py", line 337, in __init__
self.parse_definition(definition)
File "/home/iroussos/work/meltano/src/meltano/core/sql/base.py", line 428, in parse_definition
design_graph, last_table_added, table.name
File "/home/iroussos/work/meltano/src/meltano/core/sql/base.py", line 459, in joins_for_table
return nx.shortest_path(design_graph, source=source_table, target=target_table)
File "/home/iroussos/work/meltano/venv/lib/python3.6/site-packages/networkx/algorithms/shortest_paths/generic.py", line 170, in shortest_path
paths = nx.bidirectional_shortest_path(G, source, target)
File "/home/iroussos/work/meltano/venv/lib/python3.6/site-packages/networkx/algorithms/shortest_paths/unweighted.py", line 226, in bidirectional_shortest_path
results = _bidirectional_pred_succ(G, source, target)
File "/home/iroussos/work/meltano/venv/lib/python3.6/site-packages/networkx/algorithms/shortest_paths/unweighted.py", line 294, in _bidirectional_pred_succ
raise nx.NetworkXNoPath("No path between %s and %s." % (source, target))
networkx.exception.NetworkXNoPath: No path between gitlab_projects and gitlab_project_milestones.
Possible fixes
(If you can, link to the line of code that might be responsible for the problem or suggest a fix)
Further regression test
(Ensure we automatically catch similar issues in the future)
-
Write additional adequate test cases and submit test results -
Test results should be reviewed by a person from the team
/cc @valexieva