Fix missing entity check bug
Merge Request Checklist
-
Link to an issue -
Include the proposed fix or feature -
Include and update tests for the modified code -
Include a documentation change -
Add an entry in the Unreleased section of the CHANGELOG
Fixes the issue documented in #138 (closed) wherein loading the spec results in some incorrectly parsed table names, leading to Missing Entity Errors that prevent Permifrost from running.
The first commit implements a test on the method that contains the bug (group_table_by_database). Here's the test failure before implementing a fix:
_________________________________________________________________________ TestEntityGenerator.test_tables_by_database _________________________________________________________________________
self = <test_entities.TestEntityGenerator object at 0xffffa8d43af0>
entities = {'database_refs': {'demodb', 'demodb2', 'demodb3', 'demodb4', 'demodb5', 'demodb6'}, 'databases': {'demo', 'shared_demo'}, 'integration_refs': {'demo'}, 'integrations': {'demo'}, ...}
def test_tables_by_database(self, entities):
"""
Expect all <database>.<schema>.<table> references from the roles section in
snowflake_spec_reference_roles.yml spec
"""
expected = {
"demodb6": ["demodb6.demo_schema.demo_table"],
"demodb": ["demodb.*.*"],
"demodb5": [
"demodb5.demo_schema.demo_table",
"demodb5.demo_schema.demo_table_2",
],
"demodb2": ["demodb2.*.*"],
}
> assert entities["tables_by_database"] == expected
E AssertionError: assert {'demodb': ['....demo_table']} == {'demodb': ['....demo_table']}
E Omitting 3 identical items, use -vv to show
E Differing items:
E {'demodb5': ['demodb5.demo_schema.demo_table_2', 'd', 'e', 'm', 'o', 'd', ...]} != {'demodb5': ['demodb5.demo_schema.demo_table', 'demodb5.demo_schema.demo_table_2']}
E Full diff:
E {
E 'demodb': ['demodb.*.*'],
E 'demodb2': ['demodb2.*.*'],...
E
E ...Full output truncated (37 lines hidden), use '-vv' to show
tests/permifrost/test_entities.py:90: AssertionError
Note that this bug is particularly sticky because it only appears when more than one table is defined in the spec for a given database. If only a single table is declared for each database, the code never gets to the line (under Else:) where the bug is present.
Regarding the fix:
In group_table_by_database, we construct a dictionary where the keys are the database names and the values are lists of table names in that database. This is where extend is garbling the table names. Instead of switching extend to append, however, I have the function group the tables into sets, not lists. Using lists there is likely fragile because they preserve order.
Last thing: I may yet come back to test_load_spec_with_edge_case_tables in test_snowflake_spec_loader.py because I'm still not quite convinced it's doing what we want it to, but that might not be necessary to get this through.
- closes #138 (closed)