Skip to content
Commits on Source (93)
......@@ -34,6 +34,33 @@ tests:
- python3 setup.py sdist
- pip3 install $(ls dist/*)
- rm -rf d3m d3m.*
- pip3 freeze
- pip3 check
- python3 ./run_tests.py
tests_oldest_dependencies:
stage: build
image: registry.gitlab.com/datadrivendiscovery/images/testing:ubuntu-bionic-python36
services:
- docker:dind
before_script:
- docker info
variables:
GIT_SUBMODULE_STRATEGY: recursive
DOCKER_HOST: tcp://docker:2375
script:
- sed -i "s/__version__ = 'devel'/__version__ = '1970.1.1'/" d3m/__init__.py
- python3 setup.py sdist
- pip3 install $(ls dist/*)
- rm -rf d3m d3m.*
- ./oldest_dependencies.py | pip3 install --upgrade --upgrade-strategy only-if-needed --exists-action w --requirement /dev/stdin
- pip3 freeze
- pip3 check
- python3 ./run_tests.py
benchmarks:
......@@ -68,6 +95,7 @@ test_datasets:
before_script:
- docker info
- git -C /data/datasets show -s
- git -C /data/datasets_public show -s
variables:
GIT_SUBMODULE_STRATEGY: recursive
......@@ -78,7 +106,10 @@ test_datasets:
- python3 setup.py sdist
- pip3 install $(ls dist/*)
- rm -rf d3m
- find /data/datasets -name datasetDoc.json -print0 | xargs -0 python3 -m d3m dataset describe --continue --list --time
- find /data/datasets -name datasetDoc.json -print0 | xargs -r -0 python3 -m d3m dataset describe --continue --list --time
- find /data/datasets_public -name datasetDoc.json -print0 | xargs -r -0 python3 -m d3m dataset describe --continue --list --time
- find /data/datasets -name problemDoc.json -print0 | xargs -r -0 python3 -m d3m problem describe --continue --list --no-print
- find /data/datasets_public -name problemDoc.json -print0 | xargs -r -0 python3 -m d3m problem describe --continue --list --no-print
run_check:
stage: build
......
## v2020.1.9
### Enhancements
* Support for D3M datasets with minimal metadata.
[#429](https://gitlab.com/datadrivendiscovery/d3m/issues/429)
[!327](https://gitlab.com/datadrivendiscovery/d3m/merge_requests/327)
* Pipeline runs (and in fact many other input documents) can now be directly used gzipped
in all CLI commands. They have to have filename end with `.gz` for decompression to happen
automatically.
[#420](https://gitlab.com/datadrivendiscovery/d3m/issues/420)
[!317](https://gitlab.com/datadrivendiscovery/d3m/merge_requests/317)
* Made problem descriptions again more readable when converted to JSON.
[!316](https://gitlab.com/datadrivendiscovery/d3m/merge_requests/316)
* Improved YAML handling to encourage faster C implementation.
[#416](https://gitlab.com/datadrivendiscovery/d3m/issues/416)
[!313](https://gitlab.com/datadrivendiscovery/d3m/merge_requests/313)
### Bugfixes
* Fixed the error message if all required CLI arguments are not passed to the runtime.
[#411](https://gitlab.com/datadrivendiscovery/d3m/issues/411)
[!319](https://gitlab.com/datadrivendiscovery/d3m/merge_requests/319)
* Removed assumption that all successful pipeline run steps have method calls.
[#422](https://gitlab.com/datadrivendiscovery/d3m/issues/422)
[!320](https://gitlab.com/datadrivendiscovery/d3m/merge_requests/320)
* Fixed "Duplicate problem ID" warnings when multiple problem descriptions
have the same problem ID, but in fact they are the same problem description.
No warning is made in this case anymore.
[#417](https://gitlab.com/datadrivendiscovery/d3m/issues/417)
[!321](https://gitlab.com/datadrivendiscovery/d3m/merge_requests/321)
* Fixed the use of D3M container types in recent versions of Keras and TensorFlow.
[#426](https://gitlab.com/datadrivendiscovery/d3m/issues/426)
[!322](https://gitlab.com/datadrivendiscovery/d3m/merge_requests/322)
* Fixed `validate` CLI commands to work on YAML files.
### Other
* Updated upper bounds of core dependencies to latest available versions.
[#427](https://gitlab.com/datadrivendiscovery/d3m/issues/427)
[!325](https://gitlab.com/datadrivendiscovery/d3m/merge_requests/325)
* Refactored default pipeline run parser implementation to make it
easier to provide alternative dataset and problem resolvers.
[!314](https://gitlab.com/datadrivendiscovery/d3m/merge_requests/314)
* Moved out local test primitives into [`tests/data` git submodule](https://gitlab.com/datadrivendiscovery/tests-data).
Now all test primitives are in one place.
[#254](https://gitlab.com/datadrivendiscovery/d3m/issues/254)
[!312](https://gitlab.com/datadrivendiscovery/d3m/merge_requests/312)
## v2019.11.10
* Support for version 4.0.0 of D3M dataset schema has been added.
......
......@@ -27,7 +27,7 @@
* Commit with message `Version bump for development.`
* `git push`
* After a release:
* Create a new [`core` Docker image](https://gitlab.com/datadrivendiscovery/images) for the release.
* Create a new [`core` and `primitives` Docker images](https://gitlab.com/datadrivendiscovery/images) for the release.
* Add new release to the [primitives index repository](https://gitlab.com/datadrivendiscovery/primitives/blob/master/HOW_TO_MANAGE.md).
If there is a need for a patch version to fix a released version on the same day,
......
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
include README.md
include LICENSE.txt
......@@ -9,6 +9,7 @@ This package works with Python 3.6+ and pip 19+. You need to have the following
* `libssl-dev`
* `libcurl4-openssl-dev`
* `libyaml-dev`
You can install latest stable version from [PyPI](https://pypi.org/):
......
__version__ = '2019.11.10'
__version__ = '2020.1.9'
__description__ = 'Common code for D3M project'
__author__ = 'DARPA D3M Program'
......
......@@ -212,6 +212,11 @@ def problem_configure_parser(parser: argparse.ArgumentParser, *, skip_arguments:
'-o', '--output', type=utils.FileType('w', encoding='utf8'), default='-', action='store',
help="save output to a file, default stdout",
)
if 'no_print' not in skip_arguments:
describe_parser.add_argument(
'--no-print', default=False, action='store_true',
help="do not print JSON",
)
if 'problems' not in skip_arguments:
describe_parser.add_argument(
'problems', metavar='PROBLEM', nargs='+',
......@@ -477,7 +482,7 @@ def runtime_handler(
'{command} requires either -u/--input-run or the following arguments: {manual_arguments}'.format(
command=arguments.runtime_command,
manual_arguments=', '.join(
name for (name, dest) in manual_config if getattr(arguments, dest, None) is not None
name for (name, dest) in manual_config
),
)
)
......
......@@ -1335,13 +1335,6 @@ class D3MDatasetLoader(Loader):
column_names = None
data_path = os.path.join(dataset_path, data_resource['resPath'])
expected_names = None
if data_resource.get('columns', None):
expected_names = []
for i, column in enumerate(data_resource['columns']):
assert i == column['colIndex'], (i, column['colIndex'])
expected_names.append(column['colName'])
if utils.is_sequence(data_resource['resFormat']) and len(data_resource['resFormat']) == 1:
resource_format = data_resource['resFormat'][0]
elif isinstance(data_resource['resFormat'], typing.Mapping) and len(data_resource['resFormat']) == 1:
......@@ -1352,7 +1345,6 @@ class D3MDatasetLoader(Loader):
if resource_format in ['text/csv', 'text/csv+gzip']:
data = pandas.read_csv(
data_path,
usecols=expected_names,
# We do not want to do any conversion of values at this point.
# This should be done by primitives later on.
dtype=str,
......@@ -1368,12 +1360,6 @@ class D3MDatasetLoader(Loader):
column_names = list(data.columns)
if expected_names is not None and expected_names != column_names:
raise ValueError("Mismatch between column names in data {column_names} and expected names {expected_names}.".format(
column_names=column_names,
expected_names=expected_names,
))
if hash is not None:
# We include both the filename and the content.
# TODO: Currently we read the file twice, once for reading and once to compute digest. Could we do it in one pass? Would it make it faster?
......@@ -1424,60 +1410,76 @@ class D3MDatasetLoader(Loader):
'structural_type': str,
})
if expected_names is not None:
for i, column in enumerate(data_resource['columns']):
column_metadata = self._get_column_metadata(column)
if 'https://metadata.datadrivendiscovery.org/types/Boundary' in column_metadata['semantic_types'] and 'boundary_for' not in column_metadata:
# Let's reconstruct for which column this is a boundary: currently
# this seems to be the first non-boundary column before this one.
for column_index in range(i - 1, 0, -1):
column_semantic_types = metadata.query((data_resource['resID'], metadata_base.ALL_ELEMENTS, column_index)).get('semantic_types', ())
if 'https://metadata.datadrivendiscovery.org/types/Boundary' not in column_semantic_types:
column_metadata['boundary_for'] = {
'resource_id': data_resource['resID'],
'column_index': column_index,
}
break
metadata_columns = {}
for column in data_resource.get('columns', []):
metadata_columns[column['colIndex']] = column
for i in range(len(column_names)):
if i in metadata_columns:
if column_names[i] != metadata_columns[i]['colName']:
raise ValueError("Mismatch between column name in data '{data_name}' and column name in metadata '{metadata_name}'.".format(
data_name=column_names[i],
metadata_name=metadata_columns[i]['colName'],
))
column_metadata = self._get_column_metadata(metadata_columns[i])
else:
column_metadata = {
'semantic_types': [
D3M_COLUMN_TYPE_CONSTANTS_TO_SEMANTIC_TYPES['unknown'],
],
}
metadata = metadata.update((data_resource['resID'], metadata_base.ALL_ELEMENTS, i), column_metadata)
current_boundary_start = None
current_boundary_list: typing.Tuple[str, ...] = None
column_index = 0
while column_index < len(data_resource['columns']):
column_semantic_types = metadata.query((data_resource['resID'], metadata_base.ALL_ELEMENTS, column_index)).get('semantic_types', ())
if is_simple_boundary(column_semantic_types):
# Let's reconstruct which type of a boundary this is. Heuristic is simple.
# If there are two boundary columns next to each other, it is an interval.
if current_boundary_start is None:
assert current_boundary_list is None
count = 1
for next_column_index in range(column_index + 1, len(data_resource['columns'])):
if is_simple_boundary(metadata.query((data_resource['resID'], metadata_base.ALL_ELEMENTS, next_column_index)).get('semantic_types', ())):
count += 1
else:
break
if count == 2:
current_boundary_start = column_index
current_boundary_list = INTERVAL_SEMANTIC_TYPES
if 'https://metadata.datadrivendiscovery.org/types/Boundary' in column_metadata['semantic_types'] and 'boundary_for' not in column_metadata:
# Let's reconstruct for which column this is a boundary: currently
# this seems to be the first non-boundary column before this one.
for column_index in range(i - 1, 0, -1):
column_semantic_types = metadata.query((data_resource['resID'], metadata_base.ALL_ELEMENTS, column_index)).get('semantic_types', ())
if 'https://metadata.datadrivendiscovery.org/types/Boundary' not in column_semantic_types:
column_metadata['boundary_for'] = {
'resource_id': data_resource['resID'],
'column_index': column_index,
}
break
metadata = metadata.update((data_resource['resID'], metadata_base.ALL_ELEMENTS, i), column_metadata)
current_boundary_start = None
current_boundary_list: typing.Tuple[str, ...] = None
column_index = 0
while column_index < len(column_names):
column_semantic_types = metadata.query((data_resource['resID'], metadata_base.ALL_ELEMENTS, column_index)).get('semantic_types', ())
if is_simple_boundary(column_semantic_types):
# Let's reconstruct which type of a boundary this is. Heuristic is simple.
# If there are two boundary columns next to each other, it is an interval.
if current_boundary_start is None:
assert current_boundary_list is None
count = 1
for next_column_index in range(column_index + 1, len(column_names)):
if is_simple_boundary(metadata.query((data_resource['resID'], metadata_base.ALL_ELEMENTS, next_column_index)).get('semantic_types', ())):
count += 1
else:
# Unsupported group of boundary columns, let's skip them all.
column_index += count
continue
break
column_semantic_types = column_semantic_types + (current_boundary_list[column_index - current_boundary_start],)
metadata = metadata.update((data_resource['resID'], metadata_base.ALL_ELEMENTS, column_index), {
'semantic_types': column_semantic_types,
})
if count == 2:
current_boundary_start = column_index
current_boundary_list = INTERVAL_SEMANTIC_TYPES
else:
# Unsupported group of boundary columns, let's skip them all.
column_index += count
continue
column_semantic_types = column_semantic_types + (current_boundary_list[column_index - current_boundary_start],)
metadata = metadata.update((data_resource['resID'], metadata_base.ALL_ELEMENTS, column_index), {
'semantic_types': column_semantic_types,
})
if column_index - current_boundary_start + 1 == len(current_boundary_list):
current_boundary_start = None
current_boundary_list = None
if column_index - current_boundary_start + 1 == len(current_boundary_list):
current_boundary_start = None
current_boundary_list = None
column_index += 1
column_index += 1
return data, metadata
......@@ -3142,7 +3144,17 @@ if pyarrow_lib is not None:
)
def get_dataset(dataset_uri: str, *, compute_digest: ComputeDigest = ComputeDigest.ONLY_IF_MISSING, strict_digest: bool = False, lazy: bool = False) -> Dataset:
def get_dataset(
dataset_uri: str, *, compute_digest: ComputeDigest = ComputeDigest.ONLY_IF_MISSING,
strict_digest: bool = False, lazy: bool = False,
datasets_dir: str = None, handle_score_split: bool = True,
) -> Dataset:
if datasets_dir is not None:
datasets, problem_descriptions = utils.get_datasets_and_problems(datasets_dir, handle_score_split)
if dataset_uri in datasets:
dataset_uri = datasets[dataset_uri]
dataset_uri = utils.fix_uri(dataset_uri)
return Dataset.load(dataset_uri, compute_digest=compute_digest, strict_digest=strict_digest, lazy=lazy)
......
......@@ -151,9 +151,6 @@ class matrix(numpy.matrix, ndarray):
ndarray.__array_finalize__(self, obj)
typing.Sequence.register(numpy.ndarray) # type: ignore
def ndarray_serializer(obj: ndarray) -> dict:
data = {
'metadata': obj.metadata,
......
......@@ -161,7 +161,7 @@ class DataFrame(pandas.DataFrame):
self.metadata = state['metadata']
def to_csv(self, path_or_buf: typing.Union[typing.TextIO, str] = None, sep: str = ',', na_rep: str = '',
def to_csv(self, path_or_buf: typing.Union[typing.IO[typing.Any], str] = None, sep: str = ',', na_rep: str = '',
float_format: str = None, columns: typing.Sequence = None, header: typing.Union[bool, typing.Sequence[str]] = True,
index: bool = False, **kwargs: typing.Any) -> typing.Optional[str]:
"""
......@@ -458,9 +458,6 @@ class DataFrame(pandas.DataFrame):
return self.append_columns(right, use_right_metadata=use_right_metadata)
typing.Sequence.register(pandas.DataFrame) # type: ignore
def dataframe_serializer(obj: DataFrame) -> dict:
data = {
'metadata': obj.metadata,
......
......@@ -1441,7 +1441,7 @@ class Metadata:
return output
def pretty_print(self, selector: Selector = None, handle: typing.TextIO = None, _level: int = 0) -> None:
def pretty_print(self, selector: Selector = None, handle: typing.IO[typing.Any] = None, _level: int = 0) -> None:
"""
Pretty-prints metadata to ``handle``, or `sys.stdout` if not specified.
......
......@@ -135,17 +135,18 @@ class Resolver:
def _from_file(self, pipeline_description: typing.Dict) -> 'typing.Optional[Pipeline]':
for path in self.pipeline_search_paths:
pipeline_path = os.path.join(path, '{pipeline_id}.json'.format(pipeline_id=pipeline_description['id']))
try:
with open(pipeline_path, 'r', encoding='utf8') as pipeline_file:
return self.get_pipeline_class().from_json(pipeline_file, resolver=self, strict_digest=self.strict_digest)
except FileNotFoundError:
pass
for extension in ['json', 'json.gz']:
pipeline_path = os.path.join(path, '{pipeline_id}.{extension}'.format(pipeline_id=pipeline_description['id'], extension=extension))
try:
with utils.open(pipeline_path, 'r', encoding='utf8') as pipeline_file:
return self.get_pipeline_class().from_json(pipeline_file, resolver=self, strict_digest=self.strict_digest)
except FileNotFoundError:
pass
for extension in ['yml', 'yaml']:
for extension in ['yml', 'yaml', 'yml.gz', 'yaml.gz']:
pipeline_path = os.path.join(path, '{pipeline_id}.{extension}'.format(pipeline_id=pipeline_description['id'], extension=extension))
try:
with open(pipeline_path, 'r', encoding='utf8') as pipeline_file:
with utils.open(pipeline_path, 'r', encoding='utf8') as pipeline_file:
return self.get_pipeline_class().from_yaml(pipeline_file, resolver=self, strict_digest=self.strict_digest)
except FileNotFoundError:
pass
......@@ -1770,14 +1771,14 @@ class Pipeline:
return outputs_types
@classmethod
def from_yaml(cls: typing.Type[P], string_or_file: typing.Union[str, typing.TextIO], *, resolver: typing.Optional[Resolver] = None,
def from_yaml(cls: typing.Type[P], string_or_file: typing.Union[str, typing.IO[typing.Any]], *, resolver: typing.Optional[Resolver] = None,
strict_digest: bool = False) -> P:
description = yaml.safe_load(string_or_file)
description = utils.yaml_load(string_or_file)
return cls.from_json_structure(description, resolver=resolver, strict_digest=strict_digest)
@classmethod
def from_json(cls: typing.Type[P], string_or_file: typing.Union[str, typing.TextIO], *, resolver: typing.Optional[Resolver] = None,
def from_json(cls: typing.Type[P], string_or_file: typing.Union[str, typing.IO[typing.Any]], *, resolver: typing.Optional[Resolver] = None,
strict_digest: bool = False) -> P:
if isinstance(string_or_file, str):
description = json.loads(string_or_file)
......@@ -1947,7 +1948,7 @@ class Pipeline:
return pipeline_description
def to_json(self, file: typing.TextIO = None, *, nest_subpipelines: bool = False, canonical: bool = False, **kwargs: typing.Any) -> typing.Optional[str]:
def to_json(self, file: typing.IO[typing.Any] = None, *, nest_subpipelines: bool = False, canonical: bool = False, **kwargs: typing.Any) -> typing.Optional[str]:
obj = self.to_json_structure(nest_subpipelines=nest_subpipelines, canonical=canonical)
if 'allow_nan' not in kwargs:
......@@ -1959,10 +1960,10 @@ class Pipeline:
json.dump(obj, file, **kwargs)
return None
def to_yaml(self, file: typing.TextIO = None, *, nest_subpipelines: bool = False, canonical: bool = False, **kwargs: typing.Any) -> typing.Optional[str]:
def to_yaml(self, file: typing.IO[typing.Any] = None, *, nest_subpipelines: bool = False, canonical: bool = False, **kwargs: typing.Any) -> typing.Optional[str]:
obj = self.to_json_structure(nest_subpipelines=nest_subpipelines, canonical=canonical)
return yaml.safe_dump(obj, stream=file, **kwargs)
return utils.yaml_dump(obj, stream=file, **kwargs)
def equals(self, pipeline: P, *, strict_order: bool = False, only_control_hyperparams: bool = False) -> bool:
"""
......@@ -2793,7 +2794,7 @@ def get_pipeline(
)
if os.path.exists(pipeline_path):
with open(pipeline_path, 'r', encoding='utf8') as pipeline_file:
with utils.open(pipeline_path, 'r', encoding='utf8') as pipeline_file:
if pipeline_path.endswith('.yml') or pipeline_path.endswith('.yaml'):
return pipeline_class.from_yaml(pipeline_file, resolver=resolver, strict_digest=strict_digest)
elif pipeline_path.endswith('.json'):
......@@ -2833,14 +2834,14 @@ def describe_handler(
print(pipeline_path, file=output_stream)
try:
with open(pipeline_path, 'r', encoding='utf8') as pipeline_file:
if pipeline_path.endswith('.yml') or pipeline_path.endswith('.yaml'):
with utils.open(pipeline_path, 'r', encoding='utf8') as pipeline_file:
if pipeline_path.endswith('.yml') or pipeline_path.endswith('.yaml') or pipeline_path.endswith('.yml.gz') or pipeline_path.endswith('.yaml.gz'):
pipeline = pipeline_class.from_yaml(
pipeline_file,
resolver=resolver,
strict_digest=getattr(arguments, 'strict_digest', False),
)
elif pipeline_path.endswith('.json'):
elif pipeline_path.endswith('.json') or pipeline_path.endswith('.json.gz'):
pipeline = pipeline_class.from_json(
pipeline_file,
resolver=resolver,
......
......@@ -68,8 +68,7 @@ class User(dict):
return dumper.represent_dict(data)
yaml.Dumper.add_representer(User, User._yaml_representer)
yaml.SafeDumper.add_representer(User, User._yaml_representer)
utils.yaml_add_representer(User, User._yaml_representer)
class PipelineRunStep:
......@@ -412,13 +411,13 @@ class PipelineRun:
return json_structure
def to_yaml(self, file: typing.TextIO, *, appending: bool = False, **kwargs: typing.Any) -> typing.Optional[str]:
def to_yaml(self, file: typing.IO[typing.Any], *, appending: bool = False, **kwargs: typing.Any) -> typing.Optional[str]:
obj = self.to_json_structure()
if appending and 'explicit_start' not in kwargs:
kwargs['explicit_start'] = True
return yaml.safe_dump(obj, stream=file, **kwargs)
return utils.yaml_dump(obj, stream=file, **kwargs)
def add_input_dataset(self, dataset: container.Dataset) -> None:
metadata = dataset.metadata.query(())
......@@ -1160,8 +1159,7 @@ class RuntimeEnvironment(dict):
return dumper.represent_dict(data)
yaml.Dumper.add_representer(RuntimeEnvironment, RuntimeEnvironment._yaml_representer)
yaml.SafeDumper.add_representer(RuntimeEnvironment, RuntimeEnvironment._yaml_representer)
utils.yaml_add_representer(RuntimeEnvironment, RuntimeEnvironment._yaml_representer)
def _validate_pipeline_run_random_seeds(pipeline_run: typing.Dict) -> None:
......@@ -1215,10 +1213,7 @@ def _validate_pipeline_run_timestamps(pipeline_run: typing.Dict, parent_start: d
def _validate_success_step(step: typing.Dict) -> None:
if step['type'] == metadata_base.PipelineStepType.PRIMITIVE:
if 'method_calls' not in step:
raise exceptions.InvalidPipelineRunError("Successful primitive step with missing method calls.")
for method_call in step['method_calls']:
for method_call in step.get('method_calls', []):
if method_call['status']['state'] != metadata_base.PipelineRunStatusState.SUCCESS:
raise exceptions.InvalidPipelineRunError(
"Step with '{expected_status}' status has a method call with '{status}' status".format(
......@@ -1523,14 +1518,16 @@ def pipeline_run_handler(arguments: argparse.Namespace) -> None:
print(pipeline_run_path)
try:
with open(pipeline_run_path, 'r', encoding='utf8') as pipeline_run_file:
if pipeline_run_path.endswith('.yml') or pipeline_run_path.endswith('.yaml'):
pipeline_runs: typing.Iterator[typing.Dict] = yaml.safe_load_all(pipeline_run_file)
with utils.open(pipeline_run_path, 'r', encoding='utf8') as pipeline_run_file:
if pipeline_run_path.endswith('.yml') or pipeline_run_path.endswith('.yaml') or pipeline_run_path.endswith('.yml.gz') or pipeline_run_path.endswith('.yaml.gz'):
pipeline_runs: typing.Iterator[typing.Dict] = utils.yaml_load_all(pipeline_run_file)
else:
pipeline_runs = typing.cast(typing.Iterator[typing.Dict], [json.load(pipeline_run_file)])
for pipeline_run in pipeline_runs:
validate_pipeline_run(pipeline_run)
# It has to be inside context manager because YAML loader returns a lazy iterator
# which requires an open file while iterating.
for pipeline_run in pipeline_runs:
validate_pipeline_run(pipeline_run)
except Exception as error:
if getattr(arguments, 'continue', False):
traceback.print_exc(file=sys.stdout)
......@@ -1555,14 +1552,14 @@ def pipeline_handler(
print(pipeline_path)
try:
with open(pipeline_path, 'r', encoding='utf8') as pipeline_file:
if pipeline_path.endswith('.yml') or pipeline_path.endswith('.yaml'):
pipelines: typing.Iterator[typing.Dict] = yaml.safe_load_all(pipeline_file)
with utils.open(pipeline_path, 'r', encoding='utf8') as pipeline_file:
if pipeline_path.endswith('.yml') or pipeline_path.endswith('.yaml') or pipeline_path.endswith('.yml.gz') or pipeline_path.endswith('.yaml.gz'):
pipelines: typing.Iterator[typing.Dict] = utils.yaml_load_all(pipeline_file)
else:
pipelines = typing.cast(typing.Iterator[typing.Dict], [json.load(pipeline_file)])
for pipeline in pipelines:
validate_pipeline(pipeline)
for pipeline in pipelines:
validate_pipeline(pipeline)
except Exception as error:
if getattr(arguments, 'continue', False):
traceback.print_exc(file=sys.stdout)
......@@ -1584,14 +1581,14 @@ def problem_handler(arguments: argparse.Namespace, *, problem_resolver: typing.C
print(problem_path)
try:
with open(problem_path, 'r', encoding='utf8') as problem_file:
if problem_path.endswith('.yml') or problem_path.endswith('.yaml'):
problems: typing.Iterator[typing.Dict] = yaml.safe_load_all(problem_file)
with utils.open(problem_path, 'r', encoding='utf8') as problem_file:
if problem_path.endswith('.yml') or problem_path.endswith('.yaml') or problem_path.endswith('.yml.gz') or problem_path.endswith('.yaml.gz'):
problems: typing.Iterator[typing.Dict] = utils.yaml_load_all(problem_file)
else:
problems = typing.cast(typing.Iterator[typing.Dict], [json.load(problem_file)])
for problem in problems:
validate_problem(problem)
for problem in problems:
validate_problem(problem)
except Exception as error:
if getattr(arguments, 'continue', False):
traceback.print_exc(file=sys.stdout)
......@@ -1613,14 +1610,14 @@ def dataset_handler(arguments: argparse.Namespace, *, dataset_resolver: typing.C
print(dataset_path)
try:
with open(dataset_path, 'r', encoding='utf8') as dataset_file:
with utils.open(dataset_path, 'r', encoding='utf8') as dataset_file:
if dataset_path.endswith('.yml') or dataset_path.endswith('.yaml'):
datasets: typing.Iterator[typing.Dict] = yaml.safe_load_all(dataset_file)
datasets: typing.Iterator[typing.Dict] = utils.yaml_load_all(dataset_file)
else:
datasets = typing.cast(typing.Iterator[typing.Dict], [json.load(dataset_file)])
for dataset in datasets:
validate_dataset(dataset)
for dataset in datasets:
validate_dataset(dataset)
except Exception as error:
if getattr(arguments, 'continue', False):
traceback.print_exc(file=sys.stdout)
......
......@@ -51,6 +51,7 @@ PRIMITIVE_NAMES = [
'cluster_curve_fitting_kmeans',
'collaborative_filtering_link_prediction',
'collaborative_filtering_parser',
'collaborative_filtering_recommender_system',
'column_fold',
'column_map',
'column_parser',
......@@ -61,10 +62,10 @@ PRIMITIVE_NAMES = [
'concat',
'conditioner',
'construct_predictions',
'convolutional_neural_net',
'convolution_1d',
'convolution_2d',
'convolution_3d',
'convolutional_neural_net',
'corex_continuous',
'corex_supervised',
'corex_text',
......@@ -96,18 +97,20 @@ PRIMITIVE_NAMES = [
'diagonal_mvn',
'dict_vectorizer',
'dimension_selection',
'doc_2_vec',
'discriminative_structured_classifier',
'do_nothing',
'do_nothing_for_dataset',
'doc_2_vec',
'dropout',
'dummy',
'discriminative_structured_classifier',
'echo_ib',
'echo_linear',
'edge_list_to_graph',
'ekss',
'elastic_net',
'encoder',
'enrich_dates',
'ensemble_forest',
'ensemble_voting',
'extra_trees',
'extract_columns',
......@@ -127,12 +130,14 @@ PRIMITIVE_NAMES = [
'gaussian_random_projection',
'gcn_mixhop',
'general_relational_dataset',
'generative_structured_classifier',
'generic_univariate_select',
'geocoding',
'glda',
'global_average_pooling_1d',
'global_average_pooling_2d',
'global_average_pooling_3d',
'global_structure_imputer',
'gmm',
'go_dec',
'goturn',
......@@ -146,8 +151,6 @@ PRIMITIVE_NAMES = [
'grasta_masked',
'greedy_imputation',
'grouse',
'global_structure_imputer',
'generative_structured_classifier',
'hdbscan',
'hdp',
'hinge',
......@@ -157,6 +160,7 @@ PRIMITIVE_NAMES = [
'ibex',
'identity_parentchildren_markov_blanket',
'image_reader',
'image_transfer',
'image_transfer_learning_transformer',
'imputer',
'inceptionV3_image_feature',
......@@ -165,14 +169,14 @@ PRIMITIVE_NAMES = [
'iterative_labeling',
'iterative_regression_imputation',
'joint_mutual_information',
'k_means',
'k_neighbors',
'kernel_pca',
'kernel_ridge',
'kfold_dataset_split',
'kfold_timeseries_split',
'kullback_leibler_divergence',
'k_means',
'kss',
'kullback_leibler_divergence',
'l1_low_rank',
'label_decoder',
'label_encoder',
......@@ -193,23 +197,26 @@ PRIMITIVE_NAMES = [
'link_prediction',
'list_to_dataframe',
'list_to_ndarray',
'logcosh',
'load_edgelist',
'load_graphs',
'load_single_graph',
'local_structure_imputer',
'log_mel_spectrogram',
'logcosh',
'logistic_regression',
'loss',
'lstm',
'lupi_svm',
'local_structure_imputer',
'max_abs_scaler',
'max_pooling_1d',
'max_pooling_2d',
'max_pooling_3d',
'max_abs_scaler',
'mean_absolute_error',
'mean_absolute_percentage_error',
'mean_squared_error',
'mean_squared_logarithmic_error',
'mean_baseline',
'mean_imputation',
'mean_squared_error',
'mean_squared_logarithmic_error',
'metafeature_extractor',
'mice_imputation',
'min_max_scaler',
......@@ -239,6 +246,7 @@ PRIMITIVE_NAMES = [
'ordinal_encoder',
'out_of_sample_adjacency_spectral_embedding',
'out_of_sample_laplacian_spectral_embedding',
'output_dataframe',
'owl_regression',
'pass_to_ranks',
'passive_aggressive',
......@@ -271,6 +279,7 @@ PRIMITIVE_NAMES = [
'remove_semantic_types',
'rename_duplicate_name',
'replace_semantic_types',
'replace_singletons',
'resnet50_image_feature',
'resnext101_kinetics_video_features',
'retina_net',
......@@ -327,17 +336,19 @@ PRIMITIVE_NAMES = [
'tensor_machines_binary_classification',
'tensor_machines_regularized_least_squares',
'term_filter',
'text_classifier',
'text_encoder',
'text_reader',
'text_summarization',
'text_to_vocabulary',
'text_tokenizer',
'tfidf_vectorizer',
'time_series_to_list',
'time_series_formatter',
'time_series_neighbours',
'time_series_reshaper',
'topic_vectorizer',
'time_series_to_list',
'to_numeric',
'topic_vectorizer',
'train_score_dataset_split',
'trecs',
'tree_augmented_naive_bayes',
......@@ -356,8 +367,8 @@ PRIMITIVE_NAMES = [
'vertical_concatenate',
'vgg16',
'vgg16_image_feature',
'voter',
'video_reader',
'voter',
'voting',
'wikifier',
'word_2_vec',
......
......@@ -416,6 +416,29 @@ TASK_TYPE_TO_KEYWORDS_MAP: typing.Dict[typing.Optional[str], typing.List] = {
'overlapping': [TaskKeyword.OVERLAPPING], # type: ignore
'nonOverlapping': [TaskKeyword.NONOVERLAPPING], # type: ignore
}
JSON_TASK_TYPE_TO_KEYWORDS_MAP: typing.Dict[typing.Optional[str], typing.List] = {
None: [],
'CLASSIFICATION': [TaskKeyword.CLASSIFICATION], # type: ignore
'REGRESSION': [TaskKeyword.REGRESSION], # type: ignore
'CLUSTERING': [TaskKeyword.CLUSTERING], # type: ignore
'LINK_PREDICTION': [TaskKeyword.LINK_PREDICTION], # type: ignore
'VERTEX_CLASSIFICATION': [TaskKeyword.VERTEX_CLASSIFICATION], # type: ignore
'VERTEX_NOMINATION': [TaskKeyword.VERTEX_NOMINATION], # type: ignore
'COMMUNITY_DETECTION': [TaskKeyword.COMMUNITY_DETECTION], # type: ignore
'GRAPH_MATCHING': [TaskKeyword.GRAPH_MATCHING], # type: ignore
'TIME_SERIES_FORECASTING': [TaskKeyword.TIME_SERIES, TaskKeyword.FORECASTING], # type: ignore
'COLLABORATIVE_FILTERING': [TaskKeyword.COLLABORATIVE_FILTERING], # type: ignore
'OBJECT_DETECTION': [TaskKeyword.OBJECT_DETECTION], # type: ignore
'SEMISUPERVISED_CLASSIFICATION': [TaskKeyword.SEMISUPERVISED, TaskKeyword.CLASSIFICATION], # type: ignore
'SEMISUPERVISED_REGRESSION': [TaskKeyword.SEMISUPERVISED, TaskKeyword.REGRESSION], # type: ignore
'BINARY': [TaskKeyword.BINARY], # type: ignore
'MULTICLASS': [TaskKeyword.MULTICLASS], # type: ignore
'MULTILABEL': [TaskKeyword.MULTILABEL], # type: ignore
'UNIVARIATE': [TaskKeyword.UNIVARIATE], # type: ignore
'MULTIVARIATE': [TaskKeyword.MULTIVARIATE], # type: ignore
'OVERLAPPING': [TaskKeyword.OVERLAPPING], # type: ignore
'nonoverlapping': [TaskKeyword.NONOVERLAPPING], # type: ignore
}
class Loader(metaclass=utils.AbstractMetaclass):
......@@ -812,13 +835,55 @@ class Problem(dict):
return cls(structure, strict_digest=strict_digest)
def to_json_structure(self, *, canonical: bool = False) -> typing.Dict:
"""
For standard enumerations we map them to strings. Non-standard problem
description fields we convert in a reversible manner.
"""
PROBLEM_SCHEMA_VALIDATOR.validate(self)
return utils.to_reversible_json_structure(self.to_simple_structure(canonical=canonical))
simple_structure = copy.deepcopy(self.to_simple_structure(canonical=canonical))
if simple_structure.get('problem', {}).get('task_keywords', []):
simple_structure['problem']['task_keywords'] = [task_keyword.name for task_keyword in simple_structure['problem']['task_keywords']]
if simple_structure.get('problem', {}).get('performance_metrics', []):
for metric in simple_structure['problem']['performance_metrics']:
metric['metric'] = metric['metric'].name
return utils.to_reversible_json_structure(simple_structure)
@classmethod
def from_json_structure(cls: typing.Type[P], structure: typing.Dict, *, strict_digest: bool = False) -> P:
return cls.from_simple_structure(utils.from_reversible_json_structure(structure), strict_digest=strict_digest)
"""
For standard enumerations we map them from strings. For non-standard problem
description fields we used a reversible conversion.
"""
simple_structure = utils.from_reversible_json_structure(structure)
# Legacy (before v4.0.0).
legacy_task_keywords: typing.List[TaskKeyword] = [] # type: ignore
legacy_task_keywords += JSON_TASK_TYPE_TO_KEYWORDS_MAP[simple_structure.get('problem', {}).get('task_type', None)]
legacy_task_keywords += JSON_TASK_TYPE_TO_KEYWORDS_MAP[simple_structure.get('problem', {}).get('task_subtype', None)]
if legacy_task_keywords:
# We know "problem" field exists.
simple_structure['problem']['task_keywords'] = simple_structure['problem'].get('task_keywords', []) + legacy_task_keywords
if simple_structure.get('problem', {}).get('task_keywords', []):
mapped_task_keywords = []
for task_keyword in simple_structure['problem']['task_keywords']:
if isinstance(task_keyword, str):
mapped_task_keywords.append(TaskKeyword[task_keyword])
else:
mapped_task_keywords.append(task_keyword)
simple_structure['problem']['task_keywords'] = mapped_task_keywords
if simple_structure.get('problem', {}).get('performance_metrics', []):
for metric in simple_structure['problem']['performance_metrics']:
if isinstance(metric['metric'], str):
metric['metric'] = PerformanceMetric[metric['metric']]
return cls.from_simple_structure(simple_structure, strict_digest=strict_digest)
@deprecate.function(message="use Problem.load class method instead")
......@@ -866,7 +931,13 @@ if pyarrow_lib is not None:
)
def get_problem(problem_uri: str, *, strict_digest: bool = False) -> Problem:
def get_problem(problem_uri: str, *, strict_digest: bool = False, datasets_dir: str = None, handle_score_split: bool = True) -> Problem:
if datasets_dir is not None:
datasets, problem_descriptions = utils.get_datasets_and_problems(datasets_dir, handle_score_split)
if problem_uri in problem_descriptions:
problem_uri = problem_descriptions[problem_uri]
problem_uri = utils.fix_uri(problem_uri)
return Problem.load(problem_uri, strict_digest=strict_digest)
......@@ -902,7 +973,7 @@ def describe_handler(
if getattr(arguments, 'print', False):
pprint.pprint(problem_description, stream=output_stream)
else:
elif not getattr(arguments, 'no_print', False):
json.dump(
problem_description,
output_stream,
......
import argparse
import inspect
import json
import functools
import logging
import os
import os.path
......@@ -15,7 +16,6 @@ import uuid
import jsonschema # type: ignore
import frozendict # type: ignore
import pandas # type: ignore
import yaml # type: ignore
from d3m import container, deprecate, exceptions, types, utils
from d3m.container import dataset as dataset_module
......@@ -1525,106 +1525,24 @@ get_problem = deprecate.function(message="use d3m.metadata.problem.get_problem i
get_pipeline = deprecate.function(message="use d3m.metadata.pipeline.get_pipeline instead")(pipeline_module.get_pipeline)
# TODO: Do not traverse the datasets directory every time.
@deprecate.function(message="use d3m.utils.get_datasets_and_problems instead")
def _get_datasets_and_problems(
datasets_dir: str, handle_score_split: bool = True,
) -> typing.Tuple[typing.Dict[str, str], typing.Dict[str, str]]:
datasets: typing.Dict[str, str] = {}
problem_descriptions: typing.Dict[str, str] = {}
for dirpath, dirnames, filenames in os.walk(datasets_dir, followlinks=True):
if 'datasetDoc.json' in filenames:
# Do not traverse further (to not parse "datasetDoc.json" or "problemDoc.json" if they
# exists in raw data filename).
dirnames[:] = []
dataset_path = os.path.join(os.path.abspath(dirpath), 'datasetDoc.json')
try:
with open(dataset_path, 'r', encoding='utf8') as dataset_file:
dataset_doc = json.load(dataset_file)
dataset_id = dataset_doc['about']['datasetID']
# Handle a special case for SCORE dataset splits (those which have "targets.csv" file).
# They are the same as TEST dataset splits, but we present them differently, so that
# SCORE dataset splits have targets as part of data. Because of this we also update
# corresponding dataset ID.
# See: https://gitlab.com/datadrivendiscovery/d3m/issues/176
if handle_score_split and os.path.exists(os.path.join(dirpath, '..', 'targets.csv')) and dataset_id.endswith('_TEST'):
dataset_id = dataset_id[:-5] + '_SCORE'
if dataset_id in datasets:
logger.warning(
"Duplicate dataset ID '%(dataset_id)s': '%(old_dataset)s' and '%(dataset)s'", {
'dataset_id': dataset_id,
'dataset': dataset_path,
'old_dataset': datasets[dataset_id],
},
)
else:
datasets[dataset_id] = dataset_path
except (ValueError, KeyError):
logger.exception(
"Unable to read dataset '%(dataset)s'.", {
'dataset': dataset_path,
},
)
if 'problemDoc.json' in filenames:
# We continue traversing further in this case.
problem_path = os.path.join(os.path.abspath(dirpath), 'problemDoc.json')
try:
with open(problem_path, 'r', encoding='utf8') as problem_file:
problem_doc = json.load(problem_file)
problem_id = problem_doc['about']['problemID']
# Handle a special case for SCORE dataset splits (those which have "targets.csv" file).
# They are the same as TEST dataset splits, but we present them differently, so that
# SCORE dataset splits have targets as part of data. Because of this we also update
# corresponding problem ID.
# See: https://gitlab.com/datadrivendiscovery/d3m/issues/176
if handle_score_split and os.path.exists(os.path.join(dirpath, '..', 'targets.csv')) and problem_id.endswith('_TEST'):
problem_id = problem_id[:-5] + '_SCORE'
# Also update dataset references.
for data in problem_doc.get('inputs', {}).get('data', []):
if data['datasetID'].endswith('_TEST'):
data['datasetID'] = data['datasetID'][:-5] + '_SCORE'
if problem_id in problem_descriptions:
logger.warning(
"Duplicate problem ID '%(problem_id)s': '%(old_problem)s' and '%(problem)s'", {
'problem_id': problem_id,
'problem': problem_path,
'old_problem': problem_descriptions[problem_id],
},
)
else:
problem_descriptions[problem_id] = problem_path
except (ValueError, KeyError):
logger.exception(
"Unable to read problem description '%(problem)s'.", {
'problem': problem_path,
},
)
return datasets, problem_descriptions
return utils.get_datasets_and_problems(datasets_dir, handle_score_split)
def _resolve_pipeline_run_datasets(
datasets: typing.Dict[str, str], pipeline_run_datasets: typing.Sequence[typing.Dict[str, str]], *,
pipeline_run_datasets: typing.Sequence[typing.Dict[str, str]], *,
dataset_resolver: typing.Callable, compute_digest: dataset_module.ComputeDigest, strict_digest: bool,
strict_resolving: bool = False,
strict_resolving: bool, datasets_dir: typing.Optional[str], handle_score_split: bool,
) -> typing.Sequence[container.Dataset]:
resolved_datasets = []
for dataset_reference in pipeline_run_datasets:
resolved_dataset = dataset_resolver(
datasets[dataset_reference['id']], compute_digest=compute_digest, strict_digest=strict_digest,
dataset_reference['id'], compute_digest=compute_digest, strict_digest=strict_digest,
datasets_dir=datasets_dir, handle_score_split=handle_score_split,
)
resolved_dataset_digest = resolved_dataset.metadata.query(()).get('digest', None)
......@@ -1656,15 +1574,12 @@ def _resolve_pipeline_run_datasets(
def parse_pipeline_run(
pipeline_run_file: typing.TextIO, pipeline_search_paths: typing.Sequence[str], datasets_dir: str, *,
pipeline_run_file: typing.IO[typing.Any], pipeline_search_paths: typing.Sequence[str], datasets_dir: typing.Optional[str], *,
pipeline_resolver: typing.Callable = None, dataset_resolver: typing.Callable = None,
problem_resolver: typing.Callable = None, strict_resolving: bool = False,
compute_digest: dataset_module.ComputeDigest = dataset_module.ComputeDigest.ONLY_IF_MISSING,
strict_digest: bool = False, handle_score_split: bool = True,
) -> typing.Sequence[typing.Dict[str, typing.Any]]:
if datasets_dir is None:
raise exceptions.InvalidArgumentValueError("Datasets directory has to be provided to resolve pipeline run files.")
if pipeline_resolver is None:
pipeline_resolver = pipeline_module.get_pipeline
if dataset_resolver is None:
......@@ -1672,13 +1587,11 @@ def parse_pipeline_run(
if problem_resolver is None:
problem_resolver = problem.get_problem
pipeline_runs = list(yaml.safe_load_all(pipeline_run_file))
pipeline_runs = list(utils.yaml_load_all(pipeline_run_file))
if not pipeline_runs:
raise exceptions.InvalidArgumentValueError("Pipeline run file must contain at least one pipeline run document.")
datasets, problem_descriptions = _get_datasets_and_problems(datasets_dir, handle_score_split)
for pipeline_run in pipeline_runs:
try:
pipeline_run_module.validate_pipeline_run(pipeline_run)
......@@ -1686,12 +1599,19 @@ def parse_pipeline_run(
raise exceptions.InvalidArgumentValueError("Provided pipeline run document is not valid.") from error
pipeline_run['datasets'] = _resolve_pipeline_run_datasets(
datasets, pipeline_run['datasets'], dataset_resolver=dataset_resolver,
compute_digest=compute_digest, strict_digest=strict_digest, strict_resolving=strict_resolving,
pipeline_run['datasets'], dataset_resolver=dataset_resolver,
compute_digest=compute_digest, strict_digest=strict_digest,
strict_resolving=strict_resolving, datasets_dir=datasets_dir,
handle_score_split=handle_score_split,
)
if 'problem' in pipeline_run:
pipeline_run['problem'] = problem_resolver(problem_descriptions[pipeline_run['problem']['id']], strict_digest=strict_digest)
pipeline_run['problem'] = problem_resolver(
pipeline_run['problem']['id'],
strict_digest=strict_digest,
datasets_dir=datasets_dir,
handle_score_split=handle_score_split,
)
pipeline_run['pipeline'] = pipeline_resolver(
pipeline_run['pipeline']['id'],
......@@ -1712,8 +1632,9 @@ def parse_pipeline_run(
if 'datasets' in pipeline_run['run']['scoring']:
assert 'data_preparation' not in pipeline_run['run']
pipeline_run['run']['scoring']['datasets'] = _resolve_pipeline_run_datasets(
datasets, pipeline_run['run']['scoring']['datasets'], dataset_resolver=dataset_resolver,
pipeline_run['run']['scoring']['datasets'], dataset_resolver=dataset_resolver,
compute_digest=compute_digest, strict_digest=strict_digest, strict_resolving=strict_resolving,
datasets_dir=datasets_dir, handle_score_split=handle_score_split,
)
if pipeline_run['run']['scoring']['pipeline']['id'] == DEFAULT_SCORING_PIPELINE_ID:
......@@ -1884,7 +1805,7 @@ def combine_pipeline_runs(
@deprecate.function(message="use extended DataFrame.to_csv method instead")
def export_dataframe(dataframe: container.DataFrame, output_file: typing.TextIO = None) -> typing.Optional[str]:
def export_dataframe(dataframe: container.DataFrame, output_file: typing.IO[typing.Any] = None) -> typing.Optional[str]:
return dataframe.to_csv(output_file)
......
......@@ -7,6 +7,8 @@ import copy
import datetime
import decimal
import enum
import functools
import gzip
import hashlib
import inspect
import json
......@@ -38,6 +40,13 @@ from pytypes import type_util # type: ignore
import d3m
from d3m import deprecate, exceptions
if yaml.__with_libyaml__:
from yaml import CSafeLoader as SafeLoader, CSafeDumper as SafeDumper # type: ignore
else:
from yaml import SafeLoader, SafeDumper
logger = logging.getLogger(__name__)
NONE_TYPE: typing.Type = type(None)
# Only types without elements can be listed here. If they are elements, we have to
......@@ -233,6 +242,58 @@ def type_to_str(obj: type) -> str:
return type_util.type_str(obj, assumed_globals={}, update_assumed_globals=False)
yaml_warning_issued = False
def yaml_dump_all(documents: typing.Sequence[typing.Any], stream: typing.IO[typing.Any] = None, **kwds: typing.Any) -> typing.Any:
global yaml_warning_issued
if not yaml.__with_libyaml__ and not yaml_warning_issued:
yaml_warning_issued = True
logger.warning("cyaml not found, using a slower pure Python YAML implementation.")
return yaml.dump_all(documents, stream, Dumper=SafeDumper, **kwds)
def yaml_dump(data: typing.Any, stream: typing.IO[typing.Any] = None, **kwds: typing.Any) -> typing.Any:
global yaml_warning_issued
if not yaml.__with_libyaml__ and not yaml_warning_issued:
yaml_warning_issued = True
logger.warning("cyaml not found, using a slower pure Python YAML implementation.")
return yaml.dump_all([data], stream, Dumper=SafeDumper, **kwds)
def yaml_load_all(stream: typing.Union[str, typing.IO[typing.Any]]) -> typing.Any:
global yaml_warning_issued
if not yaml.__with_libyaml__ and not yaml_warning_issued:
yaml_warning_issued = True
logger.warning("cyaml not found, using a slower pure Python YAML implementation.")
return yaml.load_all(stream, SafeLoader)
def yaml_load(stream: typing.Union[str, typing.IO[typing.Any]]) -> typing.Any:
global yaml_warning_issued
if not yaml.__with_libyaml__ and not yaml_warning_issued:
yaml_warning_issued = True
logger.warning("cyaml not found, using a slower pure Python YAML implementation.")
return yaml.load(stream, SafeLoader)
def yaml_add_representer(value_type: typing.Type, represented: typing.Callable) -> None:
yaml.Dumper.add_representer(value_type, represented)
yaml.SafeDumper.add_representer(value_type, represented)
if yaml.__with_libyaml__:
yaml.CDumper.add_representer(value_type, represented) # type: ignore
yaml.CSafeDumper.add_representer(value_type, represented) # type: ignore
class EnumMeta(enum.EnumMeta):
def __new__(mcls, class_name, bases, namespace, **kwargs): # type: ignore
def __reduce_ex__(self: typing.Any, proto: int) -> typing.Any:
......@@ -246,7 +307,7 @@ class EnumMeta(enum.EnumMeta):
def yaml_representer(dumper, data): # type: ignore
return yaml.ScalarNode('tag:yaml.org,2002:str', data.name)
yaml.add_representer(cls, yaml_representer)
yaml_add_representer(cls, yaml_representer)
return cls
......@@ -346,7 +407,7 @@ def make_immutable_copy(obj: typing.Any) -> typing.Any:
return type(obj)(make_immutable_copy(o) for o in obj)
if isinstance(obj, pandas.DataFrame):
return tuple(make_immutable_copy(o) for o in obj.itertuples(index=False, name=None))
if isinstance(obj, typing.Sequence):
if isinstance(obj, (typing.Sequence, numpy.ndarray)):
return tuple(make_immutable_copy(o) for o in obj)
raise TypeError("{obj} is not known to be immutable.".format(obj=obj))
......@@ -560,7 +621,7 @@ class JsonEncoder(json.JSONEncoder):
return sorted(o, key=str)
if isinstance(o, pandas.DataFrame):
return list(o.itertuples(index=False, name=None))
if isinstance(o, typing.Sequence):
if isinstance(o, (typing.Sequence, numpy.ndarray)):
return list(o)
if isinstance(o, decimal.Decimal):
return float(o)
......@@ -1095,8 +1156,7 @@ def register_yaml_representers() -> None:
]
for representer in representers:
yaml.Dumper.add_representer(representer['type'], representer['representer'])
yaml.SafeDumper.add_representer(representer['type'], representer['representer'])
yaml_add_representer(representer['type'], representer['representer'])
def matches_structural_type(source_structural_type: type, target_structural_type: typing.Union[str, type]) -> bool:
......@@ -1262,8 +1322,23 @@ def log_once(logger: logging.Logger, level: int, msg: str, *args: typing.Any, **
# A workaround to handle also binary stdin/stdout.
# See: https://gitlab.com/datadrivendiscovery/d3m/issues/353
# See: https://bugs.python.org/issue14156
# Moreover, if filename ends in ".gz" it decompresses the file as well.
class FileType(argparse.FileType):
def __call__(self, string: str) -> typing.IO[typing.Any]:
if string.endswith('.gz'):
# "gzip.open" has as a default binary mode,
# but we want text mode as a default.
if 't' not in self._mode and 'b' not in self._mode: # type: ignore
mode = self._mode + 't' # type: ignore
else:
mode = self._mode # type: ignore
try:
return gzip.open(string, mode=mode, encoding=self._encoding, errors=self._errors) # type: ignore
except OSError as error:
message = argparse._("can't open '%s': %s") # type: ignore
raise argparse.ArgumentTypeError(message % (string, error))
handle = super().__call__(string)
if string == '-' and 'b' in self._mode: # type: ignore
......@@ -1272,6 +1347,16 @@ class FileType(argparse.FileType):
return handle
def open(file: str, mode: str = 'r', buffering: int = -1, encoding: str = None, errors: str = None) -> typing.IO[typing.Any]:
try:
return FileType(mode=mode, bufsize=buffering, encoding=encoding, errors=errors)(file)
except argparse.ArgumentTypeError as error:
original_error = error.__context__
# So that we are outside of the except clause.
raise original_error
def filter_local_location_uris(doc: typing.Dict, *, empty_value: typing.Any = None) -> None:
if 'location_uris' in doc:
location_uris = []
......@@ -1365,3 +1450,101 @@ def json_structure_equals(
else:
return obj1 == obj2
@functools.lru_cache()
def get_datasets_and_problems(
datasets_dir: str, handle_score_split: bool = True,
) -> typing.Tuple[typing.Dict[str, str], typing.Dict[str, str]]:
if datasets_dir is None:
raise exceptions.InvalidArgumentValueError("Datasets directory has to be provided.")
datasets: typing.Dict[str, str] = {}
problem_descriptions: typing.Dict[str, str] = {}
problem_description_contents: typing.Dict[str, typing.Dict] = {}
for dirpath, dirnames, filenames in os.walk(datasets_dir, followlinks=True):
if 'datasetDoc.json' in filenames:
# Do not traverse further (to not parse "datasetDoc.json" or "problemDoc.json" if they
# exists in raw data filename).
dirnames[:] = []
dataset_path = os.path.join(os.path.abspath(dirpath), 'datasetDoc.json')
try:
with open(dataset_path, 'r', encoding='utf8') as dataset_file:
dataset_doc = json.load(dataset_file)
dataset_id = dataset_doc['about']['datasetID']
# Handle a special case for SCORE dataset splits (those which have "targets.csv" file).
# They are the same as TEST dataset splits, but we present them differently, so that
# SCORE dataset splits have targets as part of data. Because of this we also update
# corresponding dataset ID.
# See: https://gitlab.com/datadrivendiscovery/d3m/issues/176
if handle_score_split and os.path.exists(os.path.join(dirpath, '..', 'targets.csv')) and dataset_id.endswith('_TEST'):
dataset_id = dataset_id[:-5] + '_SCORE'
if dataset_id in datasets:
logger.warning(
"Duplicate dataset ID '%(dataset_id)s': '%(old_dataset)s' and '%(dataset)s'", {
'dataset_id': dataset_id,
'dataset': dataset_path,
'old_dataset': datasets[dataset_id],
},
)
else:
datasets[dataset_id] = dataset_path
except (ValueError, KeyError):
logger.exception(
"Unable to read dataset '%(dataset)s'.", {
'dataset': dataset_path,
},
)
if 'problemDoc.json' in filenames:
# We continue traversing further in this case.
problem_path = os.path.join(os.path.abspath(dirpath), 'problemDoc.json')
try:
with open(problem_path, 'r', encoding='utf8') as problem_file:
problem_doc = json.load(problem_file)
problem_id = problem_doc['about']['problemID']
# Handle a special case for SCORE dataset splits (those which have "targets.csv" file).
# They are the same as TEST dataset splits, but we present them differently, so that
# SCORE dataset splits have targets as part of data. Because of this we also update
# corresponding problem ID.
# See: https://gitlab.com/datadrivendiscovery/d3m/issues/176
if handle_score_split and os.path.exists(os.path.join(dirpath, '..', 'targets.csv')) and problem_id.endswith('_TEST'):
problem_id = problem_id[:-5] + '_SCORE'
# Also update dataset references.
for data in problem_doc.get('inputs', {}).get('data', []):
if data['datasetID'].endswith('_TEST'):
data['datasetID'] = data['datasetID'][:-5] + '_SCORE'
with open(problem_path, 'r', encoding='utf8') as problem_file:
problem_description = json.load(problem_file)
if problem_id in problem_descriptions and problem_description != problem_description_contents[problem_id]:
logger.warning(
"Duplicate problem ID '%(problem_id)s': '%(old_problem)s' and '%(problem)s'", {
'problem_id': problem_id,
'problem': problem_path,
'old_problem': problem_descriptions[problem_id],
},
)
else:
problem_descriptions[problem_id] = problem_path
problem_description_contents[problem_id] = problem_description
except (ValueError, KeyError):
logger.exception(
"Unable to read problem description '%(problem)s'.", {
'problem': problem_path,
},
)
return datasets, problem_descriptions
......@@ -181,7 +181,15 @@ texinfo_documents = [
# -- Options for intersphinx extension ---------------------------------------
# Example configuration for intersphinx: refer to the Python standard library.
intersphinx_mapping = {'https://docs.python.org/': None}
intersphinx_mapping = {
'https://docs.python.org/': None,
'pandas': ('https://pandas.pydata.org/pandas-docs/stable/', None),
'numpy': ('https://docs.scipy.org/doc/numpy/', None),
#'numpy': ('https://numpydoc.readthedocs.io/en/latest/', None),
'scikit-learn': ('https://scikit-learn.org/stable/', None),
'mypy': ('https://mypy.readthedocs.io/en/stable/', None),
'setuptools': ('https://setuptools.readthedocs.io/en/latest/', None),
}
# -- Options for todo extension ----------------------------------------------
......
......@@ -4,11 +4,10 @@ Primitives discovery
Primitives D3M namespace
------------------------
The ``d3m.primitives`` module exposes all primitives under the same
The :mod:`d3m.primitives` module exposes all primitives under the same
``d3m.primitives`` namespace.
This is achieved using `Python entry
points <https://setuptools.readthedocs.io/en/latest/setuptools.html#dynamic-discovery-of-services-and-plugins>`__.
This is achieved using :ref:`Python entry points <setuptools:Entry Points>`.
Python packages containing primitives should register them and expose
them under the common namespace by adding an entry like the following to
package's ``setup.py``:
......@@ -31,8 +30,8 @@ your primitives on the system. Then your package with primitives just
have to be installed on the system and can be automatically discovered
and used by any other Python code.
**Note:** Only primitive classes are available thorough the
``d3m.primitives`` namespace, not other symbols from a source
**Note:** Only primitive classes are available through the
``d3m.primitives`` namespace, no other symbols from a source
module. In the example above, only ``PrimitiveClassName`` is
available, not other symbols inside ``my_module`` (except if they
are other classes also added to entry points).
......@@ -65,7 +64,7 @@ compatible Python Package Index), publish a package with a keyword
d3m.index API
--------------------------
The ``d3m.index`` module exposes the following Python utility functions.
The :mod:`d3m.index` module exposes the following Python utility functions.
``search``
~~~~~~~~~~
......@@ -116,7 +115,7 @@ keywords.
Command line
------------
The ``d3m.index`` module also provides a command line interface by
The :mod:`d3m.index` module also provides a command line interface by
running ``python3 -m d3m index``. The following commands are currently
available.
......