d3m.primitives.regression.decision_tree.SKlearn needs to be updated to reflect inclusive/exclusive lower and upper bounds
Specifying legitimate value of 0.0 for the min_samples_split hyper-parameter on the d3m.primitives.regression.decision_tree.SKlearn primitive yields an exception during fit. (Note: this was discovered during hyper-parameter value sweeps that were generated using the hyper-parameter class sample() method)
Here are the limitations on the min_samples_split hyper-parameter for d3m.primitives.regression.decision_tree.SKlearn:
min_samples_split = hyperparams.Union(
configuration=OrderedDict({
'float': hyperparams.Bounded[float](
lower=0,
upper=1,
default=1.0,
description='It\'s a percentage and `ceil(min_samples_split * n_samples)` is the minimum number of samples for each split.',
semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'],
),
'int': hyperparams.Bounded[int](
lower=0,
upper=None,
default=2,
description='Minimum number.',
semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter'],
)
}),
default='int',
description='The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a percentage and `ceil(min_samples_split * n_samples)` are the minimum number of samples for each split. .. versionchanged:: 0.18 Added float values for percentages.',
semantic_types=['https://metadata.datadrivendiscovery.org/types/TuningParameter']
)
When this primitive is added to a pipeline d3m.primitives.regression.decision_tree.SKlearn.json and fit is called, it yields the following exception:
File "/src/d3m/d3m/runtime.py", line 941, in _do_run_step
self._run_step(step)
File "/src/d3m/d3m/runtime.py", line 931, in _run_step
self._run_primitive(step)
File "/src/d3m/d3m/runtime.py", line 839, in _run_primitive
multi_call_result = self._call_primitive_method(primitive.fit_multi_produce, fit_multi_produce_arguments)
File "/src/d3m/d3m/runtime.py", line 914, in _call_primitive_method
raise error
File "/src/d3m/d3m/runtime.py", line 910, in _call_primitive_method
result = method(**arguments)
File "/src/d3m/d3m/primitive_interfaces/base.py", line 529, in fit_multi_produce
return self._fit_multi_produce(produce_methods=produce_methods, timeout=timeout, iterations=iterations, inputs=inputs, outputs=outputs)
File "/src/d3m/d3m/primitive_interfaces/base.py", line 556, in _fit_multi_produce
fit_result = self.fit(timeout=timeout, iterations=iterations)
File "/src/sklearn-wrap/sklearn_wrap/SKDecisionTreeRegressor.py", line 324, in fit
self._clf.fit(self._training_inputs, sk_training_output)
File "/usr/local/lib/python3.6/dist-packages/sklearn/tree/tree.py", line 1157, in fit
X_idx_sorted=X_idx_sorted)
File "/usr/local/lib/python3.6/dist-packages/sklearn/tree/tree.py", line 215, in fit
% self.min_samples_split)
ValueError: min_samples_split must be an integer greater than 1 or a float in (0.0, 1.0]; got the float 0.0
Examining sklearns underlying implementation (tree.py line 210-217):
else: # float
if not 0. < self.min_samples_split <= 1.:
raise ValueError("min_samples_split must be an integer "
"greater than 1 or a float in (0.0, 1.0]; "
"got the float %s"
% self.min_samples_split)
min_samples_split = int(ceil(self.min_samples_split * n_samples))
min_samples_split = max(2, min_samples_split)
It seems that the sklearn_wrap implementation allows values that are out of range for the underlying sklearn class. 0.0 should not be allowed as a lower value.