Commit 7d476736 authored by Mike Ryan's avatar Mike Ryan

#11: Add Filter interface, with Select filter implementation.

parent c105a3e5
Pipeline #58005975 passed with stage
in 2 minutes and 19 seconds
......@@ -41,9 +41,9 @@ If you discover any security related issues, please email `[email protected]
To setup for demoing Soong ETL:
1. Create an empty database for testing.
1. Import data/extractsource.sql to the database (table to be populated by the first demo).
1. Import data/beer.sql to the database (tables to be populated for the second demo).
1. Edit each of the `.yml` files in `config/` - where indicated, replace the sample credentials with those for the test database.
1. Import `data/extractsource.sql` to the database (table to be populated by the first demo).
1. Import `data/beer.sql` to the database (tables to be populated for the second demo).
1. Edit each of the files in `config/` - where indicated, replace the sample credentials with those for the test database.
Demo 1:
......@@ -86,5 +86,5 @@ The MIT License (MIT). Please see [License File](LICENSE.md) for more informatio
[link-scrutinizer]: https://scrutinizer-ci.com/g/soong/soong/code-structure
[link-code-quality]: https://scrutinizer-ci.com/g/soong/soong
[link-downloads]: https://packagist.org/packages/soong/soong
[link-author]: https://gitlab.com/mikeryan776
[link-author]: https://gitlab.com/mikeryan
[link-contributors]: ../../contributors
......@@ -38,6 +38,22 @@ arraytosql:
bar: description of third record
num: 38
related: 1
-
id: 12
foo: bogus
bar: we should skip this
num: -5
# Filters can be used to narrow down the raw data. In this example, we
# use a Select filter to skip bogus records.
filters:
-
class: Soong\Filter\Select
configuration:
criteria:
-
- foo
- <>
- bogus
transform:
# @todo: have 'configuration' and 'property' keys.
# Configuration option: automatically copy source record to data record so
......
......@@ -8,6 +8,11 @@ This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.htm
## [Unreleased]
### Added
- The `Filter` interface has been added, to determine whether a DataRecord should be processed.
- The `Select` filter has been added, allowing for filtering by comparing DataRecord properties to values using PHP comparison operators.
- The `--select` option has been added to the `migrate` command, allow for ad-hoc filtering of extracted data at runtime.
## [0.5.3] - 2019-04-12
### Changed
......
# Contributing
Contributions are **welcome** and will be fully **credited**. There's still a lot of refinement to be done to Soong - this is your opportunity to get involved with a new framework (and community) on the ground floor! As mentioned above, the plan is ultimately to break out components into small well-contained libraries - these will be excellent opportunities to get your feet wet maintaining your own open-source project.
Contributions are **welcome** and will be fully **credited**. There's still a lot of refinement to be done to Soong - this is your opportunity to get involved with a new framework (and community) on the ground floor! As mentioned above, the plan is ultimately to break out components into small well-contained libraries - these will be excellent opportunities to get your feet wet maintaining your own open-source project. Mike Ryan will be happy to help mentor new contributors.
There's plenty of work already identified in [the Gitlab issue queue](https://gitlab.com/soongetl/soong/issues). Feel free to browse, ask questions, and offer your own insights - or, if you have a migration itch you'd like to scratch and don't see an existing issue, open a new one.
......@@ -12,13 +12,13 @@ There's plenty of work already identified in [the Gitlab issue queue](https://gi
1. Develop your solution locally. Be sure to:
- Make sure your changes are fully tested (see below).
- Make sure your changes are fully documented (see below).
- Follow the [PSR-2 Coding Standard](https://github.com/php-fig/fig-standards/blob/master/accepted/PSR-2-coding-style-guide.md). Check the code style with `$ composer check-style` and fix it with `$ composer fix-style`.
- Follow the [PSR-2 Coding Standard](https://github.com/php-fig/fig-standards/blob/master/accepted/PSR-2-coding-style-guide.md). Check the code style with `$ composer check-style` - many issues can be automatically fixed with `$ composer fix-style`. The only complaints you should see from `check-style` are long lines in tests.
1. Make sure each individual commit in your pull request is meaningful. If you had to make multiple intermediate commits while developing, please [squash them](http://www.git-scm.com/book/en/v2/Git-Tools-Rewriting-History#Changing-Multiple-Commit-Messages) before submitting.
1. Commits should reference the issue number - e.g., a commit for [Add community docs up front](https://gitlab.com/soongetl/soong/issues/51) might have the commit message "#51: Expand community documentation and move to docs directory.".
1. On gitlab, create a merge request and submit it.
## Tests
Automated tests are critical, especially when code is changing rapidly. They help ensure that any changes made don't produce any unexpected consequences, and give confidence that a new piece of code does what it's expected to do. In the Soong `tests` directory, you'll find existing tests laid out in parallel with the `src` directory. Of particular note is `tests/Contracts` - while interfaces can't be tested (since they don't do anything worth testing), we do provide base classes here which you should extend for the tests of your components - these will give you testing that your components meet the documented expectations of the interfaces, so in writing tests you can focus on the specific features added by your own code.
Automated tests are critical, especially when code is changing rapidly. They help ensure that any changes made don't produce any unexpected consequences, and give confidence that a new piece of code does what it's expected to do. In the Soong `tests` directory, you'll find existing tests laid out in parallel with the `src` directory. Of particular note is `tests/Contracts` - while interfaces can't be tested (since they don't do anything to test), we do provide base classes here which you should extend for the tests of your components - these will give you testing that your components meet the documented expectations of the interfaces, so in writing tests you can focus on the specific features added by your own code.
To run the test suite locally:
......@@ -29,5 +29,5 @@ $ composer test
## Documentation
* Classes and methods are to be fully documented in comment blocks - these are used to automatically generate the API Reference section of the [online documentation](https://soong-etl.readthedocs.io/).
* Add any non-trivial changes you've made to docs/CHANGELOG.md.
* Review `README.md` and any `.md` files under `docs` to see if any changes are missing.
* Add any non-trivial changes you've made to the [CHANGELOG](CHANGELOG.md).
* Review `README.md` and any `.md` files under `docs` to see if any changes need to be made there.
......@@ -2,22 +2,27 @@
This is the API reference documentation for Soong, generated from the code using [Doxygen](http://doxygen.nl/).
At the moment, configuration is represented as a simple keyed array. We anticipate [using an outside library][44ab6025] to provide configuration handling services before long.
All components other than `DataProperty` and `DataRecord` take a keyed configuration array as their single constructor argument. The component interfaces all inherit from [ConfigurableComponent][d086dd58], and at the moment all concrete configurable component classes inherit from [OptionsResolverComponent][f40a8c73] which is based on [Symfony OptionsResolver][ca87104b]. All components using this base class must implement [optionDefinitions()][43def4f7] to define the configuration options they accept.
[d086dd58]: interface_soong_1_1_contracts_1_1_configuration_1_1_configurable_component.html "ConfigurableComponent"
[f40a8c73]: class_soong_1_1_configuration_1_1_options_resolver_component.html "OptionsResolverComponent"
[ca87104b]: https://symfony.com/doc/current/components/options_resolver.html "Symfony OptionsResolver"
[43def4f7]: class_soong_1_1_configuration_1_1_options_resolver_component.html#aa3271447a235ee4e580066710a6f35f9 "optionDefinitions"
As an ETL framework, the key components of Soong are of course:
- [Extractors][1b68bb10]: Extractors read data from a source data store and via `extract*()` methods produce iterators to deliver one record at a time as a `DataRecord` instance. They accept static configuration to determine where and how to access the source data, and runtime options to control what records to process on a given invocation. Being able to tell how many source records are available for migration is very helpful, although on occasion there may be data sources where this is impossible (or at least very slow) - therefore, countability is not required by `ExtractorInterface`. Most extractors will want to implement `\Countable` (a `CountableExtractorBase` class is provided which should be a good starting point for most extractors).
- [Extractors][1b68bb10]: Extractors read data from a source data store and via `extract*()` methods produce iterators to deliver one record at a time as a `DataRecord` instance. They accept configuration to determine where and how to access the source data, including filters (see below) to control what records to process on a given invocation. Being able to tell how many source records are available for migration is very helpful, although on occasion there may be data sources where this is impossible (or at least very slow) - therefore, countability is not required by `Extractor`. Most extractors will want to implement `\Countable` (a `CountableExtractorBase` class is provided which should be a good starting point for most extractors).
- [Transformers][f8e7b6dc]: A Transfomer class accepts a value (usually a property from an extractor-produced record) and produces a new value.
- [Loaders][d4c501b1]: Loaders accept one `DataRecord` instance at a time and load the data it contains into a destination as configured. Note that not all destinations may permit deleting loaded data (e.g., a loader could be used to output a CSV file). The deletion capability (used by rollback operations) should be moved to a separate interface.
[1b68bb10]: interface_soong_1_1_contracts_1_1_extractor_1_1_extractor.html "Extractors"
[f8e7b6dc]: interface_soong_1_1_contracts_1_1_transformer_1_1_transformer.html "Transformers"
[d4c501b1]: interface_soong_1_1_contracts_1_1_loader_1_1_loader.html "Loaders"
[1b68bb10]: interface_soong_1_1_contracts_1_1_extractor_1_1_extractor.html "Extractor"
[f8e7b6dc]: interface_soong_1_1_contracts_1_1_transformer_1_1_transformer.html "Transformer"
[d4c501b1]: interface_soong_1_1_contracts_1_1_loader_1_1_loader.html "Loader"
The ETL pipeline components need to communicate the data they handle with each other (extractor outputs need to pass through a series of transformers and ultimately into a loader). The canonical representation of such data would be an associative array of arbitrarily-typed values, but rather than require a specific representation it is more flexible to abstract the data.
The ETL pipeline components need to communicate the data they handle with each other - extractor outputs need to pass through a series of transformers and ultimately into a loader. The canonical representation of such data would be an associative array of arbitrarily-typed values, but rather than require a specific representation it is more flexible to abstract the data.
- [DataProperty][81696853]: Represents a value (which could be a scalar, an array, or an object). Implementations of DataProperty should be immutable - the value should be set at construction time and may not subsequently be changed. The value may be any scalar, array, or object type - including `DataPropertyInterface`.
- [DataRecord][ba5fb4bd]: A data record (a set of named `DataProperty` instances) is represented by `DataRecordInterface`. In the context of an ETL pipeline, an extractor will output a `DataRecordInterface` to input to transformers, and the transformation process will populate another instance of `DataRecordInterface` one property at a time to ultimately pass to a loader.
- [DataProperty][81696853]: Represents a value (which could be a scalar, an array, or an object). Implementations of DataProperty should be immutable - the value should be set at construction time and may not subsequently be changed. The value may be any scalar, array, or object type - including `DataProperty`.
- [DataRecord][ba5fb4bd]: A data record (a set of named `DataProperty` instances) is represented by `DataRecord`. In the context of an ETL pipeline, an extractor will output a `DataRecord` to input to transformers, and the transformation process will populate another instance of `DataRecord` one property at a time to ultimately pass to a loader.
[81696853]: interface_soong_1_1_contracts_1_1_data_1_1_data_property.html "DataProperty"
[ba5fb4bd]: interface_soong_1_1_contracts_1_1_data_1_1_data_record.html "DataRecord"
......@@ -25,7 +30,7 @@ The ETL pipeline components need to communicate the data they handle with each o
To manage the migration process, we have:
- [Task][845d1aeb]: A named object controlling the execution of operations according to a set of configuration. Most tasks will be ETL tasks, designed to migrate data, but the overall migration process may require some non-ETL housekeeping tasks (like moving files around) - classes derived from `Task` rather than `EtlTask` can be used to incorporate these operations.
- [EtlTask][fd591c8f]: A Task specifically designed to perform an ETL operation in the following manner:
- [EtlTask][fd591c8f]: A Task specifically designed to perform operations on data using extractors, transformers, and loaders. The most important operation is `migrate`, which will:
1. Invoke an `Extractor` instance and iterate over its data set, retrieving one source `DataRecord` at a time.
2. Create a destination `DataRecord`, and for each property to be stored in this record, execute one or more `Transformer` instances to derive the destination property from source properties and configuration.
3. Pass the destination `DataRecord` to a `Loader` instance for final disposition.
......@@ -38,8 +43,7 @@ To manage the migration process, we have:
Finally, we have:
- [KeyMap][8129d923]: Storage of the relationships between extracted and loaded records (based on the designated unique keys for each). This enables maintaining relationships between keyed records when the keys change during migration (as when loading into an auto-increment SQL table), as well as providing rollback and auditing capabilities. This component is optional - you may implement ETL processes without tracking the keys being processed.
- [Filter][1ea4455f]: A filter simply accepts a `DataRecord` and based on the record's property values and its own configuration, decides whether the record should be further processed. Filters may be configured in the base configuration of an extractor (to help define the canonical source data to be migrated), or injected at run time (to, say, process a single specific record for debugging).
[8129d923]: interface_soong_1_1_contracts_1_1_key_map_1_1_key_map.html "KeyMap"
[162ae00a]: https://gitlab.com/soongetl/soong/issues "Soong ETL issue queue"
[44ab6025]: https://gitlab.com/soongetl/soong/issues/14 "Configuration/option handling"
[1ea4455f]: interface_soong_1_1_contracts_1_1_filter_1_1_filter.html "Filter"
......@@ -60,6 +60,22 @@ arraytosql:
bar: description of third record
num: 38
related: 1
-
id: 12
foo: bogus
bar: we should skip this
num: -5
# Filters can be used to narrow down the raw data. In this example, we
# use a Select filter to skip bogus records.
filters:
-
class: Soong\Filter\Select
configuration:
criteria:
-
- foo
- <>
- bogus
# The transform stage is keyed by the destination property names to be
# populated.
transform:
......
......@@ -68,7 +68,7 @@ abstract class OptionsResolverComponent implements ConfigurableComponent
*/
public function getConfigurationValue(string $optionName)
{
return $this->configuration[$optionName];
return $this->configuration[$optionName] ?? null;
}
/**
......
......@@ -58,6 +58,22 @@ class EtlCommand extends Command
);
}
/**
* Configure the "select" option.
*
* @return InputOption
* The configured command option.
*/
protected function selectOption() : InputOption
{
return new InputOption(
'select',
null,
InputOption::VALUE_IS_ARRAY | InputOption::VALUE_OPTIONAL,
'List of property name=value criteria'
);
}
/**
* Obtain all task configuration contained in the specified directories.
*
......
......@@ -18,13 +18,14 @@ class MigrateCommand extends EtlCommand
protected function configure()
{
$this->setName("migrate")
->setDescription("Migrate data from one place to another")
->setDefinition([
$this->tasksArgument(),
$this->directoryOption(),
])
->setHelp(<<<EOT
The <info>migrate</info> command does things and stuff
->setDescription("Migrate data from one place to another")
->setDefinition([
$this->tasksArgument(),
$this->directoryOption(),
$this->selectOption(),
])
->setHelp(<<<EOT
The <info>migrate</info> command executes a Soong migration task
EOT
);
}
......@@ -35,11 +36,12 @@ EOT
protected function execute(InputInterface $input, OutputInterface $output)
{
$directoryNames = $input->getOption('directory');
$options = ['select' => $input->getOption('select')];
$this->loadConfiguration($directoryNames);
foreach ($input->getArgument('tasks') as $id) {
if ($task = $this->pipeline->getTask($id)) {
$output->writeln("<info>Executing $id</info>");
$task->execute('migrate');
$task->execute('migrate', $options);
} else {
$output->writeln("<error>$id not found</error>");
}
......
......@@ -6,7 +6,7 @@ namespace Soong\Contracts\Exception;
/**
* Thrown when an attempting to add a task with an already existing ID.
*/
class DuplicateTask extends \RuntimeException implements SoongException
class DuplicateTask extends TaskException
{
}
<?php
declare(strict_types=1);
namespace Soong\Contracts\Exception;
/**
* Base for all exceptions thrown by/about filters.
*/
abstract class FilterException extends \RuntimeException implements SoongException
{
}
<?php
declare(strict_types=1);
namespace Soong\Contracts\Exception;
/**
* Base for all exceptions thrown by/about tasks.
*/
abstract class TaskException extends \RuntimeException implements SoongException
{
}
<?php
declare(strict_types=1);
namespace Soong\Contracts\Exception;
/**
* Thrown when an unrecognized operator is passed to the Select filter.
*/
class UnrecognizedOperator extends FilterException
{
}
<?php
declare(strict_types=1);
namespace Soong\Contracts\Filter;
use Soong\Contracts\Configuration\ConfigurableComponent;
use Soong\Contracts\Data\DataRecord;
/**
* Filters decide whether a DataRecord should or should not be processed.
*/
interface Filter extends ConfigurableComponent
{
/**
* Decide whether a data record should be processed.
*
* @param \Soong\Contracts\Data\DataRecord $dataRecord
* Record to examine.
*
* @return bool
* TRUE if the record should be processed, FALSE if it should be skipped.
*/
public function filter(DataRecord $dataRecord) : bool;
}
......@@ -16,10 +16,13 @@ interface EtlTask extends Task
/**
* Retrieves the configured extractor for this task, if any.
*
* @param array $options
* List of configuration options to set on the extractor.
*
* @return Extractor
* The extractor, or NULL if none.
*/
public function getExtractor() : ?Extractor;
public function getExtractor(array $options = []) : ?Extractor;
/**
* Retrieves the configured loader for this task, if any.
......
......@@ -8,7 +8,7 @@ use Soong\Data\Record;
/**
* Extractor for in-memory arrays.
*/
class ArrayExtractor extends ExtractorBase
class ArrayExtractor extends CountableExtractorBase
{
/**
......
......@@ -28,6 +28,11 @@ abstract class ExtractorBase extends OptionsResolverComponent implements Extract
'default_value' => [],
'allowed_types' => 'array',
];
$options['filters'] = [
'required' => false,
'default_value' => [],
'allowed_types' => 'Soong\Contracts\Filter\Filter[]',
];
return $options;
}
......@@ -52,6 +57,18 @@ abstract class ExtractorBase extends OptionsResolverComponent implements Extract
*/
public function extractFiltered() : iterable
{
return $this->extractAll();
foreach ($this->extractAll() as $dataRecord) {
$yield = true;
/** @var \Soong\Contracts\Filter\Filter $filter */
foreach ($this->getConfigurationValue('filters') as $filter) {
if (!$filter->filter($dataRecord)) {
$yield = false;
break;
}
}
if ($yield) {
yield $dataRecord;
}
}
}
}
<?php
declare(strict_types=1);
namespace Soong\Filter;
use Soong\Configuration\OptionsResolverComponent;
use Soong\Contracts\Data\DataRecord;
use Soong\Contracts\Exception\UnrecognizedOperator;
use Soong\Contracts\Filter\Filter;
/**
* Filter out records based on specific property values.
*
* One configuration option is supported, 'criteria', which consists of an array
* of criterion arrays, each containing three values: the name of the property
* whose value will be the first operand, one of the comparative operators
* listed in Select::OPERATORS, and the value for the second operand. Some
* examples:
*
* @code
* // This filter accepts only data records whose id property is equal to 5.
* $filter = new Select([
* 'criteria' => [
* ['id', '==', 5],
* ],
* ]);
* @endcode
*
* @code
* // This filter accepts only data records whose status property is equal to 1
* // and whose last_name value is in the last half of the alphabet.
* $filter = new Select([
* 'criteria' => [
* ['status', '==', 1],
* ['last_name', '>=', 'N'],
* ],
* ]);
* @endcode
*/
class Select extends OptionsResolverComponent implements Filter
{
/**
* The comparison operators supported by this filter.
*/
public const OPERATORS = [
'=', // Mapped to ==
'==',
'===',
'!=',
'<>',
'!==',
'<',
'>',
'<=',
'>=',
];
/**
* @inheritdoc
*/
public function filter(DataRecord $dataRecord): bool
{
$criteria = $this->getConfigurationValue('criteria');
foreach ($criteria as $criterion) {
[$propertyName, $operator, $testValue] = $criterion;
$propertyValue = $dataRecord->getProperty($propertyName)->getValue();
switch ($operator) {
case '=':
case '==':
if (!($propertyValue == $testValue)) {
return false;
}
break;
case '===':
if (!($propertyValue === $testValue)) {
return false;
}
break;
case '!=':
if (!($propertyValue != $testValue)) {
return false;
}
break;
case '<>':
if (!($propertyValue <> $testValue)) {
return false;
}
break;
case '!==':
if (!($propertyValue !== $testValue)) {
return false;
}
break;
case '<':
if (!($propertyValue < $testValue)) {
return false;
}
break;
case '>':
if (!($propertyValue > $testValue)) {
return false;
}
break;
case '<=':
if (!($propertyValue <= $testValue)) {
return false;
}
break;
case '>=':
if (!($propertyValue >= $testValue)) {
return false;
}
break;
default:
throw new UnrecognizedOperator("Select filter: unrecognized operator $operator");
}
}
return true;
}
/**
* @inheritdoc
*/
protected function optionDefinitions(): array
{
$options = parent::optionDefinitions();
$options['criteria'] = [
'required' => true,
'allowed_types' => 'array[]',
];
return $options;
}
}
......@@ -8,6 +8,7 @@ use Soong\Contracts\KeyMap\KeyMap;
use Soong\Contracts\Loader\Loader;
use Soong\Contracts\Task\EtlTask as EtlTaskInterface;
use Soong\Contracts\Transformer\Transformer;
use Soong\Filter\Select;
use Symfony\Component\OptionsResolver\OptionsResolver;
/**
......@@ -57,7 +58,7 @@ class EtlTask extends Task implements EtlTaskInterface
/**
* @inheritdoc
*/
public function getExtractor(): ?Extractor
public function getExtractor(array $options = []): ?Extractor
{
$taskConfiguration = $this->getAllConfigurationValues();
if (empty($taskConfiguration['extract'])) {
......@@ -65,7 +66,36 @@ class EtlTask extends Task implements EtlTaskInterface
}
/** @var \Soong\Contracts\Extractor\Extractor $extractorClass */
$extractorClass = $taskConfiguration['extract']['class'];
$extractor = new $extractorClass($taskConfiguration['extract']['configuration']);
$extractorConfiguration = $taskConfiguration['extract']['configuration'];
// @todo Belongs in the Command rather than the Task. Not really
// practical until we make the Command rather than the Task responsible
// for instantiating everything.
if (!empty($options['select'])) {
// Each expression arrives in the form "$name$op$value" = we need to
// turn that into an array [$name, $op, $value].
$criteria = [];
$operatorList = implode('|', Select::OPERATORS);
foreach ($options['select'] as $expression) {
if (!preg_match("/(.*?)($operatorList)(.*)/", $expression, $matches)) {
// @todo Throw exception - should be Command exception.
}
$criteria[] = [$matches[1], $matches[2], $matches[3]];
}
$extractorConfiguration['filters'][] = [
'class' => 'Soong\Filter\Select',
'configuration' => [
'criteria' => $criteria,
]
];
}
// Replace filter configuration with actual instances.
if (!empty($extractorConfiguration['filters'])) {
foreach ($extractorConfiguration['filters'] as $key => $filter) {
$extractorConfiguration['filters'][$key] =
new $filter['class']($filter['configuration']);
}
}
$extractor = new $extractorClass($extractorConfiguration);
return $extractor;
}
......@@ -151,7 +181,7 @@ class EtlTask extends Task implements EtlTaskInterface
return;
}
$extractor = $this->getExtractor();
$extractor = $this->getExtractor($options);
$loader = $this->getLoader();
$keyMap = $this->getKeyMap();
/** @var \Soong\Contracts\Data\DataRecord $recordClass */
......
......@@ -72,6 +72,7 @@ abstract class ExtractorTestBase extends TestCase
$extractor = new $this->extractorClass($configuration);
reset($expected);
$count = 0;
/** @var \Soong\Contracts\Extractor\Extractor $extractor */
foreach ($extractor->extractAll() as $record) {
$count++;
$this->assertEquals(current($expected), $record->toArray());
......@@ -81,6 +82,32 @@ abstract class ExtractorTestBase extends TestCase
$this->assertEquals(count($expected), $count);
}
/**
* Test extractFiltered().
*
* @dataProvider extractFilteredDataProvider
*
* @param array $configuration
* Extractor configuration.
* @param array $expected
* Expected set of data records returned.
*/
public function testExtractFiltered(array $configuration, array $expected)
{
$extractor = new $this->extractorClass($configuration);
reset($expected);
$count = 0;
/** @var \Soong\Contracts\Extractor\Extractor $extractor */
foreach ($extractor->extractFiltered() as $record) {
$count++;
/** @var \Soong\Contracts\Data\DataRecord $record */
$this->assertEquals(current($expected), $record->toArray());
next($expected);
}
// Validate the expected number of records were processed.
$this->assertEquals(count($expected), $count);
}
/**
* Test getProperties() and getKeyProperties().