Commit bdf696fe authored by Chris Coughlin's avatar Chris Coughlin

Updated for new normalization feature

parent a7cdfd1f
......@@ -86,9 +86,9 @@ The Data Ingestion stage is responsible for reading the input data. Prior to st
* Bitmaps such as GIF, JPEG or PNG
* DICOM / DICONDE
Other than the standard stage configuration options for remote/local operation and the number of workers, Data Ingestion has no configuration options.
In addition to the standard stage configuration options for remote/local operation and the number of workers, Data Ingestion has an option for automatically normalizing data. When enabled, data are [normalized](http://myrdocs.azurewebsites.net/api/com/emphysic/myriad/core/data/ops/NormalizeSignalOperation.html) such that all values are between 0 and 1. Normalization is particularly useful when data sources may have different scales, e.g. when analyzing data saved as images.
When the pipeline is executed, the Data Ingestion router builds a list of the data files to be read. When a worker is ready, it takes the next job off the queue. The worker uses Myriad’s “file sniffer” functionality to attempt to automatically determine the type of file and to read its contents, e.g. files that end in ``.txt`` are first examined as possible text files and Myriad attempts to determine the most likely delimiter (comma, whitespace, etc.). If successful the file is read and passed to the next stage. For multi-frame file formats such as GIF, TIFF, or DICOM/DICONDE, the worker will attempt to read each frame and send to the next stage.
When the pipeline is executed, the Data Ingestion router builds a list of the data files to be read. When a worker is ready, it takes the next job off the queue. The worker uses Myriad’s “file sniffer” functionality to attempt to automatically determine the type of file and to read its contents, e.g. files that end in ``.txt`` are first examined as possible text files and Myriad attempts to determine the most likely delimiter (comma, whitespace, etc.). If successful data from the file is read, normalized if configured to do so, and passed to the next stage. For multi-frame file formats such as GIF, TIFF, or DICOM/DICONDE, the worker will attempt to read each frame and send to the next stage.
If the file could not be read, the worker makes a log entry that includes the name of the file that could not be read and the nature of the error encountered.
......
docs/img/ingestion.png

19.5 KB | W: | H:

docs/img/ingestion.png

17 KB | W: | H:

docs/img/ingestion.png
docs/img/ingestion.png
docs/img/ingestion.png
docs/img/ingestion.png
  • 2-up
  • Swipe
  • Onion skin
docs/img/trainer.png

104 KB | W: | H:

docs/img/trainer.png

96.2 KB | W: | H:

docs/img/trainer.png
docs/img/trainer.png
docs/img/trainer.png
docs/img/trainer.png
  • 2-up
  • Swipe
  • Onion skin
No preview for this file type
......@@ -103,7 +103,9 @@ Myriad uses [Smile](http://haifengl.github.io/smile/) and [Apache Mahout](http:/
Each algorithm has its strengths and weaknesses; Emphysic recommends that each be evaluated during an initial experiment. Note that for SGD in particular it is possible that Myriad Trainer will report that no useful results were returned; this is an indication that the model was unable to learn any difference between the positive and negative samples. If you are using a train and a test set, additional attempts may address the issue.
## Training
To create a new model, select an algorithm to use by choosing the appropriate tab in the Myriad Trainer interface e.g. create a new Passive Aggressive model by making the Passive Aggressive tab visible. Adjust the train:test ratios, sample balancing, and preprocessing operations (if any) as desired.
To create a new model, select an algorithm to use by choosing the appropriate tab in the Myriad Trainer interface e.g. create a new Passive Aggressive model by making the Passive Aggressive tab visible. Adjust the train:test ratios, sample balancing, and preprocessing operations (if any) as desired.
Trainer also provides an option for automatically [normalizing](http://myrdocs.azurewebsites.net/api/com/emphysic/myriad/core/data/ops/NormalizeSignalOperation.html) the samples such that each value in each sample is scaled to lie between 0 and 1. If a machine learning algorithm is sensitive to feature scaling (as is the case with [SGD algorithms](http://scikit-learn.org/stable/modules/sgd.html#tips-on-practical-use)), normalizing the data can improve the model's performance. As with preprocessing operations in general, we recommend conducting experiments with your data to determine whether accuracies improve with normalization.
When the `Train` button is pressed, Myriad Trainer will attempt to read each of the folders specified. For each file in each of the folders, the trainer will attempt to load the contents as a Myriad dataset and if successful will assign the appropriate label to the dataset. When all the available data have been read, a random subset is set aside for testing and the remainder used to train the selected model. At the same time, the trainer will produce a plot that visualizes a projection of the entire dataset in three dimensions, with red markers indicating positive samples and blue indicating negative. The projection is based on [Principal Component Analysis](https://en.wikipedia.org/wiki/Principal_component_analysis ).
......@@ -148,9 +150,11 @@ If a model appears to be making too many false ROI calls, try adjusting the conf
Consider putting the model into service and saving the data. In particular, a procedure known as [hard negative mining](https://www.reddit.com/r/computervision/comments/2ggc5l/what_is_hard_negative_mining_and_how_is_it/) takes the data that were incorrectly labelled as positive samples and uses them in the next round of training for the model. In some applications a model can learn to distinguish positive and negative samples on the periphery but may have trouble differentiating closer to the decision boundary (the region where labels are ambiguous). Hard negative mining can help improve the accuracy of a model not just by correctly labelling mis-labelled samples but also by helping to more narrowly define the decision boundary.
#### Preprocessing
Myriad has multiple preprocessing operations that can be applied to training data. As mentioned previously, edge detection algorithms including Sobel, Scharr, Prewitt, and Canny can often make implicit features in data explicit. Canny edge detection in particular is a good place to start because it has the side effect of normalizing the data between 0 and 1, which can help to remove variations in amplitude that might otherwise confuse an ROI detector.
Myriad has multiple preprocessing operations that can be applied to training data. As mentioned previously, edge detection algorithms including Sobel, Scharr, Prewitt, and Canny can often make implicit features in data explicit. Canny edge detection in particular is a good place to start because it has the side effect of normalizing the data between 0 and 1, which can help to remove variations in amplitude that might otherwise confuse an ROI detector.
Normalizing can also be performed in combination with other preprocessing operations by enabling the `Normalize Data?` checkbox and selecting the preprocessing operation of choice. The data will be normalized prior to performing the preprocessing operation.
If edge / blob detection algorithms do not improve results, a more elaborate algorithm such as Histogram of Oriented Gradients (HOG) is worth considering. The HOG algorithm was written specifically to encode details of gradients or edge directions in ROI detection. Although computationally it is a more expensive operation than edge detection, it may ultimately provide better results.
If edge / blob detection algorithms do not improve results, a more elaborate algorithm such as Histogram of Oriented Gradients (HOG) is worth considering. The HOG algorithm was written specifically to encode details of gradients or edge directions in ROI detection. Although it is a more computationally expensive operation than edge detection, it may ultimately provide better results.
#### Change Algorithm Parameters
Some algorithms present one or more options for changing their behavior, from the penalty they incur for a wrong answer to the maximum number of iterations in their training. Although a thorough examination of each algorithm is beyond the scope of this document, a good starting point is to change a single parameter and evaluate its effect on the resultant model’s accuracy. Keeping good records of configurations investigated and their results are crucial.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment