- Ideally developed a good DL model from a randomly sampled subset of dataset and don’t need to change model further with full dataset
- It takes too long to train the DL model on full dataset with single node
- Have access to computing cluster /supercomputer and enough compute power
- Ideally developed a good DL model from a randomly sampled subset of
dataset and don’t need to change model further with full dataset
- It takes too long to train the DL model on full dataset with single
node
- Have access to computing cluster /supercomputer and enough compute
power
## Data parallel or model parallel decision-making
- If DL model can fit onto one node, choose data parallel (e.g., horovod [[2](#references)])
- If DL model is too large to run on one node and dataset can fit one node, choose model parallel, could try Mesh-Tensorflow [[3](#references)]
- If both DL model and dataset are too large for one node, choose model-data-parallel [[10](#references)], can also try Mesh-Tensorflow [[3](#references)]
- If DL model can fit onto one node, choose data parallel (e.g.,
horovod [[2](#references)])
- If DL model is too large to run on one node and dataset can fit one
node, choose model parallel, could try Mesh-Tensorflow
[[3](#references)]
- If both DL model and dataset are too large for one node, choose
model-data-parallel [[10](#references)], can also try
Mesh-Tensorflow [[3](#references)]
## How large dataset is required for scaling/training
- Dataset size: Different problem/dataset requires different dataset size. The larger dataset size, the better. Some examples can be found in the table below:
- Dataset size: Different problem/dataset requires different dataset
size. The larger dataset size, the better. Some examples can be
found in the table below:
| Project name | Model type | Model parameter numbers | Dataset size | comment |
@@ -29,66 +50,119 @@ Guidelines prepared by **Lei Shao**, **Victor Lee** (Intel) and **Thorsten Kurth
| ImageNet classification | Resnet50 | 25Millions | 1.28 million training images with ImageNet 2012 classification dataset with 1000 classes | Can converge well |
## How to increase dataset size
- Run more simulations
- Data augmentation, this largely depends on the invariances in your data. For example, some common augmentation transformations for image and object recognition tasks:
- Horizontal flips, random crops and scales, color jitter
- Data augmentation, this largely depends on the invariances in your
data. For example, some common augmentation transformations for
image and object recognition tasks:
- Horizontal flips, random crops and scales, color jitter
- Random mix/combinations of: translation, rotation, stretching,
shearing, lens distortions, ...
## Optimizer selection
- Continue to use the default optimizer as in single node case for multi-node scaling when global batch size is not scaled to too large
- Extreme large global batch size (model and dataset dependent): consider combining LARS [[7](#references)]/LARC with base optimizer (e.g. Adam, SGD)
- Best accuracy: SGD with momentum (but may be difficult to tune hyper-parameters)
- Adam [[11](#references)] optimizer is the most popular per-parameter adaptive learning rate optimizer, which works very well in most of use cases without the need of difficult learning rate tuning. And it works for both single node and multi-node case. We recommend users to give it a try.
- Continue to use the default optimizer as in single node case for
multi-node scaling when global batch size is not scaled to too large
- Extreme large global batch size (model and dataset dependent):
consider combining LARS [[7](#references)]/LARC with base optimizer
(e.g. Adam, SGD)
- Best accuracy: SGD with momentum (but may be difficult to tune
hyper-parameters)
- Adam [[11](#references)] optimizer is the most popular per-parameter
adaptive learning rate optimizer, which works very well in most of
use cases without the need of difficult learning rate tuning. And it
works for both single node and multi-node case. We recommend users
to give it a try.
## Learning rate scheduling
- Apply learning rate scaling for weak scaling with large batch size, e.g., linear scaling [[9](#references)], sqrt scaling [[7](#references)]
- Use learning rate warm up [[9](#references)] when scaling the DL training to multi-nodes with larger global batch size. Start with single worker/rank/node LR and scale up to desired value linearly over a couple of epochs
exponential decay, 1/t decay, polynomial decay, cosine decay, etc.
## Synchronous SGD or Asynchronous SGD or hybrid SGD
- Sync SGD for proof of concept
- Async SGD for well-studied algorithm to further improve scaling efficiency
- Consider gradient lag-sync [[8](#references)] (also named stale-synchronous or pipelining)
- Async SGD for well-studied algorithm to further improve scaling
efficiency
- Consider gradient lag-sync [[8](#references)] (also named
stale-synchronous or pipelining)
- Hybrid SGD (not straightforward with most frameworks)
- Recommendation: use synchronous SGD for reproduction and easy to converge
- Recommendation: use synchronous SGD for reproduction and easy to
converge
## Distributed training framework
- Horovod-MPI, Cray ML Plugin, Horovod-MLSL, etc
- GRPC is not recommended on HPC systems
- Recommendation: use Horovod-MPI unless you have access to Cray machine.
- Recommendation: use Horovod-MPI unless you have access to Cray
machine.
## Batch size selection and node number selection
- Different workload (model, training algorithm, dataset) allows different *maximum useful batch size*, which is related to gradient noise scale [[4](#references)]
- More complex datasets/tasks have higher gradient noise, thus can benefit from training with larger batch-sizes [[4](#references)]
- For dataset size N, usually use maximal global batch size ≤ sqrt(N)
- Make sure the local batch size is not too small for computation efficiency
- Up to 64 nodes is recommended for shorter queue wait time
- Different workload (model, training algorithm, dataset) allows
different *maximum useful batch size*, which is related to gradient
noise scale [[4](#references)]
- More complex datasets/tasks have higher gradient noise, thus can
benefit from training with larger batch-sizes [[4](#references)]
- For dataset size N, usually use maximal global batch size ≤
sqrt(N)
- Make sure the local batch size is not too small for computation
efficiency
- Up to 64 nodes is recommended for shorter queue wait time
- More nodes are not necessarily better
- For weak scaling, learning rate and global batch size need to be scaled at the same time
- For weak scaling, learning rate and global batch size need to be
scaled at the same time
**Figure 2.** The “simple noise scale” roughly predicts the maximum useful batch size for many ML tasks [[4](#references)]
**Figure 2.** The “simple noise scale” roughly predicts the maximum
useful batch size for many ML tasks [[4](#references)]
**Figure 4.** The tradeoff between time and compute resources spent to train a model to a given level of performance takes the form of a Pareto frontier (left). (Right) a concrete example of the Pareto frontiers obtained from training a model to solve the Atari Breakout game to different levels of performance [[4](#references)]
**Figure 4.** The tradeoff between time and compute resources spent to
train a model to a given level of performance takes the form of a
Pareto frontier (left). (Right) a concrete example of the Pareto
frontiers obtained from training a model to solve the Atari Breakout
game to different levels of performance [[4](#references)]
4. McCandlish, Kaplan and Amodei, An Empirical Model of Large-Batch Training, [arXiv:1812.06162](http://arxiv.org/abs/1812.06162)
5. Sgallue, Lee, Antognini, Sohl-Dickstein, Frostig, Dahl, Measuring the Effects of Data Parallelism on Neural Network Training, [arXiv:1811.03600](https://arxiv.org/abs/1811.03600)
@@ -17,6 +17,16 @@ showcase the use of Tensorflow optimized for the KNL architecture.
*[Using deep networks for HEP physics analyses](hep-cnn.md)
*[Using deep networks for neutrino telescopes (Ice Cube)](https://www.nersc.gov/news-publications/nersc-news/nersc-center-news/2018/icecube-research-garners-best-paper-award-at-ieee-machine-learning-conference/)
*[CosmoGAN: Deep networks for generating cosmology mass maps](https://www.nersc.gov/news-publications/nersc-news/science-news/2019/cosmogan-training-a-neural-network-to-study-dark-matter/)
* A use of SciKitLearn by Juliette Ugirumurera can be found in [this iPython notebook](https://github.com/NERSC/data-day-examples/blob/master/SLURM_challenge.ipynb). The code uses SciKitLearn to construct, train and evaluate the network, and was the winning code for the SLURM log data challenge in the [2017 Data Day Competition](https://www.nersc.gov/users/NUG/annual-meetings/nersc-data-day-and-nug2017/data-competition/).
* The winning code for the Astronomy challenge in the [2017 Data Day Competition](https://www.nersc.gov/users/NUG/annual-meetings/nersc-data-day-and-nug2017/data-competition/) by Yisha Sun uses TensorFlow to set up and train the network. The code can be found in [this github repository](https://github.com/miaoshasha/Astronomical_Classification).
* A use of SciKitLearn by Juliette Ugirumurera can be found in [this
iPython
notebook](https://github.com/NERSC/data-day-examples/blob/master/SLURM_challenge.ipynb). The
code uses SciKitLearn to construct, train and evaluate the network,
and was the winning code for the SLURM log data challenge in the