Commit e12fe22d authored by Lisa Gerhardt's avatar Lisa Gerhardt

Fixed lint warnings on line length, blank lines around lists, and fixed broken link to BB

parent 487ffe6a
......@@ -25,7 +25,7 @@ We compare PyTorch software installations, hardware, and analyze scaling
performance using the PyTorch distributed library with MPI. See the notebooks
in the links below for numbers and plots.
**Software versions**
#### Software versions
Results for a handful of software versions available on Cori are in this
notebook:
......@@ -35,13 +35,13 @@ https://github.com/sparticlesteve/pytorch-benchmarks/blob/master/notebooks/Softw
Training throughput results:
![Training results](images/pytorch_training_benchmarks.png)
**Hardware comparisons**
#### Hardware comparisons
Results comparing training throughput on Cori Haswell, KNL, and GPU are here:
https://github.com/sparticlesteve/pytorch-benchmarks/blob/master/notebooks/HardwareAnalysis.ipynb
**Scaling analysis**
#### Scaling analysis
Throughput scaling results on Cori Haswell with Intel PyTorch v1.0.0 are
available here:
......
# Distributed Training
Guidelines prepared by **Lei Shao**, **Victor Lee** (Intel) and **Thorsten Kurth**, **Prabhat** (NERSC) under the Big Data Center collaboration.
Guidelines prepared by **Lei Shao**, **Victor Lee** (Intel) and
**Thorsten Kurth**, **Prabhat** (NERSC) under the Big Data Center
collaboration.
## Motivation
- Scientific datasets can be large in volume and complex (multivariate, high dimensional)
- Models get bigger and more compute intensive as they tackle more complex tasks
- Scaling deep learning (DL) applications enables rapid prototyping / model evaluation
- Scientific datasets can be large in volume and complex
(multivariate, high dimensional)
- Models get bigger and more compute intensive as they tackle more
complex tasks
- Scaling deep learning (DL) applications enables rapid prototyping /
model evaluation
[![Figure 1. AlexNet to AlphaGo Zero: A 300,000x Increase in Compute](figures/openai-compute-diagram.png)](#references)
[![Figure 1. AlexNet to AlphaGo Zero: A 300,000x Increase in
Compute](figures/openai-compute-diagram.png)](#references)
## Assumptions
- Ideally developed a good DL model from a randomly sampled subset of dataset and don’t need to change model further with full dataset
- It takes too long to train the DL model on full dataset with single node
- Have access to computing cluster /supercomputer and enough compute power
- Ideally developed a good DL model from a randomly sampled subset of
dataset and don’t need to change model further with full dataset
- It takes too long to train the DL model on full dataset with single
node
- Have access to computing cluster /supercomputer and enough compute
power
## Data parallel or model parallel decision-making
- If DL model can fit onto one node, choose data parallel (e.g., horovod [[2](#references)])
- If DL model is too large to run on one node and dataset can fit one node, choose model parallel, could try Mesh-Tensorflow [[3](#references)]
- If both DL model and dataset are too large for one node, choose model-data-parallel [[10](#references)], can also try Mesh-Tensorflow [[3](#references)]
- If DL model can fit onto one node, choose data parallel (e.g.,
horovod [[2](#references)])
- If DL model is too large to run on one node and dataset can fit one
node, choose model parallel, could try Mesh-Tensorflow
[[3](#references)]
- If both DL model and dataset are too large for one node, choose
model-data-parallel [[10](#references)], can also try
Mesh-Tensorflow [[3](#references)]
## How large dataset is required for scaling/training
- Dataset size: Different problem/dataset requires different dataset size. The larger dataset size, the better. Some examples can be found in the table below:
- Dataset size: Different problem/dataset requires different dataset
size. The larger dataset size, the better. Some examples can be
found in the table below:
| Project name | Model type | Model parameter numbers | Dataset size | comment |
|-------------------------|------------|-------------------------|------------------------------------------------------------------------------------------|-------------------|
......@@ -29,66 +50,119 @@ Guidelines prepared by **Lei Shao**, **Victor Lee** (Intel) and **Thorsten Kurth
| ImageNet classification | Resnet50 | 25Millions | 1.28 million training images with ImageNet 2012 classification dataset with 1000 classes | Can converge well |
## How to increase dataset size
- Run more simulations
- Data augmentation, this largely depends on the invariances in your data. For example, some common augmentation transformations for image and object recognition tasks:
- Horizontal flips, random crops and scales, color jitter
- Random mix/combinations of: translation, rotation, stretching, shearing, lens distortions, ...
- Data augmentation, this largely depends on the invariances in your
data. For example, some common augmentation transformations for
image and object recognition tasks:
- Horizontal flips, random crops and scales, color jitter
- Random mix/combinations of: translation, rotation, stretching,
shearing, lens distortions, ...
## Optimizer selection
- Continue to use the default optimizer as in single node case for multi-node scaling when global batch size is not scaled to too large
- Extreme large global batch size (model and dataset dependent): consider combining LARS [[7](#references)]/LARC with base optimizer (e.g. Adam, SGD)
- Best accuracy: SGD with momentum (but may be difficult to tune hyper-parameters)
- Adam [[11](#references)] optimizer is the most popular per-parameter adaptive learning rate optimizer, which works very well in most of use cases without the need of difficult learning rate tuning. And it works for both single node and multi-node case. We recommend users to give it a try.
- Continue to use the default optimizer as in single node case for
multi-node scaling when global batch size is not scaled to too large
- Extreme large global batch size (model and dataset dependent):
consider combining LARS [[7](#references)]/LARC with base optimizer
(e.g. Adam, SGD)
- Best accuracy: SGD with momentum (but may be difficult to tune
hyper-parameters)
- Adam [[11](#references)] optimizer is the most popular per-parameter
adaptive learning rate optimizer, which works very well in most of
use cases without the need of difficult learning rate tuning. And it
works for both single node and multi-node case. We recommend users
to give it a try.
## Learning rate scheduling
- Apply learning rate scaling for weak scaling with large batch size, e.g., linear scaling [[9](#references)], sqrt scaling [[7](#references)]
- Use learning rate warm up [[9](#references)] when scaling the DL training to multi-nodes with larger global batch size. Start with single worker/rank/node LR and scale up to desired value linearly over a couple of epochs
- Consider adding learning rate decay schedule. Try step decay, exponential decay, 1/t decay, polynomial decay, cosine decay, etc.
- Apply learning rate scaling for weak scaling with large batch size,
e.g., linear scaling [[9](#references)], sqrt scaling
[[7](#references)]
- Use learning rate warm up [[9](#references)] when scaling the DL
training to multi-nodes with larger global batch size. Start with
single worker/rank/node LR and scale up to desired value linearly
over a couple of epochs
- Consider adding learning rate decay schedule. Try step decay,
exponential decay, 1/t decay, polynomial decay, cosine decay, etc.
## Synchronous SGD or Asynchronous SGD or hybrid SGD
- Sync SGD for proof of concept
- Async SGD for well-studied algorithm to further improve scaling efficiency
- Consider gradient lag-sync [[8](#references)] (also named stale-synchronous or pipelining)
- Async SGD for well-studied algorithm to further improve scaling
efficiency
- Consider gradient lag-sync [[8](#references)] (also named
stale-synchronous or pipelining)
- Hybrid SGD (not straightforward with most frameworks)
- Recommendation: use synchronous SGD for reproduction and easy to converge
- Recommendation: use synchronous SGD for reproduction and easy to
converge
## Distributed training framework
- Horovod-MPI, Cray ML Plugin, Horovod-MLSL, etc
- GRPC is not recommended on HPC systems
- Recommendation: use Horovod-MPI unless you have access to Cray machine.
- Recommendation: use Horovod-MPI unless you have access to Cray
machine.
## Batch size selection and node number selection
- Different workload (model, training algorithm, dataset) allows different *maximum useful batch size*, which is related to gradient noise scale [[4](#references)]
- More complex datasets/tasks have higher gradient noise, thus can benefit from training with larger batch-sizes [[4](#references)]
- For dataset size N, usually use maximal global batch size ≤ sqrt(N)
- Make sure the local batch size is not too small for computation efficiency
- Up to 64 nodes is recommended for shorter queue wait time
- Different workload (model, training algorithm, dataset) allows
different *maximum useful batch size*, which is related to gradient
noise scale [[4](#references)]
- More complex datasets/tasks have higher gradient noise, thus can
benefit from training with larger batch-sizes [[4](#references)]
- For dataset size N, usually use maximal global batch size ≤
sqrt(N)
- Make sure the local batch size is not too small for computation
efficiency
- Up to 64 nodes is recommended for shorter queue wait time
- More nodes are not necessarily better
- For weak scaling, learning rate and global batch size need to be scaled at the same time
- For weak scaling, learning rate and global batch size need to be
scaled at the same time
**Figure 2.** The “simple noise scale” roughly predicts the maximum useful batch size for many ML tasks [[4](#references)]
**Figure 2.** The “simple noise scale” roughly predicts the maximum
useful batch size for many ML tasks [[4](#references)]
[![](figures/gradient-noise-scale.png)](#references)
**Figure 3.** The relationship between steps to result and batch size has the same characteristic form for all problems [[5](#references)]
**Figure 3.** The relationship between steps to result and batch size
has the same characteristic form for all problems [[5](#references)]
[![](figures/google-paper-scaling.png)](#references)
**Figure 4.** The tradeoff between time and compute resources spent to train a model to a given level of performance takes the form of a Pareto frontier (left). (Right) a concrete example of the Pareto frontiers obtained from training a model to solve the Atari Breakout game to different levels of performance [[4](#references)]
**Figure 4.** The tradeoff between time and compute resources spent to
train a model to a given level of performance takes the form of a
Pareto frontier (left). (Right) a concrete example of the Pareto
frontiers obtained from training a model to solve the Atari Breakout
game to different levels of performance [[4](#references)]
[![](figures/gradient-noise-scale-2.png)](#references)
**Figure 5.** Effect of larger batch size on estimated gradients and training speed [[4](#references)]
**Figure 5.** Effect of larger batch size on estimated gradients and
training speed [[4](#references)]
[![](figures/batch-size.png)](#references)
## References
1. AI and Compute, OpenAI blog, [https://openai.com/blog/ai-and-compute/ ](https://openai.com/blog/ai-and-compute/)
1. AI and Compute, OpenAI blog,
[https://openai.com/blog/ai-and-compute/
](https://openai.com/blog/ai-and-compute/)
2. [Horovod](https://github.com/horovod/horovod)
3. [Mesh-Tensorflow](https://github.com/tensorflow/mesh)
4. McCandlish, Kaplan and Amodei, An Empirical Model of Large-Batch Training, [arXiv:1812.06162](http://arxiv.org/abs/1812.06162)
5. Sgallue, Lee, Antognini, Sohl-Dickstein, Frostig, Dahl, Measuring the Effects of Data Parallelism on Neural Network Training, [arXiv:1811.03600](https://arxiv.org/abs/1811.03600)
6. [Cray HPO](https://pubs.cray.com/content/S-2589/1.2.UP00/xctm-series-urika-xc-analytic-applications-guide/hyperparameter-optimization-hpo-support)
7. You, Gitman, Ginsburg, Large Batch Training of Convolutional Networks, [arXiv:1708.03888](https://arxiv.org/abs/1708.03888)
8. Kurth, et al, Exascale Deep Learning for Climate Analytics, [arXiv:1810.01993](https://arxiv.org/abs/1810.01993)
9. Goyal, et al, Accurate, Large Minibatch SGD: Training ImageNet in 1 hour, [arXiv:1706.02677](https://arxiv.org/abs/1706.02677)
10. Kurth, Zhang, Satish, Mitliagkas, Racah, Patwary, Malas, Sundaram, Bhimji, Smorkalov, Deslippe, Shiryaev, Sridharan, Prabhat, Dubey: Deep Learning at 15PF: Supervised and Semi-supervised Classification for Scientific Data, [arxiv:1708.05256](https://arxiv.org/abs/1708.05256)
11. Kingma and Ba, Adam: A method for Stochastic Optimization, [arXiv:1412.6980](https://arxiv.org/abs/1412.6980)
4. McCandlish, Kaplan and Amodei, An Empirical Model of Large-Batch
Training, [arXiv:1812.06162](http://arxiv.org/abs/1812.06162)
5. Sgallue, Lee, Antognini, Sohl-Dickstein, Frostig, Dahl, Measuring
the Effects of Data Parallelism on Neural Network Training,
[arXiv:1811.03600](https://arxiv.org/abs/1811.03600)
6. [Cray
HPO](https://pubs.cray.com/content/S-2589/1.2.UP00/xctm-series-urika-xc-analytic-applications-guide/hyperparameter-optimization-hpo-support)
7. You, Gitman, Ginsburg, Large Batch Training of Convolutional
Networks, [arXiv:1708.03888](https://arxiv.org/abs/1708.03888)
8. Kurth, et al, Exascale Deep Learning for Climate Analytics,
[arXiv:1810.01993](https://arxiv.org/abs/1810.01993)
9. Goyal, et al, Accurate, Large Minibatch SGD: Training ImageNet in 1
hour, [arXiv:1706.02677](https://arxiv.org/abs/1706.02677)
10. Kurth, Zhang, Satish, Mitliagkas, Racah, Patwary, Malas, Sundaram,
Bhimji, Smorkalov, Deslippe, Shiryaev, Sridharan, Prabhat, Dubey: Deep
Learning at 15PF: Supervised and Semi-supervised Classification for
Scientific Data, [arxiv:1708.05256](https://arxiv.org/abs/1708.05256)
11. Kingma and Ba, Adam: A method for Stochastic Optimization,
[arXiv:1412.6980](https://arxiv.org/abs/1412.6980)
12. [IBM AutoAI](https://www.ibm.com/cloud/watson-studio/autoai)
......@@ -6,19 +6,19 @@ on our systems.
These docs include details about how to use our system optimized frameworks,
multi-node training libraries, and performance guidelines.
### Classical Machine Learning
## Classical Machine Learning
Libraries like scikit-learn and other non-deep-learning libraries are supported
through our standard installations and environments for Python and R.
### Deep Learning Frameworks
## Deep Learning Frameworks
We have prioritized support for the following Deep Learning frameworks on Cori:
* [TensorFlow](tensorflow.md)
* [PyTorch](pytorch.md)
### Deploying with Jupyter
## Deploying with Jupyter
Users can deploy distributed deep learning workloads to Cori from Jupyter
notebooks using parallel execution libraries such as IPyParallel. Jupyter
......@@ -30,12 +30,12 @@ We have some examples for running multi-node training and distributed
hyper-parameter optimization jobs from notebooks in this github repository:
https://github.com/sparticlesteve/cori-intml-examples
### Benchmarks
## Benchmarks
We track general performance of Deep Learning frameworks as well as some
specific scientific applications. See the [benchmarks](benchmarks.md) for details.
### Science use-cases
## Science use-cases
Machine Learning and Deep Learning are increasingly used to analyze
scientific data, in diverse fields. We have gathered some examples of
......
......@@ -17,6 +17,16 @@ showcase the use of Tensorflow optimized for the KNL architecture.
* [Using deep networks for HEP physics analyses](hep-cnn.md)
* [Using deep networks for neutrino telescopes (Ice Cube)](https://www.nersc.gov/news-publications/nersc-news/nersc-center-news/2018/icecube-research-garners-best-paper-award-at-ieee-machine-learning-conference/)
* [CosmoGAN: Deep networks for generating cosmology mass maps](https://www.nersc.gov/news-publications/nersc-news/science-news/2019/cosmogan-training-a-neural-network-to-study-dark-matter/)
* A use of SciKitLearn by Juliette Ugirumurera can be found in [this iPython notebook](https://github.com/NERSC/data-day-examples/blob/master/SLURM_challenge.ipynb). The code uses SciKitLearn to construct, train and evaluate the network, and was the winning code for the SLURM log data challenge in the [2017 Data Day Competition](https://www.nersc.gov/users/NUG/annual-meetings/nersc-data-day-and-nug2017/data-competition/).
* The winning code for the Astronomy challenge in the [2017 Data Day Competition](https://www.nersc.gov/users/NUG/annual-meetings/nersc-data-day-and-nug2017/data-competition/) by Yisha Sun uses TensorFlow to set up and train the network. The code can be found in [this github repository](https://github.com/miaoshasha/Astronomical_Classification).
* A use of SciKitLearn by Juliette Ugirumurera can be found in [this
iPython
notebook](https://github.com/NERSC/data-day-examples/blob/master/SLURM_challenge.ipynb). The
code uses SciKitLearn to construct, train and evaluate the network,
and was the winning code for the SLURM log data challenge in the
[2017 Data Day
Competition](https://www.nersc.gov/users/NUG/annual-meetings/nersc-data-day-and-nug2017/data-competition/).
* The winning code for the Astronomy challenge in the [2017 Data Day
Competition](https://www.nersc.gov/users/NUG/annual-meetings/nersc-data-day-and-nug2017/data-competition/)
by Yisha Sun uses TensorFlow to set up and train the network. The
code can be found in [this github
repository](https://github.com/miaoshasha/Astronomical_Classification).
* Many other projects from LBNL can be found at https://ml4sci.lbl.gov
......@@ -53,21 +53,27 @@ to install your own packages on top of our PyTorch installations. For example,
module load tensorflow
pip install netCDF --user
```
2. **Install TensorFlow into your custom conda environments** - You can setup a [conda](../machinelearning/tensorflow.md)
environment as described in our [Python documentation](../development/languages/python/nersc-python.md)
and install TensorFlow into it. You can choose builds that target CPU or GPU to install.
For example, to choose the CPU optimized version, first look up the available builds:
```bash
conda search tensorflow
```
that would output a long list like this one:
```bash
tensorflow 2.1.0 eigen_py37h1a52d58_0 pkgs/main
tensorflow 2.1.0 gpu_py37h7a4bb67_0 pkgs/main
tensorflow 2.1.0 mkl_py36h23468d9_0 pkgs/main
tensorflow 2.1.0 mkl_py37h80a91df_0 pkgs/main
```
the ones that have `mkl` are optimized for CPU. You can now choose one to install:
```bash
conda install tensorflow=2.1.0=mkl_py37h80a91df_0
```
......@@ -75,6 +81,7 @@ conda install tensorflow=2.1.0=mkl_py37h80a91df_0
To install the gpu version you need to load the module of the [CUDA
version](https://www.tensorflow.org/install/source#gpu) against which TensorFlow
has been compiled. For example, to install TensorFlow `2.1.0`:
```bash
module load cuda/10.1.243
conda install tensorflow=2.1.0=gpu_py37h7a4bb67_0
......@@ -232,20 +239,19 @@ manually and is application dependent.
For performance reasons, we recommend storing the data on the scratch
directory, accessible via the `SCRATCH` environment variable. At high
concurrency, i.e. when many nodes need to read the files we
recommend [staging them into burst buffer](). For efficient data
feeding we recommend using the `TFRecord` data format and using
the
[`dataset` API](https://www.tensorflow.org/programmers_guide/datasets)
to feed data to the CPU. Especially, please note that the
`TFRecordDataset` constructor takes `buffer_size` and
`num_parallel_reads` options which allow for prefetching and
multi-threaded reads. Those should be tuned for good performance, but
please note that a thread is dispatched for every independent
read. Therefore, the number of inter-threads needs to be adjusted
accordingly (see below). The `buffer_size` parameter is meant to be in
bytes and should be an integer multiple of the node-local batch size
for optimal performance.
concurrency, i.e. when many nodes need to read the files we recommend
[staging them into burst
buffer](../../jobs/examples/#burst-buffer). For efficient data feeding
we recommend using the `TFRecord` data format and using the [`dataset`
API](https://www.tensorflow.org/programmers_guide/datasets) to feed
data to the CPU. Especially, please note that the `TFRecordDataset`
constructor takes `buffer_size` and `num_parallel_reads` options which
allow for prefetching and multi-threaded reads. Those should be tuned
for good performance, but please note that a thread is dispatched for
every independent read. Therefore, the number of inter-threads needs
to be adjusted accordingly (see below). The `buffer_size` parameter is
meant to be in bytes and should be an integer multiple of the
node-local batch size for optimal performance.
### Potential Issues
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment