Commit a155d42d authored by Annette Greiner's avatar Annette Greiner

Merge branch 'master' into agreiner/add_matlab

parents 0fc6bfa8 0d14e608
......@@ -11,4 +11,8 @@ src
*~
*.o
*.so
\#*\#
\ No newline at end of file
\#*\#
*.DS_Store
*__pycache__*
node_modules
package-lock.json
\ No newline at end of file
image: python:3.6
image: python:3.7
stages:
- build
- test
- deploy
test:
stage: test
- post-deploy
build:
stage: build
script:
- pip install -r requirements.txt
- pwd
- find .
- mkdir -p .build_cache
- pip install --cache-dir=.build_cache -r requirements.txt
- python --version
- mkdocs --version
- mkdocs --verbose build
- ls -l public
- pip install -r util/requirements.txt
artifacts:
paths:
- public
expire_in: 1 week
cache:
paths:
- .build_cache
key: "build-$CI_COMMIT_REF_SLUG"
check_links:
stage: test
script:
- mkdir -p .links_cache
- pip install --cache-dir=.links_cache -r util/requirements.txt
- python util/scrape_urls.py public
except:
- master@NERSC/nersc.gitlab.io
allow_failure: true
cache:
paths:
- .links_cache
key: "links-$CI_COMMIT_REF_SLUG"
markdown_lint:
image: node
stage: test
script:
- npm install -g markdownlint-cli
- markdownlint docs
allow_failure: true
pages:
stage: deploy
script:
- pip install -r requirements.txt
- mkdocs --verbose build
- ls -l public
- gzip -k -6 -r public/assets/stylesheets
- gzip -k -6 -r public/assets/javascripts
......@@ -31,3 +59,16 @@ pages:
- public
only:
- master@NERSC/nersc.gitlab.io
update-search-api:
stage: post-deploy
image: ubuntu:18.04
before_script:
- apt-get update; apt-get -y install curl
script:
- "curl --request POST \
--url https://nersc-docs-search-app-214117.appspot.com/reindex \
--header 'content-type: application/json' \
--data '{\"source\":\"docs\"}'"
only:
- master@NERSC/nersc.gitlab.io
{
"default": false,
"MD003": { "style": "atx" },
"MD009": true
}
......@@ -18,7 +18,7 @@ NERSC customizations such as the colors.
1. Do not commit large files
(e.g. very high-res images, binary data, executables)
* [Image optimization](https://developers.google.com/web/fundamentals/performance/optimizing-content-efficiency/image-optimization)
1. Do not commit directly to master branch
1. No commits directly to the master branch
## Setup
......@@ -108,6 +108,11 @@ to edit Markdown files directly on gitlab. This work should be in a
private fork or branch and submitted as a merge request. A new branch
can be created by clicking the "+" button next to the repository name.
### Add a new page
For a newly added page to appear in the navigation edit the top-level
`mkdocs.yml` file.
### Review a Merge Request from a private fork
1. Modify `.git/config` so merge requests are visible
......@@ -171,6 +176,11 @@ can be created by clicking the "+" button next to the repository name.
## Writing Style
When adding a page think about your audience.
* Are they new or advanced expert users?
* What is the goal of this content?
* [Grammatical Person](https://en.wikiversity.org/wiki/Technical_writing_style#Grammatical_person)
* [Active Voice](https://en.wikiversity.org/wiki/Technical_writing_style#Use_active_voice)
......@@ -185,5 +195,17 @@ can be created by clicking the "+" button next to the repository name.
## Slurm options
* Show both long and short option when introducing an option in text
* Use the short version (where possible) in scripts
* Use the long version (where possible) in scripts
## Markdown lint
Install the markdown linter (requires node/npm) locally
```shell
npm install markdownlint-cli
```
Run the linter from the base directory of the repository
```shell
./node_modules/markdownlint-cli/markdownlint.js docs
```
\ No newline at end of file
......@@ -2,51 +2,110 @@
## Password and Account Protection
A user is given a username (also known as a login name) and associated password that permits her/him to access NERSC resources. This username/password pair may be used by a single individual only: *passwords must not be shared with any other person*. Users who share their passwords will have their access to NERSC disabled.
A user is given a username (also known as a login name) and associated
password that permits her/him to access NERSC resources. This
username/password pair may be used by a single individual only:
*passwords must not be shared with any other person*. Users who share
their passwords will have their access to NERSC disabled.
Passwords must be changed as soon as possible after exposure or suspected compromise. Exposure of passwords and suspected compromises must immediately be reported to NERSC at security@nersc.gov or the Account Support Group, accounts@nersc.gov.
Passwords must be changed as soon as possible after exposure or
suspected compromise. Exposure of passwords and suspected compromises
must immediately be reported to NERSC at security@nersc.gov or the
Account Support Group, accounts@nersc.gov.
## Forgotten Passwords
If you forget your password or if it has recently expired, and you have previously answered your NIM security questions, you can reset your password using the Self-Service Password reset link on the NIM login page (https://nim.nersc.gov - Reset Your NIM Password?). See "Managing Your User Account With NIM" in the NIM User's Guide. If you haven't answered the Security Questions, you will need to call Operations at 800-666-3772, menu option 1, or 510-486-6821 to get a new temporary password (we do not send passwords via email). The temporary password is good for only 24 hours. You should login to NIM with this temporary password, and immediately choose a new password. After about 10 minutes, this new password may be used to login to any NERSC computer.
If you forget your password or if it has recently expired, and you
have previously answered your NIM security questions, you can reset
your password using the Self-Service Password reset link on the NIM
login page (https://nim.nersc.gov - Reset Your NIM Password?). See
"Managing Your User Account With NIM" in the NIM User's Guide. If you
haven't answered the Security Questions, you will need to call
Operations at 800-666-3772, menu option 1, or 510-486-6821 to get a
new temporary password (we do not send passwords via email). The
temporary password is good for only 24 hours. You should login to NIM
with this temporary password, and immediately choose a new password.
After about 10 minutes, this new password may be used to login to any
NERSC computer.
## Passwords for New Users
NERSC must have a Computer User Agreement (CUA) form on file before activating a user's account and assigning a user a password in the NIM system. This form can be submitted online.
Once we have received the form and attached it to your account (assuming that your PI has already requested that you be added to their project repository), you will receive an email with a link that will allow you to set your initial password. This link will expire after 72 hours. If it has expired, you will need to call Operations at 800-666-3772, menu option 1, or 510-486-6821 to get a temporary password. The temporary password is good for only 24 hours. You should immediately login to NIM with this password, and choose a new password. After about 10 minutes, this new password may be used to login to any NERSC computer.
NERSC must have a Computer User Agreement (CUA) form on file before
activating a user's account and assigning a user a password in the NIM
system. This form can be submitted online.
Once we have received the form and attached it to your account
(assuming that your PI has already requested that you be added to
their project repository), you will receive an email with a link that
will allow you to set your initial password. This link will expire
after 72 hours. If it has expired, you will need to call Operations at
800-666-3772, menu option 1, or 510-486-6821 to get a temporary
password. The temporary password is good for only 24 hours. You
should immediately login to NIM with this password, and choose a new
password. After about 10 minutes, this new password may be used to
login to any NERSC computer.
## How To Change Your Password in NIM
All of NERSC's computational systems are managed by the LDAP protocol and use the NIM password. Passwords cannot be changed directly on the computational machines, but rather the NIM password itself must be changed:
All of NERSC's computational systems are managed by the LDAP protocol
and use the NIM password. Passwords cannot be changed directly on the
computational machines, but rather the NIM password itself must be
changed:
1. Point your browser to nim.nersc.gov and login to nim.nersc.gov.
2. Click on the "Change My Password" link at the top left-hand corner of the main page next to the Logout link, or select "Change NIM Password" from the Actions pull-down list in the NIM main menu.
2. Click on the "Change My Password" link at the top left-hand corner
of the main page next to the Logout link, or select "Change NIM
Password" from the Actions pull-down list in the NIM main menu.
Passwords must be changed under any one of the following circumstances:
* At least every six months.
* Immediately after someone else has obtained your password (do *NOT* give your password to anyone else).
* As soon as possible, but at least within one business day after a password has been compromised or after you suspect that a password has been compromised.
* Immediately after someone else has obtained your password (do *NOT*
give your password to anyone else).
* As soon as possible, but at least within one business day after a
password has been compromised or after you suspect that a password
has been compromised.
* On direction from NERSC staff.
Your new password must adhere to NERSC's password requirements.
## Password Requirements
As a Department of Energy facility, NERSC is required to adhere to Department of Energy guidelines regarding passwords. The following requirements conform to the Department of Energy guidelines regarding passwords, namely DOE Order 205.3 and to Lawrence Berkeley National Laboratory's [RPM §9.02 Operational Procedures for Computing and Communications](http://www.lbl.gov/Workplace/RPM/R9.02.html).
As a Department of Energy facility, NERSC is required to adhere to
Department of Energy guidelines regarding passwords. The following
requirements conform to the Department of Energy guidelines regarding
passwords, namely DOE Order 205.3 and to Lawrence Berkeley National
Laboratory's
[RPM §9.02 Operational Procedures for Computing and Communications](https://www.lbl.gov/Workplace/RPM/R9.02.html).
When users are selecting their own passwords for use at NERSC, the following requirements must be used.
When users are selecting their own passwords for use at NERSC, the
following requirements must be used.
* Passwords must contain at least eight nonblank characters.
* Passwords must contain a combination of upper and lowercase letters, numbers, and at least one special character within the first seven positions.
* Passwords must contain a nonnumeric letter or symbol in the first and last positions.
* Passwords must contain a combination of upper and lowercase
letters, numbers, and at least one special character within the
first seven positions.
* Passwords must contain a nonnumeric letter or symbol in the first
and last positions.
* Passwords must not contain the user login name.
* Passwords must not include the user's own or (to the best of his or her knowledge) a close friend's or relative's name, employee number, Social Security or other Identification number, birth date, telephone number, or any information about him or her that the user believes could be readily learned or guessed.
* Passwords must not (to the best of the user's knowledge) include common words from an English dictionary or a dictionary of another language with which the user has familiarity.
* Passwords must not (to the best of the user's knowledge) contain commonly used proper names, including the name of any fictional character or place.
* Passwords must not contain any simple pattern of letters or numbers such as "qwertyxx".
* Passwords must not include the user's own or (to the best of his or
her knowledge) a close friend's or relative's name, employee
number, Social Security or other Identification number, birth date,
telephone number, or any information about him or her that the user
believes could be readily learned or guessed.
* Passwords must not (to the best of the user's knowledge) include
common words from an English dictionary or a dictionary of another
language with which the user has familiarity.
* Passwords must not (to the best of the user's knowledge) contain
commonly used proper names, including the name of any fictional
character or place.
* Passwords must not contain any simple pattern of letters or numbers
such as "qwertyxx".
## Login Failures
Your login privileges will be disabled if you have five login failures while entering your password on a NERSC machine. You do not need a new password in this situation. You can clear your login failures on all systems by simply logging in to NIM . No additional actions are necessary.
Your login privileges will be disabled if you have five login failures
while entering your password on a NERSC machine. You do not need a
new password in this situation. You can clear your login failures on
all systems by simply logging in to NIM . No additional actions are
necessary.
# Account Policies
There are a number of policies which apply to NERSC users. These policies originate from a number of sources, such as DOE regulations and DOE and NERSC management.
There are a number of policies which apply to NERSC users. These
policies originate from a number of sources, such as DOE regulations
and DOE and NERSC management.
## User Account Ownership and Password Policies
A user is given a username (also known as a login name) and associated password that permits her/him to access NERSC resources. This username/password pair may be used by a single individual only.
A user is given a username (also known as a login name) and associated
password that permits her/him to access NERSC resources. This
username/password pair may be used by a single individual only.
*Passwords may not be shared and must be created or changed using specified rules*. See [NERSC Passwords](passwords.md)
* Passwords may not be shared and must be created or changed using
specified rules*. See [NERSC Passwords](passwords.md)
NERSC will disable a user who shares any one of her/his passwords with another person. If a person using a username/password pair is not the one who is officially registered with NERSC as the owner of that username, then "sharing" has occurred and all usernames associated with the owner of the shared username will be disabled.
NERSC will disable a user who shares any one of her/his passwords with
another person. If a person using a username/password pair is not the
one who is officially registered with NERSC as the owner of that
username, then "sharing" has occurred and all usernames associated
with the owner of the shared username will be disabled.
If a user is disabled due to account sharing, the PI or an authorized project manager must send a memo to consult@nersc.gov explaining why the sharing occurred and assuring that it will not occur again. NERSC will then determine if the user should be re-enabled.
If a user is disabled due to account sharing, the PI or an authorized
project manager must send a memo to consult@nersc.gov explaining why
the sharing occurred and assuring that it will not occur again. NERSC
will then determine if the user should be re-enabled.
The computer use policies and security rules that apply to all users of NERSC resources are listed in the Computer User Agreement form.
The computer use policies and security rules that apply to all users
of NERSC resources are listed in the Computer User Agreement form.
## Security Incidents
If you think there has been a computer security incident you should contact NERSC Security as soon as possible at security@nersc.gov. You may also call the NERSC consultants (or NERSC Operations during non-business hours) at 1-800-66-NERSC.
If you think there has been a computer security incident you should
contact NERSC Security as soon as possible at security@nersc.gov. You
may also call the NERSC consultants (or NERSC Operations during
non-business hours) at 1-800-66-NERSC.
Please save any evidence of the break-in and include as many details as possible in your communication with us.
Please save any evidence of the break-in and include as many details
as possible in your communication with us.
## Acknowledge use of NERSC resources
Please acknowledge NERSC in your publications, for example:
>This research used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.
>This research used resources of the National Energy Research
>Scientific Computing Center, which is supported by the Office of
>Science of the U.S. Department of Energy under Contract
>No. DE-AC02-05CH11231.
Placeholder text
# Analytics
Analytics is key to gaining insights from massive, complex datasets.
NERSC provides general purpose analytics (iPython, Spark, MATLAB, IDL,
Mathematica, ROOT), statistics (R), machine learning, and imaging
tools.
# Machine Learning at NERSC
NERSC supports a variety of software of Machine Learning and Deep Learning
on our systems.
These docs pages are still in progress, but will include details about how
to use our system optimized frameworks, multi-node training libraries, and
performance guidelines.
# PyTorch
[PyTorch](https://pytorch.org/) is a high-productivity Deep Learning framework
based on dynamic computation graphs and automatic differentiation.
It is designed to be as close to native Python as possible for maximum
flexibility and expressivity.
## Availability on Cori
PyTorch can be picked up from the Anaconda python installations (e.g. via
`module load python`) or from dedicated modules with MPI enabled. You can
see which version are available with `module avail pytorch-mpi`.
## Multi-node training
PyTorch makes it fairly easy to get up and running with multi-node training
via its included _distributed_ package. Refer to the distributed tutorial for
details: https://pytorch.org/tutorials/intermediate/dist_tuto.html
Note the above tutorial doesn't actually document our currently recommended
approach of using DistributedDataParallelCPU. See examples below.
## Examples
We're putting together a coherent set of example problems, datasets, models,
and training code in this repository:
https://github.com/NERSC/pytorch-examples
This repository can serve as a template for your research projects with a
flexibly organized design for layout and code structure. The `template` branch
contains the core layout without all of the examples so you can build your
code on top of that minimal, fully functional setup. The code provided should
minimize your own boiler plate and let you get up and running in a distributed
fashion on Cori as quickly and seamlessly as possible.
The examples include:
* A simple hello-world example
* HEP-CNN classifier
* ResNet50 CIFAR10 image classification
* HEP-GAN for generation of RPV SUSY images.
The repository will also be used to benchmark our system for single and
multi-node training.
# TensorFlow
## Description
TensorFlow is a deep learning framework developed by Google in 2015. It is maintained and continuously updated by implementing results of recent deep learning research. Therefore, TensorFlow supports a large variety of state-of-the-art neural network layers, activation functions, optimizers and tools for analyzing, profiling and debugging deep neural networks. In order to deliver good performance, the TensorFlow installation at NERSC utlizes the optimized MKL-DNN library from Intel.
Explaining the full framework is beyond the scope of this website. For users who want to get started we recommend reading the TensorFlow [getting started page](https://www.tensorflow.org/get_started/). The TensorFlow page also provides a complete [API documentation](https://www.tensorflow.org/api_docs/).
## TensorFlow at NERSC
In order to use TensorFlow at NERSC load the TensorFlow module via
```bash
module load tensorflow/intel-<version>
```
where `<version>` should be replaced with the version string you are trying to load. To see which ones are available use `module avail tensorflow`.
Running TensorFlow on a single node is the same as on a local machine, just invoke the script with
```bash
python my_tensorflow_program.py
```
## Distributed TensorFlow
By default, TensorFlow supports GRPC for distributed training. However, this framework is tedious to use and very slow on tightly couple HPC systems. Therefore, we recommend using [Uber Horovod](https://github.com/uber/horovod) and thus also pack it together with the TensorFlow module we provide. The version of Horovod we provide is compiled against the optimized Cray MPI and thus integrates well with SLURM. We will give a brief overview of how to make an existing TensorFlow code multi-node ready but we recommend inspecting the [examples on the Horovod page](https://github.com/uber/horovod/tree/master/examples). We recommend using pure TensorFlow instead of Keras as it shows better performance and the Horovod integration is more smooth.
In order to use TensorFlow, one needs to import the horovod module by doing
```python
import horovod.tensorflow as hvd
```
One of the first statements should then be
```python
hvd.init()
```
which initializes the MPI runtime. Then, the user needs to wrap the optimizers for distributed training using
```python
opt = hvd.DistributedOptimizer(opt)
```
To keep track of the global step a global step object has to be created via `tf.train.get_or_create_global_step()` and passed to the `minimize` (or `apply_gradients`) member functions of the optimizer instance.
Furthermore, to ensure model consistency on all nodes it is mandatory to register a broadcast hook via
```python
bcast_hook = [hvd.BroadcastGlobalVariablesHook(0)]
```
and pass it along with other hooks to the `MonitoredTrainingSession` object. For example it is beneficial to register a stop hook via
```python
stop_hook = [tf.train.StopAtStepHook(last_step=num_steps_total)]
```
For example, a training code like
```python
import tensorflow as tf
# Pin GPU to be used to process local rank (one GPU per process)
config = tf.ConfigProto()
# Build model...
loss = ...
opt = tf.train.AdagradOptimizer(0.01 * hvd.size())
# Make training operation
train_op = opt.minimize(loss)
# The MonitoredTrainingSession takes care of session initialization,
# restoring from a checkpoint, saving to a checkpoint, and closing when done
# or an error occurs.
with tf.Session() as sess:
for steps in range(num_steps_total):
# Perform synchronous training.
sess.run(train_op)
```
should read for distributed training
```python
import tensorflow as tf
import horovod.tensorflow as hvd
# Initialize Horovod
hvd.init()
# Build model...
loss = ...
opt = tf.train.AdagradOptimizer(0.01 * hvd.size())
# Add Horovod Distributed Optimizer
opt = hvd.DistributedOptimizer(opt)
# Add hook to broadcast variables from rank 0 to all other processes during
# initialization.
hooks = [hvd.BroadcastGlobalVariablesHook(0)]
# Add session stop hook
global_step = tf.train.get_or_create_global_step()
hooks.append(tf.train.StopAtStepHook(last_step=num_steps_total))
# Make training operation
train_op = opt.minimize(loss, global_step=global_step)
# Save checkpoints only on worker 0 to prevent other workers from corrupting them.
checkpoint_dir = '/tmp/train_logs' if hvd.rank() == 0 else None
# The MonitoredTrainingSession takes care of session initialization,
# restoring from a checkpoint, saving to a checkpoint, and closing when done
# or an error occurs.
with tf.train.MonitoredTrainingSession(checkpoint_dir=checkpoint_dir,
hooks=hooks) as mon_sess:
while not mon_sess.should_stop():
# Perform synchronous training.
mon_sess.run(train_op)
```
It is important to use `MonitoredTrainingSession` instead of the regular `Session` because it keeps track of the number of global steps and knows when to stop the training process when a correspondig hook is installed. For more fine grained control over checkpointing, a [`CheckpointSaverHook`](https://www.tensorflow.org/api_docs/python/tf/train/CheckpointSaverHook) can be registered as well. Note that the graph has to be finalized before the monitored training session context is entered. In case of the regular session object, this is a limitation and can cause some trouble with summary writers. Please see the [distributed training recommendations](https://www.tensorflow.org/deploy/distributed) for how to handle these cases.
## Splitting Data
It is important to note that splitting the data among the nodes is up to the user and needs to be done besides the modifications stated above. Here, utility functions can be used to determine the number of independent ranks via `hvd.size()` and the local rank id via `hvd.rank()`. If multiple ranks are employed per node, `hvd.local_rank()` and `hvd.local_size()` return the node-local rank-id's and number of ranks. If the [dataset API](https://www.tensorflow.org/programmers_guide/datasets) is being used we recommend using the `dataset.shard` option to split the dataset. In other cases, the data sharding needs to be done manually and is application dependent.
## Frequently Asked Questions
### I/O Performance and Data Feeding Pipeline
For performance reasons, we recommend storing the data on the scratch directory, accessible via the `SCRATCH` environment variable. At high concurrency, i.e. when many nodes need to read the files we recommend [staging them into burst buffer](). For efficient data feeding we recommend using the `TFRecord` data format and using the [`dataset` API](https://www.tensorflow.org/programmers_guide/datasets) to feed data to the CPU. Especially, please note that the `TFRecordDataset` constructor takes `buffer_size` and `num_parallel_reads` options which allow for prefetching and multi-threaded reads. Those should be tuned for good performance, but please note that a thread is dispatched for every independent read. Therefore, the number of inter-threads needs to be adjusted accordingly (see below). The `buffer_size` parameter is meant to be in bytes and should be an integer multiple of the node-local batch size for optimal performance.
### Potential Issues
For best MKL-DNN performance, the module already sets a set of OpenMP environment variables and we encourage the user not changing those, especially not changing the `OMP_NUM_THREADS` variable. Setting this variable incorrectly can cause a resource starvation error which manifests in TensorFlow telling the user that too many threads are spawned. If that happens, we encourage to adjust the inter- and intra-task parallelism by changing the `NUM_INTER_THREADS` and `NUM_INTRA_THREADS` environment variables. Those parameters can also be changed in the TensorFlow python script as well by creating a session configs object via
```python
sess_config=tf.ConfigProto(inter_op_parallelism_threads=num_inter_threads,
intra_op_parallelism_threads=num_intra_threads)
```
and pass that to the session manager
```python
with tf.train.MonitoredTrainingSession(config=sess_config, hooks=hooks) as sess:
...
```
Please note that `num_inter_threads*num_intra_threads<=num_total_threads` where `num_total_threads` is 64 on Haswell or 272 on KNL.
# Abinit
ABINIT is a software suite to calculate the optical, mechanical,
vibrational, and other observable properties of materials. Starting
from the quantum equations of density functional theory, you can build
up to advanced applications with perturbation theories based on DFT,
and many-body Green's functions (GW and DMFT).
NERSC provides modules for [abinit](https://www.abinit.org).
Use the `module avail` command to see what versions are available:
```bash
nersc$ module avail abinit
```
## Example
See the [example jobs page](/jobs/examples/) for additional
examples and infortmation about jobs.
### Edison
```
#!/bin/bash
#SBATCH --qos=regular
#SBATCH --time=01:00:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=24
#SBATCH --cpus-per-task=2
module load abinit
srun abinit < example.in
```
## Support
* [Forum](https://forum.abinit.org)
* [Wiki](https://wiki.abinit.org/doku.php)
* [Mailing List](https://sympa-2.sipr.ucl.ac.be/abinit.org)
!!! tip
If *after* consulting with the above you believe there is an issue
with the NERSC module, please file a
[support ticket](https://help.nersc.gov).
# BerkeleyGW
The BerkeleyGW Package is a set of computer codes that calculates the quasiparticle
properties and the optical responses of a large variety of materials from bulk periodic
crystals to nanostructures such as slabs, wires and molecules. The package takes as
input the mean-field results from various electronic structure codes such as the
Kohn-Sham DFT eigenvalues and eigenvectors computed with Quantum ESPRESSO, PARATEC,
PARSEC, Octopus, Abinit, Siesta etc.
NERSC provides modules for [BerkeleyGW](https://www.berkeleygw.org).
Use the `module avail` command to see what versions are available:
```bash
nersc$ module avail berkeleygw