README.md 10.2 KB
Newer Older
Brandon's avatar
Brandon committed
1
# Minimal overview of Slurm at PDSF
Brandon's avatar
Brandon committed
2

Brandon's avatar
Brandon committed
3
(see any generic Slurm tutorial for more details)
Jan Balewski's avatar
Jan Balewski committed
4

Brandon's avatar
Brandon committed
5
To submit in Slurm the equivalent of 'qsub  jobscript.sh' is:
Jan Balewski's avatar
Jan Balewski committed
6 7 8 9 10 11 12
```bash
[laptop]
$ ssh  -X pdsf.nersc.gov
sbatch -p shared-chos   jobscript.sh
   Submitted batch job 102992
```

Brandon's avatar
Brandon committed
13 14 15 16 17
!!! note
	your default chos will be used to run jobscript.sh , the
	default run time is set to 24h, default RAM is set to 4 GB<br>

The simplest slurm job, charged to LZ account, looks like this:
Jan Balewski's avatar
Jan Balewski committed
18 19 20

```shell
$cat hello2.slr
Jan Balewski's avatar
Jan Balewski committed
21
--8<-- "docs/pdsf/slurm/hello2.slr"
Jan Balewski's avatar
Jan Balewski committed
22
```
Brandon's avatar
Brandon committed
23 24
The Slurm job is submitted with command below

Jan Balewski's avatar
Jan Balewski committed
25 26 27
```shell
$sbatch hello2.slr
```
Jan Balewski's avatar
Jan Balewski committed
28

Brandon's avatar
Brandon committed
29 30
it will produce the stdout/err as one file: 'slurm-3726343.out', where
job id is the big number.
Jan Balewski's avatar
Jan Balewski committed
31

Brandon's avatar
Brandon committed
32
**Interactive session** using Shifter on PDSF in SL6.4
Jan Balewski's avatar
Jan Balewski committed
33 34 35 36 37 38 39 40 41 42 43 44 45 46

```bash
ssh -Y pdsf.nersc.gov

salloc -n 1 -p shared  -t 50:00 --image=custom:pdsf-chos-sl64:v4 --volume=/global/project:/project
shifter /bin/bash

echo inShifter:`env|grep  SHIFTER_RUNTIME`
export CHOS=sl64
source ~/.bash_profile.ext

cd abc/
```

Brandon's avatar
Brandon committed
47 48 49 50
!!! note
	you can't put all of the above in to one script (you would
	need 2 scripts). See majorana-shifter example to see how to launch
	arbitrary command inside a shifter from PDSF login node.
Jan Balewski's avatar
Jan Balewski committed
51 52

Submit a job array of size 100:
Brandon's avatar
Brandon committed
53

Jan Balewski's avatar
Jan Balewski committed
54 55 56 57 58
```bash
$ sbatch -p shared-chos    --array=1-100 jobscript.sh
```

Submit a job array of size 100 but run up to 10 tasks at once
Brandon's avatar
Brandon committed
59

Jan Balewski's avatar
Jan Balewski committed
60 61 62 63
```bash
$ sbatch -p shared-chos  -t 24:00:00  --array=1-100%10 jobscript.sh
```

Jan Balewski's avatar
Jan Balewski committed
64
Submit one task running on 32 vCores and use 50.1 GB of RAM
Jan Balewski's avatar
Jan Balewski committed
65
```bash
Jan Balewski's avatar
Jan Balewski committed
66
$ sbatch -p shared-chos --mem 50100M -n32  jobscript.sh
Jan Balewski's avatar
Jan Balewski committed
67
```
Brandon's avatar
Brandon committed
68
Start **interactive session** on a Slurm worker node with
Jan Balewski's avatar
Jan Balewski committed
69 70 71 72 73
```bash
$ salloc  -p shared-chos  -t 1:00:00
    salloc: Granted job allocation 93574
```

Brandon's avatar
Brandon committed
74 75
**Licenses**: optional constraint informing Slurm about resources your
job needs. If speciffied will all Slurm to protect ( e.g. not start)
Brandon's avatar
Brandon committed
76 77 78
your job in the case given resource is not avaliable. Typically, users
who need /project(a) should add this line to the slurm job description

Jan Balewski's avatar
Jan Balewski committed
79 80 81 82 83 84
```bash
#SBATCH -L project
#SBATCH -L projecta
```

STAR specific: request license to access HPSS:
Brandon's avatar
Brandon committed
85

Jan Balewski's avatar
Jan Balewski committed
86
```bash
Brandon's avatar
Brandon committed
87
#SBATCH -L starhpssio
Jan Balewski's avatar
Jan Balewski committed
88 89
```

Brandon's avatar
Brandon committed
90 91
Check if you can run jobs in PDSF Slurm

Jan Balewski's avatar
Jan Balewski committed
92 93 94 95 96 97
```bash
$ sacctmgr show assoc where user=$USER
     Cluster    Account       User    Share
       pdsf1       lz       balewski    10
```

Brandon's avatar
Brandon committed
98 99 100 101
List all yours queued and running jobs w/ sqs , no arguments.

`sqs` can also list jobs for other users, see 'sqs --help'. e.g.

Jan Balewski's avatar
Jan Balewski committed
102 103 104
```bash
$ sqs -u dybspade
JOBID              ST   REASON       USER         NAME         NODES        USED         REQUESTED    SUBMIT                PARTITION    RANK_P       RANK_BF
Brandon's avatar
Brandon committed
105 106
20105              R    None         dybspade     rmq_pdsf_kup 1            19:51:39     24:00:00     2017-06-22T13:39:51   shared       N/A          N/A
20106              R    None         dybspade     rmq_pdsf_kup 1            19:50:02     24:00:00     2017-06-22T13:41:28   shared       N/A          N/A
Jan Balewski's avatar
Jan Balewski committed
107
```
Brandon's avatar
Brandon committed
108

Jan Balewski's avatar
Jan Balewski committed
109
To learn more  info about one job  you can use Rebecca's line:
Brandon's avatar
Brandon committed
110

Jan Balewski's avatar
Jan Balewski committed
111 112
```bash
$ sacct --format=job,user,submit,start,end,exitcode,nnodes,alloccpus,timelimit,cputime,state%20,maxvmsize,qos,maxrs -j 21115
Brandon's avatar
Brandon committed
113 114 115 116
       JobID      User              Submit               Start                 End ExitCode   NNodes  AllocCPUS  Timelimit    CPUTime      State  MaxVMSize        QOS
------------ --------- ------------------- ------------------- ------------------- -------- -------- ---------- ---------- ---------- ---------- ---------- ----------
21115          kkrizka 2017-06-23T13:23:53 2017-06-23T13:23:53 2017-06-23T13:32:54      0:0        1          1   00:25:00   00:09:01  COMPLETED                normal
21115.batch            2017-06-23T13:23:53 2017-06-23T13:23:53 2017-06-23T13:32:54      0:0        1          1              00:09:01  COMPLETED    130940K
Jan Balewski's avatar
Jan Balewski committed
117 118 119
```

Why my job is not starting?
Brandon's avatar
Brandon committed
120

Jan Balewski's avatar
Jan Balewski committed
121 122 123 124 125 126 127 128 129 130 131
```bash
$ scontrol show job 28547_300
   JobId=28547 ArrayJobId=28547 ArrayTaskId=300 JobName=atlas-chos
   Priority=1802 Nice=0 Account=atlas QOS=normal
   JobState=PENDING Reason=Resources Dependency=(null)
   Partition=shared-chos AllocNode:Sid=pdsf8:15532
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=3008,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=2 MinMemoryNode=3008M MinTmpDiskNode=0
```
Brandon's avatar
Brandon committed
132

Jan Balewski's avatar
Jan Balewski committed
133
How many any-jobs are running in shared-chos partition?
Brandon's avatar
Brandon committed
134

Jan Balewski's avatar
Jan Balewski committed
135 136 137 138 139 140
```bash
$ squeue -p shared-chos |nl|tail
     1	 JOBID     USER ACCOUNT           NAME  ST REASON          START_TIME                TIME  TIME_LEFT NODES CPUS  PARTITION   PRIORITY
     3	34047_  shapiro   atlas     atlas-chos   R None            2017-06-29T10:51:53    1:10:42    3:44:18     1    1 shared-cho        722
     4	34047_  shapiro   atlas     atlas-chos   R None            2017-06-29T10:51:53    1:10:42    3:44:18     1    1 shared-cho        722
```
Brandon's avatar
Brandon committed
141

Jan Balewski's avatar
Jan Balewski committed
142
How many job slots,status,etc are in the queue:
Brandon's avatar
Brandon committed
143

Jan Balewski's avatar
Jan Balewski committed
144 145 146 147 148 149 150
```bash
$ scontrol show partition shared-chos
   PartitionName=shared-chos
   Nodes=mc15[28-34]
   State=UP TotalCPUs=420 TotalNodes=7 SelectTypeParameters=NONE
   DefMemPerCPU=1000 MaxMemPerCPU=2000
```
Brandon's avatar
Brandon committed
151 152 153

Who can run jobs on account=star?

Jan Balewski's avatar
Jan Balewski committed
154 155 156
```bash
$ sacctmgr list assoc account=star
   Cluster    Account       User  Partition     Share
Brandon's avatar
Brandon committed
157 158 159 160 161 162
---------- ---------- ---------- ---------- ---------
     pdsf1       star                             736
     pdsf1       star     aarose                   10
     pdsf1       star       abha                   10
     pdsf1       star   afleming                   10
     pdsf1       star   agafover                   10
Jan Balewski's avatar
Jan Balewski committed
163 164
```

Brandon's avatar
Brandon committed
165 166
List all Slurm jobs from all PDSF users (former sgeusers)

Jan Balewski's avatar
Jan Balewski committed
167 168 169
```bash
$ slusers

Brandon's avatar
Brandon committed
170
Current Slurm usage summed over all PDSF users
Jan Balewski's avatar
Jan Balewski committed
171 172 173 174 175 176 177 178 179 180 181 182 183 184 185
   Rjob     Rcpu   Rcpu*h    PDjob    PDcpu      user:account:partition
      5       15     17.3        0        0      balewski nstaff shared
     10       10     15.2        0        0      balewski nstaff shared-cho
     47       47    103.6       20       20      kkrizka atlas shared
      2        2      0.1        0        0      shapiro atlas shared

   Rjob     Rcpu   Rcpu*h    PDjob    PDcpu      account:partition
     49       49    103.7       20       27      atlas shared
      5       15     17.3        0        7      nstaff shared
     10       10     15.2        0        7      nstaff shared-cho

     64       74    136.3       20       27        TOTAL
```

Primary PDSF shares per experiment
Brandon's avatar
Brandon committed
186

Jan Balewski's avatar
Jan Balewski committed
187 188 189 190
```bash
$ sshare -A alice,rhstar,dayabay,majorana,atlas,lz,lux,cuore,pdtheory -l
pdsf7 $ sshare -A alice,rhstar,dayabay,majorana,atlas,lz,lux,cuore,pdtheory
             Account       User  RawShares  NormShares    RawUsage  EffectvUsage  FairShare
Brandon's avatar
Brandon committed
191
-------------------- ---------- ---------- ----------- ----------- ------------- ----------
Jan Balewski's avatar
Jan Balewski committed
192 193 194 195 196 197 198 199 200 201 202
alice                                  495    0.186160   567432591      0.311335
atlas                                  427    0.160587    12506197      0.006862
cuore                                    2    0.000752      377071      0.000207
dayabay                                265    0.099662   193828083      0.106348
lux                                     26    0.009778    24359728      0.013366
lz                                     400    0.150432   383734087      0.210545
majorana                                51    0.019180     8572583      0.004704
pdtheory                                 2    0.000752           0      0.000000
rhstar                                 736    0.276796   611072852      0.335279
```

Brandon's avatar
Brandon committed
203
Examples of intaractive and Slurm batch jobs for all PDSF experiments,
Brandon's avatar
Brandon committed
204
updated June, 2017.
Jan Balewski's avatar
Jan Balewski committed
205

Brandon's avatar
Brandon committed
206 207
**Slurm job script generator** (designed for Cori) seems like it can
answer most of your SBATCH questions:
Jan Balewski's avatar
Jan Balewski committed
208 209 210
https://my.nersc.gov/script_generator.php


Brandon's avatar
Brandon committed
211 212
### How can I get code for the examples?

Jan Balewski's avatar
Jan Balewski committed
213 214 215 216 217 218 219
```bash
ssh pdsf
git clone https://bitbucket.org/balewski/tutorNersc
cd tutorNersc/2017-05-pdsf3.0
ls
```

Brandon's avatar
Brandon committed
220
### Table 1
Jan Balewski's avatar
Jan Balewski committed
221

Brandon's avatar
Brandon committed
222
List of all Slurm+Shifter exampels provide by PDSF users.
Jan Balewski's avatar
Jan Balewski committed
223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238

----------
|  Experiment	| shifter image 	| Example code 	| author 	| on Cori |  slurm+CHOS \[remarks]|
|----------	|:-------------:	|-------------	|----------:	| ---|---|
| LZ 	|   custom:pdsf-chos-sl64:v4	| /lz-afan \[lz2]  	| Alden Fan 	| no CVMFS|  yes|
 | Majorana|  custom:pdsf-chos-sl64:v4 	| /majorana-mbuuck/ [mj1] 	|  Micah Buuck 	| yes | yes|
 | Majorana|  custom:pdsf-chos-sl64:v4 	| /majorana-dave/  	|   David Tedeschi	| yes | no|
|ATLAS |    custom:pdsf-chos-sl64:v4 	| /atlas-shapiro [at1] | Haichen Wang | no CVMFS| yes|
|  |  | /atlas-kkrizka [at2] | Karol Krizka | yes | yes , [IOn10] |
 |  |  | /atlas-spgriso [at3] | Simone Griso |  no CVMFS | yes, [IOn10] |
| STAR |   custom:pdsf-sl64-star:v6 | /star-balewski [st1] | J.B. | - | yes |
| | |  root4star BFC [st2] | J.B. | yes |yes , [IOy60]|
| DayaBay | docker:balewski/sl64-dayabay:c| /dayabay-balewski [dyb1] |J.B. | YES (only!)|N/A
| DayaBay | custom:pdsf-chos-sl53:v1 | /dayabay-hack [dyb2]  |Robert  Hackenburg | no |yes |
| LUX |  custom:pdsf-sl64-star:v6 | /lux-epease  | Evan Pease | - | yes |

Brandon's avatar
Brandon committed
239 240
\[lz2] LZ reconstruction package, uses cvmfs, reads data from
/project. See /lz-afan/Readme.
Jan Balewski's avatar
Jan Balewski committed
241

Brandon's avatar
Brandon committed
242 243
\[mj1] Majorana data analysis,raw waveforms classifcation, reads data
from /project.
Jan Balewski's avatar
Jan Balewski committed
244

Brandon's avatar
Brandon committed
245 246 247
\[at1] rootTask1.sh: copies a tar file of a compiled RootCore package,
untar it in a running directory, and do rcSetup and run the executable
of the package, needs CVMFS, job-array aware, works on /project
Jan Balewski's avatar
Jan Balewski committed
248

Brandon's avatar
Brandon committed
249 250
\[at2] launch.sh oneMG.slr: run MadGraph in local scratch, I/O only to
/project, job-array aware.
Jan Balewski's avatar
Jan Balewski committed
251

Brandon's avatar
Brandon committed
252 253 254
\[at3] athena_sim1.sh : athena job that runs simulation, uses
AtlasProduction releases, cvmfs, and a flag for large memory. Can be
run as job array, each tasks sees different subset of events.
Jan Balewski's avatar
Jan Balewski committed
255

Brandon's avatar
Brandon committed
256 257 258
\[st1] STAR interactive detailed tutorial,
URL:
[ws2-interactive-starExp.md](https://bitbucket.org/balewski/tutornersc/src/master/2017-05-pdsf3.0/star-balewski/ws2-interactive-starExp.md)
Jan Balewski's avatar
Jan Balewski committed
259

Brandon's avatar
Brandon committed
260 261 262
\[st2] r4sTask_bfc.csh : makes a sandbox, set starver, runs BFC on a
daq file from /project, use DB:mstardbNN.nersc.gov, job-array aware,
writes to $SLURM_TMP, saves to /project
Jan Balewski's avatar
Jan Balewski committed
263

Brandon's avatar
Brandon committed
264 265
\[dyb1] private Docker image, sl64, CERN libs and DYB software
compiled inside the image, works only for user=balewski
Jan Balewski's avatar
Jan Balewski committed
266

Brandon's avatar
Brandon committed
267 268
\[dyb2] generic sl53, intractive & batch. Does not work on Cori
because DYB bins from /common/dayabay/releases/ are needed.
Jan Balewski's avatar
Jan Balewski committed
269

Brandon's avatar
Brandon committed
270 271
\[IOy60] can run 60 processes on single node even if the system is
empty
Jan Balewski's avatar
Jan Balewski committed
272

Brandon's avatar
Brandon committed
273 274
\[IOn10] can NOT run even 10 processes on a single node if the system
is empty due to IO contention