# A Tutorial on the Basics of Using a Cluster, CARNiE, and Slurm
The purpose of this document is to introduce an introductory workflow for using a cluster. This includes commentary on technical knowledge a user should have for using a cluster as well as the expectations for users of a cluster. This tutorial assumes no prior knowledge to the subject. The discussion will be split into several parts including Cluster Navigation Basics, and Cluster Operation Basics.
# Part 1: Cluster Basics
Using a cluster can be quite the jump for someone just learning how to use the command line. However this is where a majority of the work gets done, so this is most likely going to be where learning is going to be done. Here we will be logging onto another computer, located who-knows-where, and learning how to get files we need to use onto the cluster.
Note: Clusters, like people, are not always built the same way. The instructions here aim to be general, but know that some clusters have their own workflows.
## Logging Into a Cluster
Although the majority of this tutorial is general for all Linux systems, this section is specific to CARNiE. Here we will go over specific commands to log into CARNiE, as well as send files to CARNiE. Before doing any of this, you need to have an account on CARNiE. If you have one already, feel free to move right along. Otherwise, send us an email and we will get you set up as soon as we can!
Clusters are remote resources that multiple users can benefit from. To access these resources, we need to be able to do a remote log in to the cluster's head/master node. A node is just an accessible compute resource, so you can imagine a node to be one single desktop computer. In this case, the head node is like a welcome desk for all users to check in to and then request resources for their work. To log into the head node, we use the command `ssh` (short for secure shell). The general structure of ssh is `ssh -flag1 -flag2 username@headnodename`. On CARNiE this would look something like the following (here make sure you replace `bburnett` with your username:
From here, we operate just as if we were at the command line on our own computer!
## Using GNU Screen
"GNU Screen" is an incredibly useful tool when you will be logged into a cluster for any appreciable amount of time (for instance, troubleshooting jobs, installing software, or running interactive sessions).
Screen will maintain an active cluster session EVEN if your connection hiccups, or if you switch computers.
To run Screen properly, you should add one thing to the screenrc file (which will allow you to scroll up). Navigate to the (root) /etc directory and transfer the “screenrc” file (you may have to look closely, it’s easy to overlook!) to someplace handy like your Desktop. Open the transferred screenrc file and add the following two lines to the end of the file:
```
# Allows scrolling with the mouse wheel:
termcapinfo xterm* ti@:te@
```
Next, upload the screenrc file to the cluster into your home directory (/autofs/nas1/home/username). Finally, rename the file to include a dot at the beginning (.screenrc). Screen will now read this .screenrc file on startup.
To use Screen, after you log into the cluster just enter `screen` and you will see the Terminal window refresh. The text in the title bar of the window will also be different. You will still be at the master node, and you can proceed as normal. The only difference is that if your connection to the cluster drops, you can get back into your active session!
If your connection to the cluster blips and you get a “connection reset” error, just log back into the cluster, then enter `screen -ls` to see if the screen session is attached or detached. If the session is attached, enter `screen -x` and you should see the text from before the connection reset. If the session is detached, enter `screen -r` to reattach it and you should see the text from before the connection reset.
IMPORTANT: To terminate the Screen session (when the task is complete), enter `exit` to close the Screen session. You can then enter `exit` repeatedly to log out of the cluster and then close the Terminal window.
NOTE: You may not be able to submit batch jobs within a Screen session! You may get an error such as "syntax error near unexpected token" or similar. This is because Screen sets some environmental variables that include special syntax, such as a close parenthesis ")" which is incompatible with some scripts that read those variables. (If you are using Launcher and you get this error, see [https://github.com/TACC/launcher/pull/66](https://github.com/TACC/launcher/pull/66)
Full info about GNU screen can be found here: [https://www.gnu.org/software/screen/](https://www.gnu.org/software/screen/)
## Copying Files Onto a Cluster
In order to do work on a cluster, we need to be able to transfer files to it. To do this we use the `scp` command (short for secure copy). This command functions very similarly to `cp` and has the form of `scp -flags source destination`. Either the source or the destination can be a remote resource, but I've found it to be easiest to assume the source file is always on our computer. With this assumption, we can refine the form of our command to be `scp -flags source username@headnodename:destination`. This refined form is just a mix of `cp` and `ssh` syntax. It is also possible to send multiple files to the cluster with `scp -flags source1 source2 source3 username@headnodename:destination`. This would look something like:
Here, I sent the source `file.txt` to the user `bburnett` on the `carnie.cscvr.umassd.edu` cluster to the destination `~/`. Here the destination `~/` is short for my users home directory. At the command line `~` always stands for home, which is typically `/home/username`.
Although it is possible to send directories with `scp`, it is not recommended. To save bandwidth, make sure you use some form of compression (zip, tar, etc.) before sending files to a cluster. Files can then be decompressed on the cluster with their associated commands. For zip files: `unzip filename`, for tarballs: `tar -xf filename`.
## Cluster Basics TLDR
To log into a remote computer: `ssh -flags username@hostname`
To copy files onto a remote computer: `scp -flags source1 source2 source3 username@hostname:destination`
To unzip a file: `unzip filename`
To extract a tarball: `tar -xf filename`
# Using FileZilla for Windows Users
Windows users can use FileZilla to move and manipulate files on the cluster. FileZilla provides a nice graphical user interface of files and folders on the cluster.
FileZilla can be found here: [https://filezilla-project.org/](https://filezilla-project.org/)
To set up FileZilla for CARNie, use these settings:
- Protocol = SFTP
- Host = carnie.cscvr.umassd.edu
- Logon Type = Normal
- User = [your CARNiE username]
- Password - [your CARNiE password]
- All other settings can be default
# Part 2: Cluster Operations
Now that we can access the cluster, it is time to learn how to use it. In this section we will learn how to get information about CARNiE, how to submit jobs to the scheduler, and how to check the status of jobs. CARNiE uses the Slurm resource manager, and there exists a breadth of examples on how to use it. Here we will go over some of the basics of what Slurm provides.
## Cluster Information
When we first log into a cluster, it is useful to get an idea of what is available for us to use. The command to do this is `sinfo` (short for Slurm information). The output of this command looks something like:
```console
bburnett@cluster: $sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
..........
```
Here, the actual useful information has been left out because this output differs from cluster to cluster. The important point for this tutorial is what the columns actually mean. The first is the column `PARTITION`. A cluster can be split up into different partitions/queues for different types of jobs. For example, there could be a partition for users who only need one node for a long job called `long-single`, and another partition for users who use multiple nodes for jobs called `large-parallel`. This helps organize a cluster by the different use cases.
The second column `AVAIL` tells us the partition's availability. This column can take values of `up` or `down`.
The third column `TIMELIMIT` tells us how long a user can use an allocation from the partition for. These times will have a form of `D-HH:MM:SS`. If a job goes over this time limit, it will be forced to terminate. This is to make sure all users have a chance to get their work done. This also means that users need to be aware of how long they will need a job to run, and make sure their jobs are cut up in such a way that they can complete in time.
The fourth column `NODES` tells the numerical count of node resources in that partition state.
The fifth column `STATE` gives the state of that partition. Note that a partition will show up multiple times in this list based off of the state. The states of the nodes can take values of `idle`, `down`, `drain`, `alloc`, and sometimes more depending on the cluster set up.
The sixth column `NODELIST` gives the names of the compute nodes in the form of a list. For example, let's say our job has 3 nodes in a cluster with names node1, node2, and node3. The node list would then be `node[1-3]`.
## Cluster Etiquette
A cluster is a shared resource. As such, a user needs to be respectful of others on the system. To be respectful in a multi-user environment, be considerate of how many jobs you submit to a queue. A scheduler will do its best to make sure that all users are treated fairly, but it is possible to bog down a scheduler. It is a good idea to submit your jobs in waves such that it gives you enough work to do. An analogy would be that a cluster is a watering hole where all animals come to drink. No animal likes the hog that pushes other animals away, keeping the water all to themselves. All animals are there to drink, so drink your fill and be aware of how much is available for others.
Also, do your best to not run heavy processes on the head/master node. All users congregate on the head node and if a user runs an intense program on the head node, all other users will feel the impact. The general idea is to make the worker/compute nodes run things and the head node just orchestrate that work.
Note that you should only compile software on the head node, NOT a compute node!
To request an interactive session on a compute node (leaving the head/master mode), see the "Submitting Interactive Jobs" section below.
## Submitting Interactive Jobs
Now that we know a little about the structure and etiquette of the cluster, it is time to learn how to do work on the cluster. The first job submission technique is interactively, which is useful for testing that your job script will work. It can also be used for a quick job that you would like to run interactively. This technique is an interactive job submission. To request a node to log into, enter the following:
```console
bburnett@cluster: $hostname
cluster
bburnett@cluster: $srun -N 1 --pty /bin/bash
bburnett@node3: $hostname
node3
```
The command here that logs us into a node is `srun`. The flag `-N 1` requests one node (optional), and the flag `--pty` requests "pseudo terminal mode" which can be deciphered as running interactively. The final `/bin/bash` just tells `srun` to use bash for our command line. (By default, the allocation will be in the large-parallel queue.)
Note that you should only compile software on the head node, NOT a compute node!
Once you are on a compute node, you can run interactive jobs as you would like. NOTE that switching nodes purges your loaded modules, so you MUST re-load any modules you require after switching to a compute node. When you are finished, enter `exit` to return to the master/head node.
## Submitting Batch Jobs
The second method of work we will be learning is batch job submission. This is the intended way to use a cluster. In this method, the user creates a script that a job with an allocation will perform behind the scenes. A batch job for Slurm is just a bash script with additional directives for Slurm about what allocation you would like. A sample job that simply sleeps for 2 minutes and prints the host (aka node) name would look like:
```bash
#!/bin/bash
#SBATCH --job-name=MyJob
#SBATCH --partition=PARTITIONNAME
#SBATCH --nodes=1
#SBATCH --output=MyJob-%j.out
echo'Sleeping........'
sleep 120
hostname
```
The directives here set the name of the job to `MyJob`, allocate one node from `PARTITIONNAME`, and put the output into a file `MyJob-%j.out` where the `%j` will be replaced by the job ID provided by Slurm. There are many more directives to put into a script. Just as an example, we can request one particular node by adding the -- nodelist line:
```bash
#!/bin/bash
#SBATCH --job-name=MyJob
#SBATCH --partition=ShortSingle
#SBATCH --nodes=1
#SBATCH --nodelist=node[2]
#SBATCH --output=MyJob-%j.out
echo'Sleeping........'
sleep 120
hostname
```
To submit this job to the scheduler, use the command `sbatch MyJobScript.sh`
For another example Slurm batch job script, see the [Batch Job Scripts](/2_TUTORIALS/D. Batch Job Scripts) wiki subpage.
## Job Status
With a job now submitted for execution, it would be nice to know what the status of that job is. To do this, we use the command `squeue`. This will display a list of all of the jobs on the cluster and their status. To only display the status of your job, use `squeue -u username`. For example:
```console
bburnett@cluster: $squeue -u bburnett
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
............
```
The columns here are similar to those from `sinfo`, however the `ST` and `NODELIST(REASON)` deserve a little more commentary. The `ST` displays the status of the job, and will report `R` for running, `CP` for completing, `C` for canceled, and `PD` for pending. The `NODELIST(REASON)` will either display the list of nodes that the job is running with, or the reason that a job is not running yet. The reasons a job could be pending include (but are not limited to): resources - the scheduler is waiting for the desired resources; PartitionTimeLimit - the user requested more time than is allowed and the job will not run; PartitionNodeLimit - the user requested more nodes than is allowed and will not run; ReqNodeNotAvail - the nodes requested are not available at this time and will not run; SystemFailure - Slurm is having trouble and you should wait until the system admins get things fixed.
Another useful thing here is the `JOBID`. With this job ID, you can cancel your job while it is running with the command `scancel JOBID`.
### Cluster Operation TLDR
Get info on the cluster: `sinfo`
Submit a job to the queue: `sbatch scriptname`
Request an interactive allocation: `srun -N 1 --pty /bin/bash`
Request an interactive node on a partition: `srun -N 1 -p partitionName --pty /bin/bash`
Look at the queue status: `squeue`
Look at the queue status for username: `squeue -u username`
Cancel a job with id ID: `scancel ID`
For more cluster commands, see the [Commands Reference](/1_CARNiE/E. Commands Reference) wiki subpage.