archive.md 14.5 KB
Newer Older
1 2 3 4
# The HPSS Archive System

## Intro

Brandon's avatar
Brandon committed
5 6 7 8 9 10 11
The High Performance Storage System (HPSS) is a modern, flexible,
performance-oriented mass storage system. It has been used at NERSC
for archival storage since 1998. HPSS is intended for long term
storage of data that is not frequently accessed.

HPSS is Hierarchical Storage Management (HSM) software developed by a
collaboration of DOE labs, of which NERSC is a participant, and
Brandon's avatar
Brandon committed
12 13 14 15 16 17
IBM. The HPSS system is a tape system that uses HSM software to ingest
data onto high performance disk arrays and automatically migrate it to
a very large enterprise tape subsystem for long-term retention. The
disk cache in HPSS is designed to retain many days worth of new data
and the tape subsystem is designed to provide the most cost-effective
long-term scalable data storage available.
Brandon's avatar
Brandon committed
18

19 20 21
NERSC's HPSS system can be accessed at archive.nersc.gov through a
variety of clients such as hsi, htar, ftp, pftp, and globus. By
default every user has an HPSS account.
Brandon's avatar
Brandon committed
22

23
## Getting Started
Brandon's avatar
Brandon committed
24

25
### Accessing HPSS
26 27

You can access HPSS from any NERSC system. Inside of NERSC, files can
Brandon's avatar
Brandon committed
28 29
be archived to HPSS individually with the `hsi` command or in groups
with the `htar` command (similar to the way `tar` works). HPSS is also
30 31 32 33 34
accessible via Globus, gridFTP, ftp, and pftp. Please see the
[Accessing HPSS](archive_access.md) page for a list of all possible
way to access HPSS and details on their use.

HPSS uses NIM to create an "hpss token" for user authentication. On a
Brandon's avatar
Brandon committed
35
NERSC system, typing `hsi` or `htar` will usually be enough to create
36 37 38 39
this token. If you are access HPSS remotely (using ftp, pftp, or
gridFTP), you may need to manually generate a token. Please see the
[Accessing HPSS](archive_access.md) page for more details.

40
### Best Practices
41 42 43 44 45 46 47 48 49 50 51 52 53 54

HPSS is intended for long term storage of data that is not frequently
accessed.

The best guide for how files should be stored in HPSS is how you might
want to retrieve them. If you are backing up against accidental
directory deletion / failure, then you would want to store your files
in a structure where you use htar to separately bundle up each
directory. On the other hand, if you are archiving data files, you
might want to bundle things up according to month the data was taken
or detector run characteristics, etc. The optimal size for htar
bundles is 100 - 500 GBs, so you may need to do several htar bundles
for each set depending on the size of the data.

55
#### Group Small Files Together
56 57 58 59 60 61 62 63 64 65 66 67 68 69

HPSS is optimized for file sizes of 100 - 500 GB. If you need to store
many files smaller than this, please use htar to bundle them together
before archiving. HPSS is a tape system and responds differently than
a typical file system. If you upload large numbers of small files they
will be spread across dozens or hundreds of tapes, requiring multiple
loads into tape drives and positioning the tape. Storing many small
files in HPSS without bundling them together will result in extremely
long retrieval times for these files and will slow down the HPSS
system for all users.

Please see the [Accessing HPSS](archive_access.md) for more details on
how to use htar.

70
#### Order Large Retrievals
71 72 73 74 75

If you are retrieving many (> 100 files) from HPSS, you need to
order your retrievals so that all files on a single tape will be
retieved in a single pass in the order they are on the tape. NERSC has
a script to help you generate an ordered list for retrieval called
Brandon's avatar
Brandon committed
76
`hpss_file_sorter.script`.
77

Lisa Gerhardt's avatar
Lisa Gerhardt committed
78
??? tip "Generating a sorted list for retrieval"
Brandon's avatar
Brandon committed
79 80 81 82
    To use the script, you first need a list of fully qualified
	file path names and/or directory path names. If you do not
	already have such a list, you can query HPSS using the
	following command:
83

Brandon's avatar
Brandon committed
84 85 86
	```
	hsi -q 'ls -1 <HPSS_files_or_directories_you_want_to_retrieve>' 2> temp.txt
	```
Lisa Gerhardt's avatar
Lisa Gerhardt committed
87

Brandon's avatar
Brandon committed
88
	(for csh replace "2>" with ">&"). Once you have the list of files, feed it to the sorting script:
89

Brandon's avatar
Brandon committed
90 91 92
	```
	hpss_file_sorter.script temp.txt > retrieval_list.txt
	```
93

Brandon's avatar
Brandon committed
94
    The best way to retrieve this list from HPSS is with the `cget`
95 96
    command, which will get the file from HPSS only if it isn't
    already in the output directory. You also should take advantage of
Brandon's avatar
Brandon committed
97
    the `hsi in <file_of_hsi_commands.txt>` to run an entire set of
98 99 100 101 102 103
    HPSS commands in one HPSS session. This will avoid HPSS doing a
    sign in procedure for each file, which can add up to a significant
    amount of time if you are retrieving many files. To do this,
    you'll need to add a little something to the retrieval_list.txt
    file you already generated:

Brandon's avatar
Brandon committed
104 105 106
    ```
	awk '{print "cget",$1}' retrieval_list.txt > final_retrieval_list.txt
	```
107 108 109

    Finally, you can retrieve the files from HPSS with

Brandon's avatar
Brandon committed
110 111 112
    ```
	hsi "in final_retrieval_list.txt"
	```
113 114 115 116 117 118 119

    This procedure will return all the files you're retrieving in a
    single directory. You may want to preserve some of the directory
    structure you have in HPSS. If so, you could automatically
    recreate HPSS subdirectories in your target directory with this
    command

Brandon's avatar
Brandon committed
120 121 122 123
    ```
	sed 's:^'<your_hpss_directory>'/\(.*\):\1:' temp.txt | xargs
    -I {} dirname {} | sort | uniq | xargs -I {} mkdir -p {}
	```
124 125 126

    where <your_hpss_directory> is the root directory you want to
    harvest subdirectories from, and temp.txt holds the output from
Brandon's avatar
Brandon committed
127
    your `ls -1` call.
128

129
#### Avoid Very Large Files
130 131

Files sizes greater than 1 TB can be difficult for HPSS to work with
Brandon's avatar
Brandon committed
132 133 134 135
and lead to longer transfer times, increasing the possibility of
transfer interruptions. Generally it's best to aim for file sizes in
the 100 - 500 GB range. You can use `tar` and `split` to break up
large aggregates or large files into 500 GB sized chunks:
136

Brandon's avatar
Brandon committed
137 138 139
```
tar cvf - myfiles* | split -d --bytes=500G - my_output_tarname.tar.
```
140 141

This will generate a number of files with names like
Brandon's avatar
Brandon committed
142 143 144
`my_output_tarname.tar.00`, `my_output_tarname.tar.01`, which you can
use "hsi put" to archive into HPSS. When you retrieve these files, you
can recombine them with cat
145

Brandon's avatar
Brandon committed
146 147 148
```
cat my_output_tarname.tar.* | tar xvf -
```
149

150
#### Accessing HPSS Data Remotely
151 152 153 154 155 156 157 158 159 160 161

We recommend a two-stage process to move data to / from HPSS and a
remote site. Use globus to transfer the data between NERSC and the
remote site (your scratch directory would make a useful temporary
staging point) and use hsi or htar to move the data into HPSS.

When connecting with HPSS via ftp or pftp, it is not uncommon to
encounter problems due to firewalls at the client site. Often you will
have to configure your client firewall to allow connections to
HPSS. See the HPSS firewall page for more details.

162
#### Use the Xfer Queue
163 164 165 166 167 168 169

User the dedicated [xfer queue](../../jobs/examples/#xfer-queue) for
long-running transfers to / from HPSS. You can also submit jobs to the
xfer queue after your computations are done. The xfer queue is
configured to limit the number of running jobs per user to the same
number as the limit of HPSS sessions.

170
### Session Limits
171 172 173

Each HPSS user is limited to no more than 15 concurrent HPSS sessions.

174
### HPSS Usage Charging
175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228

DOE's Office of Science awards an HPSS quota to each NERSC project
every year. Users charge their HPSS space usage to the HPSS repos of
which they are members.

Users can check their HPSS usage and quotas with the hpssquota command
on Cori or Edison. You view usages on a user level:

```
cori03> hpssquota -u usgtest
HPSS Usage for User usgtest
REPO                          STORED [GB]      REPO QUOTA [GB]     PERCENT USED [%]
-----------------------------------------------------------------------------------
nstaff                             144.25              49500.0                  0.3
matcomp                              10.0                950.0                  1.1
-----------------------------------------------------------------------------------
TOTAL USAGE [GB]                   154.25
```

Here, "Stored" shows you how much data you have stored in HPSS. Data
stored in HPSS could potentially be charged to any repo that you are a
member of (see below for details). The "Repo Quota" shows you the
maximum amount your PI has allocated for you to store data, and the
"Percent Used" shows the percentage of the quota you've used.

You can also view usage on a repo level:

```
edison03> hpssquota -r ntrain
HPSS Usage for Repo ntrain

USER                          STORED [GB]           USER QUOTA [GB]          PERCENT USED [%]
---------------------------------------------------------------------------------------------
train1                             100.00                     500.0                      20.0
train2                               0.35                      50.0                       0.1
train47                              0.12                     500.0                       0.0
train28                              0.09                     500.0                       0.0
---------------------------------------------------------------------------------------------

TOTAL USAGE [GB]         TOTAL QUOTA [GB]              PERCENT USED
100.56                              500.0                     20.11
```

"Stored" shows how much data each user has in HPSS that is charged to
this repo. "User Quota" shows how much total space the PI has
allocated for that user (by default this is 100%, PIs may want to
adjust these for each user, see below for more info) and the "Percent
Used" is the percentage of allocated quota each user has used. The
totals at the bottom shows the total space and quota stored for the
whole repo.

You can also check the HPSS quota for a repo by logging in to the NIM
and clicking on their "Account Usage" tab.

229
#### Apportioning User Charges to Repositories: Project Percents
230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250

If a user belongs to only one HPSS repo all usage is charged to that
repo. If a user belongs to multiple repos daily charges are
apportioned among the repos using the project percents for that login
name. Default project percents are assigned based on the size of each
repo's storage allocation. Users (only the user, not the project
managers) can change their project percents by selecting Change SRU
Proj Pct (this is a historic name based on the old charging model)
from the Actions pull-down list in the NIM main menu. Users should try
to set project percents to reflect their actual use of HPSS for each
of the projects of which they are a member. Note that this is quite
different from the way that computational resources are charged.

On each computational system each job is charged to a specific
repository. This is possible because the batch system has accounting
hooks that handle charging to repos. The HPSS system has no
notion of repo accounting but only of user accounting. Users must
say "after the fact" how to distribute their HPSS usage charges to the
HPSS repos to which they belong. For a given repo the MPP
repository and the HPSS repository usually have the same name.

251
#### Adding or Removing Users
252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283

If a user is added to a new repo or removed from an existing repo the
project percents for that user are adjusted based on the size of the
quotas of the repos to which the user currently belongs. However, if
the user has previously changed the default project percents the
relative ratios of the previously set project percents are respected.

As an example user u1 belongs to repos r1 and r2 and has changed the
project percents from the default of 50% for each repo to 40% for r1
and 60% for r2:

| Login	| Repo	| Allocation (GBs) | Project % |
|-------|-------|------------------|-----------|
| u1 	| r1 	| 500 	   | 40 |
| u1 	| r2 	| 500 	   | 60 |

If u1 then becomes a new member of repo r3 which has a storage
allocation of 1,000 GBs the project percents will be adjusted as
follows (to preserve the old ratio of 40:60 between r1 and r2 while
adding r3 which has the same SRU allocation as r1+r2):

| Login  | Repo	| Allocation (GBs) |	    Project % |
|--------|------|------------------|------------------|
| u1     | r1 	| 500 	   | 20 |
| u1     | r2 	| 500 	   | 30 |
| u1     | r3 	| 1,000 	  | 50 |

If a repo is retired, the percentage charged to that repo is spread
among the remaining repos while keeping their relative values the
same.


284
### HPSS Project Directories
285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303

A special "project directory" can be created in HPSS for groups of
researchers who wish to easily share files. The file in this directory
will be readable by all members of a particular unix file group. This
file group can have the same name as the repository (in which case all
members of the repository will have access to the project directory)
or a new name can be requested (in which case only those users added
to the new file group by the requester will have access to the project
directory.

HPSS project directories have the following properties:

  * located under /home/projects
  * owned by the PI, a PI Proxy, or a Project Manager of the associated repository
  * have suitable group attribute (include "setgid bit")

To request creation of an HPSS project directory the PI, a PI Proxy or
a Project Manager of the requesting repository should fill out the
[HPSS Project Directory Request
Lisa Gerhardt's avatar
Lisa Gerhardt committed
304
Form](https://www.nersc.gov/users/storage-and-file-systems/hpss/request-form).
305 306


307
### Troubleshooting
308 309 310

Some frequently encountered issues and how to solve them.

311
#### Trouble connecting
312 313

The first time you try to connect using a NERSC provided client like
Lisa Gerhardt's avatar
Lisa Gerhardt committed
314 315 316 317 318
HSI, HTAR, or PFTP you will be prompted for your NIM password +
one-time password which will generate a token stored in
$HOME/.netrc. This allows you to connect to HPSS without typing a
password. However, sometimes this file can become out of date or
otherwise corrupted. This generates errors that look like this:
319 320 321 322 323 324 325 326 327 328 329 330 331

```
% hsi
result = -11000, errno = 29
Unable to authenticate user with HPSS.
result = -11000, errno = 9
Unable to setup communication to HPSS...
*** HSI: error opening logging
Error - authentication/initialization failed
```

If this error occurs try moving $HOME/.netrc file to
$HOME/.netrc_temp. Then connect to the HPSS system again and enter
Lisa Gerhardt's avatar
Lisa Gerhardt committed
332 333 334
your NIM password + one-time password when prompted. A new
$HOME/.netrc file will be generated with a new entry/token. If the
problem persists contact account support.
335

336
#### Cannot transfer files using htar
337 338 339 340 341 342

Htar requires the node you're on to accept incoming connections from
its movers. This is not possible from a compute node at NERSC, so htar
transfers will fail. Instead we recommend you use our special xfer
queue for data transfers

343
#### Globus transfer errors
344 345 346 347 348

Globus transfers will fail if you don't have permission to read the
source directory or space to write in the target directory. One common
mistake is to make the files readable, but forget to make the
directory holding them readable. You can check directory permissions
Brandon's avatar
Brandon committed
349 350
with `ls -ld`. At NERSC you can make sure you have enough space to
write in a directory by using the `myquota` command.