00014-tar.md 19 KB
Newer Older
Martin Dørum's avatar
Martin Dørum committed
1
2
<!--date: 23 Jul 2022 21:00 +0200-->

Martin Dørum's avatar
Martin Dørum committed
3
# The tar archive format, its extensions, and why GNU tar extracts in quadratic time
Martin Dørum's avatar
Martin Dørum committed
4

5
6
7
<div class="social">
	<a href="https://lobste.rs/s/4vjkbb/tar_archive_format_its_extensions_why_gnu"><img src="/_/imgs/social-lobsters.png"></a>
	<a href="https://news.ycombinator.com/item?id=32206579"><img src="/_/imgs/social-hacker-news.png"></a>
8
	<a href="https://www.reddit.com/r/programming/comments/w6yfl6/the_tar_archive_format_its_extensions_and_why_gnu/"><img src="/_/imgs/social-reddit.png"></a>
9
10
</div>

Martin Dørum's avatar
Martin Dørum committed
11
Date: 2022-07-23 \
12
13
Git: <https://gitlab.com/mort96/blog/blob/published/content/00000-home/00014-tar.md>

Martin Dørum's avatar
Martin Dørum committed
14
15
(If you're here from Google and just need help with tar being slow:
If you trust the tar archive, extract with `-P` to make tar fast.)
Martin Dørum's avatar
Martin Dørum committed
16

Martin Dørum's avatar
Martin Dørum committed
17
18
19
A couple of days ago, I had a 518GiB tar.gz file (1.1 TiB uncompressed) that I had to extract.
At first, GNU tar was doing a great job, chewing through the tar.gz at around 100MiB/s.
But after a while, it slowed significantly; down to less than a kilobyte per second.
Martin Dørum's avatar
Martin Dørum committed
20
21
[pv](http://www.ivarch.com/programs/pv.shtml)'s time estimate went from a bit over an hour,
to multiple hours, to over a day, to almost a week.
Martin Dørum's avatar
Martin Dørum committed
22
23
24
After giving it some time, and after failing to find anything helpful through Google,
I decided that learning the tar file format and making my own tar extractor would probably be faster
than waiting for tar.
Martin Dørum's avatar
Martin Dørum committed
25
And I was right; before the day was over, I had a working tar extractor,
Martin Dørum's avatar
Martin Dørum committed
26
and I had successfully extracted my 1.1TiB tarball.
Martin Dørum's avatar
Martin Dørum committed
27
28

I will explain why GNU tar is so slow later in this post, but first, let's take a look at:
Martin Dørum's avatar
Martin Dørum committed
29

Martin Dørum's avatar
Martin Dørum committed
30
## The original tar file format
Martin Dørum's avatar
Martin Dørum committed
31

Martin Dørum's avatar
Martin Dørum committed
32
Tar is pretty unusual for an archive file format.
Martin Dørum's avatar
Martin Dørum committed
33
34
There's no archive header, no index of files to fascilitate seeking, no magic bytes to help `file`
and its ilk detect whether a file is a tar archive, no footer, no archive-wide metadata.
Martin Dørum's avatar
Martin Dørum committed
35
36
37
The only kind of thing in a tar file is a file object.

So, how do these file objects look?
Martin Dørum's avatar
Martin Dørum committed
38
Well, they start with a 512-byte file object header which looks like this:
Martin Dørum's avatar
Martin Dørum committed
39
40

```C
Martin Dørum's avatar
Martin Dørum committed
41
struct file_header {
42
	char file_path[100];
Martin Dørum's avatar
Martin Dørum committed
43
44
45
46
	char file_mode[8];
	char owner_user_id[8];
	char owner_group_id[8];
	char file_size[12];
Martin Dørum's avatar
Martin Dørum committed
47
	char file_mtime[12];
Martin Dørum's avatar
Martin Dørum committed
48
49
	char header_checksum[8];
	char file_type;
50
	char link_path[100];
Martin Dørum's avatar
Martin Dørum committed
51

Martin Dørum's avatar
Martin Dørum committed
52
	char padding[255];
Martin Dørum's avatar
Martin Dørum committed
53
54
55
};
```

Martin Dørum's avatar
Martin Dørum committed
56
Followed by `ceil(file_size / 512)` 512-byte blocks of payload (i.e file contents).
Martin Dørum's avatar
Martin Dørum committed
57

Martin Dørum's avatar
Martin Dørum committed
58
We have most of the attributes we would expect a file object to have:
59
the file path, the mode, the modification time (mtime), the user/group ID, the file size,
Martin Dørum's avatar
Martin Dørum committed
60
and the file type.
61
To support symlinks and hard links, there's also a link path.
Martin Dørum's avatar
Martin Dørum committed
62

Martin Dørum's avatar
Martin Dørum committed
63
The original tar file format defines these possible values for the `file_type` field:
Martin Dørum's avatar
Martin Dørum committed
64

Martin Dørum's avatar
Martin Dørum committed
65
* `'0'` (or sometimes `'\0'`, the NUL character): Normal file
Martin Dørum's avatar
Martin Dørum committed
66
67
* `'1'`: Hard link
* `'2'`: Symbolic link
Martin Dørum's avatar
Martin Dørum committed
68

Martin Dørum's avatar
Martin Dørum committed
69
Future extensions to tar implements additional file types, among them `'5'`, which represents a directory.
Martin Dørum's avatar
Martin Dørum committed
70
71
Some old tar implementations apparently used a trailing slash `'/'` in a `'0'`-type file object
to represent directories, at least according to Wikipedia.
Martin Dørum's avatar
Martin Dørum committed
72

Martin Dørum's avatar
Martin Dørum committed
73
You may think that the numeric values (`file_mode`, `file_size`, `file_mtime`, ...) would be
Martin Dørum's avatar
Martin Dørum committed
74
75
76
encoded in base 10, or maybe in hex, or using plain binary numbers ("base 256").
But no, they're actually encoded as octal strings (with a NUL terminator,
or sometimes a space terminator).
Martin Dørum's avatar
Martin Dørum committed
77
78
79
Tar is the only file format I know of which uses base 8 to encode numbers.
I don't quite understand why, since octal is neither space-efficient nor human-friendly.
When representing numbers in this post, I will write them in decimal (base 10).
Martin Dørum's avatar
Martin Dørum committed
80

Martin Dørum's avatar
Martin Dørum committed
81
To encode a tar archive with one file called `"hello.txt"` and the content `"Hello World"`,
Martin Dørum's avatar
Martin Dørum committed
82
we need two 512-byte blocks:
Martin Dørum's avatar
Martin Dørum committed
83

84
1. Bytes 0-511: Header, `type='0'`, `file_path="./hello.txt"`, `file_size=11`
Martin Dørum's avatar
Martin Dørum committed
85
2. Bytes 512-1023: `"Hello World"`, followed by 501 zero bytes
Martin Dørum's avatar
Martin Dørum committed
86

Martin Dørum's avatar
Martin Dørum committed
87
88
In addition, a tar file is supposed to end with 1024 zero-bytes to represent an end-of-file marker.

Martin Dørum's avatar
Martin Dørum committed
89
90
91
92
93
94
The two big limitations of the original tar format is that paths can't be longer than 100 characters,
and files can't be larger than 8GiB (8^11 bytes).
Otherwise though, I quite like the simplicity of the format.
We'll discuss how various extensions address the limitations later,
but first, let's try to implement an extractor:

Martin Dørum's avatar
Martin Dørum committed
95
96
(Feel free to skip this source code, but you should at least skim the comments)

Martin Dørum's avatar
Martin Dørum committed
97
98

```C
Martin Dørum's avatar
Martin Dørum committed
99
100
// tarex.c

Martin Dørum's avatar
Martin Dørum committed
101
102
103
104
105
106
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/stat.h>
#include <string.h>

Martin Dørum's avatar
Martin Dørum committed
107
struct file_header {
108
	char file_path[100];
Martin Dørum's avatar
Martin Dørum committed
109
110
111
112
113
114
115
	char file_mode[8];
	char owner_user_id[8];
	char owner_group_id[8];
	char file_size[12];
	char file_mtime[12];
	char header_checksum[8];
	char file_type;
116
	char link_path[100];
Martin Dørum's avatar
Martin Dørum committed
117

Martin Dørum's avatar
Martin Dørum committed
118
	char padding[255];
Martin Dørum's avatar
Martin Dørum committed
119
120
};

Martin Dørum's avatar
Martin Dørum committed
121
122
// We don't bother with great error reporting, just abort on error
#define check(x) if (!(x)) abort()
Martin Dørum's avatar
Martin Dørum committed
123
124
125
126
127

// Utilities to abort on short read/write
#define xfread(ptr, size, f) check(fread(ptr, 1, size, f) == size)
#define xfwrite(ptr, size, f) check(fwrite(ptr, 1, size, f) == size)

Martin Dørum's avatar
Martin Dørum committed
128
// Tar represents all its numbers as octal
Martin Dørum's avatar
Martin Dørum committed
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
size_t parse_octal(char *str, size_t maxlen) {
	size_t num = 0;
	for (size_t i = 0; i < maxlen && str[i] >= '0' && str[i] <= '7'; ++i) {
		num *= 8;
		num += str[i] - '0';
	}

	return num;
}

// Extract one file from the archive.
// Returns 1 if it extracted something, or 0 if it reached the end.
int extract(FILE *f) {
	unsigned char header_block[512];
	xfread(header_block, sizeof(header_block), f);
	struct file_header *header = (struct file_header *)header_block;

	// The end of the archive is represented with blocks of all-zero content.
147
	// For simplicity, assume that if the file path is empty, the block is all zero
Martin Dørum's avatar
Martin Dørum committed
148
	// and we reached the end.
149
	if (header->file_path[0] == '\0') {
Martin Dørum's avatar
Martin Dørum committed
150
151
152
		return 0;
	}

153
	// The file path and link path fields aren't always 0-terminated, so we need to copy them
Martin Dørum's avatar
Martin Dørum committed
154
	// into our own buffers, otherwise we break on files with exactly 100 character paths.
155
156
157
158
	char file_path[101] = {0};
	memcpy(file_path, header->file_path, 100);
	char link_path[101] = {0};
	memcpy(link_path, header->link_path, 100);
Martin Dørum's avatar
Martin Dørum committed
159
160
161
162
163
164

	// We need these for later
	size_t file_size = parse_octal(header->file_size, sizeof(header->file_size));
	FILE *out_file = NULL;

	if (header->file_type == '0' || header->file_type == '\0') {
Martin Dørum's avatar
Martin Dørum committed
165
		// A type of '0' means that this is a plain file.
Martin Dørum's avatar
Martin Dørum committed
166
		// Some early implementations also use a NUL character ('\0') instead of an ASCII zero.
Martin Dørum's avatar
Martin Dørum committed
167

Martin Dørum's avatar
Martin Dørum committed
168
		fprintf(stderr, "Regular file: %s\n", file_path);
169
		out_file = fopen(file_path, "w");
Martin Dørum's avatar
Martin Dørum committed
170
171
		check(out_file != NULL);

Martin Dørum's avatar
Martin Dørum committed
172
173
	} else if (header->file_type == '1') {
		// A type of '1' means that this is a hard link.
174
		// That means we create a hard link at 'file_path' which links to the file at 'link_path'.
Martin Dørum's avatar
Martin Dørum committed
175

176
177
		fprintf(stderr, "Hard link: %s -> %s\n", file_path, link_path);
		check(link(link_path, file_path) >= 0);
Martin Dørum's avatar
Martin Dørum committed
178

Martin Dørum's avatar
Martin Dørum committed
179
180
	} else if (header->file_type == '2') {
		// A type of '2' means that this is a symbolic link.
181
		// That means we create a symlink at 'file_path' which links to the file at 'link_path'.
Martin Dørum's avatar
Martin Dørum committed
182

183
184
		fprintf(stderr, "Symbolic link: %s -> %s\n", file_path, link_path);
		check(symlink(link_path, file_path) >= 0);
Martin Dørum's avatar
Martin Dørum committed
185

Martin Dørum's avatar
Martin Dørum committed
186
187
188
	} else if (header->file_type == '5') {
		// A type of '5' means that this is a directory.

189
190
		fprintf(stderr, "Directory: %s\n", file_path);
		check(mkdir(file_path, 0777) >= 0);
Martin Dørum's avatar
Martin Dørum committed
191
192
193
194

		// Directories sometimes use the size field, but they don't contain data blocks.
		// Zero out file_size to avoid skipping entries.
		file_size = 0;
Martin Dørum's avatar
Martin Dørum committed
195

Martin Dørum's avatar
Martin Dørum committed
196
197
198
	} else {
		// There are other possible fields added by various tar implementations and standards,
		// but we'll ignore those for this implementation.
199
		fprintf(stderr, "Unsupported file type %c: %s\n", header->file_type, file_path);
Martin Dørum's avatar
Martin Dørum committed
200
201
	}

Martin Dørum's avatar
Martin Dørum committed
202
	// We have read the header block, now we need to read the payload.
Martin Dørum's avatar
Martin Dørum committed
203
204
205
206
207
	// If we're reading a file (i.e if 'outfile' is non-NULL) we will also write the body,
	// but otherwise we'll just skip it.
	char block[512];
	while (file_size > 0) {
		xfread(block, sizeof(block), f);
Martin Dørum's avatar
Martin Dørum committed
208
209
210
211
212
		size_t n = file_size > 512 ? 512 : file_size;

		file_size -= n;
		if (out_file != NULL) {
			xfwrite(block, n, out_file);
Martin Dørum's avatar
Martin Dørum committed
213
214
215
		}
	}

Martin Dørum's avatar
Martin Dørum committed
216
217
218
219
	if (out_file != NULL) {
		check(fclose(out_file) >= 0);
	}

Martin Dørum's avatar
Martin Dørum committed
220
	// Indicate that we have successfully extracted a file object, and are ready to read the next
Martin Dørum's avatar
Martin Dørum committed
221
222
223
224
225
226
227
228
229
230
	return 1;
}

int main() {
	while (extract(stdin));
}
```

Let's see it in action:

Martin Dørum's avatar
Martin Dørum committed
231
```nohighlight
Martin Dørum's avatar
Martin Dørum committed
232
~/tarex $ ls
Martin Dørum's avatar
Martin Dørum committed
233
tarex.c testdir
Martin Dørum's avatar
Martin Dørum committed
234
235
~/tarex $ gcc -o tarex tarex.c
~/tarex $ tree
Martin Dørum's avatar
Martin Dørum committed
236
237
.
├── tarex.c
Martin Dørum's avatar
Martin Dørum committed
238
├── tarex
Martin Dørum's avatar
Martin Dørum committed
239
240
241
242
243
└── testdir
    ├── hello-symlink -> hello.txt
    ├── hello.txt
    └── subdir
        └── file.txt
Martin Dørum's avatar
Martin Dørum committed
244

Martin Dørum's avatar
Martin Dørum committed
245
246
~/tarex $ tar c testdir >testdir.tar
~/tarex $ mkdir extract && cd extract
Martin Dørum's avatar
Martin Dørum committed
247

Martin Dørum's avatar
Martin Dørum committed
248
~/tarex/extract $ ../tarex <../testdir.tar
Martin Dørum's avatar
Martin Dørum committed
249
250
251
252
253
Directory: testdir/
Symbolic link: testdir/hello-symlink -> hello.txt
Directory: testdir/subdir/
Regular file: testdir/hello.txt
Regular file: testdir/subdir/file.txt
Martin Dørum's avatar
Martin Dørum committed
254

Martin Dørum's avatar
Martin Dørum committed
255
~/tarex/extract $ tree
Martin Dørum's avatar
Martin Dørum committed
256
257
258
259
260
261
262
263
.
└── testdir
    ├── hello-symlink -> hello.txt
    ├── hello.txt
    └── subdir
        └── file.txt
```

Martin Dørum's avatar
Martin Dørum committed
264
265
## The UStar file format

Martin Dørum's avatar
typo    
Martin Dørum committed
266
The first major extension to the tar file format we will look at is the UStar format,
Martin Dørum's avatar
Martin Dørum committed
267
268
269
270
which increases the file length limit to 256 characters and adds some new file types.
The header is expanded to this:

```C
Martin Dørum's avatar
Martin Dørum committed
271
struct file_header {
Martin Dørum's avatar
Martin Dørum committed
272
	// Original tar header fields
273
	char file_path[100];
Martin Dørum's avatar
Martin Dørum committed
274
275
276
277
	char file_mode[8];
	char owner_user_id[8];
	char owner_group_id[8];
	char file_size[12];
Martin Dørum's avatar
Martin Dørum committed
278
	char file_mtime[12];
Martin Dørum's avatar
Martin Dørum committed
279
280
	char header_checksum[8];
	char file_type;
281
	char link_path[100];
Martin Dørum's avatar
Martin Dørum committed
282
283
284
285
286
287
288
289
290
291

	// New UStar fields
	char magic_bytes[6];
	char version[2];
	char owner_user_name[32];
	char owner_group_name[32];
	char device_major_number[8];
	char device_minor_number[8];
	char prefix[155];

Martin Dørum's avatar
Martin Dørum committed
292
	char padding[12];
Martin Dørum's avatar
Martin Dørum committed
293
294
295
296
};
```

We now have some magic bytes (defined to be `"ustar\0"` for the UStar format),
Martin Dørum's avatar
Martin Dørum committed
297
as well as the owner user/group names.
Martin Dørum's avatar
Martin Dørum committed
298
But most importantly, we have a `prefix` field, which allows up to 256 character file paths.
299
With UStar, instead of just extracting the bytes from `file_path` and `link_path` like before,
Martin Dørum's avatar
Martin Dørum committed
300
301
302
we must construct a file path like this:

```C
303
void read_path(char dest[257], char path[100], char prefix[100]) {
Martin Dørum's avatar
Martin Dørum committed
304
305
	// If there's no prefix, use name directly
	if (prefix[0] == '\0') {
306
		memcpy(dest, path, 100);
Martin Dørum's avatar
Martin Dørum committed
307
308
309
310
		dest[100] = '\0';
		return;
	}

311
	// If there is a prefix, the path is: <prefix> '/' <path>
312
	size_t prefix_len = strnlen(prefix, 155);
Martin Dørum's avatar
Martin Dørum committed
313
314
	memcpy(dest, prefix, prefix_len);
	dest[prefix_len] = '/';
315
	memcpy(&dest[prefix_len + 1], path, 100);
Martin Dørum's avatar
Martin Dørum committed
316
317
318
319
	dest[256] = '\0';
}

int extract(FILE *f) {
Martin Dørum's avatar
Martin Dørum committed
320
	/* ... */
Martin Dørum's avatar
Martin Dørum committed
321

322
323
324
325
	char file_path[257];
	read_path(file_path, header->file_path, header->prefix);
	char link_path[257];
	read_path(link_path, header->link_path, header->prefix);
Martin Dørum's avatar
Martin Dørum committed
326
327
328
329
330

	/* ... */
}
```

Martin Dørum's avatar
Martin Dørum committed
331
The original tar format had the file types `'0'` (or `'\0'`), `'1'` and `'2'`,
Martin Dørum's avatar
Martin Dørum committed
332
333
334
for regular files, hard links and symlinks.
UStar defines these additional file types:

Martin Dørum's avatar
Martin Dørum committed
335
* `'3'` and `'4'`: Character devices and block devices. These are the reason for the new
Martin Dørum's avatar
Martin Dørum committed
336
  `device_major_number` and `device_minor_number` fields.
Martin Dørum's avatar
Martin Dørum committed
337
338
339
* `'5'`: Directories.
* `'6'`: FIFO files.
* `'7'`: Contiguous files. This type isn't really used much these days, and most implementations
Martin Dørum's avatar
Martin Dørum committed
340
341
342
343
344
345
346
347
348
  just treat it as a regular file.

This is definitely an improvement, but we can still only encode up to 256 character long paths.
And that 8GiB file size limit still exists. Which leads us to:

## The pax file format

The POSIX.1-2001 standard introduced the `pax` command line tool, and with it, a new set of
extensions to the tar file format.
Martin Dørum's avatar
Martin Dørum committed
349
This format is identical to UStar, except that it adds two new file object types: `'x'` and `'g'`.
Martin Dørum's avatar
Martin Dørum committed
350
351
352
353
354
355
356
357
Both of these types let us define "extended header records", as the spec calls it.
Records set with `'x'` apply to only the next file, while records set with `'g'` apply to all
following files.

With this new extended header, we can encode the access and modification times with more precision,
user/group IDs above 8^7, file sizes over 8^11, file paths of arbitrary length, and a whole lot more.
The records are in the payload of the extended headr file object,
and use a simple length-prefixed key/value syntax.
Martin Dørum's avatar
Martin Dørum committed
358
To represent our `"hello.txt"` example file with an access time attribute,
Martin Dørum's avatar
Martin Dørum committed
359
we need these four 512-byte blocks:
Martin Dørum's avatar
Martin Dørum committed
360

Martin Dørum's avatar
Martin Dørum committed
361
1. Header, `type='x'`, `file_size=30`
Martin Dørum's avatar
Martin Dørum committed
362
2. `"30 atime=1658409251.551879906\n"`, followed by 482 zeroes
363
3. Header, `type='0'`, `file_path="hello.txt"`, `file_size=11`
Martin Dørum's avatar
Martin Dørum committed
364
4. `"Hello World"`, followed by 501 zero bytes
Martin Dørum's avatar
Martin Dørum committed
365
366
367
368

Interestingly, these extended header records all seem to use decimal (base 10).
On the one hand, using base 10 makes sense, but on the other hand, wouldn't it be nice to stick to
one way of representing numbers?
Martin Dørum's avatar
Martin Dørum committed
369

Martin Dørum's avatar
Martin Dørum committed
370
371
372
Anyways, we can see that the file format has become quite complex now.
Just the file path can be provided in any of four different ways:

373
374
* The full path might be in the `file_path` field.
* The path might be a combination of the `prefix` and the `file_path` fields.
Martin Dørum's avatar
Martin Dørum committed
375
376
* The previous file object might've been an `'x'` type record with set a `path` property.
* There might've been some `'g'` type file object earlier in the archive which set a `path` property.
Martin Dørum's avatar
Martin Dørum committed
377
378
379
380
381
382
383

## The GNU tar file format

GNU tar has its own file format, called `gnu`, which is different from the pax format.
Like pax, the `gnu` format is based on UStar, but it has a different way of encoding
arbitrary length paths and large file sizes:

384
* It introduces the `'L'` type, where the payload of the file object represents the `file_path`
Martin Dørum's avatar
Martin Dørum committed
385
  of the next file object.
386
* It introduces the `'K'` type, where the payload of the file object represents the `link_path`
Martin Dørum's avatar
Martin Dørum committed
387
  of the next file object.
388
* A link with both a long `file_path` and a long `link_path` is preceeded by both an `'L'` type
Martin Dørum's avatar
Martin Dørum committed
389
  file object and a `'K'` type file object. The order isn't specified from what I can tell.
Martin Dørum's avatar
Martin Dørum committed
390
* If a file is over 8GiB, it will set the high bit of the first character in `file_size`,
Martin Dørum's avatar
Martin Dørum committed
391
  and the rest of the string is parsed as base 256 (i.e it's treated as a 95-bit integer, big endian).
Martin Dørum's avatar
Martin Dørum committed
392
393

In some ways, I prefer this approach over the pax approach, since it's much simpler;
Martin Dørum's avatar
Martin Dørum committed
394
395
the pax format requires the extractor to parse the record grammar.
On the other hand, the pax format is both more space efficient and vastly more flexible.
Martin Dørum's avatar
Martin Dørum committed
396
397
398
399
400
401
402
403

In any case, the result is that a tar extractor which wants to support both pax tar files
and GNU tar files needs to support 5 different ways of reading the file path,
5 different ways of reading the link path,
and 3 different ways of reading the file size.

Whatever happened to the nice and simple format we started out with?

Martin Dørum's avatar
Martin Dørum committed
404
405
406
407
## Why GNU tar extracts in quadratic time

Our simple tar extraction implementation has what could be considered a quite serious security bug:
It allows people to put files outside the directory we're extracting to.
408
Nothing is stopping an evil arcive from containing a file object with `file_path="../hello.txt"`.
Martin Dørum's avatar
Martin Dørum committed
409
410
411
412
You might try to fix that by just disallowing file objects from using ".." as a path component,
but it's not that simple.
Consider the following sequence of file objects:

413
414
1. Symlink, `file_path="./foo"`, `link_path=".."`
2. Normal file, `file_path="./foo/hello.txt"`
Martin Dørum's avatar
Martin Dørum committed
415
416
417
418
419
420
421
422

We want to allow symlinks which point to their parent directory, since there are completely legitimate
use cases for that.
We could try to figure out whether a symlink will end up pointing to somewhere outside of the
extraction directory, but that gets complicated real fast when you have to consider symlinks
to symlinks and hard links to symlinks.
It might be possible to do correctly, but it's not the solution GNU tar goes for.

423
When GNU tar encounters a hard link or symlink with `".."` as a path component in its `link_path`,
Martin Dørum's avatar
Martin Dørum committed
424
tar will create a regular file in its place as a placeholder, and put a note about the delayed link
Martin Dørum's avatar
Martin Dørum committed
425
in a linked list datastructure.
Martin Dørum's avatar
Martin Dørum committed
426
When it's done extracting the entire archive, it will go through the whole list of delayed links
Martin Dørum's avatar
Martin Dørum committed
427
and replace the placeholders with proper links.
Martin Dørum's avatar
Martin Dørum committed
428
429
So far, so good.

Martin Dørum's avatar
Martin Dørum committed
430
The problem comes when trying to extract a hard link which _doesn't_ contain `".."` as a path component
431
in its `link_path`.
Martin Dørum's avatar
Martin Dørum committed
432
433
434
GNU tar wants to create such hard links immediately if it can.
But it can't create a hard link if the target is occupied by a placeholder file.
That means, every time GNU tar wants to create a hard link, it first has to walk the entire linked list
Martin Dørum's avatar
Martin Dørum committed
435
of delayed links and see if the target is a delayed link.
Martin Dørum's avatar
Martin Dørum committed
436
437
438
439
440
If the target is a delayed link, the new link must also be delayed.

Your time complexity alarm bells should be starting to ring now.
For every hard link, we walk the list of all delayed links.
But it actually gets worse; for reasons I don't quite understand yet, tar will actually go through
Martin Dørum's avatar
Martin Dørum committed
441
the entire list of delayed links _again_ if it found out that it can create the link immediately.
Martin Dørum's avatar
Martin Dørum committed
442
443
444
So for all "normal" hard links, it has to go through the entire linked list of delayed links _twice_.

If you're a bit crafty, you can construct a tar archive which GNU tar extracts in precisely O(n^2) time;
445
you just need to alternate between links whose `link_path` has `".."` as a path component
Martin Dørum's avatar
Martin Dørum committed
446
and thus get delayed, and "normal" hard links which don't get delayed.
Martin Dørum's avatar
Martin Dørum committed
447
448
449
If you're a bit unlucky, you might have a totally benign tarball which nevertheless happens to contain
a bunch of symlinks which refer to files in a parent directory, followed by a bunch of normal hard links.
This is what had happened to me.
Martin Dørum's avatar
Martin Dørum committed
450
My tarball happened to contain over 800 000 links with `".."` as a path component.
Martin Dørum's avatar
Martin Dørum committed
451
452
453
454
455
456
457
It also happened to contain over 5.4 million hard links.
Every one of those hard links had to go through the entire list of every hitherto deferred link.
No wonder tar got slow.

If you ever find yourself in this situation, pass the `--absolute-paths` (or `-P`) parameter to tar.
Tar's documentation says this about `--absolute-paths`:

Martin Dørum's avatar
Martin Dørum committed
458
> Preserve pathnames.  By default, absolute pathnames (those that begin with a `/`
Martin Dørum's avatar
Martin Dørum committed
459
> character) have the leading slash removed both when creating archives and extracting
Martin Dørum's avatar
Martin Dørum committed
460
> from them.  Also, tar will refuse to extract archive entries whose pathnames contain `..`
Martin Dørum's avatar
Martin Dørum committed
461
462
463
464
> or whose target directory would be altered by a symlink.  This option suppresses these
> behaviors.

You would never guess it from reading the documentation, but when you pass `--absolute-paths`
Martin Dørum's avatar
Martin Dørum committed
465
466
during extraction,
tar assumes that the archive is benign and the whole delayed linking mechanism is disabled.
Martin Dørum's avatar
Martin Dørum committed
467
468
469
Make sure you trust the tar archive though! When extracted with `--absolute-paths`,
a malicious archive will be able to put files anywhere it wants.

Martin Dørum's avatar
Martin Dørum committed
470
I'm absolutely certain that it's possible to make GNU tar extract in O(n) without `--absolute-paths`
Martin Dørum's avatar
Martin Dørum committed
471
472
by replacing the linked list with a hash map.
But that's an adventure for another time.
Martin Dørum's avatar
Martin Dørum committed
473
474
475
476
477
478

## References

These are the documents I've drawn information from when researching for my tar extractor
and this blog post:

Martin Dørum's avatar
Martin Dørum committed
479
* The Wikipedia article on tar: <https://en.wikipedia.org/wiki/Tar_(computing)#File_format>
Martin Dørum's avatar
Martin Dørum committed
480
* The POSIX spec for pax: <https://pubs.opengroup.org/onlinepubs/9699919799/utilities/pax.html#tag_20_92_13_01>
Martin Dørum's avatar
Martin Dørum committed
481
482
* The GNU tar file format docs: <https://www.gnu.org/software/tar/manual/html_node/Standard.html>
* The GNU tar source code: <https://git.savannah.gnu.org/cgit/tar.git/tree/>
Martin Dørum's avatar
Martin Dørum committed
483
484
485
  (in particular, [extract.c](https://git.savannah.gnu.org/cgit/tar.git/tree/src/extract.c))

If I have represented anything inaccurately in this post, please do correct me.