Commit 08a72e40 authored by Christopher Schinnerl's avatar Christopher Schinnerl

vendor godirwalk

parent 65bc6b3d
BSD 2-Clause License
Copyright (c) 2017, Karrick McDermott
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
* Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
# godirwalk
`godirwalk` is a library for traversing a directory tree on a file
system.
In short, why do I use this library?
1. It's faster than `filepath.Walk`.
1. It's more correct on Windows than `filepath.Walk`.
1. It's more easy to use than `filepath.Walk`.
1. It's more flexible than `filepath.Walk`.
## Usage Example
Additional examples are provided in the `examples/` subdirectory.
This library will normalize the provided top level directory name
based on the os-specific path separator by calling `filepath.Clean` on
its first argument. However it always provides the pathname created by
using the correct os-specific path separator when invoking the
provided callback function.
```Go
dirname := "some/directory/root"
err := godirwalk.Walk(dirname, &godirwalk.Options{
Callback: func(osPathname string, de *godirwalk.Dirent) error {
fmt.Printf("%s %s\n", de.ModeType(), osPathname)
return nil
},
Unsorted: true, // (optional) set true for faster yet non-deterministic enumeration (see godoc)
})
```
This library not only provides functions for traversing a file system
directory tree, but also for obtaining a list of immediate descendants
of a particular directory, typically much more quickly than using
`os.ReadDir` or `os.ReadDirnames`.
Documentation is available via
[![GoDoc](https://godoc.org/github.com/karrick/godirwalk?status.svg)](https://godoc.org/github.com/karrick/godirwalk).
## Description
Here's why I use `godirwalk` in preference to `filepath.Walk`,
`os.ReadDir`, and `os.ReadDirnames`.
### It's faster than `filepath.Walk`
When compared against `filepath.Walk` in benchmarks, it has been
observed to run between five and ten times the speed on darwin, at
speeds comparable to the that of the unix `find` utility; about twice
the speed on linux; and about four times the speed on Windows.
How does it obtain this performance boost? It does less work to give
you nearly the same output. This library calls the same `syscall`
functions to do the work, but it makes fewer calls, does not throw
away information that it might need, and creates less memory churn
along the way by reusing the same scratch buffer rather than
reallocating a new buffer every time it reads data from the operating
system.
While traversing a file system directory tree, `filepath.Walk` obtains
the list of immediate descendants of a directory, and throws away the
file system node type information provided by the operating system
that comes with the node's name. Then, immediately prior to invoking
the callback function, `filepath.Walk` invokes `os.Stat` for each
node, and passes the returned `os.FileInfo` information to the
callback.
While the `os.FileInfo` information provided by `os.Stat` is extremely
helpful--and even includes the `os.FileMode` data--providing it
requires an additional system call for each node.
Because most callbacks only care about what the node type is, this
library does not throw the type information away, but rather provides
that information to the callback function in the form of a
`os.FileMode` value. Note that the provided `os.FileMode` value that
this library provides only has the node type information, and does not
have the permission bits, sticky bits, or other information from the
file's mode. If the callback does care about a particular node's
entire `os.FileInfo` data structure, the callback can easiy invoke
`os.Stat` when needed, and only when needed.
#### Benchmarks
##### macOS
```Bash
go test -bench=.
goos: darwin
goarch: amd64
pkg: github.com/karrick/godirwalk
BenchmarkFilepathWalk-8 1 3001274570 ns/op
BenchmarkGoDirWalk-8 3 465573172 ns/op
BenchmarkFlameGraphFilepathWalk-8 1 6957916936 ns/op
BenchmarkFlameGraphGoDirWalk-8 1 4210582571 ns/op
PASS
ok github.com/karrick/godirwalk 16.822s
```
##### Linux
```Bash
go test -bench=.
goos: linux
goarch: amd64
pkg: github.com/karrick/godirwalk
BenchmarkFilepathWalk-12 1 1609189170 ns/op
BenchmarkGoDirWalk-12 5 211336628 ns/op
BenchmarkFlameGraphFilepathWalk-12 1 3968119932 ns/op
BenchmarkFlameGraphGoDirWalk-12 1 2139598998 ns/op
PASS
ok github.com/karrick/godirwalk 9.007s
```
### It's more correct on Windows than `filepath.Walk`
I did not previously care about this either, but humor me. We all love
how we can write once and run everywhere. It is essential for the
language's adoption, growth, and success, that the software we create
can run unmodified on all architectures and operating systems
supported by Go.
When the traversed file system has a logical loop caused by symbolic
links to directories, on unix `filepath.Walk` ignores symbolic links
and traverses the entire directory tree without error. On Windows
however, `filepath.Walk` will continue following directory symbolic
links, even though it is not supposed to, eventually causing
`filepath.Walk` to terminate early and return an error when the
pathname gets too long from concatenating endless loops of symbolic
links onto the pathname. This error comes from Windows, passes through
`filepath.Walk`, and to the upstream client running `filepath.Walk`.
The takeaway is that behavior is different based on which platform
`filepath.Walk` is running. While this is clearly not intentional,
until it is fixed in the standard library, it presents a compatibility
problem.
This library correctly identifies symbolic links that point to
directories and will only follow them when `FollowSymbolicLinks` is
set to true. Behavior on Windows and other operating systems is
identical.
### It's more easy to use than `filepath.Walk`
Since this library does not invoke `os.Stat` on every file system node
it encounters, there is no possible error event for the callback
function to filter on. The third argument in the `filepath.WalkFunc`
function signature to pass the error from `os.Stat` to the callback
function is no longer necessary, and thus eliminated from signature of
the callback function from this library.
Also, `filepath.Walk` invokes the callback function with a solidus
delimited pathname regardless of the os-specific path separator. This
library invokes the callback function with the os-specific pathname
separator, obviating a call to `filepath.Clean` in the callback
function for each node prior to actually using the provided pathname.
In other words, even on Windows, `filepath.Walk` will invoke the
callback with `some/path/to/foo.txt`, requiring well written clients
to perform pathname normalization for every file prior to working with
the specified file. In truth, many clients developed on unix and not
tested on Windows neglect this subtlety, and will result in software
bugs when running on Windows. This library would invoke the callback
function with `some\path\to\foo.txt` for the same file when running on
Windows, eliminating the need to normalize the pathname by the client,
and lessen the likelyhood that a client will work on unix but not on
Windows.
### It's more flexible than `filepath.Walk`
#### Configurable Handling of Symbolic Links
The default behavior of this library is to ignore symbolic links to
directories when walking a directory tree, just like `filepath.Walk`
does. However, it does invoke the callback function with each node it
finds, including symbolic links. If a particular use case exists to
follow symbolic links when traversing a directory tree, this library
can be invoked in manner to do so, by setting the
`FollowSymbolicLinks` parameter to true.
#### Configurable Sorting of Directory Children
The default behavior of this library is to always sort the immediate
descendants of a directory prior to visiting each node, just like
`filepath.Walk` does. This is usually the desired behavior. However,
this does come at a performance penalty to sort the names when a
directory node has many entries. If a particular use case exists that
does not require sorting the directory's immediate descendants prior
to visiting its nodes, this library will skip the sorting step when
the `Unsorted` parameter is set to true.
#### Configurable Post Children Callback
This library provides upstream code with the ability to specify a
callback to be invoked for each directory after its children are
processed. This has been used to recursively delete empty directories
after traversing the file system in a more efficient manner. See the
`examples/clean-empties` directory for an example of this usage.
#### Configurable Error Callback
This library provides upstream code with the ability to specify a
callback to be invoked for errors that the operating system returns,
allowing the upstream code to determine the next course of action to
take, whether to halt walking the hierarchy, as it would do were no
error callback provided, or skip the node that caused the error. See
the `examples/walk-fast` directory for an example of this usage.
package godirwalk
import (
"os"
"path/filepath"
)
// Dirent stores the name and file system mode type of discovered file system
// entries.
type Dirent struct {
name string
modeType os.FileMode
}
// NewDirent returns a newly initialized Dirent structure, or an error. This
// function does not follow symbolic links.
//
// This function is rarely used, as Dirent structures are provided by other
// functions in this library that read and walk directories.
func NewDirent(osPathname string) (*Dirent, error) {
fi, err := os.Lstat(osPathname)
if err != nil {
return nil, err
}
return &Dirent{
name: filepath.Base(osPathname),
modeType: fi.Mode() & os.ModeType,
}, nil
}
// Name returns the basename of the file system entry.
func (de Dirent) Name() string { return de.name }
// ModeType returns the mode bits that specify the file system node type. We
// could make our own enum-like data type for encoding the file type, but Go's
// runtime already gives us architecture independent file modes, as discussed in
// `os/types.go`:
//
// Go's runtime FileMode type has same definition on all systems, so that
// information about files can be moved from one system to another portably.
func (de Dirent) ModeType() os.FileMode { return de.modeType }
// IsDir returns true if and only if the Dirent represents a file system
// directory. Note that on some operating systems, more than one file mode bit
// may be set for a node. For instance, on Windows, a symbolic link that points
// to a directory will have both the directory and the symbolic link bits set.
func (de Dirent) IsDir() bool { return de.modeType&os.ModeDir != 0 }
// IsRegular returns true if and only if the Dirent represents a regular
// file. That is, it ensures that no mode type bits are set.
func (de Dirent) IsRegular() bool { return de.modeType&os.ModeType == 0 }
// IsSymlink returns true if and only if the Dirent represents a file system
// symbolic link. Note that on some operating systems, more than one file mode
// bit may be set for a node. For instance, on Windows, a symbolic link that
// points to a directory will have both the directory and the symbolic link bits
// set.
func (de Dirent) IsSymlink() bool { return de.modeType&os.ModeSymlink != 0 }
// Dirents represents a slice of Dirent pointers, which are sortable by
// name. This type satisfies the `sort.Interface` interface.
type Dirents []*Dirent
// Len returns the count of Dirent structures in the slice.
func (l Dirents) Len() int { return len(l) }
// Less returns true if and only if the Name of the element specified by the
// first index is lexicographically less than that of the second index.
func (l Dirents) Less(i, j int) bool { return l[i].name < l[j].name }
// Swap exchanges the two Dirent entries specified by the two provided indexes.
func (l Dirents) Swap(i, j int) { l[i], l[j] = l[j], l[i] }
/*
Package godirwalk provides functions to read and traverse directory trees.
In short, why do I use this library?
* It's faster than `filepath.Walk`.
* It's more correct on Windows than `filepath.Walk`.
* It's more easy to use than `filepath.Walk`.
* It's more flexible than `filepath.Walk`.
USAGE
This library will normalize the provided top level directory name based on the
os-specific path separator by calling `filepath.Clean` on its first
argument. However it always provides the pathname created by using the correct
os-specific path separator when invoking the provided callback function.
dirname := "some/directory/root"
err := godirwalk.Walk(dirname, &godirwalk.Options{
Callback: func(osPathname string, de *godirwalk.Dirent) error {
fmt.Printf("%s %s\n", de.ModeType(), osPathname)
return nil
},
})
This library not only provides functions for traversing a file system directory
tree, but also for obtaining a list of immediate descendants of a particular
directory, typically much more quickly than using `os.ReadDir` or
`os.ReadDirnames`.
*/
package godirwalk
package main
import (
"fmt"
"os"
"path/filepath"
"github.com/karrick/godirwalk"
"github.com/pkg/errors"
)
func main() {
if len(os.Args) < 2 {
fmt.Fprintf(os.Stderr, "usage: %s dir1 [dir2 [dir3...]]\n", filepath.Base(os.Args[0]))
os.Exit(2)
}
scratchBuffer := make([]byte, 64*1024) // allocate once and re-use each time
var count, total int
var err error
for _, arg := range os.Args[1:] {
count, err = pruneEmptyDirectories(arg, scratchBuffer)
total += count
if err != nil {
break
}
}
fmt.Fprintf(os.Stderr, "Removed %d empty directories\n", total)
if err != nil {
fmt.Fprintf(os.Stderr, "ERROR: %s\n", err)
os.Exit(1)
}
}
func pruneEmptyDirectories(osDirname string, scratchBuffer []byte) (int, error) {
var count int
err := godirwalk.Walk(osDirname, &godirwalk.Options{
Unsorted: true,
ScratchBuffer: scratchBuffer,
Callback: func(_ string, _ *godirwalk.Dirent) error {
// no-op while diving in; all the fun happens in PostChildrenCallback
return nil
},
PostChildrenCallback: func(osPathname string, _ *godirwalk.Dirent) error {
deChildren, err := godirwalk.ReadDirents(osPathname, scratchBuffer)
if err != nil {
return errors.Wrap(err, "cannot ReadDirents")
}
// NOTE: ReadDirents skips "." and ".."
if len(deChildren) > 0 {
return nil // this directory has children; no additional work here
}
if osPathname == osDirname {
return nil // do not remove provided root directory
}
err = os.Remove(osPathname)
if err == nil {
count++
}
return err
},
})
return count, err
}
package main
import (
"fmt"
"os"
"github.com/karrick/godirwalk"
)
func main() {
dirname := "."
if len(os.Args) > 1 {
dirname = os.Args[1]
}
err := godirwalk.Walk(dirname, &godirwalk.Options{
Callback: func(osPathname string, de *godirwalk.Dirent) error {
// fmt.Printf("%s %s\n", de.ModeType(), osPathname)
return nil
},
ErrorCallback: func(osPathname string, err error) godirwalk.ErrorAction {
// Your program may want to log the error somehow.
// fmt.Fprintf(os.Stderr, "ERROR: %s\n", err)
// For the purposes of this example, a simple SkipNode will suffice,
// although in reality perhaps additional logic might be called for.
return godirwalk.SkipNode
},
Unsorted: true, // set true for faster yet non-deterministic enumeration (see godoc)
})
if err != nil {
fmt.Fprintf(os.Stderr, "%s\n", err)
os.Exit(1)
}
}
package main
import (
"fmt"
"os"
"path/filepath"
)
func main() {
dirname := "."
if len(os.Args) > 1 {
dirname = os.Args[1]
}
err := filepath.Walk(dirname, func(osPathname string, info os.FileInfo, err error) error {
if err != nil {
return err
}
// fmt.Printf("%s %s\n", info.Mode(), osPathname)
return nil
})
if err != nil {
fmt.Fprintf(os.Stderr, "%s\n", err)
os.Exit(1)
}
}
module github.com/karrick/godirwalk
package godirwalk
// ReadDirents returns a sortable slice of pointers to Dirent structures, each
// representing the file system name and mode type for one of the immediate
// descendant of the specified directory. If the specified directory is a
// symbolic link, it will be resolved.
//
// If an optional scratch buffer is provided that is at least one page of
// memory, it will be used when reading directory entries from the file system.
//
// children, err := godirwalk.ReadDirents(osDirname, nil)
// if err != nil {
// return nil, errors.Wrap(err, "cannot get list of directory children")
// }
// sort.Sort(children)
// for _, child := range children {
// fmt.Printf("%s %s\n", child.ModeType, child.Name)
// }
func ReadDirents(osDirname string, scratchBuffer []byte) (Dirents, error) {
return readdirents(osDirname, scratchBuffer)
}
// ReadDirnames returns a slice of strings, representing the immediate
// descendants of the specified directory. If the specified directory is a
// symbolic link, it will be resolved.
//
// If an optional scratch buffer is provided that is at least one page of
// memory, it will be used when reading directory entries from the file system.
//
// Note that this function, depending on operating system, may or may not invoke
// the ReadDirents function, in order to prepare the list of immediate
// descendants. Therefore, if your program needs both the names and the file
// system mode types of descendants, it will always be faster to invoke
// ReadDirents directly, rather than calling this function, then looping over
// the results and calling os.Stat for each child.
//
// children, err := godirwalk.ReadDirnames(osDirname, nil)
// if err != nil {
// return nil, errors.Wrap(err, "cannot get list of directory children")
// }
// sort.Strings(children)
// for _, child := range children {
// fmt.Printf("%s\n", child)
// }
func ReadDirnames(osDirname string, scratchBuffer []byte) ([]string, error) {
return readdirnames(osDirname, scratchBuffer)
}
package godirwalk
import (
"os"
"path/filepath"
"sort"
"testing"
)
func TestReadDirents(t *testing.T) {
root := setup(t)
defer teardown(t, root)
entries, err := ReadDirents(root, nil)
if err != nil {
t.Fatal(err)
}
expected := Dirents{
&Dirent{
name: "dir1",
modeType: os.ModeDir,
},
&Dirent{
name: "dir2",
modeType: os.ModeDir,
},
&Dirent{
name: "dir3",
modeType: os.ModeDir,
},
&Dirent{
name: "dir4",
modeType: os.ModeDir,
},
&Dirent{
name: "dir5",
modeType: os.ModeDir,
},
&Dirent{
name: "dir6",
modeType: os.ModeDir,
},
&Dirent{
name: "file3",
modeType: 0,
},
&Dirent{
name: "symlinks",
modeType: os.ModeDir,
},
}
if got, want := len(entries), len(expected); got != want {
t.Fatalf("(GOT) %v; (WNT) %v", got, want)
}
sort.Sort(entries)
sort.Sort(expected)
for i := 0; i < len(entries); i++ {
if got, want := entries[i].name, expected[i].name; got != want {
t.Errorf("(GOT) %v; (WNT) %v", got, want)
}
if got, want := entries[i].modeType, expected[i].modeType; got != want {
t.Errorf("(GOT) %v; (WNT) %v", got, want)
}
}
}
func TestReadDirentsSymlinks(t *testing.T) {
root := setup(t)
defer teardown(t, root)
osDirname := filepath.Join(root, "symlinks")
// Because some platforms set multiple mode type bits, when we create the
// expected slice, we need to ensure the mode types are set appropriately.
var expected Dirents
for _, pathname := range []string{"dir-symlink", "file-symlink", "invalid-symlink"} {
info, err := os.Lstat(filepath.Join(osDirname, pathname))
if err != nil {
t.Fatal(err)
}
expected = append(expected, &Dirent{name: pathname, modeType: info.Mode() & os.ModeType})
}
entries, err := ReadDirents(osDirname, nil)
if err != nil {
t.Fatal(err)
}
if got, want := len(entries), len(expected); got != want {
t.Fatalf("(GOT) %v; (WNT) %v", got, want)
}
sort.Sort(entries)
sort.Sort(expected)
for i := 0; i < len(entries); i++ {
if got, want := entries[i].name, expected[i].name; got != want {
t.Errorf("(GOT) %v; (WNT) %v", got, want)
}
if got, want := entries[i].modeType, expected[i].modeType; got != want {
t.Errorf("(GOT) %v; (WNT) %v", got, want)
}
}
}
func TestReadDirnames(t *testing.T) {
root := setup(t)
defer teardown(t, root)
entries, err := ReadDirnames(root, nil)
if err != nil {
t.Fatal(err)
}
expected := []string{"dir1", "dir2", "dir3", "dir4", "dir5", "dir6", "file3", "symlinks"}
if got, want := len(entries), len(expected); got != want {
t.Fatalf("(GOT) %v; (WNT) %v", got, want)
}
sort.Strings(entries)
sort.Strings(expected)
for i := 0; i < len(entries); i++ {
if got, want := entries[i], expected[i]; got != want {
t.Errorf("(GOT) %v; (WNT) %v", got, want)
}
}
}
// +build darwin freebsd linux netbsd openbsd
package godirwalk
import (
"os"
"path/filepath"
"syscall"
"unsafe"
)
func readdirents(osDirname string, scratchBuffer []byte) (Dirents, error) {
dh, err := os.Open(osDirname)
if err != nil {
return nil, err
}
var entries Dirents
fd := int(dh.Fd())
if len(scratchBuffer) < MinimumScratchBufferSize {
scratchBuffer = make([]byte, DefaultScratchBufferSize)
}
var de *syscall.Dirent
for {
n, err := syscall.ReadDirent(fd, scratchBuffer)
if err != nil {
_ = dh.Close() // ignore potential error returned by Close
return nil, err
}
if n <= 0 {
break // end of directory reached
}
// Loop over the bytes returned by reading the directory entries.
buf := scratchBuffer[:n]
for len(buf) > 0 {
de = (*syscall.Dirent)(unsafe.Pointer(&buf[0])) // point entry to first syscall.Dirent in buffer
buf = buf[de.Reclen:] // advance buffer
if inoFromDirent(de) == 0 {
continue // this item has been deleted, but not yet removed from directory
}
nameSlice := nameFromDirent(de)
namlen := len(nameSlice)
if (namlen == 0) || (namlen == 1 && nameSlice[0] == '.') || (namlen == 2 && nameSlice[0] == '.' && nameSlice[1] == '.') {
continue // skip unimportant entries
}