• Linus Torvalds's avatar
    Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · 7f427d3a
    Linus Torvalds authored
    Pull parallel filesystem directory handling update from Al Viro.
    This is the main parallel directory work by Al that makes the vfs layer
    able to do lookup and readdir in parallel within a single directory.
    That's a big change, since this used to be all protected by the
    directory inode mutex.
    The inode mutex is replaced by an rwsem, and serialization of lookups of
    a single name is done by a "in-progress" dentry marker.
    The series begins with xattr cleanups, and then ends with switching
    filesystems over to actually doing the readdir in parallel (switching to
    the "iterate_shared()" that only takes the read lock).
    A more detailed explanation of the process from Al Viro:
     "The xattr work starts with some acl fixes, then switches ->getxattr to
      passing inode and dentry separately.  This is the point where the
      things start to get tricky - that got merged into the very beginning
      of the -rc3-based #work.lookups, to allow untangling the
      security_d_instantiate() mess.  The xattr work itself proceeds to
      switch a lot of filesystems to generic_...xattr(); no complications
      After that initial xattr work, the series then does the following:
       - untangle security_d_instantiate()
       - convert a bunch of open-coded lookup_one_len_unlocked() to calls of
         that thing; one such place (in overlayfs) actually yields a trivial
         conflict with overlayfs fixes later in the cycle - overlayfs ended
         up switching to a variant of lookup_one_len_unlocked() sans the
         permission checks.  I would've dropped that commit (it gets
         overridden on merge from #ovl-fixes in #for-next; proper resolution
         is to use the variant in mainline fs/overlayfs/super.c), but I
         didn't want to rebase the damn thing - it was fairly late in the
       - some filesystems had managed to depend on lookup/lookup exclusion
         for *fs-internal* data structures in a way that would break if we
         relaxed the VFS exclusion.  Fixing hadn't been hard, fortunately.
       - core of that series - parallel lookup machinery, replacing
         ->i_mutex with rwsem, making lookup_slow() take it only shared.  At
         that point lookups happen in parallel; lookups on the same name
         wait for the in-progress one to be done with that dentry.
         Surprisingly little code, at that - almost all of it is in
         fs/dcache.c, with fs/namei.c changes limited to lookup_slow() -
         making it use the new primitive and actually switching to locking
       - parallel readdir stuff - first of all, we provide the exclusion on
         per-struct file basis, same as we do for read() vs lseek() for
         regular files.  That takes care of most of the needed exclusion in
         readdir/readdir; however, these guys are trickier than lookups, so
         I went for switching them one-by-one.  To do that, a new method
         '->iterate_shared()' is added and filesystems are switched to it
         as they are either confirmed to be OK with shared lock on directory
         or fixed to be OK with that.  I hope to kill the original method
         come next cycle (almost all in-tree filesystems are switched
         already), but it's still not quite finished.
       - several filesystems get switched to parallel readdir.  The
         interesting part here is dealing with dcache preseeding by readdir;
         that needs minor adjustment to be safe with directory locked only
         Most of the filesystems doing that got switched to in those
         commits.  Important exception: NFS.  Turns out that NFS folks, with
         their, er, insistence on VFS getting the fuck out of the way of the
         Smart Filesystem Code That Knows How And What To Lock(tm) have
         grown the locking of their own.  They had their own homegrown
         rwsem, with lookup/readdir/atomic_open being *writers* (sillyunlink
         is the reader there).  Of course, with VFS getting the fuck out of
         the way, as requested, the actual smarts of the smart filesystem
         code etc. had become exposed...
       - do_last/lookup_open/atomic_open cleanups.  As the result, open()
         without O_CREAT locks the directory only shared.  Including the
         ->atomic_open() case.  Backmerge from #for-linus in the middle of
         that - atomic_open() fix got brought in.
       - then comes NFS switch to saner (VFS-based ;-) locking, killing the
         homegrown "lookup and readdir are writers" kinda-sorta rwsem.  All
         exclusion for sillyunlink/lookup is done by the parallel lookups
         mechanism.  Exclusion between sillyunlink and rmdir is a real rwsem
         now - rmdir being the writer.
         Result: NFS lookups/readdirs/O_CREAT-less opens happen in parallel
       - the rest of the series consists of switching a lot of filesystems
         to parallel readdir; in a lot of cases ->llseek() gets simplified
         as well.  One backmerge in there (again, #for-linus - rockridge
    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (74 commits)
      ext4: switch to ->iterate_shared()
      hfs: switch to ->iterate_shared()
      hfsplus: switch to ->iterate_shared()
      hostfs: switch to ->iterate_shared()
      hpfs: switch to ->iterate_shared()
      hpfs: handle allocation failures in hpfs_add_pos()
      gfs2: switch to ->iterate_shared()
      f2fs: switch to ->iterate_shared()
      afs: switch to ->iterate_shared()
      befs: switch to ->iterate_shared()
      befs: constify stuff a bit
      isofs: switch to ->iterate_shared()
      get_acorn_filename(): deobfuscate a bit
      btrfs: switch to ->iterate_shared()
      logfs: no need to lock directory in lseek
      switch ecryptfs to ->iterate_shared
      9p: switch to ->iterate_shared()
      fat: switch to ->iterate_shared()
      romfs, squashfs: switch to ->iterate_shared()
      more trivial ->iterate_shared conversions
commoncap.c 31.3 KB