Skip to content
Snippets Groups Projects
  1. Jul 19, 2002
    • Andrew Morton's avatar
      [PATCH] minimal rmap · c48c43e6
      Andrew Morton authored
      This is the "minimal rmap" patch, writen by Rik, ported to 2.5 by Craig
      Kulsea.
      
      Basically,
      
      before: When the page reclaim code decides that is has scanned too many
      unreclaimable pages on the LRU it does a scan of process virtual
      address spaces for pages to add to swapcache.  ptes pointing at the
      page are unmapped as the scan proceeds.  When all ptes referring to a
      page have been unmapped and it has been written to swap the page is
      reclaimable.
      
      after: When an anonymous page is encountered on the tail of the LRU we
      use the rmap to see if it hasn't been referenced lately.  If so then
      add it to swapcache.  When the page is again encountered on the LRU, if
      it is still unreferenced then try to unmap all ptes which refer to it
      in one hit, and if it is clean (ie: on swap) then free it.
      
      The rest of the VM - list management, the classzone concept, etc
      remains unchanged.
      
      There are a number of things which the per-page pte chain could be
      used for.  Bill Irwin has identified the following.
      
      
      (1)  page replacement no longer goes around randomly unmapping things
      
      (2)  referenced bits are more accurate because there aren't several ms
              or even seconds between find the multiple pte's mapping a page
      
      (3)  reduces page replacement from O(total virtually mapped) to O(physical)
      
      (4)  enables defragmentation of physical memory
      
      (5)  enables cooperative offlining of memory for friendly guest instance
              behavior in UML and/or LPAR settings
      
      (6)  demonstrable benefit in performance of swapping which is common in
              end-user interactive workstation workloads (I don't like the word
              "desktop"). c.f. Craig Kulesa's post wrt. swapping performance
      
      (7)  evidence from 2.4-based rmap trees indicates approximate parity
              with mainline in kernel compiles with appropriate locking bits
      
      (8)  partitioning of physical memory can reduce the complexity of page
              replacement searches by scanning only the "interesting" zones
              implemented and merged in 2.4-based rmap
      
      (9)  partitioning of physical memory can increase the parallelism of page
              replacement searches by independently processing different zones
              implemented, but not merged in 2.4-based rmap
      
      (10) the reverse mappings may be used for efficiently keeping pte cache
              attributes coherent
      
      (11) they may be used for virtual cache invalidation (with changes)
      
      (12) the reverse mappings enable proper RSS limit enforcement
              implemented and merged in 2.4-based rmap
      
      
      
      The code adds a pointer to struct page, consumes additional storage for
      the pte chains and adds computational expense to the page reclaim code
      (I measured it at 3% additional load during streaming I/O).  The
      benefits which we get back for all this are, I must say, theoretical
      and unproven.  If it has real advantages (or, indeed, disadvantages)
      then why has nobody demonstrated them?
      
      
      
      There are a number of things remaining to be done:
      
      1: Demonstrate the above advantages.
      
      2: Make it work with pte-highmem  (Bill Irwin is signed up for this)
      
      3: Don't add pte_chains to non-shared pages optimisation (Dave McCracken's
         patch does this)
      
      4: Move the pte_chains into highmem too (Bill, I guess)
      
      5: per-cpu pte_chain freelists (Rik?)
      
      6: maybe GC the pte_chain backing pages. (Seems unavoidable.  Rik?)
      
      7: multithread the page reclaim code.  (I have patches).
      
      8: clustered add-to-swap.  Not sure if I buy this.  anon pages are
         often well-ordered-by-virtual-address on the LRU, so it "just
         works" for benchmarky loads.  But there may be some other loads...
      
      9: Fix bad IO latency in page reclaim (I have lame patches)
      
      10: Develop tuning tools, use them.
      
      11: The nightly updatedb run is still evicting everything.
      c48c43e6
  2. Jul 04, 2002
    • Andrew Morton's avatar
      [PATCH] suppress more allocation failure warnings · 193ae036
      Andrew Morton authored
      The `page allocation failure' warning in __alloc_pages() is being a
      pain.  But I'm persisting with it...
      
      The patch renames PF_RADIX_TREE to PF_NOWARN, and uses it in a few
      places where allocations failures are known to happen.  These code
      paths are well-tested now and suppressing the warning is OK.
      193ae036
    • Andrew Morton's avatar
      [PATCH] always update page->flags atomically · a2b41d23
      Andrew Morton authored
      move_from_swap_cache() and move_to_swap_cache() are playing with
      page->flags nonatomically.  The page is on the LRU at the time and
      another CPU could be altering page->flags concurrently.
      
      The patch converts those functions to use atomic operations.
      
      It also rationalises the number of bits which are cleared.  It's not
      really clear to me what page flags we really want to set to a known
      state in there.
      
      It had no right to go clearing PG_arch_1.  I'm now clearing PG_arch_1
      inside rmqueue() which is still a bit presumptious.
      
      btw: shmem uses PAGE_CACHE_SIZE and swapper_space uses PAGE_SIZE.  I've
      been carefully maintaining the distinction, but it looks like shmem
      will break if we ever do make these values different.
      
      
      Also, __add_to_page_cache() was performing a non-atomic RMW against
      page->flags, under the assumption that it was a newly allocated page
      which no other CPU would look at.  Not true - this function is used for
      moving anon pages into swapcache.  Those anon pages are on the LRU -
      other CPUs can be performing operations against page->flags while
      __add_to_swap_cache is stomping on them.  This had me running around in
      circles for two days.
      
      So let's move the initialisation of the page state into rmqueue(),
      where the page really is new (could do it in page_cache_alloc,
      perhaps).
      
      The SetPageLocked() in __add_to_page_cache() is also rather curious.
      Seems OK for both pagecache and swapcache so I covered that with a
      comment.
      
      
      2.4 has the same problem.  Basically, add_to_swap_cache() can stomp on
      another CPU's manipulation of page->flags.  After a quick review of the
      code there, it is barely conceivable that a concurrent refill_inactve()
      could get its PG_referenced and PG_active bits scribbled on.  Rather
      unlikely because swap_out() will probably see PageActive() and bale
      out.
      
      Also, mark_dirty_kiobuf() could have its PG_dirty bit accidentally
      cleared (but try_to_swap_out() sets it again later).
      
      But there may be other code paths.  Really, I think this needs fixing
      in 2.4 - it's horrid.
      a2b41d23
    • Andrew Morton's avatar
      [PATCH] resurrect __GFP_HIGH · 371151c9
      Andrew Morton authored
      This patch reinstates __GFP_HIGH functionality.
      
      __GFP_HIGH means "able to dip into the emergency pools".  However,
      somewhere along the line this got broken.  __GFP_HIGH ceased to do
      anything.  Instead, !__GFP_WAIT is used to tell the page allocator to
      try harder.
      
      __GFP_HIGH makes sense.  The concepts of "unable to sleep" and "should
      try harder" are quite separate, and overloading !__GFP_WAIT to mean
      "should access emergency pools" seems wrong.
      
      This patch fixes a problem in mempool_alloc().  mempool_alloc() tries
      the first allocation with __GFP_WAIT cleared.  If that fails, it tries
      again with __GFP_WAIT enabled (if the caller can support __GFP_WAIT).
      So it is currently performing an atomic allocation first, even though
      the caller said that they're prepared to go in and call the page
      stealer.
      
      I thought this was a mempool bug, but Ingo said:
      
      > no, it's not GFP_ATOMIC. The important difference is __GFP_HIGH, which
      > triggers the intrusive highprio allocation mode. Otherwise gfp_nowait is
      > just a nonblocking allocation of the same type as the original gfp_mask.
      > ...
      > what i've added is a bit more subtle allocation method, with both
      > performance and balancing-correctness in mind:
      >
      > 1. allocate via gfp_mask, but nonblocking
      > 2. if failure => try to get from the pool if the pool is 'full enough'.
      > 3. if failure => allocate with gfp_mask [which might block]
      >
      > there is performance data that this method improves bounce-IO performance
      > significantly, because even under VM pressure (when gfp_mask would block)
      > we can still use up to 50% of the memory pool without blocking (and
      > without endangering deadlock-free allocation). Ie. the memory pool is also
      > a fast 'frontside cache' of memory elements.
      
      Ingo was assuming that __GFP_HIGH was still functional.  It isn't, and the
      mempool design wants it.
      371151c9
  3. Jun 18, 2002
    • Andrew Morton's avatar
      [PATCH] allow GFP_NOFS allocators to perform swapcache writeout · 493f4988
      Andrew Morton authored
      One weakness which was introduced when the buffer LRU went away was
      that GFP_NOFS allocations became equivalent to GFP_NOIO.  Because all
      writeback goes via writepage/writepages, which requires entry into the
      filesystem.
      
      However now that swapout no longer calls bmap(), we can honour
      GFP_NOFS's intent for swapcache pages.  So if the allocation request
      specifies __GFP_IO and !__GFP_FS, we can wait on swapcache pages and we
      can perform swapcache writeout.
      
      This should strengthen the VM somewhat.
      493f4988
  4. Jun 02, 2002
    • Andrew Morton's avatar
      [PATCH] fix swapcache packing in the radix tree · 02eaba7f
      Andrew Morton authored
      First some terminology: this patch introduces a kernel-wide `pgoff_t'
      type.  It is the index of a page into the pagecache.  The thing at
      page->index.  For most mappings it is also the offset of the page into
      that mapping.  This type has a very distinct function in the kernel and
      it needs a name.  I don't have any particular plans to go and migrate
      everything so we can support 64-bit pagecache indices on x86, but this
      would be the way to do it.
      
      This patch improves the packing density of swapcache pages in the radix
      tree.
      
      A swapcache page is identified by the `swap type' (indexes the swap
      device) and the `offset' (into that swap device).  These two numbers
      are encoded into a `swp_entry_t' machine word in arch-specific code
      because the resulting number is placed into pagetables in a form which
      will generate a fault.
      
      The kernel also need to generate a pgoff_t for that page to index it
      into the swapper_space radix tree.  That pgoff_t is usually
      bitwise-identical to the swp_entry_t.  That worked OK when the
      pagecache was using a hash.  But with a radix tree, it produces
      catastrophically bad results.
      
      x86 (and many other architectures) place the `type' field into the
      low-order bits of the swp_entry_t.  So *all* swapcache pages are
      basically identical in the eight low-order bits.  This produces a very
      sparse radix tree for swapcache.  I'm observing packing densities of 1%
      to 2%: so the typical 128-slot radix tree node has only one or two
      pages in it.
      
      The end result is that the kernel needs to allocate approximately one
      new radix-tree node for each page which is added to the swapcache.  So
      no wonder we're having radix-tree node exhaustion during swapout!
      (It's actually quite encouraging that the kernel works as well as it
      does).
      
      The patch changes the encoding of the swp_entry_t so that its
      most-significant bits contain the `type' field and the
      least-significant bits contain the `offset' field, right-aligned.
      
      That is: the encoding in swp_entry_t is now arch-independent.  The new
      file <linux/swapops.h> has conversion functions which convert the
      swp_entry_t to and from its machine pte representation.
      
      Packing density in the swapper_space mapping goes up to around 90%
      (observed) and the kernel is tons happier under swap load.
      
      
      An alternative approach would be to create new conversion functions
      which convert an arch-specific swp_entry_t to and from a pgoff_t.  I
      tried that.  It worked, but I liked it less.
      02eaba7f
  5. May 29, 2002
    • Pavel Machek's avatar
      [PATCH] swsusp: cleanup · d72fb463
      Pavel Machek authored
       - use list_for_each in head_of_free_region
       - cleanups from 2.4
       - fix for usb
       - kill broken queueing
      d72fb463
  6. May 28, 2002
    • Jens Axboe's avatar
      [PATCH] block plugging reworked · eba5b46c
      Jens Axboe authored
      This patch provides the ability for a block driver to signal it's too
      busy to receive more work and temporarily halt the request queue. In
      concept it's similar to the networking netif_{start,stop}_queue helpers.
      
      To do this cleanly, I've ripped out the old tq_disk task queue. Instead
      an internal list of plugged queues is maintained which will honor the
      current queue state (see QUEUE_FLAG_STOPPED bit). Execution of
      request_fn has been moved to tasklet context. blk_run_queues() provides
      similar functionality to the old run_task_queue(&tq_disk).
      
      Now, this only works at the request_fn level and not at the
      make_request_fn level. This is on purpose: drivers working at the
      make_request_fn level are essentially providing a piece of the block
      level infrastructure themselves. There are basically two reasons for
      doing make_request_fn style setups:
      
      o block remappers. start/stop functionality will be done at the target
        device in this case, which is the level that will signal hardware full
        (or continue) anyways.
      
      o drivers who wish to receive single entities of "buffers" and not
        merged requests etc. This could use the start/stop functionality. I'd
        suggest _still_ using a request_fn for these, but set the queue
        options so that no merging etc ever takes place. This has the added
        bonus of providing the usual request depletion throttling at the block
        level.
      eba5b46c
  7. May 27, 2002
  8. May 23, 2002
  9. May 21, 2002
    • Pavel Machek's avatar
      [PATCH] suspend-to-{RAM,disk} · 542f96a5
      Pavel Machek authored
      Here's suspend-to-{RAM,disk} combined patch for
      2.5.17. Suspend-to-disk is pretty stable and was tested in
      2.4-ac. Suspend-to-RAM is little more experimental, but works for me,
      and is certainly better than disk-eating version currently in kernel.
      
      Major parts are: process stopper, S3 specific code, S4 specific
      code.
      542f96a5
  10. May 19, 2002
    • Andrew Morton's avatar
      [PATCH] remove PG_launder · a2536452
      Andrew Morton authored
      Removal of PG_launder.
      
      It's not obvious (to me) why this ever existed.  If it's to prevent
      deadlocks then I'd like to know who was performing __GFP_FS allocations
      while holding a page lock?
      
      But in 2.5, the only memory allocations which are performed when the
      caller holds PG_writeback against an unsubmitted page are those which
      occur inside submit_bh().  There will be no __GFS_FS allocations in
      that call chain.
      
      Removing PG_launder means that memory allocators can block on any
      PageWriteback() page at all, which reduces the risk of very long list
      walks inside pagemap_lru_lock in shrink_cache().
      a2536452
    • Andrew Morton's avatar
      [PATCH] writeback tuning · acb5f6f9
      Andrew Morton authored
      Tune up the VM-based writeback a bit.
      
      - Always use the multipage clustered-writeback function from within
        shrink_cache(), even if the page's mapping has a NULL ->vm_writeback().  So
        clustered writeback is turned on for all address_spaces, not just ext2.
      
        Subtle effect of this change: it is now the case that *all* writeback
        proceeds along the mapping->dirty_pages list.  The orderedness of the page
        LRUs no longer has an impact on disk scheduling.  So we only have one list
        to keep well-sorted rather than two, and churning pages around on the LRU
        will no longer damage write bandwidth - it's all up to the filesystem.
      
      - Decrease the clustered writeback from 1024 pages(!) to 32 pages.
      
        (1024 was a leftover from when this code was always dispatching writeback
        to a pdflush thread).
      
      - Fix wakeup_bdflush() so that it actually does write something (duh).
      
        do_wp_page() needs to call balance_dirty_pages_ratelimited(), so we
        throttle mmap page-dirtiers in the same way as write(2) page-dirtiers.
        This may make wakeup_bdflush() obsolete, but it doesn't hurt.
      
      - Converts generic_vm_writeback() to directly call ->writeback_mapping(),
        rather that going through writeback_single_inode().  This prevents memory
        allocators from blocking on the inode's I_LOCK.  But it does mean that two
        processes can be writing pages from the same mapping at the same time.  If
        filesystems care about this (for layout reasons) then they should serialise
        in their ->writeback_mapping a_op.
      
        This means that memory-allocators will writeback only pages, not pages
        and inodes.  There are no locks in that writeback path (except for request
        queue exhaustion).  Reduces memory allocation latency.
      
      - Implement new background_writeback function, which when kicked off
        will perform writeback until dirty memory falls below the background
        threshold.
      
      - Put written-back pages onto the remote end of the page LRU.  It
        does this in the slow-and-stupid way at present.  pagemap_lru_lock
        stress-relief is planned...
      
      - Remove the funny writeback_unused_inodes() stuff from prune_icache().
        Writeback from wakeup_bdflush() and the `kupdate' function now just
        naturally cleanses the oldest inodes so we don't need to do anything
        there.
      
      - Dirty memory balancing is still using magic numbers: "after you
        dirtied your 1,000th page, go write 1,500".  Obviously, this needs
        more work.
      acb5f6f9
  11. May 05, 2002
    • Andrew Morton's avatar
      [PATCH] suppress allocation warnings for radix-tree allocations · 038de6b6
      Andrew Morton authored
      The recently-added page allocation failure warning generates a lot of
      noise due to radix-tree node allocation failures.  Those messages are
      not interesting.
      
      But I think the warning is otherwise useful - "I got an allocation
      failure and then it crashed" is better than "it crashed".
      
      The patch suppresses the message for ratnode allocation failures.
      038de6b6
  12. May 03, 2002
    • Roman Zippel's avatar
      [PATCH] 2.5.13: remove VALID_PAGE · 5528f050
      Roman Zippel authored
      This patch removes VALID_PAGE(), as the test was always too late for
      discontinous memory configuration. It is replaced with pfn_valid()/
      virt_addr_valid(), which are used to test the original input value.
      Other helper functions:
      pte_pfn() - extract the page number from a pte
      pfn_to_page()/page_to_pfn() - convert a page number to/from a page struct
      5528f050
  13. Apr 30, 2002
    • Andrew Morton's avatar
      [PATCH] page writeback locking update · a2bcb3a0
      Andrew Morton authored
      - Fixes a performance problem - callers of
        prepare_write/commit_write, etc are locking pages, which synchronises
        them behind writeback, which also locks these pages.  Significant
        slowdowns for some workloads.
      
      - So pages are no longer locked while under writeout.  Introduce a
        new PG_writeback and associated infrastructure to support this design
        change.
      
      - Pages which are under read I/O still use PageLocked.  Pages which
        are under write I/O have PageWriteback() true.
      
        I considered creating Page_IO instead of PageWriteback, and marking
        both readin and writeout pages as PageIO().  So pages are unlocked
        during both read and write.  There just doesn't seem a need to do
        this - nobody ever needs unblocking access to a page which is under
        read I/O.
      
      - Pages under swapout (brw_page) are PageLocked, not PageWriteback.
        So their treatment is unchangeded.
      
        It's not obvious that pages which are under swapout actually need
        the more asynchronous behaviour of PageWriteback.
      
        I was setting the swapout pages PageWriteback and unlocking them
        prior to submitting the buffers in brw_page().  This led to deadlocks
        on the exit_mmap->zap_page_range->free_swap_and_cache path.  These
        functions call block_flushpage under spinlock.  If the page is
        unlocked but has locked buffers, block_flushpage->discard_buffer()
        sleeps.  Under spinlock.  So that will need fixing if for some reason
        we want swapout to use PageWriteback.
      
        Kernel has called block_flushpage() under spinlock for a long time.
         It is assuming that a locked page will never have locked buffers.
        This appears to be true, but it's ugly.
      
      - Adds new function wait_on_page_writeback().  Renames wait_on_page()
        to wait_on_page_locked() to remind people that they need to call the
        appropriate one.
      
      - Renames filemap_fdatasync() to filemap_fdatawrite().  It's more
        accurate - "sync" implies, if anything, writeout and wait.  (fsync,
        msync) Or writeout.  it's not clear.
      
      - Subtly changes the filemap_fdatawrite() internals - this function
        used to do a lock_page() - it waited for any other user of the page
        to let go before submitting new I/O against a page.  It has been
        changed to simply skip over any pages which are currently under
        writeback.
      
        This is the right thing to do for memory-cleansing reasons.
      
        But it's the wrong thing to do for data consistency operations (eg,
        fsync()).  For those operations we must ensure that all data which
        was dirty *at the time of the system call* are tight on disk before
        the call returns.
      
        So all places which care about this have been converted to do:
      
      	filemap_fdatawait(mapping);	/* Wait for current writeback */
      	filemap_fdatawrite(mapping);	/* Write all dirty pages */
      	filemap_fdatawait(mapping);	/* Wait for I/O to complete */
      
      - Fixes a truncate_inode_pages problem - truncate currently will
        block when it hits a locked page, so it ends up getting into lockstep
        behind writeback and all of the file is pointlessly written back.
      
        One fix for this is for truncate to simply walk the page list in the
        opposite direction from writeback.
      
        I chose to use a separate cleansing pass.  It is more
        CPU-intensive, but it is surer and clearer.  This is because there is
        no reason why the per-address_space ->vm_writeback and
        ->writeback_mapping functions *have* to perform writeout in
        ->dirty_pages order.  They may choose to do something totally
        different.
      
        (set_page_dirty() is an a_op now, so address_spaces could almost
        privatise the whole dirty-page handling thing.  Except
        truncate_inode_pages and invalidate_inode_pages assume that the pages
        are on the address_space lists.  hmm.  So making truncate_inode_pages
        and invalidate_inode_pages a_ops would make some sense).
      a2bcb3a0
    • Andrew Morton's avatar
      [PATCH] cleanup page flags · aa78091f
      Andrew Morton authored
      page->flags cleanup.
      
      Moves the definitions of the page->flags bits and all the PageFoo
      macros into linux/page-flags.h.  That file is currently included from
      mm.h, but the stage is set to remove that and include page-flags.h
      direct in all .c files which require that.  (120 of them).
      
      The patch also makes all the page flag macros and functions consistent:
      
      For PG_foo, the following functions are defined:
      
      	SetPageFoo
      	ClearPageFoo
      	TestSetPageFoo
      	TestClearPageFoo
      	PageFoo
      
      and that's it.
      
      - Page_Uptodate is renamed to PageUptodate
      
      - LockPage is removed.  All users updated to use SetPageLocked
      
      - UnlockPage is removed.  All callers updated to use unlock_page().
        it's a real function - there's no need to hide that fact.
      
      - PageTestandClearReferenced renamed to TestClearPageReferenced
      
      - PageSetSlab renamed to SetPageSlab
      
      - __SetPageReserved is removed.  It's an infinitesimally small
         microoptimisation, and is inconsistent.
      
      - TryLockPage is renamed to TestSetPageLocked
      
      - PageSwapCache() is renamed to page_swap_cache(), so it doesn't
        pretend to be a page->flags bit test.
      aa78091f
    • Andrew Morton's avatar
      [PATCH] writeback from address spaces · 090da372
      Andrew Morton authored
      [ I reversed the order in which writeback walks the superblock's
        dirty inodes.  It sped up dbench's unlink phase greatly.  I'm
        such a sleaze ]
      
      The core writeback patch.  Switches file writeback from the dirty
      buffer LRU over to address_space.dirty_pages.
      
      - The buffer LRU is removed
      
      - The buffer hash is removed (uses blockdev pagecache lookups)
      
      - The bdflush and kupdate functions are implemented against
        address_spaces, via pdflush.
      
      - The relationship between pages and buffers is changed.
      
        - If a page has dirty buffers, it is marked dirty
        - If a page is marked dirty, it *may* have dirty buffers.
        - A dirty page may be "partially dirty".  block_write_full_page
          discovers this.
      
      - A bunch of consistency checks of the form
      
      	if (!something_which_should_be_true())
      		buffer_error();
      
        have been introduced.  These fog the code up but are important for
        ensuring that the new buffer/page code is working correctly.
      
      - New locking (inode.i_bufferlist_lock) is introduced for exclusion
        from try_to_free_buffers().  This is needed because set_page_dirty
        is called under spinlock, so it cannot lock the page.  But it
        needs access to page->buffers to set them all dirty.
      
        i_bufferlist_lock is also used to protect inode.i_dirty_buffers.
      
      - fs/inode.c has been split: all the code related to file data writeback
        has been moved into fs/fs-writeback.c
      
      - Code related to file data writeback at the address_space level is in
        the new mm/page-writeback.c
      
      - try_to_free_buffers() is now non-blocking
      
      - Switches vmscan.c over to understand that all pages with dirty data
        are now marked dirty.
      
      - Introduces a new a_op for VM writeback:
      
      	->vm_writeback(struct page *page, int *nr_to_write)
      
        this is a bit half-baked at present.  The intent is that the address_space
        is given the opportunity to perform clustered writeback.  To allow it to
        opportunistically write out disk-contiguous dirty data which may be in other zones.
        To allow delayed-allocate filesystems to get good disk layout.
      
      - Added address_space.io_pages.  Pages which are being prepared for
        writeback.  This is here for two reasons:
      
        1: It will be needed later, when BIOs are assembled direct
           against pagecache, bypassing the buffer layer.  It avoids a
           deadlock which would occur if someone moved the page back onto the
           dirty_pages list after it was added to the BIO, but before it was
           submitted.  (hmm.  This may not be a problem with PG_writeback logic).
      
        2: Avoids a livelock which would occur if some other thread is continually
           redirtying pages.
      
      - There are two known performance problems in this code:
      
        1: Pages which are locked for writeback cause undesirable
           blocking when they are being overwritten.  A patch which leaves
           pages unlocked during writeback comes later in the series.
      
        2: While inodes are under writeback, they are locked.  This
           causes namespace lookups against the file to get unnecessarily
           blocked in wait_on_inode().  This is a fairly minor problem.
      
           I don't have a fix for this at present - I'll fix this when I
           attach dirty address_spaces direct to super_blocks.
      
      - The patch vastly increases the amount of dirty data which the
        kernel permits highmem machines to maintain.  This is because the
        balancing decisions are made against the amount of memory in the
        machine, not against the amount of buffercache-allocatable memory.
      
        This may be very wrong, although it works fine for me (2.5 gigs).
      
        We can trivially go back to the old-style throttling with
        s/nr_free_pagecache_pages/nr_free_buffer_pages/ in
        balance_dirty_pages().  But better would be to allow blockdev
        mappings to use highmem (I'm thinking about this one, slowly).  And
        to move writer-throttling and writeback decisions into the VM (modulo
        the file-overwriting problem).
      
      - Drops 24 bytes from struct buffer_head.  More to come.
      
      - There's some gunk like super_block.flags:MS_FLUSHING which needs to
        be killed.  Need a better way of providing collision avoidance
        between pdflush threads, to prevent more than one pdflush thread
        working a disk at the same time.
      
        The correct way to do that is to put a flag in the request queue to
        say "there's a pdlfush thread working this disk".  This is easy to
        do: just generalise the "ra_pages" pointer to point at a struct which
        includes ra_pages and the new collision-avoidance flag.
      090da372
    • Andrew Morton's avatar
      [PATCH] page accounting · d878155c
      Andrew Morton authored
      This patch provides global accounting of locked and dirty pages.  It
      does this via lightweight per-CPU data structures.  The page_cache_size
      accounting has been changed to use this facility as well.
      
      Locked and dirty page accounting is needed for making writeback and
      throttling decisions.
      
      The patch also starts to move code which is related to page->flags
      out of linux/mm.h and into linux/page-flags.h
      d878155c
  14. Apr 15, 2002
    • Andrew Morton's avatar
      [PATCH] don't allocate ratnodes under PF_MEMALLOC · 49c7ca7c
      Andrew Morton authored
      On the swap_out() path, the radix-tree pagecache is allocating its
      nodes with PF_MEMALLOC set, which allows it to completely exhaust the
      free page lists(*).  This is fairly easy to trigger with swap-intensive
      loads.
      
      It would be better to make those node allocations fail at an earlier
      time.  When this happens, the radix-tree can still obtain nodes from its
      mempool, and we leave some memory available for the I/O layer.
      (Assuming that the I/O is being performed under PF_MEMALLOC, which it
      is).
      
      So the patch simply drops PF_MEMALLOC while adding nodes to the
      swapcache's tree.
      
      We're still performing atomic allocations, so the rat is still biting
      pretty deeply into the page reserves - under heavy load the amount of
      free memory is less than half of what it was pre-rat.
      
      It is unfortunate that the page allocator overloads !__GFP_WAIT to also
      mean "try harder".  It would be better to separate these concepts, and
      to allow the radix-tree code (at least) to perform atomic allocations,
      but to not go below pages_min.  It seems that __GFP_TRY_HARDER will be
      pretty straightforward to implement.  Later.
      
      The patch also impements a workaround for the mempool list_head
      problem, until that is sorted out.
      
      
      
      (*) The usual result is that the SCSI layer dies at scsi_merge.c:82.
      It would be nice to have a fix for that - it's going BUG if 1-order
      allocations fail at interrupt time.  That happens pretty easily.
      49c7ca7c
  15. Apr 10, 2002
    • Andrew Morton's avatar
      [PATCH] page->buffers abstraction · 9855b4a1
      Andrew Morton authored
      page->buffers is a bit of a layering violation.  Not all address_spaces
      have pages which are backed by buffers.
      
      The exclusive use of page->buffers for buffers means that a piece of
      prime real estate in struct page is unavailable to other forms of
      address_space.
      
      This patch turns page->buffers into `unsigned long page->private' and
      sets in place all the infrastructure which is needed to allow other
      address_spaces to use this storage.
      
      This change alows the multipage-bio-writeout patches to use
      page->private to cache the results of an earlier get_block(), so
      repeated calls into the filesystem are not needed in the case of file
      overwriting.
      
      Devlopers should think carefully before calling try_to_free_buffers()
      or block_flushpage() or writeout_one_page() or waitfor_one_page()
      against a page.  It's only legal to do this if you *know* that the page
      is buffer-backed.  And only the address_space knows that.
      Arguably, we need new a_ops for writeout_one_page() and
      waitfor_one_page().  But I have more patches on the boil which
      obsolete these functions in favour of ->writepage() and wait_on_page().
      
      The new PG_private page bit is used to indicate that there
      is something at page->private.  The core kernel does not
      know what that object actually is, just that it's there.
      The kernel must call a_ops->releasepage() to try to make
      page->private go away.  And a_ops->flushpage() at truncate
      time.
      9855b4a1
    • Andrew Morton's avatar
      [PATCH] Velikov/Hellwig radix-tree pagecache · 3d30a6cc
      Andrew Morton authored
      Before the mempool was added, the VM was getting many, many
      0-order allocation failures due to the atomic ratnode
      allocations inside swap_out.  That monster mempool is
      doing its job - drove a 256meg machine a gigabyte into
      swap with no ratnode allocation failures at all.
      
      So we do need to trim that pool a bit, and also handle
      the case where swap_out fails, and not just keep
      pointlessly calling it.
      3d30a6cc
  16. Mar 24, 2002
  17. Feb 28, 2002
  18. Feb 19, 2002
  19. Feb 05, 2002
    • Linus Torvalds's avatar
      v2.5.2.1 -> v2.5.2.1.1 · 468e6d17
      Linus Torvalds authored
      - David Howells: abtract out "current->need_resched" as "need_resched()"
      - Frank Davis: ide-tape update for bio
      - various: header file fixups
      - Jens Axboe: fix up bio/ide/highmem issues
      - Kai Germaschewski: ISDN update
      - Tim Waugh: parport update
      - Patrik Mochel: initcall update
      - Greg KH: USB and Compaq PCI hotplug updates
      468e6d17
    • Linus Torvalds's avatar
      v2.5.1.3 -> v2.5.1.4 · d0415686
      Linus Torvalds authored
      - Jens Axboe: more bio updates, fix some request list bogosity under load
      - Al Viro: export seq_xxx functions
      - Manfred Spraul: include file cleanups, pc110pad compile fix
      - David Woodhouse: fix JFFS2 write error handling
      - Dave Jones: start merging up with 2.4.x patches
      - Manfred Spraul: coredump fixes, FS event counter cleanups
      - me: fix SCSI CD-ROM sectorsize BIO breakage
      d0415686
    • Linus Torvalds's avatar
      v2.4.14.1 -> v2.4.14.2 · a8a2069f
      Linus Torvalds authored
        - Ivan Kokshaysky: fix alpha dec_and_lock with modules, for alpha config entry
        - Kai Germaschewski: ISDN updates
        - Jeff Garzik: network driver updates, sysv fs update
        - Kai Mäkisara: SCSI tape update
        - Alan Cox: large drivers merge
        - Nikita Danilov: reiserfs procfs information
        - Andrew Morton: ext3 merge
        - Christoph Hellwig: vxfs livelock fix
        - Trond Myklebust: NFS updates
        - Jens Axboe: cpqarray + cciss dequeue fix
        - Tim Waugh: parport_serial base_baud setting
        - Matthew Dharm: usb-storage Freecom driver fixes
        - Dave McCracken: wait4() thread group race fix
      a8a2069f
    • Linus Torvalds's avatar
      v2.4.14 -> v2.4.14.1 · 5db5272c
      Linus Torvalds authored
        - me: fix page flags race condition Andrea found
        - David Miller: sparc and network updates
        - various: fix loop driver that thought it was part of the VM system
        - me: teach DRM about VM_RESERVED
        - Alan Cox: more merging
      5db5272c
    • Linus Torvalds's avatar
      v2.4.13.8 -> v2.4.14 · aad40ef3
      Linus Torvalds authored
        - David Miller: sparc/scsi scatterlist fixes
        - Martin Mares: PCI ids, email address update
        - David Miller: revert TCP hash optimizations that need more checking
        - Ivan Kokshaysky/Richard Henderson: alpha update (atomic_dec_and_lock etc)
        - Peter Anvin: cramfs/zisofs missing pieces
      aad40ef3
    • Linus Torvalds's avatar
      v2.4.13.7 -> v2.4.13.8 · 3ea86172
      Linus Torvalds authored
        - Andrea: fix races in do_wp_page, free_swap_and_cache
        - me: clena up page dirty handling
        - Tim Waugh: parport IRQ probing and documentation fixes
        - Greg KH: USB updates
        - Michael Warfield: computone driver update
        - Randy Dunlap: add knowledge about some new io-apics
        - Richard Henderson: alpha updates
        - Trond Myklebust: make readdir xdr verify the reply packet
        - Paul Mackerras: PPC update
        - Jens Axboe: make cpqarray and cciss play nice with the request layer
        - Massimo Dal Zotto: SMM driver for Dell Inspiron 8000
        - Richard Gooch: devfs symlink deadlock fix
        - Anton Altaparmakov: make NTFS compile on sparc
      3ea86172
    • Linus Torvalds's avatar
      v2.4.13.6 -> v2.4.13.7 · 595cf06f
      Linus Torvalds authored
        - me: reinstate "delete swap cache on low swap" code
        - David Miller: ksoftirqd startup race fix
        - Hugh Dickins: make tmpfs free swap cache entries proactively
      595cf06f
    • Linus Torvalds's avatar
      v2.4.13.5 -> v2.4.13.6 · 857805c6
      Linus Torvalds authored
        - me: remember to bump the version number ;)
        - Hugh Dickins: export "free_lru_page()" for modules
        - Jeff Garzik: don't change nopage arguments, just make the last a dummy one
        - David Miller: sparc and net updates (netfilter, VLAN etc)
        - Nikita Danilov: reiserfs cleanups
        - Jan Kara: quota initialization race
        - Tigran Aivazian: make the x86 microcode update driver happy about
        hyperthreaded P4's
        - me: shrink dcache/icache more aggressively
        - me: fix up oom-killer so that it actually works
      857805c6
    • Linus Torvalds's avatar
      v2.4.13.3 -> v2.4.13.4 · f97f22cb
      Linus Torvalds authored
        - Mikael Pettersson: fix P4 boot with APIC enabled
        - me: fix device queuing thinko, clean up VM locking
      f97f22cb
    • Linus Torvalds's avatar
      v2.4.13.2 -> v2.4.13.3 · ff35c838
      Linus Torvalds authored
        - René Scharfe: random bugfix
        - me: block device queuing low-water-marks, VM mapped tweaking.
      ff35c838
    • Linus Torvalds's avatar
      v2.4.13 -> v2.4.13.1 · 980adcb2
      Linus Torvalds authored
        - Michael Warfield: computone serial driver update
        - Alexander Viro: cdrom module race fixes
        - David Miller: Acenic driver fix
        - Andrew Grover: ACPI update
        - Kai Germaschewski: ISDN update
        - Tim Waugh: parport update
        - David Woodhouse: JFFS garbage collect sleep
      980adcb2
    • Linus Torvalds's avatar
      v2.4.12.6 -> v2.4.13 · 9ff086a3
      Linus Torvalds authored
        - page write-out throttling
        - Pete Zaitcev: ymfpci sound driver update (make Civ:CTP happy with it)
        - Alan Cox: i2o sync-up
        - Andrea Arcangeli: revert broken x86 smp_call_function patch
        - me: handle VM write load more gracefully. Merge parts of -aa VM
      9ff086a3
    • Linus Torvalds's avatar
      v2.4.12.4 -> v2.4.12.5 · 2ef7e8ce
      Linus Torvalds authored
        - Greg KH: usbnet fix
        - Johannes Erdfelt: uhci.c bulk queueing fixes
      2ef7e8ce
Loading