1. 30 Jan, 2014 1 commit
  2. 13 Nov, 2013 2 commits
  3. 11 Sep, 2013 1 commit
    • Fengguang Wu's avatar
      readahead: make context readahead more conservative · 2cad4018
      Fengguang Wu authored
      
      This helps performance on moderately dense random reads on SSD.
      
      Transaction-Per-Second numbers provided by Taobao:
      
      		QPS	case
      		-------------------------------------------------------
      		7536	disable context readahead totally
      w/ patch:	7129	slower size rampup and start RA on the 3rd read
      		6717	slower size rampup
      w/o patch:	5581	unmodified context readahead
      
      Before, readahead will be started whenever reading page N+1 when it happen
      to read N recently.  After patch, we'll only start readahead when *three*
      random reads happen to access pages N, N+1, N+2.  The probability of this
      happening is extremely low for pure random reads, unless they are very
      dense, which actually deserves some readahead.
      
      Also start with a smaller readahead window.  The impact to interleaved
      sequential reads should be small, because for a long run stream, the the
      small readahead window rampup phase is negletable.
      
      The context readahead actually benefits clustered random reads on HDD
      whose seek cost is pretty high.  However as SSD is increasingly used for
      random read workloads it's better for the context readahead to concentrate
      on interleaved sequential reads.
      
      Another SSD rand read test from Miao
      
              # file size:        2GB
              # read IO amount: 625MB
              sysbench --test=fileio          \
                      --max-requests=10000    \
                      --num-threads=1         \
                      --file-num=1            \
                      --file-block-size=64K   \
                      --file-test-mode=rndrd  \
                      --file-fsync-freq=0     \
                      --file-fsync-end=off    run
      
      shows the performance of btrfs grows up from 69MB/s to 121MB/s, ext4 from
      104MB/s to 121MB/s.
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Tested-by: default avatarTao Ma <tm@tao.ma>
      Tested-by: default avatarMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2cad4018
  4. 22 May, 2013 1 commit
    • Lukas Czerner's avatar
      mm: change invalidatepage prototype to accept length · d47992f8
      Lukas Czerner authored
      
      Currently there is no way to truncate partial page where the end
      truncate point is not at the end of the page. This is because it was not
      needed and the functionality was enough for file system truncate
      operation to work properly. However more file systems now support punch
      hole feature and it can benefit from mm supporting truncating page just
      up to the certain point.
      
      Specifically, with this functionality truncate_inode_pages_range() can
      be changed so it supports truncating partial page at the end of the
      range (currently it will BUG_ON() if 'end' is not at the end of the
      page).
      
      This commit changes the invalidatepage() address space operation
      prototype to accept range to be invalidated and update all the instances
      for it.
      
      We also change the block_invalidatepage() in the same way and actually
      make a use of the new length argument implementing range invalidation.
      
      Actual file system implementations will follow except the file systems
      where the changes are really simple and should not change the behaviour
      in any way .Implementation for truncate_page_range() which will be able
      to accept page unaligned ranges will follow as well.
      Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      d47992f8
  5. 04 Mar, 2013 1 commit
  6. 27 Sep, 2012 2 commits
  7. 29 May, 2012 1 commit
  8. 31 Oct, 2011 1 commit
  9. 25 May, 2011 1 commit
  10. 10 Mar, 2011 2 commits
  11. 25 May, 2010 1 commit
  12. 07 Apr, 2010 1 commit
  13. 30 Mar, 2010 1 commit
    • Tejun Heo's avatar
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking... · 5a0e3ad6
      Tejun Heo authored
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
      
      percpu.h is included by sched.h and module.h and thus ends up being
      included when building most .c files.  percpu.h includes slab.h which
      in turn includes gfp.h making everything defined by the two files
      universally available and complicating inclusion dependencies.
      
      percpu.h -> slab.h dependency is about to be removed.  Prepare for
      this change by updating users of gfp and slab facilities include those
      headers directly instead of assuming availability.  As this conversion
      needs to touch large number of source files, the following script is
      used as the basis of conversion.
      
        http://userweb.kernel.org/~tj/misc/slabh-sweep.py
      
      The script does the followings.
      
      * Scan files for gfp and slab usages and update includes such that
        only the necessary includes are there.  ie. if only gfp is used,
        gfp.h, if slab is used, slab.h.
      
      * When the script inserts a new include, it looks at the include
        bloc...
      5a0e3ad6
  14. 06 Mar, 2010 1 commit
    • Wu Fengguang's avatar
      readahead: introduce FMODE_RANDOM for POSIX_FADV_RANDOM · 0141450f
      Wu Fengguang authored
      This fixes inefficient page-by-page reads on POSIX_FADV_RANDOM.
      
      POSIX_FADV_RANDOM used to set ra_pages=0, which leads to poor performance:
      a 16K read will be carried out in 4 _sync_ 1-page reads.
      
      In other places, ra_pages==0 means
      - it's ramfs/tmpfs/hugetlbfs/sysfs/configfs
      - some IO error happened
      where multi-page read IO won't help or should be avoided.
      
      POSIX_FADV_RANDOM actually want a different semantics: to disable the
      *heuristic* readahead algorithm, and to use a dumb one which faithfully
      submit read IO for whatever application requests.
      
      So introduce a flag FMODE_RANDOM for POSIX_FADV_RANDOM.
      
      Note that the random hint is not likely to help random reads performance
      noticeably.  And it may be too permissive on huge request size (its IO
      size is not limited by read_ahead_kb).
      
      In Quentin's report (http://lkml.org/lkml/2009/12/24/145
      
      ), the overall
      (NFS read) performance of the application increased by 313%!
      Tested-by: default avatarQuentin Barnes <qbarnes+nfs@yahoo-inc.com>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: <stable@kernel.org>			[2.6.33.x]
      Cc: <qbarnes+nfs@yahoo-inc.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0141450f
  15. 17 Dec, 2009 1 commit
    • Hisashi Hifumi's avatar
      readahead: add blk_run_backing_dev · 65a80b4c
      Hisashi Hifumi authored
      I added blk_run_backing_dev on page_cache_async_readahead so readahead I/O
      is unpluged to improve throughput on especially RAID environment.
      
      The normal case is, if page N become uptodate at time T(N), then T(N) <=
      T(N+1) holds.  With RAID (and NFS to some degree), there is no strict
      ordering, the data arrival time depends on runtime status of individual
      disks, which breaks that formula.  So in do_generic_file_read(), just
      after submitting the async readahead IO request, the current page may well
      be uptodate, so the page won't be locked, and the block device won't be
      implicitly unplugged:
      
                     if (PageReadahead(page))
                              page_cache_async_readahead()
                      if (!PageUptodate(page))
                                      goto page_not_up_to_date;
                      //...
      page_not_up_to_date:
                      lock_page_killable(page);
      
      Therefore explicit unplugging can help.
      
      Following is the test result with dd.
      
      #dd if=testdir/testfile of=/dev/null bs=16384
      
      -2.6.30-rc6
      1048576+0 records in
      1048576+0 records out
      17179869184 bytes (17 GB) copied, 224.182 seconds, 76.6 MB/s
      
      -2.6.30-rc6-patched
      1048576+0 records in
      1048576+0 records out
      17179869184 bytes (17 GB) copied, 206.465 seconds, 83.2 MB/s
      
      (7Disks RAID-0 Array)
      
      -2.6.30-rc6
      1054976+0 records in
      1054976+0 records out
      17284726784 bytes (17 GB) copied, 212.233 seconds, 81.4 MB/s
      
      -2.6.30-rc6-patched
      1054976+0 records out
      17284726784 bytes (17 GB) copied, 198.878 seconds, 86.9 MB/s
      
      (7Disks RAID-5 Array)
      
      The patch was found to improve performance with the SCST scsi target
      driver.  See
      http://sourceforge.net/mailarchive/forum.php?thread_name=a0272b440906030714g67eabc5k8f847fb1e538cc62%40mail.gmail.com&forum_name=scst-devel
      
      
      
      [akpm@linux-foundation.org: unbust comment layout]
      [akpm@linux-foundation.org: "fix" CONFIG_BLOCK=n]
      Signed-off-by: default avatarHisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
      Acked-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Tested-by: default avatarRonald <intercommit@gmail.com>
      Cc: Bart Van Assche <bart.vanassche@gmail.com>
      Cc: Vladislav Bolkhovitin <vst@vlnb.net>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      65a80b4c
  16. 17 Jun, 2009 8 commits
  17. 03 Apr, 2009 2 commits
  18. 26 Mar, 2009 1 commit
  19. 20 Oct, 2008 1 commit
    • Rik van Riel's avatar
      vmscan: split LRU lists into anon & file sets · 4f98a2fe
      Rik van Riel authored
      
      Split the LRU lists in two, one set for pages that are backed by real file
      systems ("file") and one for pages that are backed by memory and swap
      ("anon").  The latter includes tmpfs.
      
      The advantage of doing this is that the VM will not have to scan over lots
      of anonymous pages (which we generally do not want to swap out), just to
      find the page cache pages that it should evict.
      
      This patch has the infrastructure and a basic policy to balance how much
      we scan the anon lists and how much we scan the file lists.  The big
      policy changes are in separate patches.
      
      [lee.schermerhorn@hp.com: collect lru meminfo statistics from correct offset]
      [kosaki.motohiro@jp.fujitsu.com: prevent incorrect oom under split_lru]
      [kosaki.motohiro@jp.fujitsu.com: fix pagevec_move_tail() doesn't treat unevictable page]
      [hugh@veritas.com: memcg swapbacked pages active]
      [hugh@veritas.com: splitlru: BDI_CAP_SWAP_BACKED]
      [akpm@linux-foundation.org: fix /proc/vmstat units]
      [nishimura@mxp.nes.nec.co.jp: memcg: fix handling of shmem migration]
      [kosaki.motohiro@jp.fujitsu.com: adjust Quicklists field of /proc/meminfo]
      [kosaki.motohiro@jp.fujitsu.com: fix style issue of get_scan_ratio()]
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarLee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4f98a2fe
  20. 16 Oct, 2008 1 commit
  21. 26 Jul, 2008 1 commit
  22. 30 Apr, 2008 1 commit
  23. 20 Mar, 2008 1 commit
  24. 17 Oct, 2007 1 commit
  25. 16 Oct, 2007 5 commits
    • Nick Piggin's avatar
      mm: buffered write cleanup · eb2be189
      Nick Piggin authored
      
      Quite a bit of code is used in maintaining these "cached pages" that are
      probably pretty unlikely to get used. It would require a narrow race where
      the page is inserted concurrently while this process is allocating a page
      in order to create the spare page. Then a multi-page write into an uncached
      part of the file, to make use of it.
      
      Next, the buffered write path (and others) uses its own LRU pagevec when it
      should be just using the per-CPU LRU pagevec (which will cut down on both data
      and code size cacheline footprint). Also, these private LRU pagevecs are
      emptied after just a very short time, in contrast with the per-CPU pagevecs
      that are persistent. Net result: 7.3 times fewer lru_lock acquisitions required
      to add the pages to pagecache for a bulk write (in 4K chunks).
      
      [this gets rid of some cond_resched() calls in readahead.c and mpage.c due
       to clashes in -mm. What put them there, and why? ]
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eb2be189
    • Nick Piggin's avatar
      mm: use lockless radix-tree probe · 00128188
      Nick Piggin authored
      
      Probing pages and radix_tree_tagged are lockless operations with the lockless
      radix-tree.  Convert these users to RCU locking rather than using tree_lock.
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      00128188
    • Fengguang Wu's avatar
      readahead: remove several readahead macros · 535443f5
      Fengguang Wu authored
      
      Remove VM_MAX_CACHE_HIT, MAX_RA_PAGES and MIN_RA_PAGES.
      Signed-off-by: default avatarFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      535443f5
    • Fengguang Wu's avatar
      readahead: basic support of interleaved reads · 6b10c6c9
      Fengguang Wu authored
      
      This is a simplified version of the pagecache context based readahead.  It
      handles the case of multiple threads reading on the same fd and invalidating
      each others' readahead state.  It does the trick by scanning the pagecache and
      recovering the current read stream's readahead status.
      
      The algorithm works in a opportunistic way, in that it does not try to detect
      interleaved reads _actively_, which requires a probe into the page cache
      (which means a little more overhead for random reads).  It only tries to
      handle a previously started sequential readahead whose state was overwritten
      by another concurrent stream, and it can do this job pretty well.
      
      Negative and positive examples(or what you can expect from it):
      
      1) it cannot detect and serve perfect request-by-request interleaved reads
         right:
      	time	stream 1  stream 2
      	0 	1
      	1 	          1001
      	2 	2
      	3 	          1002
      	4 	3
      	5 	          1003
      	6 	4
      	7 	          1004
      	8 	5
      	9	          1005
      
      Here no single readahead will be carried out.
      
      2) However, if it's two concurrent reads by two threads, the chance of the
         initial sequential readahead be started is huge. Once the first sequential
         readahead is started for a stream, this patch will ensure that the readahead
         window continues to rampup and won't be disturbed by other streams.
      
      	time	stream 1  stream 2
      	0 	1
      	1 	2
      	2 	          1001
      	3 	3
      	4 	          1002
      	5 	          1003
      	6 	4
      	7 	5
      	8 	          1004
      	9 	6
      	10	          1005
      	11	7
      	12	          1006
      	13	          1007
      
      Here stream 1 will start a readahead at page 2, and stream 2 will start its
      first readahead at page 1003.  From then on the two streams will be served
      right.
      
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: default avatarFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6b10c6c9
    • Fengguang Wu's avatar
      readahead: combine file_ra_state.prev_index/prev_offset into prev_pos · f4e6b498
      Fengguang Wu authored
      
      Combine the file_ra_state members
      				unsigned long prev_index
      				unsigned int prev_offset
      into
      				loff_t prev_pos
      
      It is more consistent and better supports huge files.
      
      Thanks to Peter for the nice proposal!
      
      [akpm@linux-foundation.org: fix shift overflow]
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarFengguang Wu <wfg@mail.ustc.edu.cn>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f4e6b498