1. 31 Jan, 2014 1 commit
    • Minchan Kim's avatar
      zsmalloc: move it under mm · bcf1647d
      Minchan Kim authored
      
      This patch moves zsmalloc under mm directory.
      
      Before that, description will explain why we have needed custom
      allocator.
      
      Zsmalloc is a new slab-based memory allocator for storing compressed
      pages.  It is designed for low fragmentation and high allocation success
      rate on large object, but <= PAGE_SIZE allocations.
      
      zsmalloc differs from the kernel slab allocator in two primary ways to
      achieve these design goals.
      
      zsmalloc never requires high order page allocations to back slabs, or
      "size classes" in zsmalloc terms.  Instead it allows multiple
      single-order pages to be stitched together into a "zspage" which backs
      the slab.  This allows for higher allocation success rate under memory
      pressure.
      
      Also, zsmalloc allows objects to span page boundaries within the zspage.
      This allows for lower fragmentation than could be had with the kernel
      slab allocator for objects between PAGE_SIZE/2 and PAGE_SIZE.  With the
      kernel slab allocator, if a page compresses to 60% of it original size,
      the memory savings gained through compression is lost in fragmentation
      because another object of the same size can't be stored in the leftover
      space.
      
      This ability to span pages results in zsmalloc allocations not being
      directly addressable by the user.  The user is given an
      non-dereferencable handle in response to an allocation request.  That
      handle must be mapped, using zs_map_object(), which returns a pointer to
      the mapped region that can be used.  The mapping is necessary since the
      object data may reside in two different noncontigious pages.
      
      The zsmalloc fulfills the allocation needs for zram perfectly
      
      [sjenning@linux.vnet.ibm.com: borrow Seth's quote]
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarNitin Gupta <ngupta@vflare.org>
      Reviewed-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bcf1647d
  2. 19 Dec, 2013 1 commit
  3. 15 Nov, 2013 3 commits
    • Kirill A. Shutemov's avatar
      mm: dynamically allocate page->ptl if it cannot be embedded to struct page · 49076ec2
      Kirill A. Shutemov authored
      
      If split page table lock is in use, we embed the lock into struct page
      of table's page.  We have to disable split lock, if spinlock_t is too
      big be to be embedded, like when DEBUG_SPINLOCK or DEBUG_LOCK_ALLOC
      enabled.
      
      This patch add support for dynamic allocation of split page table lock
      if we can't embed it to struct page.
      
      page->ptl is unsigned long now and we use it as spinlock_t if
      sizeof(spinlock_t) <= sizeof(long), otherwise it's pointer to spinlock_t.
      
      The spinlock_t allocated in pgtable_page_ctor() for PTE table and in
      pgtable_pmd_page_ctor() for PMD table.  All other helpers converted to
      support dynamically allocated page->ptl.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      49076ec2
    • Kirill A. Shutemov's avatar
      mm: implement split page table lock for PMD level · e009bb30
      Kirill A. Shutemov authored
      
      The basic idea is the same as with PTE level: the lock is embedded into
      struct page of table's page.
      
      We can't use mm->pmd_huge_pte to store pgtables for THP, since we don't
      take mm->page_table_lock anymore.  Let's reuse page->lru of table's page
      for that.
      
      pgtable_pmd_page_ctor() returns true, if initialization is successful
      and false otherwise.  Current implementation never fails, but assumption
      that constructor can fail will help to port it to -rt where spinlock_t
      is rather huge and cannot be embedded into struct page -- dynamic
      allocation is required.
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: default avatarAlex Thorlton <athorlton@sgi.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Robin Holt <robinmholt@gmail.com>
      Cc: Sedat Dilek <sedat.dilek@gmail.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Hugh Dickins <hughd@google.com>
      Reviewed-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e009bb30
    • Kirill A. Shutemov's avatar
      mm: avoid increase sizeof(struct page) due to split page table lock · e9bb18c7
      Kirill A. Shutemov authored
      
      Alex Thorlton noticed that some massively threaded workloads work poorly,
      if THP enabled.  This patchset fixes this by introducing split page table
      lock for PMD tables.  hugetlbfs is not covered yet.
      
      This patchset is based on work by Naoya Horiguchi.
      
      : akpm result summary:
      :
      : THP off, v3.12-rc2: 18.059261877 seconds time elapsed
      : THP off, patched:   16.768027318 seconds time elapsed
      :
      : THP on, v3.12-rc2:  42.162306788 seconds time elapsed
      : THP on, patched:    8.397885779 seconds time elapsed
      :
      : HUGETLB, v3.12-rc2: 47.574936948 seconds time elapsed
      : HUGETLB, patched:   19.447481153 seconds time elapsed
      
      THP off, v3.12-rc2:
      -------------------
      
       Performance counter stats for './thp_memscale -c 80 -b 512m' (5 runs):
      
          1037072.835207 task-clock                #   57.426 CPUs utilized            ( +-  3.59% )
                  95,093 context-switches          #    0.092 K/sec                    ( +-  3.93% )
                     140 cpu-migrations            #    0.000 K/sec                    ( +-  5.28% )
              10,000,550 page-faults               #    0.010 M/sec                    ( +-  0.00% )
       2,455,210,400,261 cycles                    #    2.367 GHz                      ( +-  3.62% ) [83.33%]
       2,429,281,882,056 stalled-cycles-frontend   #   98.94% frontend cycles idle     ( +-  3.67% ) [83.33%]
       1,975,960,019,659 stalled-cycles-backend    #   80.48% backend  cycles idle     ( +-  3.88% ) [66.68%]
          46,503,296,013 instructions              #    0.02  insns per cycle
                                                   #   52.24  stalled cycles per insn  ( +-  3.21% ) [83.34%]
           9,278,997,542 branches                  #    8.947 M/sec                    ( +-  4.00% ) [83.34%]
              89,881,640 branch-misses             #    0.97% of all branches          ( +-  1.17% ) [83.33%]
      
            18.059261877 seconds time elapsed                                          ( +-  2.65% )
      
      THP on, v3.12-rc2:
      ------------------
      
       Performance counter stats for './thp_memscale -c 80 -b 512m' (5 runs):
      
          3114745.395974 task-clock                #   73.875 CPUs utilized            ( +-  1.84% )
                 267,356 context-switches          #    0.086 K/sec                    ( +-  1.84% )
                      99 cpu-migrations            #    0.000 K/sec                    ( +-  1.40% )
                  58,313 page-faults               #    0.019 K/sec                    ( +-  0.28% )
       7,416,635,817,510 cycles                    #    2.381 GHz                      ( +-  1.83% ) [83.33%]
       7,342,619,196,993 stalled-cycles-frontend   #   99.00% frontend cycles idle     ( +-  1.88% ) [83.33%]
       6,267,671,641,967 stalled-cycles-backend    #   84.51% backend  cycles idle     ( +-  2.03% ) [66.67%]
         117,819,935,165 instructions              #    0.02  insns per cycle
                                                   #   62.32  stalled cycles per insn  ( +-  4.39% ) [83.34%]
          28,899,314,777 branches                  #    9.278 M/sec                    ( +-  4.48% ) [83.34%]
              71,787,032 branch-misses             #    0.25% of all branches          ( +-  1.03% ) [83.33%]
      
            42.162306788 seconds time elapsed                                          ( +-  1.73% )
      
      HUGETLB, v3.12-rc2:
      -------------------
      
       Performance counter stats for './thp_memscale_hugetlbfs -c 80 -b 512M' (5 runs):
      
          2588052.787264 task-clock                #   54.400 CPUs utilized            ( +-  3.69% )
                 246,831 context-switches          #    0.095 K/sec                    ( +-  4.15% )
                     138 cpu-migrations            #    0.000 K/sec                    ( +-  5.30% )
                  21,027 page-faults               #    0.008 K/sec                    ( +-  0.01% )
       6,166,666,307,263 cycles                    #    2.383 GHz                      ( +-  3.68% ) [83.33%]
       6,086,008,929,407 stalled-cycles-frontend   #   98.69% frontend cycles idle     ( +-  3.77% ) [83.33%]
       5,087,874,435,481 stalled-cycles-backend    #   82.51% backend  cycles idle     ( +-  4.41% ) [66.67%]
         133,782,831,249 instructions              #    0.02  insns per cycle
                                                   #   45.49  stalled cycles per insn  ( +-  4.30% ) [83.34%]
          34,026,870,541 branches                  #   13.148 M/sec                    ( +-  4.24% ) [83.34%]
              68,670,942 branch-misses             #    0.20% of all branches          ( +-  3.26% ) [83.33%]
      
            47.574936948 seconds time elapsed                                          ( +-  2.09% )
      
      THP off, patched:
      -----------------
      
       Performance counter stats for './thp_memscale -c 80 -b 512m' (5 runs):
      
           943301.957892 task-clock                #   56.256 CPUs utilized            ( +-  3.01% )
                  86,218 context-switches          #    0.091 K/sec                    ( +-  3.17% )
                     121 cpu-migrations            #    0.000 K/sec                    ( +-  6.64% )
              10,000,551 page-faults               #    0.011 M/sec                    ( +-  0.00% )
       2,230,462,457,654 cycles                    #    2.365 GHz                      ( +-  3.04% ) [83.32%]
       2,204,616,385,805 stalled-cycles-frontend   #   98.84% frontend cycles idle     ( +-  3.09% ) [83.32%]
       1,778,640,046,926 stalled-cycles-backend    #   79.74% backend  cycles idle     ( +-  3.47% ) [66.69%]
          45,995,472,617 instructions              #    0.02  insns per cycle
                                                   #   47.93  stalled cycles per insn  ( +-  2.51% ) [83.34%]
           9,179,700,174 branches                  #    9.731 M/sec                    ( +-  3.04% ) [83.35%]
              89,166,529 branch-misses             #    0.97% of all branches          ( +-  1.45% ) [83.33%]
      
            16.768027318 seconds time elapsed                                          ( +-  2.47% )
      
      THP on, patched:
      ----------------
      
       Performance counter stats for './thp_memscale -c 80 -b 512m' (5 runs):
      
           458793.837905 task-clock                #   54.632 CPUs utilized            ( +-  0.79% )
                  41,831 context-switches          #    0.091 K/sec                    ( +-  0.97% )
                      98 cpu-migrations            #    0.000 K/sec                    ( +-  1.66% )
                  57,829 page-faults               #    0.126 K/sec                    ( +-  0.62% )
       1,077,543,336,716 cycles                    #    2.349 GHz                      ( +-  0.81% ) [83.33%]
       1,067,403,802,964 stalled-cycles-frontend   #   99.06% frontend cycles idle     ( +-  0.87% ) [83.33%]
         864,764,616,143 stalled-cycles-backend    #   80.25% backend  cycles idle     ( +-  0.73% ) [66.68%]
          16,129,177,440 instructions              #    0.01  insns per cycle
                                                   #   66.18  stalled cycles per insn  ( +-  7.94% ) [83.35%]
           3,618,938,569 branches                  #    7.888 M/sec                    ( +-  8.46% ) [83.36%]
              33,242,032 branch-misses             #    0.92% of all branches          ( +-  2.02% ) [83.32%]
      
             8.397885779 seconds time elapsed                                          ( +-  0.18% )
      
      HUGETLB, patched:
      -----------------
      
       Performance counter stats for './thp_memscale_hugetlbfs -c 80 -b 512M' (5 runs):
      
           395353.076837 task-clock                #   20.329 CPUs utilized            ( +-  8.16% )
                  55,730 context-switches          #    0.141 K/sec                    ( +-  5.31% )
                     138 cpu-migrations            #    0.000 K/sec                    ( +-  4.24% )
                  21,027 page-faults               #    0.053 K/sec                    ( +-  0.00% )
         930,219,717,244 cycles                    #    2.353 GHz                      ( +-  8.21% ) [83.32%]
         914,295,694,103 stalled-cycles-frontend   #   98.29% frontend cycles idle     ( +-  8.35% ) [83.33%]
         704,137,950,187 stalled-cycles-backend    #   75.70% backend  cycles idle     ( +-  9.16% ) [66.69%]
          30,541,538,385 instructions              #    0.03  insns per cycle
                                                   #   29.94  stalled cycles per insn  ( +-  3.98% ) [83.35%]
           8,415,376,631 branches                  #   21.286 M/sec                    ( +-  3.61% ) [83.36%]
              32,645,478 branch-misses             #    0.39% of all branches          ( +-  3.41% ) [83.32%]
      
            19.447481153 seconds time elapsed                                          ( +-  2.00% )
      
      This patch (of 11):
      
      CONFIG_GENERIC_LOCKBREAK increases sizeof(spinlock_t) to 8 bytes.  It
      leads to increase sizeof(struct page) by 4 bytes on 32-bit system if split
      page table lock is in use, since page->ptl shares space in union with
      longs and pointers.
      
      Let's disable split page table lock on 32-bit systems with
      GENERIC_LOCKBREAK enabled.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Alex Thorlton <athorlton@sgi.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Robin Holt <robinmholt@gmail.com>
      Cc: Sedat Dilek <sedat.dilek@gmail.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e9bb18c7
  4. 13 Nov, 2013 1 commit
    • Tang Chen's avatar
      mem-hotplug: introduce movable_node boot option · c5320926
      Tang Chen authored
      
      The hot-Pluggable field in SRAT specifies which memory is hotpluggable.
      As we mentioned before, if hotpluggable memory is used by the kernel, it
      cannot be hot-removed.  So memory hotplug users may want to set all
      hotpluggable memory in ZONE_MOVABLE so that the kernel won't use it.
      
      Memory hotplug users may also set a node as movable node, which has
      ZONE_MOVABLE only, so that the whole node can be hot-removed.
      
      But the kernel cannot use memory in ZONE_MOVABLE.  By doing this, the
      kernel cannot use memory in movable nodes.  This will cause NUMA
      performance down.  And other users may be unhappy.
      
      So we need a way to allow users to enable and disable this functionality.
      In this patch, we introduce movable_node boot option to allow users to
      choose to not to consume hotpluggable memory at early boot time and later
      we can set it as ZONE_MOVABLE.
      
      To achieve this, the movable_node boot option will control the memblock
      allocation direction.  That said, after memblock is ready, before SRAT is
      parsed, we should allocate memory near the kernel image as we explained in
      the previous patches.  So if movable_node boot option is set, the kernel
      does the following:
      
      1. After memblock is ready, make memblock allocate memory bottom up.
      2. After SRAT is parsed, make memblock behave as default, allocate memory
         top down.
      
      Users can specify "movable_node" in kernel commandline to enable this
      functionality.  For those who don't use memory hotplug or who don't want
      to lose their NUMA performance, just don't specify anything.  The kernel
      will work as before.
      Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Suggested-by: default avatarKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Suggested-by: default avatarIngo Molnar <mingo@kernel.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarToshi Kani <toshi.kani@hp.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Thomas Renninger <trenn@suse.de>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c5320926
  5. 14 Oct, 2013 1 commit
  6. 03 Oct, 2013 1 commit
    • Nathan Fontenot's avatar
      powerpc: Fix memory hotplug with sparse vmemmap · f7e3334a
      Nathan Fontenot authored
      Previous commit 46723bfa
      
      ... introduced a new config option
      HAVE_BOOTMEM_INFO_NODE that ended up breaking memory hot-remove for ppc
      when sparse vmemmap is not defined.
      
      This patch defines HAVE_BOOTMEM_INFO_NODE for ppc and adds the call to
      register_page_bootmem_info_node. Without this we get a BUG_ON for memory
      hot remove in put_page_bootmem().
      
      This also adds a stub for register_page_bootmem_memmap to allow ppc to build
      with sparse vmemmap defined. Leaving this as a stub is fine since the same
      vmemmap addresses are also handled in vmemmap_populate and as such are
      properly mapped.
      Signed-off-by: default avatarNathan Fontenot <nfont@linux.vnet.ibm.com>
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      CC: <stable@vger.kernel.org> [v3.9+]
      f7e3334a
  7. 12 Sep, 2013 1 commit
  8. 11 Jul, 2013 2 commits
    • Seth Jennings's avatar
      zswap: add to mm/ · 2b281117
      Seth Jennings authored
      
      zswap is a thin backend for frontswap that takes pages that are in the
      process of being swapped out and attempts to compress them and store
      them in a RAM-based memory pool.  This can result in a significant I/O
      reduction on the swap device and, in the case where decompressing from
      RAM is faster than reading from the swap device, can also improve
      workload performance.
      
      It also has support for evicting swap pages that are currently
      compressed in zswap to the swap device on an LRU(ish) basis.  This
      functionality makes zswap a true cache in that, once the cache is full,
      the oldest pages can be moved out of zswap to the swap device so newer
      pages can be compressed and stored in zswap.
      
      This patch adds the zswap driver to mm/
      Signed-off-by: default avatarSeth Jennings <sjenning@linux.vnet.ibm.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
      Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
      Cc: Jenifer Hopper <jhopper@us.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Joe Perches <joe@perches.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Cody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Hugh Dickens <hughd@google.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2b281117
    • Seth Jennings's avatar
      zbud: add to mm/ · 4e2e2770
      Seth Jennings authored
      
      zbud is an special purpose allocator for storing compressed pages.  It
      is designed to store up to two compressed pages per physical page.
      While this design limits storage density, it has simple and
      deterministic reclaim properties that make it preferable to a higher
      density approach when reclaim will be used.
      
      zbud works by storing compressed pages, or "zpages", together in pairs
      in a single memory page called a "zbud page".  The first buddy is "left
      justifed" at the beginning of the zbud page, and the last buddy is
      "right justified" at the end of the zbud page.  The benefit is that if
      either buddy is freed, the freed buddy space, coalesced with whatever
      slack space that existed between the buddies, results in the largest
      possible free region within the zbud page.
      
      zbud also provides an attractive lower bound on density.  The ratio of
      zpages to zbud pages can not be less than 1.  This ensures that zbud can
      never "do harm" by using more pages to store zpages than the
      uncompressed zpages would have used on their own.
      
      This implementation is a rewrite of the zbud allocator internally used
      by zcache in the driver/staging tree.  The rewrite was necessary to
      remove some of the zcache specific elements that were ingrained
      throughout and provide a generic allocation interface that can later be
      used by zsmalloc and others.
      
      This patch adds zbud to mm/ for later use by zswap.
      Signed-off-by: default avatarSeth Jennings <sjenning@linux.vnet.ibm.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
      Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
      Cc: Jenifer Hopper <jhopper@us.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Joe Perches <joe@perches.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Cody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Hugh Dickens <hughd@google.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Bob Liu <bob.liu@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4e2e2770
  9. 03 Jul, 2013 1 commit
    • Pavel Emelyanov's avatar
      mm: soft-dirty bits for user memory changes tracking · 0f8975ec
      Pavel Emelyanov authored
      
      The soft-dirty is a bit on a PTE which helps to track which pages a task
      writes to.  In order to do this tracking one should
      
        1. Clear soft-dirty bits from PTEs ("echo 4 > /proc/PID/clear_refs)
        2. Wait some time.
        3. Read soft-dirty bits (55'th in /proc/PID/pagemap2 entries)
      
      To do this tracking, the writable bit is cleared from PTEs when the
      soft-dirty bit is.  Thus, after this, when the task tries to modify a
      page at some virtual address the #PF occurs and the kernel sets the
      soft-dirty bit on the respective PTE.
      
      Note, that although all the task's address space is marked as r/o after
      the soft-dirty bits clear, the #PF-s that occur after that are processed
      fast.  This is so, since the pages are still mapped to physical memory,
      and thus all the kernel does is finds this fact out and puts back
      writable, dirty and soft-dirty bits on the PTE.
      
      Another thing to note, is that when mremap moves PTEs they are marked
      with soft-dirty as well, since from the user perspective mremap modifies
      the virtual memory at mremap's new address.
      Signed-off-by: default avatarPavel Emelyanov <xemul@parallels.com>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0f8975ec
  10. 02 Jul, 2013 1 commit
  11. 03 Jun, 2013 1 commit
  12. 29 Apr, 2013 1 commit
  13. 12 Mar, 2013 1 commit
  14. 28 Feb, 2013 1 commit
  15. 24 Feb, 2013 1 commit
    • Yasuaki Ishimatsu's avatar
      memory-hotplug: implement register_page_bootmem_info_section of sparse-vmemmap · 46723bfa
      Yasuaki Ishimatsu authored
      
      For removing memmap region of sparse-vmemmap which is allocated bootmem,
      memmap region of sparse-vmemmap needs to be registered by
      get_page_bootmem().  So the patch searches pages of virtual mapping and
      registers the pages by get_page_bootmem().
      
      NOTE: register_page_bootmem_memmap() is not implemented for ia64,
            ppc, s390, and sparc.  So introduce CONFIG_HAVE_BOOTMEM_INFO_NODE
            and revert register_page_bootmem_info_node() when platform doesn't
            support it.
      
            It's implemented by adding a new Kconfig option named
            CONFIG_HAVE_BOOTMEM_INFO_NODE, which will be automatically selected
            by memory-hotplug feature fully supported archs(currently only on
            x86_64).
      
            Since we have 2 config options called MEMORY_HOTPLUG and
            MEMORY_HOTREMOVE used for memory hot-add and hot-remove separately,
            and codes in function register_page_bootmem_info_node() are only
            used for collecting infomation for hot-remove, so reside it under
            MEMORY_HOTREMOVE.
      
            Besides page_isolation.c selected by MEMORY_ISOLATION under
            MEMORY_HOTPLUG is also such case, move it too.
      
      [mhocko@suse.cz: put register_page_bootmem_memmap inside CONFIG_MEMORY_HOTPLUG_SPARSE]
      [linfeng@cn.fujitsu.com: introduce CONFIG_HAVE_BOOTMEM_INFO_NODE and revert register_page_bootmem_info_node()]
      [mhocko@suse.cz: remove the arch specific functions without any implementation]
      [linfeng@cn.fujitsu.com: mm/Kconfig: move auto selects from MEMORY_HOTPLUG to MEMORY_HOTREMOVE as needed]
      [rientjes@google.com: fix defined but not used warning]
      Signed-off-by: default avatarWen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Reviewed-by: default avatarWu Jianguo <wujianguo@huawei.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Signed-off-by: default avatarLin Feng <linfeng@cn.fujitsu.com>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      46723bfa
  16. 22 Feb, 2013 1 commit
    • Darrick J. Wong's avatar
      block: optionally snapshot page contents to provide stable pages during write · ffecfd1a
      Darrick J. Wong authored
      
      This provides a band-aid to provide stable page writes on jbd without
      needing to backport the fixed locking and page writeback bit handling
      schemes of jbd2.  The band-aid works by using bounce buffers to snapshot
      page contents instead of waiting.
      
      For those wondering about the ext3 bandage -- fixing the jbd locking
      (which was done as part of ext4dev years ago) is a lot of surgery, and
      setting PG_writeback on data pages when we actually hold the page lock
      dropped ext3 performance by nearly an order of magnitude.  If we're
      going to migrate iscsi and raid to use stable page writes, the
      complaints about high latency will likely return.  We might as well
      centralize their page snapshotting thing to one place.
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Tested-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Artem Bityutskiy <dedekind1@gmail.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Eric Van Hensbergen <ericvh@gmail.com>
      Cc: Ron Minnich <rminnich@sandia.gov>
      Cc: Latchesar Ionkov <lucho@ionkov.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ffecfd1a
  17. 17 Jan, 2013 1 commit
  18. 18 Dec, 2012 1 commit
  19. 13 Dec, 2012 1 commit
  20. 12 Dec, 2012 1 commit
    • Rafael Aquini's avatar
      mm: introduce a common interface for balloon pages mobility · 18468d93
      Rafael Aquini authored
      
      Memory fragmentation introduced by ballooning might reduce significantly
      the number of 2MB contiguous memory blocks that can be used within a guest,
      thus imposing performance penalties associated with the reduced number of
      transparent huge pages that could be used by the guest workload.
      
      This patch introduces a common interface to help a balloon driver on
      making its page set movable to compaction, and thus allowing the system
      to better leverage the compation efforts on memory defragmentation.
      
      [akpm@linux-foundation.org: use PAGE_FLAGS_CHECK_AT_PREP, s/__balloon_page_flags/page_flags_cleared/, small cleanups]
      [rientjes@google.com: allow balloon compaction for any system with memory compaction enabled, which is the defconfig]
      Signed-off-by: default avatarRafael Aquini <aquini@redhat.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      18468d93
  21. 09 Oct, 2012 2 commits
  22. 01 Aug, 2012 1 commit
  23. 29 May, 2012 1 commit
  24. 21 May, 2012 1 commit
  25. 15 May, 2012 1 commit
  26. 01 Nov, 2011 1 commit
  27. 14 Jul, 2011 2 commits
    • Tejun Heo's avatar
      memblock, x86: Make ARCH_DISCARD_MEMBLOCK a config option · c378ddd5
      Tejun Heo authored
      
      From 6839454ae63f1eb21e515c10229ca95c22955fec Mon Sep 17 00:00:00 2001
      From: Tejun Heo <tj@kernel.org>
      Date: Thu, 14 Jul 2011 11:22:17 +0200
      
      Make ARCH_DISCARD_MEMBLOCK a config option so that it can be handled
      together with other MEMBLOCK options.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20110714094603.GH3455@htj.dyndns.org
      
      
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarH. Peter Anvin <hpa@linux.intel.com>
      c378ddd5
    • Tejun Heo's avatar
      memblock: Add optional region->nid · 7c0caeb8
      Tejun Heo authored
      
      From 83103b92f3234ec830852bbc5c45911bd6cbdb20 Mon Sep 17 00:00:00 2001
      From: Tejun Heo <tj@kernel.org>
      Date: Thu, 14 Jul 2011 11:22:16 +0200
      
      Add optional region->nid which can be enabled by arch using
      CONFIG_HAVE_MEMBLOCK_NODE_MAP.  When enabled, memblock also carries
      NUMA node information and replaces early_node_map[].
      
      Newly added memblocks have MAX_NUMNODES as nid.  Arch can then call
      memblock_set_node() to set node information.  memblock takes care of
      merging and node affine allocations w.r.t. node information.
      
      When MEMBLOCK_NODE_MAP is enabled, early_node_map[], related data
      structures and functions to manipulate and iterate it are disabled.
      memblock version of __next_mem_pfn_range() is provided such that
      for_each_mem_pfn_range() behaves the same and its users don't have to
      be updated.
      
      -v2: Yinghai spotted section mismatch caused by missing
           __init_memblock in memblock_set_node().  Fixed.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20110714094342.GF3455@htj.dyndns.org
      
      
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarH. Peter Anvin <hpa@linux.intel.com>
      7c0caeb8
  28. 10 Jun, 2011 1 commit
  29. 26 May, 2011 1 commit
    • Dan Magenheimer's avatar
      mm: cleancache core ops functions and config · 077b1f83
      Dan Magenheimer authored
      
      This third patch of eight in this cleancache series provides
      the core code for cleancache that interfaces between the hooks in
      VFS and individual filesystems and a cleancache backend.  It also
      includes build and config patches.
      
      Two new files are added: mm/cleancache.c and include/linux/cleancache.h.
      
      Note that CONFIG_CLEANCACHE can default to on; in systems that do
      not provide a cleancache backend, all hooks devolve to a simple
      check of a global enable flag, so performance impact should
      be negligible but can be reduced to zero impact if config'ed off.
      However for this first commit, it defaults to off.
      
      Details and a FAQ can be found in Documentation/vm/cleancache.txt
      
      Credits: Cleancache_ops design derived from Jeremy Fitzhardinge
      design for tmem
      
      [v8: dan.magenheimer@oracle.com: fix exportfs call affecting btrfs]
      [v8: akpm@linux-foundation.org: use static inline function, not macro]
      [v7: dan.magenheimer@oracle.com: cleanup sysfs and remove cleancache prefix]
      [v6: JBeulich@novell.com: robustly handle buggy fs encode_fh actor definition]
      [v5: jeremy@goop.org: clean up global usage and static var names]
      [v5: jeremy@goop.org: simplify init hook and any future fs init changes]
      [v5: hch@infradead.org: cleaner non-global interface for ops registration]
      [v4: adilger@sun.com: interface must support exportfs FS's]
      [v4: hch@infradead.org: interface must support 64-bit FS on 32-bit kernel]
      [v3: akpm@linux-foundation.org: use one ops struct to avoid pointer hops]
      [v3: akpm@linux-foundation.org: document and ensure PageLocked reqts are met]
      [v3: ngupta@vflare.org: fix success/fail codes, change funcs to void]
      [v2: viro@ZenIV.linux.org.uk: use sane types]
      Signed-off-by: default avatarDan Magenheimer <dan.magenheimer@oracle.com>
      Reviewed-by: default avatarJeremy Fitzhardinge <jeremy@goop.org>
      Reviewed-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Acked-by: default avatarAl Viro <viro@ZenIV.linux.org.uk>
      Acked-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarNitin Gupta <ngupta@vflare.org>
      Acked-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Acked-by: default avatarAndreas Dilger <adilger@sun.com>
      Acked-by: default avatarJan Beulich <JBeulich@novell.com>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik Van Riel <riel@redhat.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Ted Ts'o <tytso@mit.edu>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <joel.becker@oracle.com>
      077b1f83
  30. 26 Jan, 2011 1 commit
    • Andrea Arcangeli's avatar
      mm: compaction: don't depend on HUGETLB_PAGE · 33a93877
      Andrea Arcangeli authored
      Commit 5d689240
      
       ("thp: select CONFIG_COMPACTION if TRANSPARENT_HUGEPAGE
      enabled") causes this warning during the configuration process:
      
        warning: (TRANSPARENT_HUGEPAGE) selects COMPACTION which has unmet
        direct dependencies (EXPERIMENTAL && HUGETLB_PAGE && MMU)
      
      COMPACTION doesn't depend on HUGETLB_PAGE, it doesn't depend on THP
      either, it is also useful for regular alloc_pages(order > 0) including
      the very kernel stack during fork (THREAD_ORDER = 1).  It's always
      better to enable COMPACTION.
      
      The warning should be an error because we would end up with MIGRATION
      not selected, and COMPACTION wouldn't work without migration (despite it
      seems to build with an inline migrate_pages returning -ENOSYS).
      
      I'd also like to remove EXPERIMENTAL: compaction has been in the kernel
      for some releases (for full safety the default remains disabled which I
      think is enough).
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: default avatarLuca Tettamanti <kronos.it@gmail.com>
      Tested-by: default avatarLuca Tettamanti <kronos.it@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      33a93877
  31. 14 Jan, 2011 4 commits
  32. 02 Oct, 2010 1 commit
    • Tejun Heo's avatar
      percpu: use percpu allocator on UP too · 9b8327bb
      Tejun Heo authored
      
      On UP, percpu allocations were redirected to kmalloc.  This has the
      following problems.
      
      * For certain amount of allocations (determined by
        PERCPU_DYNAMIC_EARLY_SLOTS and PERCPU_DYNAMIC_EARLY_SIZE), percpu
        allocator can be used before the usual kernel memory allocator is
        brought online.  On SMP, this is used to initialize the kernel
        memory allocator.
      
      * percpu allocator honors alignment upto PAGE_SIZE but kmalloc()
        doesn't.  For example, workqueue makes use of larger alignments for
        cpu_workqueues.
      
      Currently, users of percpu allocators need to handle UP differently,
      which is somewhat fragile and ugly.  Other than small amount of
      memory, there isn't much to lose by enabling percpu allocator on UP.
      It can simply use kernel memory based chunk allocation which was added
      for SMP archs w/o MMUs.
      
      This patch removes mm/percpu_up.c, builds mm/percpu.c on UP too and
      makes UP build use percpu-km.  As percpu addresses and kernel
      addresses are always identity mapped and static percpu variables don't
      need any special treatment, nothing is arch dependent and mm/percpu.c
      implements generic setup_per_cpu_areas() for UP.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      9b8327bb