1. 05 Apr, 2019 1 commit
  2. 27 Feb, 2019 1 commit
  3. 21 Nov, 2018 1 commit
    • Andrea Arcangeli's avatar
      mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings · 818e5846
      Andrea Arcangeli authored
      commit ac5b2c18 upstream.
      
      THP allocation might be really disruptive when allocated on NUMA system
      with the local node full or hard to reclaim.  Stefan has posted an
      allocation stall report on 4.12 based SLES kernel which suggests the
      same issue:
      
        kvm: page allocation stalls for 194572ms, order:9, mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM), nodemask=(null)
        kvm cpuset=/ mems_allowed=0-1
        CPU: 10 PID: 84752 Comm: kvm Tainted: G        W 4.12.0+98-ph <a href="/view.php?id=1" title="[geschlossen] Integration Ramdisk" class="resolved">0000001</a> SLE15 (unreleased)
        Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 2.0 12/05/2017
        Call Trace:
         dump_stack+0x5c/0x84
         warn_alloc+0xe0/0x180
         __alloc_pages_slowpath+0x820/0xc90
         __alloc_pages_nodemask+0x1cc/0x210
         alloc_pages_vma+0x1e5/0x280
         do_huge_pmd_wp_page+0x83f/0xf00
         __handle_mm_fault+0x93d/0x1060
         handle_mm_fault+0xc6/0x1b0
         __do_page_fault+0x230/0x430
         do_page_fault+0x2a/0x70
         page_fault+0x7b/0x80
         [...]
        Mem-Info:
        active_anon:126315487 inactive_anon:1612476 isolated_anon:5
         active_file:60183 inactive_file:245285 isolated_file:0
         unevictable:15657 dirty:286 writeback:1 unstable:0
         slab_reclaimable:75543 slab_unreclaimable:2509111
         mapped:81814 shmem:31764 pagetables:370616 bounce:0
         free:32294031 free_pcp:6233 free_cma:0
        Node 0 active_anon:254680388kB inactive_anon:1112760kB active_file:240648kB inactive_file:981168kB unevictable:13368kB isolated(anon):0kB isolated(file):0kB mapped:280240kB dirty:1144kB writeback:0kB shmem:95832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 81225728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
        Node 1 active_anon:250583072kB inactive_anon:5337144kB active_file:84kB inactive_file:0kB unevictable:49260kB isolated(anon):20kB isolated(file):0kB mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
      
      The defrag mode is "madvise" and from the above report it is clear that
      the THP has been allocated for MADV_HUGEPAGA vma.
      
      Andrea has identified that the main source of the problem is
      __GFP_THISNODE usage:
      
      : The problem is that direct compaction combined with the NUMA
      : __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
      : hard the local node, instead of failing the allocation if there's no
      : THP available in the local node.
      :
      : Such logic was ok until __GFP_THISNODE was added to the THP allocation
      : path even with MPOL_DEFAULT.
      :
      : The idea behind the __GFP_THISNODE addition, is that it is better to
      : provide local memory in PAGE_SIZE units than to use remote NUMA THP
      : backed memory. That largely depends on the remote latency though, on
      : threadrippers for example the overhead is relatively low in my
      : experience.
      :
      : The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
      : extremely slow qemu startup with vfio, if the VM is larger than the
      : size of one host NUMA node. This is because it will try very hard to
      : unsuccessfully swapout get_user_pages pinned pages as result of the
      : __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
      : allocations and instead of trying to allocate THP on other nodes (it
      : would be even worse without vfio type1 GUP pins of course, except it'd
      : be swapping heavily instead).
      
      Fix this by removing __GFP_THISNODE for THP requests which are
      requesting the direct reclaim.  This effectivelly reverts 5265047a
      on the grounds that the zone/node reclaim was known to be disruptive due
      to premature reclaim when there was memory free.  While it made sense at
      the time for HPC workloads without NUMA awareness on rare machines, it
      was ultimately harmful in the majority of cases.  The existing behaviour
      is similar, if not as widespare as it applies to a corner case but
      crucially, it cannot be tuned around like zone_reclaim_mode can.  The
      default behaviour should always be to cause the least harm for the
      common case.
      
      If there are specialised use cases out there that want zone_reclaim_mode
      in specific cases, then it can be built on top.  Longterm we should
      consider a memory policy which allows for the node reclaim like behavior
      for the specific memory ranges which would allow a
      
      [1] http://lkml.kernel.org/r/20180820032204.9591-1-aarcange@redhat.com
      
      Mel said:
      
      : Both patches look correct to me but I'm responding to this one because
      : it's the fix.  The change makes sense and moves further away from the
      : severe stalling behaviour we used to see with both THP and zone reclaim
      : mode.
      :
      : I put together a basic experiment with usemem configured to reference a
      : buffer multiple times that is 80% the size of main memory on a 2-socket
      : box with symmetric node sizes and defrag set to "always".  The defrag
      : setting is not the default but it would be functionally similar to
      : accessing a buffer with madvise(MADV_HUGEPAGE).  Usemem is configured to
      : reference the buffer multiple times and while it's not an interesting
      : workload, it would be expected to complete reasonably quickly as it fits
      : within memory.  The results were;
      :
      : usemem
      :                                   vanilla           noreclaim-v1
      : Amean     Elapsd-1       42.78 (   0.00%)       26.87 (  37.18%)
      : Amean     Elapsd-3       27.55 (   0.00%)        7.44 (  73.00%)
      : Amean     Elapsd-4        5.72 (   0.00%)        5.69 (   0.45%)
      :
      : This shows the elapsed time in seconds for 1 thread, 3 threads and 4
      : threads referencing buffers 80% the size of memory.  With the patches
      : applied, it's 37.18% faster for the single thread and 73% faster with two
      : threads.  Note that 4 threads showing little difference does not indicate
      : the problem is related to thread counts.  It's simply the case that 4
      : threads gets spread so their workload mostly fits in one node.
      :
      : The overall view from /proc/vmstats is more startling
      :
      :                          4.19.0-rc1  4.19.0-rc1
      :                             vanillanoreclaim-v1r1
      : Minor Faults               35593425      708164
      : Major Faults                 484088          36
      : Swap Ins                    3772837           0
      : Swap Outs                   3932295           0
      :
      : Massive amounts of swap in/out without the patch
      :
      : Direct pages scanned        6013214           0
      : Kswapd pages scanned              0           0
      : Kswapd pages reclaimed            0           0
      : Direct pages reclaimed      4033009           0
      :
      : Lots of reclaim activity without the patch
      :
      : Kswapd efficiency              100%        100%
      : Kswapd velocity               0.000       0.000
      : Direct efficiency               67%        100%
      : Direct velocity           11191.956       0.000
      :
      : Mostly from direct reclaim context as you'd expect without the patch.
      :
      : Page writes by reclaim  3932314.000       0.000
      : Page writes file                 19           0
      : Page writes anon            3932295           0
      : Page reclaim immediate        42336           0
      :
      : Writes from reclaim context is never good but the patch eliminates it.
      :
      : We should never have default behaviour to thrash the system for such a
      : basic workload.  If zone reclaim mode behaviour is ever desired but on a
      : single task instead of a global basis then the sensible option is to build
      : a mempolicy that enforces that behaviour.
      
      This was a severe regression compared to previous kernels that made
      important workloads unusable and it starts when __GFP_THISNODE was
      added to THP allocations under MADV_HUGEPAGE.  It is not a significant
      risk to go to the previous behavior before __GFP_THISNODE was added, it
      worked like that for years.
      
      This was simply an optimization to some lucky workloads that can fit in
      a single node, but it ended up breaking the VM for others that can't
      possibly fit in a single node, so going back is safe.
      
      [mhocko@suse.com: rewrote the changelog based on the one from Andrea]
      Link: http://lkml.kernel.org/r/20180925120326.24392-2-mhocko@kernel.org
      Fixes: 5265047a
      
       ("mm, thp: really limit transparent hugepage allocation to local node")
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarStefan Priebe <s.priebe@profihost.ag>
      Debugged-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Tested-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: <stable@vger.kernel.org>	[4.1+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      818e5846
  4. 30 May, 2018 3 commits
  5. 25 Aug, 2017 1 commit
    • zhong jiang's avatar
      mm/mempolicy: fix use after free when calling get_mempolicy · 91105f2c
      zhong jiang authored
      commit 73223e4e upstream.
      
      I hit a use after free issue when executing trinity and repoduced it
      with KASAN enabled.  The related call trace is as follows.
      
        BUG: KASan: use after free in SyS_get_mempolicy+0x3c8/0x960 at addr ffff8801f582d766
        Read of size 2 by task syz-executor1/798
      
        INFO: Allocated in mpol_new.part.2+0x74/0x160 age=3 cpu=1 pid=799
           __slab_alloc+0x768/0x970
           kmem_cache_alloc+0x2e7/0x450
           mpol_new.part.2+0x74/0x160
           mpol_new+0x66/0x80
           SyS_mbind+0x267/0x9f0
           system_call_fastpath+0x16/0x1b
        INFO: Freed in __mpol_put+0x2b/0x40 age=4 cpu=1 pid=799
           __slab_free+0x495/0x8e0
           kmem_cache_free+0x2f3/0x4c0
           __mpol_put+0x2b/0x40
           SyS_mbind+0x383/0x9f0
           system_call_fastpath+0x16/0x1b
        INFO: Slab 0xffffea0009cb8dc0 objects=23 used=8 fp=0xffff8801f582de40 flags=0x200000000004080
        INFO: Object 0xffff8801f582d760 @offset=5984 fp=0xffff8801f582d600
      
        Bytes b4 ffff8801f582d750: ae 01 ff ff 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a  ........ZZZZZZZZ
        Object ffff8801f582d760: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
        Object ffff8801f582d770: 6b 6b 6b 6b 6b 6b 6b a5                          kkkkkkk.
        Redzone ffff8801f582d778: bb bb bb bb bb bb bb bb                          ........
        Padding ffff8801f582d8b8: 5a 5a 5a 5a 5a 5a 5a 5a                          ZZZZZZZZ
        Memory state around the buggy address:
        ffff8801f582d600: fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc fc
        ffff8801f582d680: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
        >ffff8801f582d700: fc fc fc fc fc fc fc fc fc fc fc fc fb fb fb fc
      
      !shared memory policy is not protected against parallel removal by other
      thread which is normally protected by the mmap_sem.  do_get_mempolicy,
      however, drops the lock midway while we can still access it later.
      
      Early premature up_read is a historical artifact from times when
      put_user was called in this path see https://lwn.net/Articles/124754/
      but that is gone since 8bccd85f ("[PATCH] Implement sys_* do_*
      layering in the memory policy layer.").  but when we have the the
      current mempolicy ref count model.  The issue was introduced
      accordingly.
      
      Fix the issue by removing the premature release.
      
      Link: http://lkml.kernel.org/r/1502950924-27521-1-git-send-email-zhongjiang@huawei.com
      
      Signed-off-by: default avatarzhong jiang <zhongjiang@huawei.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      91105f2c
  6. 12 Apr, 2017 1 commit
  7. 01 Feb, 2017 1 commit
  8. 19 Oct, 2016 1 commit
  9. 08 Oct, 2016 1 commit
  10. 02 Sep, 2016 1 commit
    • David Rientjes's avatar
      mm, mempolicy: task->mempolicy must be NULL before dropping final reference · c11600e4
      David Rientjes authored
      KASAN allocates memory from the page allocator as part of
      kmem_cache_free(), and that can reference current->mempolicy through any
      number of allocation functions.  It needs to be NULL'd out before the
      final reference is dropped to prevent a use-after-free bug:
      
      	BUG: KASAN: use-after-free in alloc_pages_current+0x363/0x370 at addr ffff88010b48102c
      	CPU: 0 PID: 15425 Comm: trinity-c2 Not tainted 4.8.0-rc2+ #140
      	...
      	Call Trace:
      		dump_stack
      		kasan_object_err
      		kasan_report_error
      		__asan_report_load2_noabort
      		alloc_pages_current	<-- use after free
      		depot_save_stack
      		save_stack
      		kasan_slab_free
      		kmem_cache_free
      		__mpol_put		<-- free
      		do_exit
      
      This patch sets current->mempolicy to NULL before dropping the final
      reference.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1608301442180.63329@chino.kir.corp.google.com
      Fixes: cd11016e ("mm, kasan: stackdepot implementation. Enable stackdepot for SLAB")
      Signed-off-by: David Ri...
      c11600e4
  11. 28 Jul, 2016 1 commit
    • Mel Gorman's avatar
      mm, vmscan: move LRU lists to node · 599d0c95
      Mel Gorman authored
      This moves the LRU lists from the zone to the node and related data such
      as counters, tracing, congestion tracking and writeback tracking.
      
      Unfortunately, due to reclaim and compaction retry logic, it is
      necessary to account for the number of LRU pages on both zone and node
      logic.  Most reclaim logic is based on the node counters but the retry
      logic uses the zone counters which do not distinguish inactive and
      active sizes.  It would be possible to leave the LRU counters on a
      per-zone basis but it's a heavier calculation across multiple cache
      lines that is much more frequent than the retry checks.
      
      Other than the LRU counters, this is mostly a mechanical patch but note
      that it introduces a number of anomalies.  For example, the scans are
      per-zone but using per-node counters.  We also mark a node as congested
      when a zone is congested.  This causes weird problems that are fixed
      later but is easier to review.
      
      In the event that there is excessive overhead on 32-bit systems due to
      the nodes being on LRU then there are two potential solutions
      
      1. Long-term isolation of highmem pages when reclaim is lowmem
      
         When pages are skipped, they are immediately added back onto the LRU
         list. If lowmem reclaim persisted for long periods of time, the same
         highmem pages get continually scanned. The idea would be that lowmem
         keeps those pages on a separate list until a reclaim for highmem pages
         arrives that splices the highmem pages back onto the LRU. It potentially
         could be implemented similar to the UNEVICTABLE list.
      
         That would reduce the skip rate with the potential corner case is that
         highmem pages have to be scanned and reclaimed to free lowmem slab pages.
      
      2. Linear scan lowmem pages if the initial LRU shrink fails
      
         This will break LRU ordering but may be preferable and faster during
         memory pressure than skipping LRU pages.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
      
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      599d0c95
  12. 26 Jul, 2016 2 commits
  13. 20 May, 2016 3 commits
    • Mel Gorman's avatar
      mm, page_alloc: avoid looking up the first zone in a zonelist twice · c33d6c06
      Mel Gorman authored
      
      The allocator fast path looks up the first usable zone in a zonelist and
      then get_page_from_freelist does the same job in the zonelist iterator.
      This patch preserves the necessary information.
      
                                                   4.6.0-rc2                  4.6.0-rc2
                                              fastmark-v1r20             initonce-v1r20
        Min      alloc-odr0-1               364.00 (  0.00%)           359.00 (  1.37%)
        Min      alloc-odr0-2               262.00 (  0.00%)           260.00 (  0.76%)
        Min      alloc-odr0-4               214.00 (  0.00%)           214.00 (  0.00%)
        Min      alloc-odr0-8               186.00 (  0.00%)           186.00 (  0.00%)
        Min      alloc-odr0-16              173.00 (  0.00%)           173.00 (  0.00%)
        Min      alloc-odr0-32              165.00 (  0.00%)           165.00 (  0.00%)
        Min      alloc-odr0-64              161.00 (  0.00%)           162.00 ( -0.62%)
        Min      alloc-odr0-128             159.00 (  0.00%)           161.00 ( -1.26%)
        Min      alloc-odr0-256             168.00 (  0.00%)           170.00 ( -1.19%)
        Min      alloc-odr0-512             180.00 (  0.00%)           181.00 ( -0.56%)
        Min      alloc-odr0-1024            190.00 (  0.00%)           190.00 (  0.00%)
        Min      alloc-odr0-2048            196.00 (  0.00%)           196.00 (  0.00%)
        Min      alloc-odr0-4096            202.00 (  0.00%)           202.00 (  0.00%)
        Min      alloc-odr0-8192            206.00 (  0.00%)           205.00 (  0.49%)
        Min      alloc-odr0-16384           206.00 (  0.00%)           205.00 (  0.49%)
      
      The benefit is negligible and the results are within the noise but each
      cycle counts.
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c33d6c06
    • Andrew Morton's avatar
      mm/mempolicy.c:offset_il_node() document and clarify · fee83b3a
      Andrew Morton authored
      
      This code was pretty obscure and was relying upon obscure side-effects
      of next_node(-1, ...) and was relying upon NUMA_NO_NODE being equal to
      -1.
      
      Clean that all up and document the function's intent.
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fee83b3a
    • Andrew Morton's avatar
      include/linux/nodemask.h: create next_node_in() helper · 0edaf86c
      Andrew Morton authored
      
      Lots of code does
      
      	node = next_node(node, XXX);
      	if (node == MAX_NUMNODES)
      		node = first_node(XXX);
      
      so create next_node_in() to do this and use it in various places.
      
      [mhocko@suse.com: use next_node_in() helper]
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Hui Zhu <zhuhui@xiaomi.com>
      Cc: Wang Xiaoqiang <wangxq10@lzu.edu.cn>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0edaf86c
  14. 17 Mar, 2016 1 commit
  15. 15 Mar, 2016 1 commit
  16. 09 Mar, 2016 1 commit
  17. 16 Feb, 2016 1 commit
    • Dave Hansen's avatar
      mm/gup: Switch all callers of get_user_pages() to not pass tsk/mm · d4edcf0d
      Dave Hansen authored
      
      We will soon modify the vanilla get_user_pages() so it can no
      longer be used on mm/tasks other than 'current/current->mm',
      which is by far the most common way it is called.  For now,
      we allow the old-style calls, but warn when they are used.
      (implemented in previous patch)
      
      This patch switches all callers of:
      
      	get_user_pages()
      	get_user_pages_unlocked()
      	get_user_pages_locked()
      
      to stop passing tsk/mm so they will no longer see the warnings.
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: jack@suse.cz
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20160212210156.113E9407@viggo.jf.intel.com
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      d4edcf0d
  18. 06 Feb, 2016 1 commit
    • Kirill A. Shutemov's avatar
      mempolicy: do not try to queue pages from !vma_migratable() · 77bf45e7
      Kirill A. Shutemov authored
      
      Maybe I miss some point, but I don't see a reason why we try to queue
      pages from non migratable VMAs.
      
      This testcase steps on VM_BUG_ON_PAGE() in isolate_lru_page():
      
          #include <fcntl.h>
          #include <unistd.h>
          #include <stdio.h>
          #include <sys/mman.h>
          #include <numaif.h>
      
          #define SIZE 0x2000
      
          int foo;
      
          int main()
          {
              int fd;
              char *p;
              unsigned long mask = 2;
      
              fd = open("/dev/sg0", O_RDWR);
              p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
              /* Faultin pages */
              foo = p[0] + p[0x1000];
              mbind(p, SIZE, MPOL_BIND, &mask, 4, MPOL_MF_MOVE | MPOL_MF_STRICT);
              return 0;
          }
      
      The only case when we can queue pages from such VMA is MPOL_MF_STRICT
      plus MPOL_MF_MOVE or MPOL_MF_MOVE_ALL for VMA which has pages on LRU,
      but gfp mask is not sutable for migaration (see mapping_gfp_mask() check
      in vma_migratable()).  That's looks like a bug to me.
      
      Let's filter out non-migratable vma at start of queue_pages_test_walk()
      and go to queue_pages_pte_range() only if MPOL_MF_MOVE or
      MPOL_MF_MOVE_ALL flag is set.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      77bf45e7
  19. 16 Jan, 2016 3 commits
  20. 15 Jan, 2016 1 commit
    • Nathan Zimmer's avatar
      mm/mempolicy.c: convert the shared_policy lock to a rwlock · 4a8c7bb5
      Nathan Zimmer authored
      
      When running the SPECint_rate gcc on some very large boxes it was
      noticed that the system was spending lots of time in
      mpol_shared_policy_lookup().  The gamess benchmark can also show it and
      is what I mostly used to chase down the issue since the setup for that I
      found to be easier.
      
      To be clear the binaries were on tmpfs because of disk I/O requirements.
      We then used text replication to avoid icache misses and having all the
      copies banging on the memory where the instruction code resides.  This
      results in us hitting a bottleneck in mpol_shared_policy_lookup() since
      lookup is serialised by the shared_policy lock.
      
      I have only reproduced this on very large (3k+ cores) boxes.  The
      problem starts showing up at just a few hundred ranks getting worse
      until it threatens to livelock once it gets large enough.  For example
      on the gamess benchmark at 128 ranks this area consumes only ~1% of
      time, at 512 ranks it consumes nearly 13%, and at 2k ranks it is over
      90%.
      
      To alleviate the contention in this area I converted the spinlock to an
      rwlock.  This allows a large number of lookups to happen simultaneously.
      The results were quite good reducing this consumtion at max ranks to
      around 2%.
      
      [akpm@linux-foundation.org: tidy up code comments]
      Signed-off-by: default avatarNathan Zimmer <nzimmer@sgi.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Nadia Yvette Chambers <nyc@holomorphy.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4a8c7bb5
  21. 08 Sep, 2015 2 commits
  22. 04 Sep, 2015 1 commit
  23. 25 Jun, 2015 1 commit
    • Vlastimil Babka's avatar
      mm, thp: respect MPOL_PREFERRED policy with non-local node · 0867a57c
      Vlastimil Babka authored
      Since commit 077fcf11 ("mm/thp: allocate transparent hugepages on
      local node"), we handle THP allocations on page fault in a special way -
      for non-interleave memory policies, the allocation is only attempted on
      the node local to the current CPU, if the policy's nodemask allows the
      node.
      
      This is motivated by the assumption that THP benefits cannot offset the
      cost of remote accesses, so it's better to fallback to base pages on the
      local node (which might still be available, while huge pages are not due
      to fragmentation) than to allocate huge pages on a remote node.
      
      The nodemask check prevents us from violating e.g.  MPOL_BIND policies
      where the local node is not among the allowed nodes.  However, the
      current implementation can still give surprising results for the
      MPOL_PREFERRED policy when the preferred node is different than the
      current CPU's local node.
      
      In such case we should honor the preferred node and not use the local
      node, which is what this patch does.  If hugepage allocation on the
      preferred node fails, we fall back to base pages and don't try other
      nodes, with the same motivation as is done for the local node hugepage
      allocations.  The patch also moves the MPOL_INTERLEAVE check around to
      simplify the hugepage specific test.
      
      The difference can be demonstrated using in-tree transhuge-stress test
      on the following 2-node machine where half memory on one node was
      occupied to show the difference.
      
      > numactl --hardware
      available: 2 nodes (0-1)
      node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 24 25 26 27 28 29 30 31 32 33 34 35
      node 0 size: 7878 MB
      node 0 free: 3623 MB
      node 1 cpus: 12 13 14 15 16 17 18 19 20 21 22 23 36 37 38 39 40 41 42 43 44 45 46 47
      node 1 size: 8045 MB
      node 1 free: 7818 MB
      node distances:
      node   0   1
        0:  10  21
        1:  21  10
      
      Before the patch:
      > numactl -p0 -C0 ./transhuge-stress
      transhuge-stress: 2.197 s/loop, 0.276 ms/page,   7249.168 MiB/s 7962 succeed,    0 failed, 1786 different pages
      
      > numactl -p0 -C12 ./transhuge-stress
      transhuge-stress: 2.962 s/loop, 0.372 ms/page,   5376.172 MiB/s 7962 succeed,    0 failed, 3873 different pages
      
      Number of successful THP allocations corresponds to free memory on node 0 in
      the first case and node 1 in the second case, i.e. -p parameter is ignored and
      cpu binding "wins".
      
      After the patch:
      > numactl -p0 -C0 ./transhuge-stress
      transhuge-stress: 2.183 s/loop, 0.274 ms/page,   7295.516 MiB/s 7962 succeed,    0 failed, 1760 different pages
      
      > numactl -p0 -C12 ./transhuge-stress
      transhuge-stress: 2.878 s/loop, 0.361 ms/page,   5533.638 MiB/s 7962 succeed,    0 failed, 1750 different pages
      
      > numactl -p1 -C0 ./transhuge-stress
      transhuge-stress: 4.628 s/loop, 0.581 ms/page,   3440.893 MiB/s 7962 succeed,    0 failed, 3918 different pages
      
      The -p parameter is respected regardless of cpu binding.
      
      > numactl -C0 ./transhuge-stress
      transhuge-stress: 2.202 s/loop, 0.277 ms/page,   7230.003 MiB/s 7962 succeed,    0 failed, 1750 different pages
      
      > numactl -C12 ./transhuge-stress
      transhuge-stress: 3.020 s/loop, 0.379 ms/page,   5273.324 MiB/s 7962 succeed,    0 failed, 3916 different pages
      
      Without -p parameter, hugepage restriction to CPU-local node works as before.
      
      Fixes: 077fcf11
      
       ("mm/thp: allocate transparent hugepages on local node")
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: <stable@vger.kernel.org>	[4.0+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0867a57c
  24. 15 May, 2015 1 commit
  25. 14 Apr, 2015 2 commits
    • David Rientjes's avatar
      mm, thp: really limit transparent hugepage allocation to local node · 5265047a
      David Rientjes authored
      Commit 077fcf11 ("mm/thp: allocate transparent hugepages on local
      node") restructured alloc_hugepage_vma() with the intent of only
      allocating transparent hugepages locally when there was not an effective
      interleave mempolicy.
      
      alloc_pages_exact_node() does not limit the allocation to the single node,
      however, but rather prefers it.  This is because __GFP_THISNODE is not set
      which would cause the node-local nodemask to be passed.  Without it, only
      a nodemask that prefers the local node is passed.
      
      Fix this by passing __GFP_THISNODE and falling back to small pages when
      the allocation fails.
      
      Commit 9f1b868a ("mm: thp: khugepaged: add policy for finding target
      node") suffers from a similar problem for khugepaged, which is also fixed.
      
      Fixes: 077fcf11 ("mm/thp: allocate transparent hugepages on local node")
      Fixes: 9f1b868a
      
       ("mm: thp: khugepaged: add policy for finding target node")
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pravin Shelar <pshelar@nicira.com>
      Cc: Jarno Rajahalme <jrajahalme@nicira.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5265047a
    • David Rientjes's avatar
      mm, mempolicy: migrate_to_node should only migrate to node · b360edb4
      David Rientjes authored
      
      migrate_to_node() is intended to migrate a page from one source node to
      a target node.
      
      Today, migrate_to_node() could end up migrating to any node, not only
      the target node.  This is because the page migration allocator,
      new_node_page() does not pass __GFP_THISNODE to
      alloc_pages_exact_node().  This causes the target node to be preferred
      but allows fallback to any other node in order of affinity.
      
      Prevent this by allocating with __GFP_THISNODE.  If memory is not
      available, -ENOMEM will be returned as appropriate.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b360edb4
  26. 14 Feb, 2015 1 commit
  27. 13 Feb, 2015 1 commit
  28. 12 Feb, 2015 4 commits
    • Naoya Horiguchi's avatar
      mm: pagewalk: fix misbehavior of walk_page_range for vma(VM_PFNMAP) · 48684a65
      Naoya Horiguchi authored
      
      walk_page_range() silently skips vma having VM_PFNMAP set, which leads to
      undesirable behaviour at client end (who called walk_page_range).  For
      example for pagemap_read(), when no callbacks are called against VM_PFNMAP
      vma, pagemap_read() may prepare pagemap data for next virtual address
      range at wrong index.  That could confuse and/or break userspace
      applications.
      
      This patch avoid this misbehavior caused by vma(VM_PFNMAP) like follows:
      - for pagemap_read() which has its own ->pte_hole(), call the ->pte_hole()
        over vma(VM_PFNMAP),
      - for clear_refs and queue_pages which have their own ->tests_walk,
        just return 1 and skip vma(VM_PFNMAP). This is no problem because
        these are not interested in hole regions,
      - for other callers, just skip the vma(VM_PFNMAP) as a default behavior.
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarShiraz Hashim <shashim@codeaurora.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      48684a65
    • Naoya Horiguchi's avatar
      mempolicy: apply page table walker on queue_pages_range() · 6f4576e3
      Naoya Horiguchi authored
      
      queue_pages_range() does page table walking in its own way now, but there
      is some code duplicate.  This patch applies page table walker to reduce
      lines of code.
      
      queue_pages_range() has to do some precheck to determine whether we really
      walk over the vma or just skip it.  Now we have test_walk() callback in
      mm_walk for this purpose, so we can do this replacement cleanly.
      queue_pages_test_walk() depends on not only the current vma but also the
      previous one, so queue_pages->prev is introduced to remember it.
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6f4576e3
    • Vlastimil Babka's avatar
      mm/mempolicy.c: merge alloc_hugepage_vma to alloc_pages_vma · be97a41b
      Vlastimil Babka authored
      
      The previous commit ("mm/thp: Allocate transparent hugepages on local
      node") introduced alloc_hugepage_vma() to mm/mempolicy.c to perform a
      special policy for THP allocations.  The function has the same interface
      as alloc_pages_vma(), shares a lot of boilerplate code and a long
      comment.
      
      This patch merges the hugepage special case into alloc_pages_vma.  The
      extra if condition should be cheap enough price to pay.  We also prevent
      a (however unlikely) race with parallel mems_allowed update, which could
      make hugepage allocation restart only within the fallback call to
      alloc_hugepage_vma() and not reconsider the special rule in
      alloc_hugepage_vma().
      
      Also by making sure mpol_cond_put(pol) is always called before actual
      allocation attempt, we can use a single exit path within the function.
      
      Also update the comment for missing node parameter and obsolete reference
      to mm_sem.
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      be97a41b
    • Aneesh Kumar K.V's avatar
      mm/thp: allocate transparent hugepages on local node · 077fcf11
      Aneesh Kumar K.V authored
      
      This make sure that we try to allocate hugepages from local node if
      allowed by mempolicy.  If we can't, we fallback to small page allocation
      based on mempolicy.  This is based on the observation that allocating
      pages on local node is more beneficial than allocating hugepages on remote
      node.
      
      With this patch applied we may find transparent huge page allocation
      failures if the current node doesn't have enough freee hugepages.  Before
      this patch such failures result in us retrying the allocation on other
      nodes in the numa node mask.
      
      [akpm@linux-foundation.org: fix comment, add CONFIG_TRANSPARENT_HUGEPAGE dependency]
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      077fcf11