1. 30 Apr, 2021 1 commit
  2. 15 Dec, 2020 1 commit
  3. 07 Aug, 2020 1 commit
    • Feng Tang's avatar
      mm: adjust vm_committed_as_batch according to vm overcommit policy · 56f3547b
      Feng Tang authored
      When checking a performance change for will-it-scale scalability mmap test
      [1], we found very high lock contention for spinlock of percpu counter
      'vm_committed_as':
      
          94.14%     0.35%  [kernel.kallsyms]         [k] _raw_spin_lock_irqsave
          48.21% _raw_spin_lock_irqsave;percpu_counter_add_batch;__vm_enough_memory;mmap_region;do_mmap;
          45.91% _raw_spin_lock_irqsave;percpu_counter_add_batch;__do_munmap;
      
      Actually this heavy lock contention is not always necessary.  The
      'vm_committed_as' needs to be very precise when the strict
      OVERCOMMIT_NEVER policy is set, which requires a rather small batch number
      for the percpu counter.
      
      So keep 'batch' number unchanged for strict OVERCOMMIT_NEVER policy, and
      lift it to 64X for OVERCOMMIT_ALWAYS and OVERCOMMIT_GUESS policies.  Also
      add a sysctl handler to adjust it when the policy is reconfigured.
      
      Benchmark with the same testcase in [1] shows 53% improvement on a 8C/16T
      desktop, and 2097%(20X) on a 4S/72C/144T server.  We tested with test
      platforms in 0day (server, desktop and laptop), and 80%+ platforms shows
      improvements with that test.  And whether it shows improvements depends on
      if the test mmap size is bigger than the batch number computed.
      
      And if the lift is 16X, 1/3 of the platforms will show improvements,
      though it should help the mmap/unmap usage generally, as Michal Hocko
      mentioned:
      
      : I believe that there are non-synthetic worklaods which would benefit from
      : a larger batch.  E.g.  large in memory databases which do large mmaps
      : during startups from multiple threads.
      
      [1] https://lore.kernel.org/lkml/20200305062138.GI5972@shao2-debian/
      
      Signed-off-by: default avatarFeng Tang <feng.tang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: kernel test robot <rong.a.chen@intel.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/1589611660-89854-4-git-send-email-feng.tang@intel.com
      Link: http://lkml.kernel.org/r/1592725000-73486-4-git-send-email-feng.tang@intel.com
      Link: http://lkml.kernel.org/r/1594389708-60781-5-git-send-email-feng.tang@intel.com
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      56f3547b
  4. 02 Jun, 2020 1 commit
  5. 07 Apr, 2020 1 commit
  6. 21 May, 2019 1 commit
  7. 28 Dec, 2018 1 commit
  8. 22 Aug, 2018 1 commit
  9. 17 Mar, 2016 1 commit
  10. 01 Jul, 2015 2 commits
  11. 13 Feb, 2015 2 commits
    • Rasmus Villemoes's avatar
      mm/mm_init.c: mark mminit_loglevel __meminitdata · 194e8151
      Rasmus Villemoes authored
      
      mminit_loglevel is only referenced from __init and __meminit functions, so
      we can mark it __meminitdata.
      Signed-off-by: default avatarRasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vishnu Pratap Singh <vishnu.ps@samsung.com>
      Cc: Pintu Kumar <pintu.k@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      194e8151
    • Rasmus Villemoes's avatar
      mm/mm_init.c: park mminit_verify_zonelist as __init · 0e2342c7
      Rasmus Villemoes authored
      
      The only caller of mminit_verify_zonelist is build_all_zonelists_init,
      which is annotated with __init, so it should be safe to also mark the
      former as __init, saving ~400 bytes of .text.
      Signed-off-by: default avatarRasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vishnu Pratap Singh <vishnu.ps@samsung.com>
      Cc: Pintu Kumar <pintu.k@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0e2342c7
  12. 28 Jan, 2014 1 commit
  13. 24 Jan, 2014 1 commit
    • Paul Gortmaker's avatar
      mm/mm_init.c: make creation of the mm_kobj happen earlier than device_initcall · da29bd36
      Paul Gortmaker authored
      
      The use of __initcall is to be eventually replaced by choosing one from
      the prioritized groupings laid out in init.h header:
      
      	pure_initcall               0
      	core_initcall               1
      	postcore_initcall           2
      	arch_initcall               3
      	subsys_initcall             4
      	fs_initcall                 5
      	device_initcall             6
      	late_initcall               7
      
      In the interim, all __initcall are mapped onto device_initcall, which as
      can be seen above, comes quite late in the ordering.
      
      Currently the mm_kobj is created with __initcall in mm_sysfs_init().
      This means that any other initcalls that want to reference the mm_kobj
      have to be device_initcall (or later), otherwise we will for example,
      trip the BUG_ON(!kobj) in sysfs's internal_create_group().  This
      unfairly restricts those users; for example something that clearly makes
      sense to be an arch_initcall will not be able to choose that.
      
      However, upon examination, it is only this way for historical reasons
      (i.e.  simply not reprioritized yet).  We see that sysfs is ready quite
      earlier in init/main.c via:
      
       vfs_caches_init
       |_ mnt_init
          |_ sysfs_init
      
      well ahead of the processing of the prioritized calls listed above.
      
      So we can recategorize mm_sysfs_init to be a pure_initcall, which in
      turn allows any mm_kobj initcall users a wider range (1 --> 7) of
      initcall priorities to choose from.
      Signed-off-by: default avatarPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      da29bd36
  14. 09 Oct, 2013 2 commits
    • Peter Zijlstra's avatar
      mm: numa: Change page last {nid,pid} into {cpu,pid} · 90572890
      Peter Zijlstra authored
      
      Change the per page last fault tracking to use cpu,pid instead of
      nid,pid. This will allow us to try and lookup the alternate task more
      easily. Note that even though it is the cpu that is store in the page
      flags that the mpol_misplaced decision is still based on the node.
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Link: http://lkml.kernel.org/r/1381141781-10992-43-git-send-email-mgorman@suse.de
      
      
      [ Fixed build failure on 32-bit systems. ]
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      90572890
    • Mel Gorman's avatar
      sched/numa: Set preferred NUMA node based on number of private faults · b795854b
      Mel Gorman authored
      
      Ideally it would be possible to distinguish between NUMA hinting faults that
      are private to a task and those that are shared. If treated identically
      there is a risk that shared pages bounce between nodes depending on
      the order they are referenced by tasks. Ultimately what is desirable is
      that task private pages remain local to the task while shared pages are
      interleaved between sharing tasks running on different nodes to give good
      average performance. This is further complicated by THP as even
      applications that partition their data may not be partitioning on a huge
      page boundary.
      
      To start with, this patch assumes that multi-threaded or multi-process
      applications partition their data and that in general the private accesses
      are more important for cpu->memory locality in the general case. Also,
      no new infrastructure is required to treat private pages properly but
      interleaving for shared pages requires additional infrastructure.
      
      To detect private accesses the pid of the last accessing task is required
      but the storage requirements are a high. This patch borrows heavily from
      Ingo Molnar's patch "numa, mm, sched: Implement last-CPU+PID hash tracking"
      to encode some bits from the last accessing task in the page flags as
      well as the node information. Collisions will occur but it is better than
      just depending on the node information. Node information is then used to
      determine if a page needs to migrate. The PID information is used to detect
      private/shared accesses. The preferred NUMA node is selected based on where
      the maximum number of approximately private faults were measured. Shared
      faults are not taken into consideration for a few reasons.
      
      First, if there are many tasks sharing the page then they'll all move
      towards the same node. The node will be compute overloaded and then
      scheduled away later only to bounce back again. Alternatively the shared
      tasks would just bounce around nodes because the fault information is
      effectively noise. Either way accounting for shared faults the same as
      private faults can result in lower performance overall.
      
      The second reason is based on a hypothetical workload that has a small
      number of very important, heavily accessed private pages but a large shared
      array. The shared array would dominate the number of faults and be selected
      as a preferred node even though it's the wrong decision.
      
      The third reason is that multiple threads in a process will race each
      other to fault the shared page making the fault information unreliable.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      [ Fix complication error when !NUMA_BALANCING. ]
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1381141781-10992-30-git-send-email-mgorman@suse.de
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      b795854b
  15. 03 Jul, 2013 1 commit
    • Tim Chen's avatar
      mm: tune vm_committed_as percpu_counter batching size · 917d9290
      Tim Chen authored
      
      Currently the per cpu counter's batch size for memory accounting is
      configured as twice the number of cpus in the system.  However, for
      system with very large memory, it is more appropriate to make it
      proportional to the memory size per cpu in the system.
      
      For example, for a x86_64 system with 64 cpus and 128 GB of memory, the
      batch size is only 2*64 pages (0.5 MB).  So any memory accounting
      changes of more than 0.5MB will overflow the per cpu counter into the
      global counter.  Instead, for the new scheme, the batch size is
      configured to be 0.4% of the memory/cpu = 8MB (128 GB/64 /256), which is
      more inline with the memory size.
      
      I've done a repeated brk test of 800KB (from will-it-scale test suite)
      with 80 concurrent processes on a 4 socket Westmere machine with a total
      of 40 cores.  Without the patch, about 80% of cpu is spent on spin-lock
      contention within the vm_committed_as counter.  With the patch, there's
      a 73x speedup on the benchmark and the lock contention drops off almost
      entirely.
      
      [akpm@linux-foundation.org: fix section mismatch]
      Signed-off-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      917d9290
  16. 24 Feb, 2013 1 commit
    • Mel Gorman's avatar
      mm: init: report on last-nid information stored in page->flags · a4e1b4c6
      Mel Gorman authored
      
      Answering the question "how much space remains in the page->flags" is
      time-consuming.  mminit_loglevel can help answer the question but it
      does not take last_nid information into account.  This patch corrects it
      and while there it corrects the messages related to page flag usage,
      pgshifts and node/zone id.  When applied the relevant output looks
      something like this but will depend on the kernel configuration.
      
        mminit::pageflags_layout_widths Section 0 Node 9 Zone 2 Lastnid 9 Flags 25
        mminit::pageflags_layout_shifts Section 19 Node 9 Zone 2 Lastnid 9
        mminit::pageflags_layout_pgshifts Section 0 Node 55 Zone 53 Lastnid 44
        mminit::pageflags_layout_nodezoneid Node/Zone ID: 64 -> 53
        mminit::pageflags_layout_usage location: 64 -> 44 layout 44 -> 25 unused 25 -> 0 page-flags
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a4e1b4c6
  17. 31 Oct, 2011 1 commit
  18. 20 Aug, 2008 1 commit
  19. 05 Aug, 2008 1 commit
  20. 24 Jul, 2008 5 commits