1. 08 Sep, 2021 1 commit
  2. 03 Sep, 2021 1 commit
  3. 04 Jul, 2021 1 commit
  4. 17 Jun, 2021 1 commit
    • Roman Gushchin's avatar
      percpu: optimize locking in pcpu_balance_workfn() · e4d77700
      Roman Gushchin authored
      
      pcpu_balance_workfn() unconditionally calls pcpu_balance_free(),
      pcpu_reclaim_populated(), pcpu_balance_populated() and
      pcpu_balance_free() again.
      
      Each call to pcpu_balance_free() and pcpu_reclaim_populated() will
      cause at least one acquisition of the pcpu_lock. So even if the
      balancing was scheduled because of a failed atomic allocation,
      pcpu_lock will be acquired at least 4 times. This obviously
      increases the contention on the pcpu_lock.
      
      To optimize the scheme let's grab the pcpu_lock on the upper level
      (in pcpu_balance_workfn()) and keep it generally locked for the whole
      duration of the scheduled work, but release conditionally to perform
      any slow operations like chunk (de)population and creation of new
      chunks.
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      e4d77700
  5. 14 Jun, 2021 1 commit
  6. 05 Jun, 2021 1 commit
    • Roman Gushchin's avatar
      percpu: rework memcg accounting · faf65dde
      Roman Gushchin authored
      
      The current implementation of the memcg accounting of the percpu
      memory is based on the idea of having two separate sets of chunks for
      accounted and non-accounted memory. This approach has an advantage
      of not wasting any extra memory for memcg data for non-accounted
      chunks, however it complicates the code and leads to a higher chunks
      number due to a lower chunk utilization.
      
      Instead of having two chunk types it's possible to declare all* chunks
      memcg-aware unless the kernel memory accounting is disabled globally
      by a boot option. The size of objcg_array is usually small in
      comparison to chunks themselves (it obviously depends on the number of
      CPUs), so even if some chunk will have no accounted allocations, the
      memory waste isn't significant and will likely be compensated by
      a higher chunk utilization. Also, with time more and more percpu
      allocations will likely become accounted.
      
      * The first chunk is initialized before the memory cgroup subsystem,
        so we don't know for sure whether we need to allocate obj_cgroups.
        Because it's small, let's make it free for use. Then we don't need
        to allocate obj_cgroups for it.
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      faf65dde
  7. 14 May, 2021 1 commit
  8. 07 May, 2021 1 commit
  9. 21 Apr, 2021 3 commits
    • Roman Gushchin's avatar
      percpu: implement partial chunk depopulation · f1833241
      Roman Gushchin authored
      
      From Roman ("percpu: partial chunk depopulation"):
      In our [Facebook] production experience the percpu memory allocator is
      sometimes struggling with returning the memory to the system. A typical
      example is a creation of several thousands memory cgroups (each has
      several chunks of the percpu data used for vmstats, vmevents,
      ref counters etc). Deletion and complete releasing of these cgroups
      doesn't always lead to a shrinkage of the percpu memory, so that
      sometimes there are several GB's of memory wasted.
      
      The underlying problem is the fragmentation: to release an underlying
      chunk all percpu allocations should be released first. The percpu
      allocator tends to top up chunks to improve the utilization. It means
      new small-ish allocations (e.g. percpu ref counters) are placed onto
      almost filled old-ish chunks, effectively pinning them in memory.
      
      This patchset solves this problem by implementing a partial depopulation
      of percpu chunks: chunks with many empty pages are being asynchronously
      depopulated and the pages are returned to the system.
      
      To illustrate the problem the following script can be used:
      --
      
      cd /sys/fs/cgroup
      
      mkdir percpu_test
      echo "+memory" > percpu_test/cgroup.subtree_control
      
      cat /proc/meminfo | grep Percpu
      
      for i in `seq 1 1000`; do
          mkdir percpu_test/cg_"${i}"
          for j in `seq 1 10`; do
      	mkdir percpu_test/cg_"${i}"_"${j}"
          done
      done
      
      cat /proc/meminfo | grep Percpu
      
      for i in `seq 1 1000`; do
          for j in `seq 1 10`; do
      	rmdir percpu_test/cg_"${i}"_"${j}"
          done
      done
      
      sleep 10
      
      cat /proc/meminfo | grep Percpu
      
      for i in `seq 1 1000`; do
          rmdir percpu_test/cg_"${i}"
      done
      
      rmdir percpu_test
      --
      
      It creates 11000 memory cgroups and removes every 10 out of 11.
      It prints the initial size of the percpu memory, the size after
      creating all cgroups and the size after deleting most of them.
      
      Results:
        vanilla:
          ./percpu_test.sh
          Percpu:             7488 kB
          Percpu:           481152 kB
          Percpu:           481152 kB
      
        with this patchset applied:
          ./percpu_test.sh
          Percpu:             7488 kB
          Percpu:           481408 kB
          Percpu:           135552 kB
      
      The total size of the percpu memory was reduced by more than 3.5 times.
      
      This patch:
      
      This patch implements partial depopulation of percpu chunks.
      
      As of now, a chunk can be depopulated only as a part of the final
      destruction, if there are no more outstanding allocations. However
      to minimize a memory waste it might be useful to depopulate a
      partially filed chunk, if a small number of outstanding allocations
      prevents the chunk from being fully reclaimed.
      
      This patch implements the following depopulation process: it scans
      over the chunk pages, looks for a range of empty and populated pages
      and performs the depopulation. To avoid races with new allocations,
      the chunk is previously isolated. After the depopulation the chunk is
      sidelined to a special list or freed. New allocations prefer using
      active chunks to sidelined chunks. If a sidelined chunk is used, it is
      reintegrated to the active lists.
      
      The depopulation is scheduled on the free path if the chunk is all of
      the following:
        1) has more than 1/4 of total pages free and populated
        2) the system has enough free percpu pages aside of this chunk
        3) isn't the reserved chunk
        4) isn't the first chunk
      If it's already depopulated but got free populated pages, it's a good
      target too. The chunk is moved to a special slot,
      pcpu_to_depopulate_slot, chunk->isolated is set, and the balance work
      item is scheduled. On isolation, these pages are removed from the
      pcpu_nr_empty_pop_pages. It is constantly replaced to the
      to_depopulate_slot when it meets these qualifications.
      
      pcpu_reclaim_populated() iterates over the to_depopulate_slot until it
      becomes empty. The depopulation is performed in the reverse direction to
      keep populated pages close to the beginning. Depopulated chunks are
      sidelined to preferentially avoid them for new allocations. When no
      active chunk can suffice a new allocation, sidelined chunks are first
      checked before creating a new chunk.
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Co-developed-by: default avatarDennis Zhou <dennis@kernel.org>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Tested-by: default avatarPratik Sampat <psampat@linux.ibm.com>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      f1833241
    • Dennis Zhou's avatar
      percpu: use pcpu_free_slot instead of pcpu_nr_slots - 1 · 1c29a3ce
      Dennis Zhou authored
      
      This prepares for adding a to_depopulate list and sidelined list after
      the free slot in the set of lists in pcpu_slot.
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      1c29a3ce
    • Roman Gushchin's avatar
      percpu: factor out pcpu_check_block_hint() · 8ea2e1e3
      Roman Gushchin authored
      
      Factor out the pcpu_check_block_hint() helper, which will be useful
      in the future. The new function checks if the allocation can likely
      fit within the contig hint.
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      8ea2e1e3
  10. 16 Apr, 2021 2 commits
  11. 09 Apr, 2021 1 commit
  12. 14 Feb, 2021 2 commits
    • Dennis Zhou's avatar
      percpu: fix clang modpost section mismatch · 258e0815
      Dennis Zhou authored
      pcpu_build_alloc_info() is an __init function that makes a call to
      cpumask_clear_cpu(). With CONFIG_GCOV_PROFILE_ALL enabled, the inline
      heuristics are modified and such cpumask_clear_cpu() which is marked
      inline doesn't get inlined. Because it works on mask in __initdata,
      modpost throws a section mismatch error.
      
      Arnd sent a patch with the flatten attribute as an alternative [2]. I've
      added it to compiler_attributes.h.
      
      modpost complaint:
        WARNING: modpost: vmlinux.o(.text+0x735425): Section mismatch in reference from the function cpumask_clear_cpu() to the variable .init.data:pcpu_build_alloc_info.mask
        The function cpumask_clear_cpu() references
        the variable __initdata pcpu_build_alloc_info.mask.
        This is often because cpumask_clear_cpu lacks a __initdata
        annotation or the annotation of pcpu_build_alloc_info.mask is wrong.
      
      clang output:
        mm/percpu.c:2724:5: remark: cpumask_clear_cpu not inlined into pcpu_build_alloc_info because too costly to inline (cost=725, threshold=325) [-Rpass-missed=inline]
      
      [1] https://lore.kernel.org/linux-mm/202012220454.9F6Bkz9q-lkp@intel.com/
      [2] https://lore.kernel.org/lkml/CAK8P3a2ZWfNeXKSm8K_SUhhwkor17jFo3xApLXjzfPqX0eUDUA@mail.gmail.com/
      
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      258e0815
    • Wonhyuk Yang's avatar
      percpu: reduce the number of cpu distance comparisons · d7d29ac7
      Wonhyuk Yang authored
      
      To build group_map[] and group_cnt[], we find out which group
      CPUs belong to by comparing the distance of the cpu. However,
      this includes cases where comparisons are not required.
      
      This patch uses a bitmap to record CPUs that is not classified in
      the group. CPUs that we know which group they belong to should be
      cleared from the bitmap. In result, we can reduce the number of
      unnecessary comparisons.
      Signed-off-by: default avatarWonhyuk Yang <vvghjk1234@gmail.com>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      [Dennis: added cpumask_clear() call and #include cpumask.h.]
      d7d29ac7
  13. 30 Oct, 2020 1 commit
  14. 18 Oct, 2020 1 commit
    • Roman Gushchin's avatar
      mm: kmem: move memcg_kmem_bypass() calls to get_mem/obj_cgroup_from_current() · 279c3393
      Roman Gushchin authored
      
      Patch series "mm: kmem: kernel memory accounting in an interrupt context".
      
      This patchset implements memcg-based memory accounting of allocations made
      from an interrupt context.
      
      Historically, such allocations were passed unaccounted mostly because
      charging the memory cgroup of the current process wasn't an option.  Also
      performance reasons were likely a reason too.
      
      The remote charging API allows to temporarily overwrite the currently
      active memory cgroup, so that all memory allocations are accounted towards
      some specified memory cgroup instead of the memory cgroup of the current
      process.
      
      This patchset extends the remote charging API so that it can be used from
      an interrupt context.  Then it removes the fence that prevented the
      accounting of allocations made from an interrupt context.  It also
      contains a couple of optimizations/code refactorings.
      
      This patchset doesn't directly enable accounting for any specific
      allocations, but prepares the code base for it.  The bpf memory accounting
      will likely be the first user of it: a typical example is a bpf program
      parsing an incoming network packet, which allocates an entry in hashmap
      map to store some information.
      
      This patch (of 4):
      
      Currently memcg_kmem_bypass() is called before obtaining the current
      memory/obj cgroup using get_mem/obj_cgroup_from_current().  Moving
      memcg_kmem_bypass() into get_mem/obj_cgroup_from_current() reduces the
      number of call sites and allows further code simplifications.
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Link: http://lkml.kernel.org/r/20200827225843.1270629-1-guro@fb.com
      Link: http://lkml.kernel.org/r/20200827225843.1270629-2-guro@fb.com
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      279c3393
  15. 17 Sep, 2020 1 commit
  16. 12 Aug, 2020 3 commits
    • Roman Gushchin's avatar
      mm: memcg/percpu: per-memcg percpu memory statistics · 772616b0
      Roman Gushchin authored
      Percpu memory can represent a noticeable chunk of the total memory
      consumption, especially on big machines with many CPUs.  Let's track
      percpu memory usage for each memcg and display it in memory.stat.
      
      A percpu allocation is usually scattered over multiple pages (and nodes),
      and can be significantly smaller than a page.  So let's add a byte-sized
      counter on the memcg level: MEMCG_PERCPU_B.  Byte-sized vmstat infra
      created for slabs can be perfectly reused for percpu case.
      
      [guro@fb.com: v3]
        Link: http://lkml.kernel.org/r/20200623184515.4132564-4-guro@fb.com
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarDennis Zhou <dennis@kernel.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Tobin C. Harding <tobin@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Bixuan Cui <cuibixuan@huawei.com>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Link: http://lkml.kernel.org/r/20200608230819.832349-4-guro@fb.com
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      772616b0
    • Roman Gushchin's avatar
      mm: memcg/percpu: account percpu memory to memory cgroups · 3c7be18a
      Roman Gushchin authored
      Percpu memory is becoming more and more widely used by various subsystems,
      and the total amount of memory controlled by the percpu allocator can make
      a good part of the total memory.
      
      As an example, bpf maps can consume a lot of percpu memory, and they are
      created by a user.  Also, some cgroup internals (e.g.  memory controller
      statistics) can be quite large.  On a machine with many CPUs and big
      number of cgroups they can consume hundreds of megabytes.
      
      So the lack of memcg accounting is creating a breach in the memory
      isolation.  Similar to the slab memory, percpu memory should be accounted
      by default.
      
      To implement the perpcu accounting it's possible to take the slab memory
      accounting as a model to follow.  Let's introduce two types of percpu
      chunks: root and memcg.  What makes memcg chunks different is an
      additional space allocated to store memcg membership information.  If
      __GFP_ACCOUNT is passed on allocation, a memcg chunk should be be used.
      If it's possible to charge the corresponding size to the target memory
      cgroup, allocation is performed, and the memcg ownership data is recorded.
      System-wide allocations are performed using root chunks, so there is no
      additional memory overhead.
      
      To implement a fast reparenting of percpu memory on memcg removal, we
      don't store mem_cgroup pointers directly: instead we use obj_cgroup API,
      introduced for slab accounting.
      
      [akpm@linux-foundation.org: fix CONFIG_MEMCG_KMEM=n build errors and warning]
      [akpm@linux-foundation.org: move unreachable code, per Roman]
      [cuibixuan@huawei.com: mm/percpu: fix 'defined but not used' warning]
        Link: http://lkml.kernel.org/r/6d41b939-a741-b521-a7a2-e7296ec16219@huawei.com
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarBixuan Cui <cuibixuan@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarDennis Zhou <dennis@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Tobin C. Harding <tobin@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Bixuan Cui <cuibixuan@huawei.com>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Link: http://lkml.kernel.org/r/20200623184515.4132564-3-guro@fb.com
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3c7be18a
    • Roman Gushchin's avatar
      percpu: return number of released bytes from pcpu_free_area() · 5b32af91
      Roman Gushchin authored
      
      Patch series "mm: memcg accounting of percpu memory", v3.
      
      This patchset adds percpu memory accounting to memory cgroups.  It's based
      on the rework of the slab controller and reuses concepts and features
      introduced for the per-object slab accounting.
      
      Percpu memory is becoming more and more widely used by various subsystems,
      and the total amount of memory controlled by the percpu allocator can make
      a good part of the total memory.
      
      As an example, bpf maps can consume a lot of percpu memory, and they are
      created by a user.  Also, some cgroup internals (e.g.  memory controller
      statistics) can be quite large.  On a machine with many CPUs and big
      number of cgroups they can consume hundreds of megabytes.
      
      So the lack of memcg accounting is creating a breach in the memory
      isolation.  Similar to the slab memory, percpu memory should be accounted
      by default.
      
      Percpu allocations by their nature are scattered over multiple pages, so
      they can't be tracked on the per-page basis.  So the per-object tracking
      introduced by the new slab controller is reused.
      
      The patchset implements charging of percpu allocations, adds memcg-level
      statistics, enables accounting for percpu allocations made by memory
      cgroup internals and provides some basic tests.
      
      To implement the accounting of percpu memory without a significant memory
      and performance overhead the following approach is used: all accounted
      allocations are placed into a separate percpu chunk (or chunks).  These
      chunks are similar to default chunks, except that they do have an attached
      vector of pointers to obj_cgroup objects, which is big enough to save a
      pointer for each allocated object.  On the allocation, if the allocation
      has to be accounted (__GFP_ACCOUNT is passed, the allocating process
      belongs to a non-root memory cgroup, etc), the memory cgroup is getting
      charged and if the maximum limit is not exceeded the allocation is
      performed using a memcg-aware chunk.  Otherwise -ENOMEM is returned or the
      allocation is forced over the limit, depending on gfp (as any other kernel
      memory allocation).  The memory cgroup information is saved in the
      obj_cgroup vector at the corresponding offset.  On the release time the
      memcg information is restored from the vector and the cgroup is getting
      uncharged.  Unaccounted allocations (at this point the absolute majority
      of all percpu allocations) are performed in the old way, so no additional
      overhead is expected.
      
      To avoid pinning dying memory cgroups by outstanding allocations,
      obj_cgroup API is used instead of directly saving memory cgroup pointers.
      obj_cgroup is basically a pointer to a memory cgroup with a standalone
      reference counter.  The trick is that it can be atomically swapped to
      point at the parent cgroup, so that the original memory cgroup can be
      released prior to all objects, which has been charged to it.  Because all
      charges and statistics are fully recursive, it's perfectly correct to
      uncharge the parent cgroup instead.  This scheme is used in the slab
      memory accounting, and percpu memory can just follow the scheme.
      
      This patch (of 5):
      
      To implement accounting of percpu memory we need the information about the
      size of freed object.  Return it from pcpu_free_area().
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarDennis Zhou <dennis@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tobin C. Harding <tobin@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Waiman Long <longman@redhat.com>
      cC: Michal Koutnýutny@suse.com>
      Cc: Bixuan Cui <cuibixuan@huawei.com>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Link: http://lkml.kernel.org/r/20200623184515.4132564-1-guro@fb.com
      Link: http://lkml.kernel.org/r/20200608230819.832349-1-guro@fb.com
      Link: http://lkml.kernel.org/r/20200608230819.832349-2-guro@fb.com
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5b32af91
  17. 16 Jul, 2020 1 commit
    • Kees Cook's avatar
      treewide: Remove uninitialized_var() usage · 3f649ab7
      Kees Cook authored
      Using uninitialized_var() is dangerous as it papers over real bugs[1]
      (or can in the future), and suppresses unrelated compiler warnings
      (e.g. "unused variable"). If the compiler thinks it is uninitialized,
      either simply initialize the variable or make compiler changes.
      
      In preparation for removing[2] the[3] macro[4], remove all remaining
      needless uses with the following script:
      
      git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
      	xargs perl -pi -e \
      		's/\buninitialized_var\(([^\)]+)\)/\1/g;
      		 s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'
      
      drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
      pathological white-space.
      
      No outstanding warnings were found building allmodconfig with GCC 9.3.0
      for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
      alpha, and m68k.
      
      [1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
      [2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6ee...
      3f649ab7
  18. 02 Jun, 2020 1 commit
    • Christoph Hellwig's avatar
      mm: remove the pgprot argument to __vmalloc · 88dca4ca
      Christoph Hellwig authored
      
      The pgprot argument to __vmalloc is always PAGE_KERNEL now, so remove it.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: Michael Kelley <mikelley@microsoft.com> [hyperv]
      Acked-by: Gao Xiang <xiang@kernel.org> [erofs]
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarWei Liu <wei.liu@kernel.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Link: http://lkml.kernel.org/r/20200414131348.444715-22-hch@lst.de
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      88dca4ca
  19. 08 May, 2020 1 commit
    • Filipe Manana's avatar
      percpu: make pcpu_alloc() aware of current gfp context · 28307d93
      Filipe Manana authored
      Since 5.7-rc1, on btrfs we have a percpu counter initialization for
      which we always pass a GFP_KERNEL gfp_t argument (this happens since
      commit 2992df73
      
       ("btrfs: Implement DREW lock")).
      
      That is safe in some contextes but not on others where allowing fs
      reclaim could lead to a deadlock because we are either holding some
      btrfs lock needed for a transaction commit or holding a btrfs
      transaction handle open.  Because of that we surround the call to the
      function that initializes the percpu counter with a NOFS context using
      memalloc_nofs_save() (this is done at btrfs_init_fs_root()).
      
      However it turns out that this is not enough to prevent a possible
      deadlock because percpu_alloc() determines if it is in an atomic context
      by looking exclusively at the gfp flags passed to it (GFP_KERNEL in this
      case) and it is not aware that a NOFS context is set.
      
      Because percpu_alloc() thinks it is in a non atomic context it locks the
      pcpu_alloc_mutex.  This can result in a btrfs deadlock when
      pcpu_balance_workfn() is running, has acquired that mutex and is waiting
      for reclaim, while the btrfs task that called percpu_counter_init() (and
      therefore percpu_alloc()) is holding either the btrfs commit_root
      semaphore or a transaction handle (done fs/btrfs/backref.c:
      iterate_extent_inodes()), which prevents reclaim from finishing as an
      attempt to commit the current btrfs transaction will deadlock.
      
      Lockdep reports this issue with the following trace:
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.6.0-rc7-btrfs-next-77 #1 Not tainted
        ------------------------------------------------------
        kswapd0/91 is trying to acquire lock:
        ffff8938a3b3fdc8 (&delayed_node->mutex){+.+.}, at: __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
      
        but task is already holding lock:
        ffffffffb4f0dbc0 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #4 (fs_reclaim){+.+.}:
               fs_reclaim_acquire.part.0+0x25/0x30
               __kmalloc+0x5f/0x3a0
               pcpu_create_chunk+0x19/0x230
               pcpu_balance_workfn+0x56a/0x680
               process_one_work+0x235/0x5f0
               worker_thread+0x50/0x3b0
               kthread+0x120/0x140
               ret_from_fork+0x3a/0x50
      
        -> #3 (pcpu_alloc_mutex){+.+.}:
               __mutex_lock+0xa9/0xaf0
               pcpu_alloc+0x480/0x7c0
               __percpu_counter_init+0x50/0xd0
               btrfs_drew_lock_init+0x22/0x70 [btrfs]
               btrfs_get_fs_root+0x29c/0x5c0 [btrfs]
               resolve_indirect_refs+0x120/0xa30 [btrfs]
               find_parent_nodes+0x50b/0xf30 [btrfs]
               btrfs_find_all_leafs+0x60/0xb0 [btrfs]
               iterate_extent_inodes+0x139/0x2f0 [btrfs]
               iterate_inodes_from_logical+0xa1/0xe0 [btrfs]
               btrfs_ioctl_logical_to_ino+0xb4/0x190 [btrfs]
               btrfs_ioctl+0x165a/0x3130 [btrfs]
               ksys_ioctl+0x87/0xc0
               __x64_sys_ioctl+0x16/0x20
               do_syscall_64+0x5c/0x260
               entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
        -> #2 (&fs_info->commit_root_sem){++++}:
               down_write+0x38/0x70
               btrfs_cache_block_group+0x2ec/0x500 [btrfs]
               find_free_extent+0xc6a/0x1600 [btrfs]
               btrfs_reserve_extent+0x9b/0x180 [btrfs]
               btrfs_alloc_tree_block+0xc1/0x350 [btrfs]
               alloc_tree_block_no_bg_flush+0x4a/0x60 [btrfs]
               __btrfs_cow_block+0x122/0x5a0 [btrfs]
               btrfs_cow_block+0x106/0x240 [btrfs]
               commit_cowonly_roots+0x55/0x310 [btrfs]
               btrfs_commit_transaction+0x509/0xb20 [btrfs]
               sync_filesystem+0x74/0x90
               generic_shutdown_super+0x22/0x100
               kill_anon_super+0x14/0x30
               btrfs_kill_super+0x12/0x20 [btrfs]
               deactivate_locked_super+0x31/0x70
               cleanup_mnt+0x100/0x160
               task_work_run+0x93/0xc0
               exit_to_usermode_loop+0xf9/0x100
               do_syscall_64+0x20d/0x260
               entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
        -> #1 (&space_info->groups_sem){++++}:
               down_read+0x3c/0x140
               find_free_extent+0xef6/0x1600 [btrfs]
               btrfs_reserve_extent+0x9b/0x180 [btrfs]
               btrfs_alloc_tree_block+0xc1/0x350 [btrfs]
               alloc_tree_block_no_bg_flush+0x4a/0x60 [btrfs]
               __btrfs_cow_block+0x122/0x5a0 [btrfs]
               btrfs_cow_block+0x106/0x240 [btrfs]
               btrfs_search_slot+0x50c/0xd60 [btrfs]
               btrfs_lookup_inode+0x3a/0xc0 [btrfs]
               __btrfs_update_delayed_inode+0x90/0x280 [btrfs]
               __btrfs_commit_inode_delayed_items+0x81f/0x870 [btrfs]
               __btrfs_run_delayed_items+0x8e/0x180 [btrfs]
               btrfs_commit_transaction+0x31b/0xb20 [btrfs]
               iterate_supers+0x87/0xf0
               ksys_sync+0x60/0xb0
               __ia32_sys_sync+0xa/0x10
               do_syscall_64+0x5c/0x260
               entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
        -> #0 (&delayed_node->mutex){+.+.}:
               __lock_acquire+0xef0/0x1c80
               lock_acquire+0xa2/0x1d0
               __mutex_lock+0xa9/0xaf0
               __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
               btrfs_evict_inode+0x40d/0x560 [btrfs]
               evict+0xd9/0x1c0
               dispose_list+0x48/0x70
               prune_icache_sb+0x54/0x80
               super_cache_scan+0x124/0x1a0
               do_shrink_slab+0x176/0x440
               shrink_slab+0x23a/0x2c0
               shrink_node+0x188/0x6e0
               balance_pgdat+0x31d/0x7f0
               kswapd+0x238/0x550
               kthread+0x120/0x140
               ret_from_fork+0x3a/0x50
      
        other info that might help us debug this:
      
        Chain exists of:
          &delayed_node->mutex --> pcpu_alloc_mutex --> fs_reclaim
      
         Possible unsafe locking scenario:
      
               CPU0                    CPU1
               ----                    ----
          lock(fs_reclaim);
                                       lock(pcpu_alloc_mutex);
                                       lock(fs_reclaim);
          lock(&delayed_node->mutex);
      
         *** DEADLOCK ***
      
        3 locks held by kswapd0/91:
         #0: (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30
         #1: (shrinker_rwsem){++++}, at: shrink_slab+0x12f/0x2c0
         #2: (&type->s_umount_key#43){++++}, at: trylock_super+0x16/0x50
      
        stack backtrace:
        CPU: 1 PID: 91 Comm: kswapd0 Not tainted 5.6.0-rc7-btrfs-next-77 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
        Call Trace:
         dump_stack+0x8f/0xd0
         check_noncircular+0x170/0x190
         __lock_acquire+0xef0/0x1c80
         lock_acquire+0xa2/0x1d0
         __mutex_lock+0xa9/0xaf0
         __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
         btrfs_evict_inode+0x40d/0x560 [btrfs]
         evict+0xd9/0x1c0
         dispose_list+0x48/0x70
         prune_icache_sb+0x54/0x80
         super_cache_scan+0x124/0x1a0
         do_shrink_slab+0x176/0x440
         shrink_slab+0x23a/0x2c0
         shrink_node+0x188/0x6e0
         balance_pgdat+0x31d/0x7f0
         kswapd+0x238/0x550
         kthread+0x120/0x140
         ret_from_fork+0x3a/0x50
      
      This could be fixed by making btrfs pass GFP_NOFS instead of GFP_KERNEL
      to percpu_counter_init() in contextes where it is not reclaim safe,
      however that type of approach is discouraged since
      memalloc_[nofs|noio]_save() were introduced.  Therefore this change
      makes pcpu_alloc() look up into an existing nofs/noio context before
      deciding whether it is in an atomic context or not.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarDennis Zhou <dennis@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Link: http://lkml.kernel.org/r/20200430164356.15543-1-fdmanana@kernel.org
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      28307d93
  20. 01 Apr, 2020 1 commit
  21. 20 Jan, 2020 1 commit
  22. 04 Sep, 2019 1 commit
    • Gustavo A. R. Silva's avatar
      percpu: Use struct_size() helper · 14d37612
      Gustavo A. R. Silva authored
      
      One of the more common cases of allocation size calculations is finding
      the size of a structure that has a zero-sized array at the end, along
      with memory for some number of elements for that array. For example:
      
      struct pcpu_alloc_info {
      	...
              struct pcpu_group_info  groups[];
      };
      
      Make use of the struct_size() helper instead of an open-coded version
      in order to avoid any potential type mistakes.
      
      So, replace the following form:
      
      sizeof(*ai) + nr_groups * sizeof(ai->groups[0])
      
      with:
      
      struct_size(ai, groups, nr_groups)
      
      This code was detected with the help of Coccinelle.
      Signed-off-by: default avatarGustavo A. R. Silva <gustavo@embeddedor.com>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      14d37612
  23. 23 Jul, 2019 1 commit
  24. 04 Jul, 2019 1 commit
  25. 05 Jun, 2019 1 commit
  26. 08 May, 2019 1 commit
    • John Sperbeck's avatar
      percpu: remove spurious lock dependency between percpu and sched · 198790d9
      John Sperbeck authored
      
      In free_percpu() we sometimes call pcpu_schedule_balance_work() to
      queue a work item (which does a wakeup) while holding pcpu_lock.
      This creates an unnecessary lock dependency between pcpu_lock and
      the scheduler's pi_lock.  There are other places where we call
      pcpu_schedule_balance_work() without hold pcpu_lock, and this case
      doesn't need to be different.
      
      Moving the call outside the lock prevents the following lockdep splat
      when running tools/testing/selftests/bpf/{test_maps,test_progs} in
      sequence with lockdep enabled:
      
      ======================================================
      WARNING: possible circular locking dependency detected
      5.1.0-dbg-DEV #1 Not tainted
      ------------------------------------------------------
      kworker/23:255/18872 is trying to acquire lock:
      000000000bc79290 (&(&pool->lock)->rlock){-.-.}, at: __queue_work+0xb2/0x520
      
      but task is already holding lock:
      00000000e3e7a6aa (pcpu_lock){..-.}, at: free_percpu+0x36/0x260
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #4 (pcpu_lock){..-.}:
             lock_acquire+0x9e/0x180
             _raw_spin_lock_irqsave+0x3a/0x50
             pcpu_alloc+0xfa/0x780
             __alloc_percpu_gfp+0x12/0x20
             alloc_htab_elem+0x184/0x2b0
             __htab_percpu_map_update_elem+0x252/0x290
             bpf_percpu_hash_update+0x7c/0x130
             __do_sys_bpf+0x1912/0x1be0
             __x64_sys_bpf+0x1a/0x20
             do_syscall_64+0x59/0x400
             entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      -> #3 (&htab->buckets[i].lock){....}:
             lock_acquire+0x9e/0x180
             _raw_spin_lock_irqsave+0x3a/0x50
             htab_map_update_elem+0x1af/0x3a0
      
      -> #2 (&rq->lock){-.-.}:
             lock_acquire+0x9e/0x180
             _raw_spin_lock+0x2f/0x40
             task_fork_fair+0x37/0x160
             sched_fork+0x211/0x310
             copy_process.part.43+0x7b1/0x2160
             _do_fork+0xda/0x6b0
             kernel_thread+0x29/0x30
             rest_init+0x22/0x260
             arch_call_rest_init+0xe/0x10
             start_kernel+0x4fd/0x520
             x86_64_start_reservations+0x24/0x26
             x86_64_start_kernel+0x6f/0x72
             secondary_startup_64+0xa4/0xb0
      
      -> #1 (&p->pi_lock){-.-.}:
             lock_acquire+0x9e/0x180
             _raw_spin_lock_irqsave+0x3a/0x50
             try_to_wake_up+0x41/0x600
             wake_up_process+0x15/0x20
             create_worker+0x16b/0x1e0
             workqueue_init+0x279/0x2ee
             kernel_init_freeable+0xf7/0x288
             kernel_init+0xf/0x180
             ret_from_fork+0x24/0x30
      
      -> #0 (&(&pool->lock)->rlock){-.-.}:
             __lock_acquire+0x101f/0x12a0
             lock_acquire+0x9e/0x180
             _raw_spin_lock+0x2f/0x40
             __queue_work+0xb2/0x520
             queue_work_on+0x38/0x80
             free_percpu+0x221/0x260
             pcpu_freelist_destroy+0x11/0x20
             stack_map_free+0x2a/0x40
             bpf_map_free_deferred+0x3c/0x50
             process_one_work+0x1f7/0x580
             worker_thread+0x54/0x410
             kthread+0x10f/0x150
             ret_from_fork+0x24/0x30
      
      other info that might help us debug this:
      
      Chain exists of:
        &(&pool->lock)->rlock --> &htab->buckets[i].lock --> pcpu_lock
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(pcpu_lock);
                                     lock(&htab->buckets[i].lock);
                                     lock(pcpu_lock);
        lock(&(&pool->lock)->rlock);
      
       *** DEADLOCK ***
      
      3 locks held by kworker/23:255/18872:
       #0: 00000000b36a6e16 ((wq_completion)events){+.+.},
           at: process_one_work+0x17a/0x580
       #1: 00000000dfd966f0 ((work_completion)(&map->work)){+.+.},
           at: process_one_work+0x17a/0x580
       #2: 00000000e3e7a6aa (pcpu_lock){..-.},
           at: free_percpu+0x36/0x260
      
      stack backtrace:
      CPU: 23 PID: 18872 Comm: kworker/23:255 Not tainted 5.1.0-dbg-DEV #1
      Hardware name: ...
      Workqueue: events bpf_map_free_deferred
      Call Trace:
       dump_stack+0x67/0x95
       print_circular_bug.isra.38+0x1c6/0x220
       check_prev_add.constprop.50+0x9f6/0xd20
       __lock_acquire+0x101f/0x12a0
       lock_acquire+0x9e/0x180
       _raw_spin_lock+0x2f/0x40
       __queue_work+0xb2/0x520
       queue_work_on+0x38/0x80
       free_percpu+0x221/0x260
       pcpu_freelist_destroy+0x11/0x20
       stack_map_free+0x2a/0x40
       bpf_map_free_deferred+0x3c/0x50
       process_one_work+0x1f7/0x580
       worker_thread+0x54/0x410
       kthread+0x10f/0x150
       ret_from_fork+0x24/0x30
      Signed-off-by: default avatarJohn Sperbeck <jsperbeck@google.com>
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      198790d9
  27. 18 Mar, 2019 1 commit
  28. 13 Mar, 2019 7 commits
    • Dennis Zhou's avatar
      percpu: use chunk scan_hint to skip some scanning · d33d9f3d
      Dennis Zhou authored
      
      Just like blocks, chunks now maintain a scan_hint. This can be used to
      skip some scanning by promoting the scan_hint to be the contig_hint.
      The chunk's scan_hint is primarily updated on the backside and relies on
      full scanning when a block becomes free or the free region spans across
      blocks.
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Reviewed-by: default avatarPeng Fan <peng.fan@nxp.com>
      d33d9f3d
    • Dennis Zhou's avatar
      percpu: convert chunk hints to be based on pcpu_block_md · 92c14cab
      Dennis Zhou authored
      
      As mentioned in the last patch, a chunk's hints are no different than a
      block just responsible for more bits. This converts chunk level hints to
      use a pcpu_block_md to maintain them. This lets us reuse the same hint
      helper functions as a block. The left_free and right_free are unused by
      the chunk's pcpu_block_md.
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Reviewed-by: default avatarPeng Fan <peng.fan@nxp.com>
      92c14cab
    • Dennis Zhou's avatar
      percpu: make pcpu_block_md generic · 047924c9
      Dennis Zhou authored
      
      In reality, a chunk is just a block covering a larger number of bits.
      The hints themselves are one in the same. Rather than maintaining the
      hints separately, first introduce nr_bits to genericize
      pcpu_block_update() to correctly maintain block->right_free. The next
      patch will convert chunk hints to be managed as a pcpu_block_md.
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Reviewed-by: default avatarPeng Fan <peng.fan@nxp.com>
      047924c9
    • Dennis Zhou's avatar
      percpu: use block scan_hint to only scan forward · da3afdd5
      Dennis Zhou authored
      
      Blocks now remember the latest scan_hint. This can be used on the
      allocation path as when a contig_hint is broken, we can promote the
      scan_hint to the contig_hint and scan forward from there. This works
      because pcpu_block_refresh_hint() is only called on the allocation path
      while block free regions are updated manually in
      pcpu_block_update_hint_free().
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      da3afdd5
    • Dennis Zhou's avatar
      percpu: remember largest area skipped during allocation · b89462a9
      Dennis Zhou authored
      
      Percpu allocations attempt to do first fit by scanning forward from the
      first_free of a block. However, fragmentation from allocation requests
      can cause holes not seen by block hint update functions. To address
      this, create a local version of bitmap_find_next_zero_area_off() that
      remembers the largest area skipped over. The caveat is that it only sees
      regions skipped over due to not fitting, not regions skipped due to
      alignment.
      
      Prior to updating the scan_hint, a scan backwards is done to try and
      recover free bits skipped due to alignment. While this can cause
      scanning to miss earlier possible free areas, smaller allocations will
      eventually fill those holes due to first fit.
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      b89462a9
    • Dennis Zhou's avatar
      percpu: add block level scan_hint · 382b88e9
      Dennis Zhou authored
      
      Fragmentation can cause both blocks and chunks to have an early
      first_firee bit available, but only able to satisfy allocations much
      later on. This patch introduces a scan_hint to help mitigate some
      unnecessary scanning.
      
      The scan_hint remembers the largest area prior to the contig_hint. If
      the contig_hint == scan_hint, then scan_hint_start > contig_hint_start.
      This is necessary for scan_hint discovery when refreshing a block.
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Reviewed-by: default avatarPeng Fan <peng.fan@nxp.com>
      382b88e9
    • Dennis Zhou's avatar
      percpu: set PCPU_BITMAP_BLOCK_SIZE to PAGE_SIZE · b239f7da
      Dennis Zhou authored
      
      Previously, block size was flexible based on the constraint that the
      GCD(PCPU_BITMAP_BLOCK_SIZE, PAGE_SIZE) > 1. However, this carried the
      overhead that keeping a floating number of populated free pages required
      scanning over the free regions of a chunk.
      
      Setting the block size to be fixed at PAGE_SIZE lets us know when an
      empty page becomes used as we will break a full contig_hint of a block.
      This means we no longer have to scan the whole chunk upon breaking a
      contig_hint which empty page management piggybacked off. A later patch
      takes advantage of this to optimize the allocation path by only scanning
      forward using the scan_hint introduced later too.
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Reviewed-by: default avatarPeng Fan <peng.fan@nxp.com>
      b239f7da