1. 18 Nov, 2021 5 commits
    • Michal Hocko's avatar
      mm, oom: do not trigger out_of_memory from the #PF · c15aeead
      Michal Hocko authored
      commit 60e2793d upstream.
      
      Any allocation failure during the #PF path will return with VM_FAULT_OOM
      which in turn results in pagefault_out_of_memory.  This can happen for 2
      different reasons.  a) Memcg is out of memory and we rely on
      mem_cgroup_oom_synchronize to perform the memcg OOM handling or b)
      normal allocation fails.
      
      The latter is quite problematic because allocation paths already trigger
      out_of_memory and the page allocator tries really hard to not fail
      allocations.  Anyway, if the OOM killer has been already invoked there
      is no reason to invoke it again from the #PF path.  Especially when the
      OOM condition might be gone by that time and we have no way to find out
      other than allocate.
      
      Moreover if the allocation failed and the OOM killer hasn't been invoked
      then we are unlikely to do the right thing from the #PF context because
      we have already lost the allocation context and restictions and
      therefore might oom kill a task from a different NUMA domain.
      
      This all suggests that there is no legitimate reason to trigger
      out_of_memory from pagefault_out_of_memory so drop it.  Just to be sure
      that no #PF path returns with VM_FAULT_OOM without allocation print a
      warning that this is happening before we restart the #PF.
      
      [VvS: #PF allocation can hit into limit of cgroup v1 kmem controller.
      This is a local problem related to memcg, however, it causes unnecessary
      global OOM kills that are repeated over and over again and escalate into a
      real disaster.  This has been broken since kmem accounting has been
      introduced for cgroup v1 (3.8).  There was no kmem specific reclaim for
      the separate limit so the only way to handle kmem hard limit was to return
      with ENOMEM.  In upstream the problem will be fixed by removing the
      outdated kmem limit, however stable and LTS kernels cannot do it and are
      still affected.  This patch fixes the problem and should be backported
      into stable/LTS.]
      
      Link: https://lkml.kernel.org/r/f5fd8dd8-0ad4-c524-5f65-920b01972a42@virtuozzo.com
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Uladzislau Rezki <urezki@gmail.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c15aeead
    • Vasily Averin's avatar
      mm, oom: pagefault_out_of_memory: don't force global OOM for dying tasks · 487a4c60
      Vasily Averin authored
      commit 0b28179a upstream.
      
      Patch series "memcg: prohibit unconditional exceeding the limit of dying tasks", v3.
      
      Memory cgroup charging allows killed or exiting tasks to exceed the hard
      limit.  It can be misused and allowed to trigger global OOM from inside
      a memcg-limited container.  On the other hand if memcg fails allocation,
      called from inside #PF handler it triggers global OOM from inside
      pagefault_out_of_memory().
      
      To prevent these problems this patchset:
       (a) removes execution of out_of_memory() from
           pagefault_out_of_memory(), becasue nobody can explain why it is
           necessary.
       (b) allow memcg to fail allocation of dying/killed tasks.
      
      This patch (of 3):
      
      Any allocation failure during the #PF path will return with VM_FAULT_OOM
      which in turn results in pagefault_out_of_memory which in turn executes
      out_out_memory() and can kill a random task.
      
      An allocation might fail when the current task is the oom victim and
      there are no memory reserves left.  The OOM killer is already handled at
      the page allocator level for the global OOM and at the charging level
      for the memcg one.  Both have much more information about the scope of
      allocation/charge request.  This means that either the OOM killer has
      been invoked properly and didn't lead to the allocation success or it
      has been skipped because it couldn't have been invoked.  In both cases
      triggering it from here is pointless and even harmful.
      
      It makes much more sense to let the killed task die rather than to wake
      up an eternally hungry oom-killer and send him to choose a fatter victim
      for breakfast.
      
      Link: https://lkml.kernel.org/r/0828a149-786e-7c06-b70a-52d086818ea3@virtuozzo.com
      
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Uladzislau Rezki <urezki@gmail.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      487a4c60
    • Vasily Averin's avatar
      memcg: prohibit unconditional exceeding the limit of dying tasks · f1e83db2
      Vasily Averin authored
      commit a4ebf1b6 upstream.
      
      Memory cgroup charging allows killed or exiting tasks to exceed the hard
      limit.  It is assumed that the amount of the memory charged by those
      tasks is bound and most of the memory will get released while the task
      is exiting.  This is resembling a heuristic for the global OOM situation
      when tasks get access to memory reserves.  There is no global memory
      shortage at the memcg level so the memcg heuristic is more relieved.
      
      The above assumption is overly optimistic though.  E.g.  vmalloc can
      scale to really large requests and the heuristic would allow that.  We
      used to have an early break in the vmalloc allocator for killed tasks
      but this has been reverted by commit b8c8a338 ("Revert "vmalloc:
      back off when the current task is killed"").  There are likely other
      similar code paths which do not check for fatal signals in an
      allocation&charge loop.  Also there are some kernel objects charged to a
      memcg which are not bound to a process life time.
      
      It has been observed that it is not really hard to trigger these
      bypasses and cause global OOM situation.
      
      One potential way to address these runaways would be to limit the amount
      of excess (similar to the global OOM with limited oom reserves).  This
      is certainly possible but it is not really clear how much of an excess
      is desirable and still protects from global OOMs as that would have to
      consider the overall memcg configuration.
      
      This patch is addressing the problem by removing the heuristic
      altogether.  Bypass is only allowed for requests which either cannot
      fail or where the failure is not desirable while excess should be still
      limited (e.g.  atomic requests).  Implementation wise a killed or dying
      task fails to charge if it has passed the OOM killer stage.  That should
      give all forms of reclaim chance to restore the limit before the failure
      (ENOMEM) and tell the caller to back off.
      
      In addition, this patch renames should_force_charge() helper to
      task_is_dying() because now its use is not associated witch forced
      charging.
      
      This patch depends on pagefault_out_of_memory() to not trigger
      out_of_memory(), because then a memcg failure can unwind to VM_FAULT_OOM
      and cause a global OOM killer.
      
      Link: https://lkml.kernel.org/r/8f5cebbb-06da-4902-91f0-6566fc4b4203@virtuozzo.com
      
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Uladzislau Rezki <urezki@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f1e83db2
    • Matthew Wilcox (Oracle)'s avatar
      mm/filemap.c: remove bogus VM_BUG_ON · 6560e8cd
      Matthew Wilcox (Oracle) authored
      commit d417b49f upstream.
      
      It is not safe to check page->index without holding the page lock.  It
      can be changed if the page is moved between the swap cache and the page
      cache for a shmem file, for example.  There is a VM_BUG_ON below which
      checks page->index is correct after taking the page lock.
      
      Link: https://lkml.kernel.org/r/20210818144932.940640-1-willy@infradead.org
      Fixes: 5c211ba2
      
       ("mm: add and use find_lock_entries")
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reported-by: <syzbot+c87be4f669d920c76330@syzkaller.appspotmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6560e8cd
    • Miaohe Lin's avatar
      mm/zsmalloc.c: close race window between zs_pool_dec_isolated() and zs_unregister_migration() · 3d6b113c
      Miaohe Lin authored
      [ Upstream commit afe8605c ]
      
      There is one possible race window between zs_pool_dec_isolated() and
      zs_unregister_migration() because wait_for_isolated_drain() checks the
      isolated count without holding class->lock and there is no order inside
      zs_pool_dec_isolated().  Thus the below race window could be possible:
      
        zs_pool_dec_isolated		zs_unregister_migration
          check pool->destroying != 0
      				  pool->destroying = true;
      				  smp_mb();
      				  wait_for_isolated_drain()
      				    wait for pool->isolated_pages == 0
          atomic_long_dec(&pool->isolated_pages);
          atomic_long_read(&pool->isolated_pages) == 0
      
      Since we observe the pool->destroying (false) before atomic_long_dec()
      for pool->isolated_pages, waking pool->migration_wait up is missed.
      
      Fix this by ensure checking pool->destroying happens after the
      atomic_long_dec(&pool->isolated_pages).
      
      Link: https://lkml.kernel.org/r/20210708115027.7557-1-linmiaohe@huawei.com
      Fixes: 701d6785
      
       ("mm/zsmalloc.c: fix race condition in zs_destroy_pool")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Henry Burns <henryburns@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      3d6b113c
  2. 12 Nov, 2021 1 commit
  3. 29 Oct, 2021 9 commits
  4. 25 Oct, 2021 1 commit
    • Matthew Wilcox (Oracle)'s avatar
      secretmem: Prevent secretmem_users from wrapping to zero · cb685432
      Matthew Wilcox (Oracle) authored
      Commit 11086054 ("mm/secretmem: use refcount_t instead of atomic_t")
      attempted to fix the problem of secretmem_users wrapping to zero and
      allowing suspend once again.
      
      But it was reverted in commit 87066fdd
      
       ("Revert 'mm/secretmem: use
      refcount_t instead of atomic_t'") because of the problems it caused - a
      refcount_t was not semantically the right type to use.
      
      Instead prevent secretmem_users from wrapping to zero by forbidding new
      users if the number of users has wrapped from positive to negative.
      This stops a long way short of reaching the necessary 4 billion users
      where it wraps to zero again, so there's no need to be clever with
      special anti-wrap types or checking the return value from atomic_inc().
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Jordy Zomer <jordy@pwning.systems>
      Cc: Kees Cook <keescook@chromium.org>,
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cb685432
  5. 24 Oct, 2021 1 commit
    • Linus Torvalds's avatar
      Revert "mm/secretmem: use refcount_t instead of atomic_t" · 87066fdd
      Linus Torvalds authored
      This reverts commit 11086054
      
      .
      
      Converting the "secretmem_users" counter to a refcount is incorrect,
      because a refcount is special in zero and can't just be incremented (but
      a count of users is not, and "no users" is actually perfectly valid and
      not a sign of a free'd resource).
      
      Reported-by: syzbot+75639e6a0331cd61d3e2@syzkaller.appspotmail.com
      Cc: Jordy Zomer <jordy@pwning.systems>
      Cc: Kees Cook <keescook@chromium.org>,
      Cc: Jordy Zomer <jordy@jordyzomer.github.io>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      87066fdd
  6. 22 Oct, 2021 2 commits
    • Mike Rapoport's avatar
      memblock: exclude MEMBLOCK_NOMAP regions from kmemleak · 658aafc8
      Mike Rapoport authored
      Vladimir Zapolskiy reports:
      
      Commit a7259df7 ("memblock: make memblock_find_in_range method
      private") invokes a kernel panic while running kmemleak on OF platforms
      with nomaped regions:
      
        Unable to handle kernel paging request at virtual address fff000021e00000
        [...]
          scan_block+0x64/0x170
          scan_gray_list+0xe8/0x17c
          kmemleak_scan+0x270/0x514
          kmemleak_write+0x34c/0x4ac
      
      The memory allocated from memblock is registered with kmemleak, but if
      it is marked MEMBLOCK_NOMAP it won't have linear map entries so an
      attempt to scan such areas will fault.
      
      Ideally, memblock_mark_nomap() would inform kmemleak to ignore
      MEMBLOCK_NOMAP memory, but it can be called before kmemleak interfaces
      operating on physical addresses can use __va() conversion.
      
      Make sure that functions that mark allocated memory as MEMBLOCK_NOMAP
      take care of informing kmemleak to ignore such memory.
      
      Link: https://lore.kernel.org/all/8ade5174-b143-d621-8c8e-dc6a1898c6fb@linaro.org
      Link: https://lore.kernel.org/all/c30ff0a2-d196-c50d-22f0-bd50696b1205@quicinc.com
      Fixes: a7259df7
      
       ("memblock: make memblock_find_in_range method private")
      Reported-by: default avatarVladimir Zapolskiy <vladimir.zapolskiy@linaro.org>
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Tested-by: default avatarVladimir Zapolskiy <vladimir.zapolskiy@linaro.org>
      Tested-by: default avatarQian Cai <quic_qiancai@quicinc.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      658aafc8
    • Mike Rapoport's avatar
      Revert "memblock: exclude NOMAP regions from kmemleak" · 6c9a5455
      Mike Rapoport authored
      Commit 6e44bd6d ("memblock: exclude NOMAP regions from kmemleak")
      breaks boot on EFI systems with kmemleak and VM_DEBUG enabled:
      
        efi: Processing EFI memory map:
        efi:   0x000090000000-0x000091ffffff [Conventional|   |  |  |  |  |  |  |  |  |   |WB|WT|WC|UC]
        efi:   0x000092000000-0x0000928fffff [Runtime Data|RUN|  |  |  |  |  |  |  |  |   |WB|WT|WC|UC]
        ------------[ cut here ]------------
        kernel BUG at mm/kmemleak.c:1140!
        Internal error: Oops - BUG: 0 [#1] SMP
        Modules linked in:
        CPU: 0 PID: 0 Comm: swapper Not tainted 5.15.0-rc6-next-20211019+ #104
        pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
        pc : kmemleak_free_part_phys+0x64/0x8c
        lr : kmemleak_free_part_phys+0x38/0x8c
        sp : ffff800011eafbc0
        x29: ffff800011eafbc0 x28: 1fffff7fffb41c0d x27: fffffbfffda0e068
        x26: 0000000092000000 x25: 1ffff000023d5f94 x24: ffff800011ed84d0
        x23: ffff800011ed84c0 x22: ffff800011ed83d8 x21: 0000000000900000
        x20: ffff800011782000 x19: 0000000092000000 x18: ffff800011ee0730
        x17: 0000000000000000 x16: 0000000000000000 x15: 1ffff0000233252c
        x14: ffff800019a905a0 x13: 0000000000000001 x12: ffff7000023d5ed7
        x11: 1ffff000023d5ed6 x10: ffff7000023d5ed6 x9 : dfff800000000000
        x8 : ffff800011eaf6b7 x7 : 0000000000000001 x6 : ffff800011eaf6b0
        x5 : 00008ffffdc2a12a x4 : ffff7000023d5ed7 x3 : 1ffff000023dbf99
        x2 : 1ffff000022f0463 x1 : 0000000000000000 x0 : ffffffffffffffff
        Call trace:
         kmemleak_free_part_phys+0x64/0x8c
         memblock_mark_nomap+0x5c/0x78
         reserve_regions+0x294/0x33c
         efi_init+0x2d0/0x490
         setup_arch+0x80/0x138
         start_kernel+0xa0/0x3ec
         __primary_switched+0xc0/0xc8
        Code: 34000041 97d526e7 f9418e80 36000040 (d4210000)
        random: get_random_bytes called from print_oops_end_marker+0x34/0x80 with crng_init=0
        ---[ end trace 0000000000000000 ]---
      
      The crash happens because kmemleak_free_part_phys() tries to use __va()
      before memstart_addr is initialized and this triggers a VM_BUG_ON() in
      arch/arm64/include/asm/memory.h:
      
      Revert 6e44bd6d
      
       ("memblock: exclude NOMAP regions from kmemleak"),
      the issue it is fixing will be fixed differently.
      Reported-by: default avatarQian Cai <quic_qiancai@quicinc.com>
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6c9a5455
  7. 19 Oct, 2021 11 commits
    • Marek Szyprowski's avatar
      mm/thp: decrease nr_thps in file's mapping on THP split · 1ca7554d
      Marek Szyprowski authored
      Decrease nr_thps counter in file's mapping to ensure that the page cache
      won't be dropped excessively on file write access if page has been
      already split.
      
      I've tried a test scenario running a big binary, kernel remaps it with
      THPs, then force a THP split with /sys/kernel/debug/split_huge_pages.
      During any further open of that binary with O_RDWR or O_WRITEONLY kernel
      drops page cache for it, because of non-zero thps counter.
      
      Link: https://lkml.kernel.org/r/20211012120237.2600-1-m.szyprowski@samsung.com
      
      Signed-off-by: default avatarMarek Szyprowski <m.szyprowski@samsung.com>
      Fixes: 09d91cda ("mm,thp: avoid writes to file with THP in pagecache")
      Fixes: 06d3eff6
      
       ("mm/thp: fix node page state in split_huge_page_to_list()")
      Acked-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: <sfoon.kim@samsung.com>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1ca7554d
    • Miaohe Lin's avatar
      mm, slub: fix incorrect memcg slab count for bulk free · 3ddd6026
      Miaohe Lin authored
      kmem_cache_free_bulk() will call memcg_slab_free_hook() for all objects
      when doing bulk free.  So we shouldn't call memcg_slab_free_hook() again
      for bulk free to avoid incorrect memcg slab count.
      
      Link: https://lkml.kernel.org/r/20210916123920.48704-6-linmiaohe@huawei.com
      Fixes: d1b2cf6c
      
       ("mm: memcg/slab: uncharge during kmem_cache_free_bulk()")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Bharata B Rao <bharata@linux.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Faiyaz Mohammed <faiyazm@codeaurora.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3ddd6026
    • Miaohe Lin's avatar
      mm, slub: fix potential use-after-free in slab_debugfs_fops · 67823a54
      Miaohe Lin authored
      When sysfs_slab_add failed, we shouldn't call debugfs_slab_add() for s
      because s will be freed soon.  And slab_debugfs_fops will use s later
      leading to a use-after-free.
      
      Link: https://lkml.kernel.org/r/20210916123920.48704-5-linmiaohe@huawei.com
      Fixes: 64dd6849
      
       ("mm: slub: move sysfs slab alloc/free interfaces to debugfs")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Bharata B Rao <bharata@linux.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Faiyaz Mohammed <faiyazm@codeaurora.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      67823a54
    • Miaohe Lin's avatar
      mm, slub: fix potential memoryleak in kmem_cache_open() · 9037c576
      Miaohe Lin authored
      In error path, the random_seq of slub cache might be leaked.  Fix this
      by using __kmem_cache_release() to release all the relevant resources.
      
      Link: https://lkml.kernel.org/r/20210916123920.48704-4-linmiaohe@huawei.com
      Fixes: 210e7a43
      
       ("mm: SLUB freelist randomization")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Bharata B Rao <bharata@linux.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Faiyaz Mohammed <faiyazm@codeaurora.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9037c576
    • Miaohe Lin's avatar
      mm, slub: fix mismatch between reconstructed freelist depth and cnt · 899447f6
      Miaohe Lin authored
      If object's reuse is delayed, it will be excluded from the reconstructed
      freelist.  But we forgot to adjust the cnt accordingly.  So there will
      be a mismatch between reconstructed freelist depth and cnt.  This will
      lead to free_debug_processing() complaining about freelist count or a
      incorrect slub inuse count.
      
      Link: https://lkml.kernel.org/r/20210916123920.48704-3-linmiaohe@huawei.com
      Fixes: c3895391
      
       ("kasan, slub: fix handling of kasan_slab_free hook")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Bharata B Rao <bharata@linux.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Faiyaz Mohammed <faiyazm@codeaurora.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      899447f6
    • Miaohe Lin's avatar
      mm, slub: fix two bugs in slab_debug_trace_open() · 2127d225
      Miaohe Lin authored
      Patch series "Fixups for slub".
      
      This series contains various bug fixes for slub.  We fix memoryleak,
      use-afer-free, NULL pointer dereferencing and so on in slub.  More
      details can be found in the respective changelogs.
      
      This patch (of 5):
      
      It's possible that __seq_open_private() will return NULL.  So we should
      check it before using lest dereferencing NULL pointer.  And in error
      paths, we forgot to release private buffer via seq_release_private().
      Memory will leak in these paths.
      
      Link: https://lkml.kernel.org/r/20210916123920.48704-1-linmiaohe@huawei.com
      Link: https://lkml.kernel.org/r/20210916123920.48704-2-linmiaohe@huawei.com
      Fixes: 64dd6849
      
       ("mm: slub: move sysfs slab alloc/free interfaces to debugfs")
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Faiyaz Mohammed <faiyazm@codeaurora.org>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Bharata B Rao <bharata@linux.ibm.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2127d225
    • Eric Dumazet's avatar
      mm/mempolicy: do not allow illegal MPOL_F_NUMA_BALANCING | MPOL_LOCAL in mbind() · 6d2aec9e
      Eric Dumazet authored
      syzbot reported access to unitialized memory in mbind() [1]
      
      Issue came with commit bda420b9 ("numa balancing: migrate on fault
      among multiple bound nodes")
      
      This commit added a new bit in MPOL_MODE_FLAGS, but only checked valid
      combination (MPOL_F_NUMA_BALANCING can only be used with MPOL_BIND) in
      do_set_mempolicy()
      
      This patch moves the check in sanitize_mpol_flags() so that it is also
      used by mbind()
      
        [1]
        BUG: KMSAN: uninit-value in __mpol_equal+0x567/0x590 mm/mempolicy.c:2260
         __mpol_equal+0x567/0x590 mm/mempolicy.c:2260
         mpol_equal include/linux/mempolicy.h:105 [inline]
         vma_merge+0x4a1/0x1e60 mm/mmap.c:1190
         mbind_range+0xcc8/0x1e80 mm/mempolicy.c:811
         do_mbind+0xf42/0x15f0 mm/mempolicy.c:1333
         kernel_mbind mm/mempolicy.c:1483 [inline]
         __do_sys_mbind mm/mempolicy.c:1490 [inline]
         __se_sys_mbind+0x437/0xb80 mm/mempolicy.c:1486
         __x64_sys_mbind+0x19d/0x200 mm/mempolicy.c:1486
         do_syscall_x64 arch/x86/entry/common.c:51 [inline]
         do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
        Uninit was created at:
         slab_alloc_node mm/slub.c:3221 [inline]
         slab_alloc mm/slub.c:3230 [inline]
         kmem_cache_alloc+0x751/0xff0 mm/slub.c:3235
         mpol_new mm/mempolicy.c:293 [inline]
         do_mbind+0x912/0x15f0 mm/mempolicy.c:1289
         kernel_mbind mm/mempolicy.c:1483 [inline]
         __do_sys_mbind mm/mempolicy.c:1490 [inline]
         __se_sys_mbind+0x437/0xb80 mm/mempolicy.c:1486
         __x64_sys_mbind+0x19d/0x200 mm/mempolicy.c:1486
         do_syscall_x64 arch/x86/entry/common.c:51 [inline]
         do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82
         entry_SYSCALL_64_after_hwframe+0x44/0xae
        =====================================================
        Kernel panic - not syncing: panic_on_kmsan set ...
        CPU: 0 PID: 15049 Comm: syz-executor.0 Tainted: G    B             5.15.0-rc2-syzkaller #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        Call Trace:
         __dump_stack lib/dump_stack.c:88 [inline]
         dump_stack_lvl+0x1ff/0x28e lib/dump_stack.c:106
         dump_stack+0x25/0x28 lib/dump_stack.c:113
         panic+0x44f/0xdeb kernel/panic.c:232
         kmsan_report+0x2ee/0x300 mm/kmsan/report.c:186
         __msan_warning+0xd7/0x150 mm/kmsan/instrumentation.c:208
         __mpol_equal+0x567/0x590 mm/mempolicy.c:2260
         mpol_equal include/linux/mempolicy.h:105 [inline]
         vma_merge+0x4a1/0x1e60 mm/mmap.c:1190
         mbind_range+0xcc8/0x1e80 mm/mempolicy.c:811
         do_mbind+0xf42/0x15f0 mm/mempolicy.c:1333
         kernel_mbind mm/mempolicy.c:1483 [inline]
         __do_sys_mbind mm/mempolicy.c:1490 [inline]
         __se_sys_mbind+0x437/0xb80 mm/mempolicy.c:1486
         __x64_sys_mbind+0x19d/0x200 mm/mempolicy.c:1486
         do_syscall_x64 arch/x86/entry/common.c:51 [inline]
         do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Link: https://lkml.kernel.org/r/20211001215630.810592-1-eric.dumazet@gmail.com
      Fixes: bda420b9
      
       ("numa balancing: migrate on fault among multiple bound nodes")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6d2aec9e
    • Peng Fan's avatar
      memblock: check memory total_size · 5173ed72
      Peng Fan authored
      mem=[X][G|M] is broken on ARM64 platform, there are cases that even
      type.cnt is 1, but total_size is not 0 because regions are merged into
      1.  So only check 'cnt' is not enough, total_size should be used,
      othersize bootargs 'mem=[X][G|B]' not work anymore.
      
      Link: https://lkml.kernel.org/r/20210930024437.32598-1-peng.fan@oss.nxp.com
      Fixes: e888fa7b
      
       ("memblock: Check memory add/cap ordering")
      Signed-off-by: default avatarPeng Fan <peng.fan@nxp.com>
      Reviewed-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Cc: Geert Uytterhoeven <geert+renesas@glider.be>
      Cc: David Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5173ed72
    • Huang Ying's avatar
      mm/migrate: fix CPUHP state to update node demotion order · a6a0251c
      Huang Ying authored
      The node demotion order needs to be updated during CPU hotplug.  Because
      whether a NUMA node has CPU may influence the demotion order.  The
      update function should be called during CPU online/offline after the
      node_states[N_CPU] has been updated.  That is done in
      CPUHP_AP_ONLINE_DYN during CPU online and in CPUHP_MM_VMSTAT_DEAD during
      CPU offline.  But in commit 884a6e5d ("mm/migrate: update node
      demotion order on hotplug events"), the function to update node demotion
      order is called in CPUHP_AP_ONLINE_DYN during CPU online/offline.  This
      doesn't satisfy the order requirement.
      
      For example, there are 4 CPUs (P0, P1, P2, P3) in 2 sockets (P0, P1 in S0
      and P2, P3 in S1), the demotion order is
      
       - S0 -> NUMA_NO_NODE
       - S1 -> NUMA_NO_NODE
      
      After P2 and P3 is offlined, because S1 has no CPU now, the demotion
      order should have been changed to
      
       - S0 -> S1
       - S1 -> NO_NODE
      
      but it isn't changed, because the order updating callback for CPU
      hotplug doesn't see the new nodemask.  After that, if P1 is offlined,
      the demotion order is changed to the expected order as above.
      
      So in this patch, we added CPUHP_AP_MM_DEMOTION_ONLINE and
      CPUHP_MM_DEMOTION_DEAD to be called after CPUHP_AP_ONLINE_DYN and
      CPUHP_MM_VMSTAT_DEAD during CPU online and offline, and register the
      update function on them.
      
      Link: https://lkml.kernel.org/r/20210929060351.7293-1-ying.huang@intel.com
      Fixes: 884a6e5d
      
       ("mm/migrate: update node demotion order on hotplug events")
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Keith Busch <kbusch@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a6a0251c
    • Dave Hansen's avatar
      mm/migrate: add CPU hotplug to demotion #ifdef · 76af6a05
      Dave Hansen authored
      Once upon a time, the node demotion updates were driven solely by memory
      hotplug events.  But now, there are handlers for both CPU and memory
      hotplug.
      
      However, the #ifdef around the code checks only memory hotplug.  A
      system that has HOTPLUG_CPU=y but MEMORY_HOTPLUG=n would miss CPU
      hotplug events.
      
      Update the #ifdef around the common code.  Add memory and CPU-specific
      #ifdefs for their handlers.  These memory/CPU #ifdefs avoid unused
      function warnings when their Kconfig option is off.
      
      [arnd@arndb.de: rework hotplug_memory_notifier() stub]
        Link: https://lkml.kernel.org/r/20211013144029.2154629-1-arnd@kernel.org
      
      Link: https://lkml.kernel.org/r/20210924161255.E5FE8F7E@davehans-spike.ostc.intel.com
      Fixes: 884a6e5d
      
       ("mm/migrate: update node demotion order on hotplug events")
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      76af6a05
    • Dave Hansen's avatar
      mm/migrate: optimize hotplug-time demotion order updates · 295be91f
      Dave Hansen authored
      Patch series "mm/migrate: 5.15 fixes for automatic demotion", v2.
      
      This contains two fixes for the "automatic demotion" code which was
      merged into 5.15:
      
       * Fix memory hotplug performance regression by watching
         suppressing any real action on irrelevant hotplug events.
      
       * Ensure CPU hotplug handler is registered when memory hotplug
         is disabled.
      
      This patch (of 2):
      
      == tl;dr ==
      
      Automatic demotion opted for a simple, lazy approach to handling hotplug
      events.  This noticeably slows down memory hotplug[1].  Optimize away
      updates to the demotion order when memory hotplug events should have no
      effect.
      
      This has no effect on CPU hotplug.  There is no known problem on the CPU
      side and any work there will be in a separate series.
      
      == Background ==
      
      Automatic demotion is a memory migration strategy to ensure that new
      allocations have room in faster memory tiers on tiered memory systems.
      The kernel maintains an array (node_demotion[]) to drive these
      migrations.
      
      The node_demotion[] path is calculated by starting at nodes with CPUs
      and then "walking" to nodes with memory.  Only hotplug events which
      online or offline a node with memory (N_ONLINE) or CPUs (N_CPU) will
      actually affect the migration order.
      
      == Problem ==
      
      However, the current code is lazy.  It completely regenerates the
      migration order on *any* CPU or memory hotplug event.  The logic was
      that these events are extremely rare and that the overhead from
      indiscriminate order regeneration is minimal.
      
      Part of the update logic involves a synchronize_rcu(), which is a pretty
      big hammer.  Its overhead was large enough to be detected by some 0day
      tests that watch memory hotplug performance[1].
      
      == Solution ==
      
      Add a new helper (node_demotion_topo_changed()) which can differentiate
      between superfluous and impactful hotplug events.  Skip the expensive
      update operation for superfluous events.
      
      == Aside: Locking ==
      
      It took me a few moments to declare the locking to be safe enough for
      node_demotion_topo_changed() to work.  It all hinges on the memory
      hotplug lock:
      
      During memory hotplug events, 'mem_hotplug_lock' is held for write.
      This ensures that two memory hotplug events can not be called
      simultaneously.
      
      CPU hotplug has a similar lock (cpuhp_state_mutex) which also provides
      mutual exclusion between CPU hotplug events.  In addition, the demotion
      code acquire and hold the mem_hotplug_lock for read during its CPU
      hotplug handlers.  This provides mutual exclusion between the demotion
      memory hotplug callbacks and the CPU hotplug callbacks.
      
      This effectively allows treating the migration target generation code to
      act as if it is single-threaded.
      
      1. https://lore.kernel.org/all/20210905135932.GE15026@xsang-OptiPlex-9020/
      
      Link: https://lkml.kernel.org/r/20210924161251.093CCD06@davehans-spike.ostc.intel.com
      Link: https://lkml.kernel.org/r/20210924161253.D7673E31@davehans-spike.ostc.intel.com
      Fixes: 884a6e5d
      
       ("mm/migrate: update node demotion order on hotplug events")
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Reported-by: default avatarkernel test robot <oliver.sang@intel.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      295be91f
  8. 13 Oct, 2021 1 commit
  9. 24 Sep, 2021 8 commits
  10. 23 Sep, 2021 1 commit
    • Shakeel Butt's avatar
      memcg: flush lruvec stats in the refault · 1f828223
      Shakeel Butt authored
      Prior to the commit 7e1c0d6f ("memcg: switch lruvec stats to rstat")
      and the commit aa48e47e ("memcg: infrastructure to flush memcg
      stats"), each lruvec memcg stats can be off by (nr_cgroups * nr_cpus *
      32) at worst and for unbounded amount of time.  The commit aa48e47e
      moved the lruvec stats to rstat infrastructure and the commit
      7e1c0d6f bounded the error for all the lruvec stats to (nr_cpus *
      32) at worst for at most 2 seconds.  More specifically it decoupled the
      number of stats and the number of cgroups from the error rate.
      
      However this reduction in error comes with the cost of triggering the
      slowpath of stats update more frequently.  Previously in the slowpath
      the kernel adds the stats up the memcg tree.  After aa48e47e, the
      kernel triggers the asyn lruvec stats flush through queue_work().  This
      causes regression reports from 0day kernel bot [1] as well as from
      phoronix test suite [2].
      
      We tried two options to fix the regression:
      
       1) Increase the threshold to trigger the slowpath in lruvec stats
          update codepath from 32 to 512.
      
       2) Remove the slowpath from lruvec stats update codepath and instead
          flush the stats in the page refault codepath. The assumption is that
          the kernel timely flush the stats, so, the update tree would be
          small in the refault codepath to not cause the preformance impact.
      
      Following are the results of will-it-scale/page_fault[1|2|3] benchmark
      on four settings i.e.  (1) 5.15-rc1 as baseline (2) 5.15-rc1 with
      aa48e47e and 7e1c0d6f reverted (3) 5.15-rc1 with option-1
      (4) 5.15-rc1 with option-2.
      
        test       (1)      (2)               (3)               (4)
        pg_f1   368563   406277 (10.23%)   399693  (8.44%)   416398 (12.97%)
        pg_f2   338399   372133  (9.96%)   369180  (9.09%)   381024 (12.59%)
        pg_f3   500853   575399 (14.88%)   570388 (13.88%)   576083 (15.02%)
      
      From the above result, it seems like the option-2 not only solves the
      regression but also improves the performance for at least these
      benchmarks.
      
      Feng Tang (intel) ran the aim7 benchmark with these two options and
      confirms that option-1 reduces the regression but option-2 removes the
      regression.
      
      Michael Larabel (phoronix) ran multiple benchmarks with these options
      and reported the results at [3] and it shows for most benchmarks
      option-2 removes the regression introduced by the commit aa48e47e
      ("memcg: infrastructure to flush memcg stats").
      
      Based on the experiment results, this patch proposed the option-2 as the
      solution to resolve the regression.
      
      Link: https://lore.kernel.org/all/20210726022421.GB21872@xsang-OptiPlex-9020 [1]
      Link: https://www.phoronix.com/scan.php?page=article&item=linux515-compile-regress [2]
      Link: https://openbenchmarking.org/result/2109226-DEBU-LINUX5104 [3]
      Fixes: aa48e47e
      
       ("memcg: infrastructure to flush memcg stats")
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Tested-by: default avatarMichael Larabel <Michael@phoronix.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Hillf Danton <hdanton@sina.com>,
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>,
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1f828223