1. 21 Apr, 2020 1 commit
    • Muchun Song's avatar
      mm/ksm: fix NULL pointer dereference when KSM zero page is enabled · 56df70a6
      Muchun Song authored
      find_mergeable_vma() can return NULL.  In this case, it leads to a crash
      when we access vm_mm(its offset is 0x40) later in write_protect_page.
      And this case did happen on our server.  The following call trace is
      captured in kernel 4.19 with the following patch applied and KSM zero
      page enabled on our server.
      
        commit e86c59b1 ("mm/ksm: improve deduplication of zero pages with colouring")
      
      So add a vma check to fix it.
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000040
        Oops: 0000 [#1] SMP NOPTI
        CPU: 9 PID: 510 Comm: ksmd Kdump: loaded Tainted: G OE 4.19.36.bsk.9-amd64 #4.19.36.bsk.9
        RIP: try_to_merge_one_page+0xc7/0x760
        Code: 24 58 65 48 33 34 25 28 00 00 00 89 e8 0f 85 a3 06 00 00 48 83 c4
              60 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 8b 46 08 a8 01 75 b8 <49>
              8b 44 24 40 4c 8d 7c 24 20 b9 07 00 00 00 4c 89 e6 4c 89 ff 48
        RSP: 0018:ffffadbdd9fffdb0 EFLAGS: 00010246
        RAX: ffffda83ffd4be08 RBX: ffffda83ffd4be40 RCX: 0000002c6e800000
        RDX: 0000000000000000 RSI: ffffda83ffd4be40 RDI: 0000000000000000
        RBP: ffffa11939f02ec0 R08: 0000000094e1a447 R09: 00000000abe76577
        R10: 0000000000000962 R11: 0000000000004e6a R12: 0000000000000000
        R13: ffffda83b1e06380 R14: ffffa18f31f072c0 R15: ffffda83ffd4be40
        FS: 0000000000000000(0000) GS:ffffa0da43b80000(0000) knlGS:0000000000000000
        CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000040 CR3: 0000002c77c0a003 CR4: 00000000007626e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        PKRU: 55555554
        Call Trace:
          ksm_scan_thread+0x115e/0x1960
          kthread+0xf5/0x130
          ret_from_fork+0x1f/0x30
      
      [songmuchun@bytedance.com: if the vma is out of date, just exit]
        Link: http://lkml.kernel.org/r/20200416025034.29780-1-songmuchun@bytedance.com
      [akpm@linux-foundation.org: add the conventional braces, replace /** with /*]
      Fixes: e86c59b1
      
       ("mm/ksm: improve deduplication of zero pages with colouring")
      Co-developed-by: default avatarXiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Cc: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
      Cc: Markus Elfring <Markus.Elfring@web.de>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200416025034.29780-1-songmuchun@bytedance.com
      Link: http://lkml.kernel.org/r/20200414132905.83819-1-songmuchun@bytedance.com
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      56df70a6
  2. 07 Apr, 2020 2 commits
  3. 28 Nov, 2019 1 commit
  4. 22 Nov, 2019 1 commit
  5. 24 Sep, 2019 1 commit
  6. 19 Jun, 2019 1 commit
  7. 14 May, 2019 2 commits
    • Jérôme Glisse's avatar
      mm/mmu_notifier: use correct mmu_notifier events for each invalidation · 7269f999
      Jérôme Glisse authored
      This updates each existing invalidation to use the correct mmu notifier
      event that represent what is happening to the CPU page table.  See the
      patch which introduced the events to see the rational behind this.
      
      Link: http://lkml.kernel.org/r/20190326164747.24405-7-jglisse@redhat.com
      
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krcmar <rkrcmar@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Christian Koenig <christian.koenig@amd.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7269f999
    • Jérôme Glisse's avatar
      mm/mmu_notifier: contextual information for event triggering invalidation · 6f4f13e8
      Jérôme Glisse authored
      CPU page table update can happens for many reasons, not only as a result
      of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as
      a result of kernel activities (memory compression, reclaim, migration,
      ...).
      
      Users of mmu notifier API track changes to the CPU page table and take
      specific action for them.  While current API only provide range of virtual
      address affected by the change, not why the changes is happening.
      
      This patchset do the initial mechanical convertion of all the places that
      calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP
      event as well as the vma if it is know (most invalidation happens against
      a given vma).  Passing down the vma allows the users of mmu notifier to
      inspect the new vma page protection.
      
      The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier
      should assume that every for the range is going away when that event
      happens.  A latter patch do convert mm call path to use a more appropriate
      events for each call.
      
      This is done as 2 patches so that no call site is forgotten especialy
      as it uses this following coccinelle patch:
      
      %<----------------------------------------------------------------------
      @@
      identifier I1, I2, I3, I4;
      @@
      static inline void mmu_notifier_range_init(struct mmu_notifier_range *I1,
      +enum mmu_notifier_event event,
      +unsigned flags,
      +struct vm_area_struct *vma,
      struct mm_struct *I2, unsigned long I3, unsigned long I4) { ... }
      
      @@
      @@
      -#define mmu_notifier_range_init(range, mm, start, end)
      +#define mmu_notifier_range_init(range, event, flags, vma, mm, start, end)
      
      @@
      expression E1, E3, E4;
      identifier I1;
      @@
      <...
      mmu_notifier_range_init(E1,
      +MMU_NOTIFY_UNMAP, 0, I1,
      I1->vm_mm, E3, E4)
      ...>
      
      @@
      expression E1, E2, E3, E4;
      identifier FN, VMA;
      @@
      FN(..., struct vm_area_struct *VMA, ...) {
      <...
      mmu_notifier_range_init(E1,
      +MMU_NOTIFY_UNMAP, 0, VMA,
      E2, E3, E4)
      ...> }
      
      @@
      expression E1, E2, E3, E4;
      identifier FN, VMA;
      @@
      FN(...) {
      struct vm_area_struct *VMA;
      <...
      mmu_notifier_range_init(E1,
      +MMU_NOTIFY_UNMAP, 0, VMA,
      E2, E3, E4)
      ...> }
      
      @@
      expression E1, E2, E3, E4;
      identifier FN;
      @@
      FN(...) {
      <...
      mmu_notifier_range_init(E1,
      +MMU_NOTIFY_UNMAP, 0, NULL,
      E2, E3, E4)
      ...> }
      ---------------------------------------------------------------------->%
      
      Applied with:
      spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place
      spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place
      spatch --sp-file mmu-notifier.spatch --dir mm --in-place
      
      Link: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.com
      
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krcmar <rkrcmar@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Christian Koenig <christian.koenig@amd.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6f4f13e8
  8. 06 Mar, 2019 3 commits
    • Yang Shi's avatar
      mm: ksm: do not block on page lock when searching stable tree · 2cee57d1
      Yang Shi authored
      ksmd needs to search the stable tree to look for the suitable KSM page,
      but the KSM page might be locked for a while due to i.e.  KSM page rmap
      walk.  Basically it is not a big deal since commit 2c653d0e ("ksm:
      introduce ksm_max_page_sharing per page deduplication limit"), since
      max_page_sharing limits the number of shared KSM pages.
      
      But it still sounds not worth waiting for the lock, the page can be
      skip, then try to merge it in the next scan to avoid potential stall if
      its content is still intact.
      
      Introduce trylock mode to get_ksm_page() to not block on page lock, like
      what try_to_merge_one_page() does.  And, define three possible
      operations (nolock, lock and trylock) as enum type to avoid stacking up
      bools and make the code more readable.
      
      Return -EBUSY if trylock fails, since NULL means not find suitable KSM
      page, which is a valid case.
      
      With the default max_page_sharing setting (256), there is almost no
      observed change comparing lock vs trylock.
      
      However, with ksm02 of LTP, the reduced ksmd full scan time can be
      observed, which has set max_page_sharing to 786432.  With lock version,
      ksmd may tak 10s - 11s to run two full scans, with trylock version ksmd
      may take 8s - 11s to run two full scans.  And, the number of
      pages_sharing and pages_to_scan keep same.  Basically, this change has
      no harm.
      
      [hughd@google.com: fix BUG_ON()]
        Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1902182122280.6914@eggly.anvils
      Link: http://lkml.kernel.org/r/1548793753-62377-1-git-send-email-yang.shi@linux.alibaba.com
      
      Signed-off-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Suggested-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2cee57d1
    • Kirill Tkhai's avatar
      mm: reuse only-pte-mapped KSM page in do_wp_page() · 52d1e606
      Kirill Tkhai authored
      Add an optimization for KSM pages almost in the same way that we have
      for ordinary anonymous pages.  If there is a write fault in a page,
      which is mapped to an only pte, and it is not related to swap cache; the
      page may be reused without copying its content.
      
      [ Note that we do not consider PageSwapCache() pages at least for now,
        since we don't want to complicate __get_ksm_page(), which has nice
        optimization based on this (for the migration case). Currenly it is
        spinning on PageSwapCache() pages, waiting for when they have
        unfreezed counters (i.e., for the migration finish). But we don't want
        to make it also spinning on swap cache pages, which we try to reuse,
        since there is not a very high probability to reuse them. So, for now
        we do not consider PageSwapCache() pages at all. ]
      
      So in reuse_ksm_page() we check for 1) PageSwapCache() and 2)
      page_stable_node(), to skip a page, which KSM is currently trying to
      link to stable tree.  Then we do page_ref_freeze() to prohibit KSM to
      merge one more page into the page, we are reusing.  After that, nobody
      can refer to the reusing page: KSM skips !PageSwapCache() pages with
      zero refcount; and the protection against of all other participants is
      the same as for reused ordinary anon pages pte lock, page lock and
      mmap_sem.
      
      [akpm@linux-foundation.org: replace BUG_ON()s with WARN_ON()s]
      Link: http://lkml.kernel.org/r/154471491016.31352.1168978849911555609.stgit@localhost.localdomain
      
      Signed-off-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Reviewed-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christian Koenig <christian.koenig@amd.com>
      Cc: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      52d1e606
    • Anshuman Khandual's avatar
      mm: replace all open encodings for NUMA_NO_NODE · 98fa15f3
      Anshuman Khandual authored
      Patch series "Replace all open encodings for NUMA_NO_NODE", v3.
      
      All these places for replacement were found by running the following
      grep patterns on the entire kernel code.  Please let me know if this
      might have missed some instances.  This might also have replaced some
      false positives.  I will appreciate suggestions, inputs and review.
      
      1. git grep "nid == -1"
      2. git grep "node == -1"
      3. git grep "nid = -1"
      4. git grep "node = -1"
      
      This patch (of 2):
      
      At present there are multiple places where invalid node number is
      encoded as -1.  Even though implicitly understood it is always better to
      have macros in there.  Replace these open encodings for an invalid node
      number with the global macro NUMA_NO_NODE.  This helps remove NUMA
      related assumptions like 'invalid node' from various places redirecting
      them to a common definition.
      
      Link: http://lkml.kernel.org/r/1545127933-10711-2-git-send-email-anshuman.khandual@arm.com
      
      Signed-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>	[ixgbe]
      Acked-by: Jens Axboe <axboe@kernel.dk>			[mtip32xx]
      Acked-by: Vinod Koul <vkoul@kernel.org>			[dmaengine.c]
      Acked-by: Michael Ellerman <mpe@ellerman.id.au>		[powerpc]
      Acked-by: Doug Ledford <dledford@redhat.com>		[drivers/infiniband]
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Hans Verkuil <hverkuil@xs4all.nl>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      98fa15f3
  9. 28 Dec, 2018 3 commits
  10. 23 Aug, 2018 1 commit
  11. 22 Aug, 2018 1 commit
  12. 17 Aug, 2018 2 commits
    • Souptick Joarder's avatar
      mm: convert return type of handle_mm_fault() caller to vm_fault_t · 50a7ca3c
      Souptick Joarder authored
      Use new return type vm_fault_t for fault handler.  For now, this is just
      documenting that the function returns a VM_FAULT value rather than an
      errno.  Once all instances are converted, vm_fault_t will become a
      distinct type.
      
      Ref-> commit 1c8f4220 ("mm: change return type to vm_fault_t")
      
      In this patch all the caller of handle_mm_fault() are changed to return
      vm_fault_t type.
      
      Link: http://lkml.kernel.org/r/20180617084810.GA6730@jordon-HP-15-Notebook-PC
      
      Signed-off-by: default avatarSouptick Joarder <jrdr.linux@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: James E.J. Bottomley <jejb@parisc-linux.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Palmer Dabbelt <palmer@sifive.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Levin, Alexander (Sasha Levin)" <alexander.levin@verizon.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      50a7ca3c
    • Dave Jiang's avatar
      dax: remove VM_MIXEDMAP for fsdax and device dax · e1fb4a08
      Dave Jiang authored
      This patch is reworked from an earlier patch that Dan has posted:
      https://patchwork.kernel.org/patch/10131727/
      
      VM_MIXEDMAP is used by dax to direct mm paths like vm_normal_page() that
      the memory page it is dealing with is not typical memory from the linear
      map.  The get_user_pages_fast() path, since it does not resolve the vma,
      is already using {pte,pmd}_devmap() as a stand-in for VM_MIXEDMAP, so we
      use that as a VM_MIXEDMAP replacement in some locations.  In the cases
      where there is no pte to consult we fallback to using vma_is_dax() to
      detect the VM_MIXEDMAP special case.
      
      Now that we have explicit driver pfn_t-flag opt-in/opt-out for
      get_user_pages() support for DAX we can stop setting VM_MIXEDMAP.  This
      also means we no longer need to worry about safely manipulating vm_flags
      in a future where we support dynamically changing the dax mode of a
      file.
      
      DAX should also now be supported with madvise_behavior(), vma_merge(),
      and copy_page_range().
      
      This patch has been tested against ndctl unit test.  It has also been
      tested against xfstests commit: 625515d using fake pmem created by
      memmap and no additional issues have been observed.
      
      Link: http://lkml.kernel.org/r/152847720311.55924.16999195879201817653.stgit@djiang5-desk3.ch.intel.com
      
      Signed-off-by: default avatarDave Jiang <dave.jiang@intel.com>
      Acked-by: default avatarDan Williams <dan.j.williams@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e1fb4a08
  13. 14 Jun, 2018 1 commit
    • Jia He's avatar
      mm/ksm.c: ignore STABLE_FLAG of rmap_item->address in rmap_walk_ksm() · 1105a2fc
      Jia He authored
      In our armv8a server(QDF2400), I noticed lots of WARN_ON caused by
      PAGE_SIZE unaligned for rmap_item->address under memory pressure
      tests(start 20 guests and run memhog in the host).
      
        WARNING: CPU: 4 PID: 4641 at virt/kvm/arm/mmu.c:1826 kvm_age_hva_handler+0xc0/0xc8
        CPU: 4 PID: 4641 Comm: memhog Tainted: G        W 4.17.0-rc3+ #8
        Call trace:
         kvm_age_hva_handler+0xc0/0xc8
         handle_hva_to_gpa+0xa8/0xe0
         kvm_age_hva+0x4c/0xe8
         kvm_mmu_notifier_clear_flush_young+0x54/0x98
         __mmu_notifier_clear_flush_young+0x6c/0xa0
         page_referenced_one+0x154/0x1d8
         rmap_walk_ksm+0x12c/0x1d0
         rmap_walk+0x94/0xa0
         page_referenced+0x194/0x1b0
         shrink_page_list+0x674/0xc28
         shrink_inactive_list+0x26c/0x5b8
         shrink_node_memcg+0x35c/0x620
         shrink_node+0x100/0x430
         do_try_to_free_pages+0xe0/0x3a8
         try_to_free_pages+0xe4/0x230
         __alloc_pages_nodemask+0x564/0xdc0
         alloc_pages_vma+0x90/0x228
         do_anonymous_page+0xc8/0x4d0
         __handle_mm_fault+0x4a0/0x508
         handle_mm_fault+0xf8/0x1b0
         do_page_fault+0x218/0x4b8
         do_translation_fault+0x90/0xa0
         do_mem_abort+0x68/0xf0
         el0_da+0x24/0x28
      
      In rmap_walk_ksm, the rmap_item->address might still have the
      STABLE_FLAG, then the start and end in handle_hva_to_gpa might not be
      PAGE_SIZE aligned.  Thus it will cause exceptions in handle_hva_to_gpa
      on arm64.
      
      This patch fixes it by ignoring (not removing) the low bits of address
      when doing rmap_walk_ksm.
      
      IMO, it should be backported to stable tree.  the storm of WARN_ONs is
      very easy for me to reproduce.  More than that, I watched a panic (not
      reproducible) as follows:
      
        page:ffff7fe003742d80 count:-4871 mapcount:-2126053375 mapping: (null) index:0x0
        flags: 0x1fffc00000000000()
        raw: 1fffc00000000000 0000000000000000 0000000000000000 ffffecf981470000
        raw: dead000000000100 dead000000000200 ffff8017c001c000 0000000000000000
        page dumped because: nonzero _refcount
        CPU: 29 PID: 18323 Comm: qemu-kvm Tainted: G W 4.14.15-5.hxt.aarch64 #1
        Hardware name: <snip for confidential issues>
        Call trace:
          dump_backtrace+0x0/0x22c
          show_stack+0x24/0x2c
          dump_stack+0x8c/0xb0
          bad_page+0xf4/0x154
          free_pages_check_bad+0x90/0x9c
          free_pcppages_bulk+0x464/0x518
          free_hot_cold_page+0x22c/0x300
          __put_page+0x54/0x60
          unmap_stage2_range+0x170/0x2b4
          kvm_unmap_hva_handler+0x30/0x40
          handle_hva_to_gpa+0xb0/0xec
          kvm_unmap_hva_range+0x5c/0xd0
      
      I even injected a fault on purpose in kvm_unmap_hva_range by seting
      size=size-0x200, the call trace is similar as above.  So I thought the
      panic is similarly caused by the root cause of WARN_ON.
      
      Andrea said:
      
      : It looks a straightforward safe fix, on x86 hva_to_gfn_memslot would
      : zap those bits and hide the misalignment caused by the low metadata
      : bits being erroneously left set in the address, but the arm code
      : notices when that's the last page in the memslot and the hva_end is
      : getting aligned and the size is below one page.
      :
      : I think the problem triggers in the addr += PAGE_SIZE of
      : unmap_stage2_ptes that never matches end because end is aligned but
      : addr is not.
      :
      : 	} while (pte++, addr += PAGE_SIZE, addr != end);
      :
      : x86 again only works on hva_start/hva_end after converting it to
      : gfn_start/end and that being in pfn units the bits are zapped before
      : they risk to cause trouble.
      
      Jia He said:
      
      : I've tested by myself in arm64 server (QDF2400,46 cpus,96G mem) Without
      : this patch, the WARN_ON is very easy for reproducing.  After this patch, I
      : have run the same benchmarch for a whole day without any WARN_ONs
      
      Link: http://lkml.kernel.org/r/1525403506-6750-1-git-send-email-hejianet@gmail.com
      
      Signed-off-by: default avatarJia He <jia.he@hxt-semitech.com>
      Reviewed-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Tested-by: default avatarJia He <hejianet@gmail.com>
      Cc: Suzuki K Poulose <Suzuki.Poulose@arm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
      Cc: Arvind Yadav <arvind.yadav.cs@gmail.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1105a2fc
  14. 08 Jun, 2018 1 commit
  15. 27 Apr, 2018 1 commit
  16. 16 Apr, 2018 1 commit
  17. 11 Apr, 2018 1 commit
  18. 06 Apr, 2018 2 commits
  19. 18 Mar, 2018 1 commit
    • Khalid Aziz's avatar
      sparc64: Add support for ADI (Application Data Integrity) · 74a04967
      Khalid Aziz authored
      
      ADI is a new feature supported on SPARC M7 and newer processors to allow
      hardware to catch rogue accesses to memory. ADI is supported for data
      fetches only and not instruction fetches. An app can enable ADI on its
      data pages, set version tags on them and use versioned addresses to
      access the data pages. Upper bits of the address contain the version
      tag. On M7 processors, upper four bits (bits 63-60) contain the version
      tag. If a rogue app attempts to access ADI enabled data pages, its
      access is blocked and processor generates an exception. Please see
      Documentation/sparc/adi.txt for further details.
      
      This patch extends mprotect to enable ADI (TSTATE.mcde), enable/disable
      MCD (Memory Corruption Detection) on selected memory ranges, enable
      TTE.mcd in PTEs, return ADI parameters to userspace and save/restore ADI
      version tags on page swap out/in or migration. ADI is not enabled by
      default for any task. A task must explicitly enable ADI on a memory
      range and set version tag for ADI to be effective for the task.
      Signed-off-by: default avatarKhalid Aziz <khalid.aziz@oracle.com>
      Cc: Khalid Aziz <khalid@gonehiking.org>
      Reviewed-by: default avatarAnthony Yznaga <anthony.yznaga@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      74a04967
  20. 07 Feb, 2018 1 commit
  21. 04 Dec, 2017 1 commit
    • Paul E. McKenney's avatar
      mm/ksm: Remove now-redundant smp_read_barrier_depends() · 08df4774
      Paul E. McKenney authored
      
      Because READ_ONCE() now implies smp_read_barrier_depends(), the
      smp_read_barrier_depends() in get_ksm_page() is now redundant.
      This commit removes it and updates the comments.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
      Cc: <linux-mm@kvack.org>
      08df4774
  22. 16 Nov, 2017 1 commit
    • Jérôme Glisse's avatar
      mm/mmu_notifier: avoid double notification when it is useless · 0f10851e
      Jérôme Glisse authored
      This patch only affects users of mmu_notifier->invalidate_range callback
      which are device drivers related to ATS/PASID, CAPI, IOMMUv2, SVM ...
      and it is an optimization for those users.  Everyone else is unaffected
      by it.
      
      When clearing a pte/pmd we are given a choice to notify the event under
      the page table lock (notify version of *_clear_flush helpers do call the
      mmu_notifier_invalidate_range).  But that notification is not necessary
      in all cases.
      
      This patch removes almost all cases where it is useless to have a call
      to mmu_notifier_invalidate_range before
      mmu_notifier_invalidate_range_end.  It also adds documentation in all
      those cases explaining why.
      
      Below is a more in depth analysis of why this is fine to do this:
      
      For secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when
      device use thing like ATS/PASID to get the IOMMU to walk the CPU page
      table to access a process virtual address space).  There is only 2 cases
      when you need to notify those secondary TLB while holding page table
      lock when clearing a pte/pmd:
      
        A) page backing address is free before mmu_notifier_invalidate_range_end
        B) a page table entry is updated to point to a new page (COW, write fault
           on zero page, __replace_page(), ...)
      
      Case A is obvious you do not want to take the risk for the device to write
      to a page that might now be used by something completely different.
      
      Case B is more subtle. For correctness it requires the following sequence
      to happen:
        - take page table lock
        - clear page table entry and notify (pmd/pte_huge_clear_flush_notify())
        - set page table entry to point to new page
      
      If clearing the page table entry is not followed by a notify before setting
      the new pte/pmd value then you can break memory model like C11 or C++11 for
      the device.
      
      Consider the following scenario (device use a feature similar to ATS/
      PASID):
      
      Two address addrA and addrB such that |addrA - addrB| >= PAGE_SIZE we
      assume they are write protected for COW (other case of B apply too).
      
      [Time N] -----------------------------------------------------------------
      CPU-thread-0  {try to write to addrA}
      CPU-thread-1  {try to write to addrB}
      CPU-thread-2  {}
      CPU-thread-3  {}
      DEV-thread-0  {read addrA and populate device TLB}
      DEV-thread-2  {read addrB and populate device TLB}
      [Time N+1] ---------------------------------------------------------------
      CPU-thread-0  {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}}
      CPU-thread-1  {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}}
      CPU-thread-2  {}
      CPU-thread-3  {}
      DEV-thread-0  {}
      DEV-thread-2  {}
      [Time N+2] ---------------------------------------------------------------
      CPU-thread-0  {COW_step1: {update page table point to new page for addrA}}
      CPU-thread-1  {COW_step1: {update page table point to new page for addrB}}
      CPU-thread-2  {}
      CPU-thread-3  {}
      DEV-thread-0  {}
      DEV-thread-2  {}
      [Time N+3] ---------------------------------------------------------------
      CPU-thread-0  {preempted}
      CPU-thread-1  {preempted}
      CPU-thread-2  {write to addrA which is a write to new page}
      CPU-thread-3  {}
      DEV-thread-0  {}
      DEV-thread-2  {}
      [Time N+3] ---------------------------------------------------------------
      CPU-thread-0  {preempted}
      CPU-thread-1  {preempted}
      CPU-thread-2  {}
      CPU-thread-3  {write to addrB which is a write to new page}
      DEV-thread-0  {}
      DEV-thread-2  {}
      [Time N+4] ---------------------------------------------------------------
      CPU-thread-0  {preempted}
      CPU-thread-1  {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}}
      CPU-thread-2  {}
      CPU-thread-3  {}
      DEV-thread-0  {}
      DEV-thread-2  {}
      [Time N+5] ---------------------------------------------------------------
      CPU-thread-0  {preempted}
      CPU-thread-1  {}
      CPU-thread-2  {}
      CPU-thread-3  {}
      DEV-thread-0  {read addrA from old page}
      DEV-thread-2  {read addrB from new page}
      
      So here because at time N+2 the clear page table entry was not pair with a
      notification to invalidate the secondary TLB, the device see the new value
      for addrB before seing the new value for addrA.  This break total memory
      ordering for the device.
      
      When changing a pte to write protect or to point to a new write protected
      page with same content (KSM) it is ok to delay invalidate_range callback
      to mmu_notifier_invalidate_range_end() outside the page table lock.  This
      is true even if the thread doing page table update is preempted right
      after releasing page table lock before calling
      mmu_notifier_invalidate_range_end
      
      Thanks to Andrea for thinking of a problematic scenario for COW.
      
      [jglisse@redhat.com: v2]
        Link: http://lkml.kernel.org/r/20171017031003.7481-2-jglisse@redhat.com
      Link: http://lkml.kernel.org/r/20170901173011.10745-1-jglisse@redhat.com
      
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Alistair Popple <alistair@popple.id.au>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0f10851e
  23. 04 Oct, 2017 1 commit
  24. 07 Sep, 2017 1 commit
  25. 10 Aug, 2017 1 commit
    • Minchan Kim's avatar
      mm: fix KSM data corruption · b3a81d08
      Minchan Kim authored
      Nadav reported KSM can corrupt the user data by the TLB batching
      race[1].  That means data user written can be lost.
      
      Quote from Nadav Amit:
       "For this race we need 4 CPUs:
      
        CPU0: Caches a writable and dirty PTE entry, and uses the stale value
        for write later.
      
        CPU1: Runs madvise_free on the range that includes the PTE. It would
        clear the dirty-bit. It batches TLB flushes.
      
        CPU2: Writes 4 to /proc/PID/clear_refs , clearing the PTEs soft-dirty.
        We care about the fact that it clears the PTE write-bit, and of
        course, batches TLB flushes.
      
        CPU3: Runs KSM. Our purpose is to pass the following test in
        write_protect_page():
      
      	if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) ||
      	    (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte)))
      
        Since it will avoid TLB flush. And we want to do it while the PTE is
        stale. Later, and before replacing the page, we would be able to
        change the page.
      
        Note that all the operations the CPU1-3 perform canhappen in parallel
        since they only acquire mmap_sem for read.
      
        We start with two identical pages. Everything below regards the same
        page/PTE.
      
        CPU0        CPU1        CPU2        CPU3
        ----        ----        ----        ----
        Write the same
        value on page
      
        [cache PTE as
         dirty in TLB]
      
                    MADV_FREE
                    pte_mkclean()
      
                                4 > clear_refs
                                pte_wrprotect()
      
                                            write_protect_page()
                                            [ success, no flush ]
      
                                            pages_indentical()
                                            [ ok ]
      
        Write to page
        different value
      
        [Ok, using stale
         PTE]
      
                                            replace_page()
      
        Later, CPU1, CPU2 and CPU3 would flush the TLB, but that is too late.
        CPU0 already wrote on the page, but KSM ignored this write, and it got
        lost"
      
      In above scenario, MADV_FREE is fixed by changing TLB batching API
      including [set|clear]_tlb_flush_pending.  Remained thing is soft-dirty
      part.
      
      This patch changes soft-dirty uses TLB batching API instead of
      flush_tlb_mm and KSM checks pending TLB flush by using
      mm_tlb_flush_pending so that it will flush TLB to avoid data lost if
      there are other parallel threads pending TLB flush.
      
      [1] http://lkml.kernel.org/r/BD3A0EBE-ECF4-41D4-87FA-C755EA9AB6BD@gmail.com
      
      Link: http://lkml.kernel.org/r/20170802000818.4760-8-namit@vmware.com
      
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarNadav Amit <namit@vmware.com>
      Reported-by: default avatarNadav Amit <namit@vmware.com>
      Tested-by: default avatarNadav Amit <namit@vmware.com>
      Reviewed-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b3a81d08
  26. 06 Jul, 2017 5 commits
    • Andrea Arcangeli's avatar
      ksm: optimize refile of stable_node_dup at the head of the chain · 80b18dfa
      Andrea Arcangeli authored
      If a candidate stable_node_dup has been found and it can accept further
      merges it can be refiled to the head of the list to speedup next
      searches without altering which dup is found and how the dups accumulate
      in the chain.
      
      We already refiled it back to the head in the prune_stale_stable_nodes
      case, but we didn't refile it if not pruning (which is more common).
      And we also refiled it when it was already at the head which is
      unnecessary (in the prune_stale_stable_nodes case, nr > 1 means there's
      more than one dup in the chain, it doesn't mean it's not already at the
      head of the chain).
      
      The stable_node_chain list is single threaded and there's no SMP locking
      contention so it should be faster to refile it to the head of the list
      also if prune_stale_stable_nodes is false.
      
      Profiling shows the refile happens 1.9% of the time when a dup is found
      with a max_page_sharing limit setting of 3 (with max_page_sharing of 2
      the refile never happens of course as there's never space for one more
      merge) which is reasonably low.  At higher max_page_sharing values it
      should be much less frequent.
      
      This is just an optimization.
      
      Link: http://lkml.kernel.org/r/20170518173721.22316-4-aarcange@redhat.com
      
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: Evgheni Dereveanchin <ederevea@redhat.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Gavin Guo <gavin.guo@canonical.com>
      Cc: Jay Vosburgh <jay.vosburgh@canonical.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      80b18dfa
    • Andrea Arcangeli's avatar
      ksm: swap the two output parameters of chain/chain_prune · 8dc5ffcd
      Andrea Arcangeli authored
      Some static checker complains if chain/chain_prune returns a potentially
      stale pointer.
      
      There are two output parameters to chain/chain_prune, one is tree_page
      the other is stable_node_dup.  Like in get_ksm_page the caller has to
      check tree_page is NULL before touching the stable_node.  Similarly in
      chain/chain_prune the caller has to check tree_page before touching the
      stable_node_dup returned or the original stable_node passed as
      parameter.
      
      Because the tree_page is never returned as a stale pointer, it may be
      more intuitive to return tree_page and to pass stable_node_dup for
      reference instead of the reverse.
      
      This patch purely swaps the two output parameters of chain/chain_prune
      as a cleanup for the static checker and to mimic the get_ksm_page
      behavior more closely.  There's no change to the caller at all except
      the swap, it's purely a cleanup and it is a noop from the caller point
      of view.
      
      Link: http://lkml.kernel.org/r/20170518173721.22316-3-aarcange@redhat.com
      
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Tested-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Cc: Evgheni Dereveanchin <ederevea@redhat.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Gavin Guo <gavin.guo@canonical.com>
      Cc: Jay Vosburgh <jay.vosburgh@canonical.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8dc5ffcd
    • Andrea Arcangeli's avatar
      ksm: cleanup stable_node chain collapse case · 0ba1d0f7
      Andrea Arcangeli authored
      Patch series "KSMscale cleanup/optimizations".
      
      There are no fixes here it's just minor cleanups and optimizations.
      
      1/3 removes makes the "fix" for the stale stable_node fall in the
          standard case without introducing new cases.  Setting stable_node to
          NULL was marginally safer, but stale pointer is still wiped from the
          caller, this looks cleaner.
      
      2/3 should fix the false positive from Dan's static checker.
      
      3/3 is a microoptimization to apply the the refile of future merge
          candidate dups at the head of the chain in all cases and to skip it in
          one case where we did it and but it was a noop (to avoid checking if
          it was already at the head but now we've to check it anyway so it got
          optimized away).
      
      This patch (of 3):
      
      When the stable_node chain is collapsed we can as well set the caller
      stable_node to match the returned stable_node_dup in chain_prune().
      
      This way the collapse case becomes indistinguishable from the regular
      stable_node case and we can remove two branches from the KSM page
      migration handling slow paths.
      
      While it was all correct this looks cleaner (and faster) as the caller has
      to deal with fewer special cases.
      
      Link: http://lkml.kernel.org/r/20170518173721.22316-2-aarcange@redhat.com
      
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: Evgheni Dereveanchin <ederevea@redhat.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Gavin Guo <gavin.guo@canonical.com>
      Cc: Jay Vosburgh <jay.vosburgh@canonical.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0ba1d0f7
    • Andrea Arcangeli's avatar
      ksm: fix use after free with merge_across_nodes = 0 · b4fecc67
      Andrea Arcangeli authored
      If merge_across_nodes was manually set to 0 (not the default value) by
      the admin or a tuned profile on NUMA systems triggering cross-NODE page
      migrations, a stable_node use after free could materialize.
      
      If the chain is collapsed stable_node would point to the old chain that
      was already freed.  stable_node_dup would be the stable_node dup now
      converted to a regular stable_node and indexed in the rbtree in
      replacement of the freed stable_node chain (not anymore a dup).
      
      This special case where the chain is collapsed in the NUMA replacement
      path, is now detected by setting stable_node to NULL by the chain_prune
      callee if it decides to collapse the chain.  This tells the NUMA
      replacement code that even if stable_node and stable_node_dup are
      different, this is not a chain if stable_node is NULL, as the
      stable_node_dup was converted to a regular stable_node and the chain was
      collapsed.
      
      It is generally safer for the callee to force the caller stable_node to
      NULL the moment it become stale so any other mistake like this would
      result in an instant Oops easier to debug than an use after free.
      
      Otherwise the replace logic would act like if stable_node was a valid
      chain, when in fact it was freed.  Notably
      stable_node_chain_add_dup(page_node, stable_node) would run on a stable
      stable_node.
      
      Andrey Ryabinin found the source of the use after free in chain_prune().
      
      Link: http://lkml.kernel.org/r/20170512193805.8807-2-aarcange@redhat.com
      
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reported-by: default avatarEvgheni Dereveanchin <ederevea@redhat.com>
      Tested-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Petr Holasek <pholasek@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Gavin Guo <gavin.guo@canonical.com>
      Cc: Jay Vosburgh <jay.vosburgh@canonical.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b4fecc67
    • Andrea Arcangeli's avatar
      ksm: introduce ksm_max_page_sharing per page deduplication limit · 2c653d0e
      Andrea Arcangeli authored
      
      Without a max deduplication limit for each KSM page, the list of the
      rmap_items associated to each stable_node can grow infinitely large.
      
      During the rmap walk each entry can take up to ~10usec to process
      because of IPIs for the TLB flushing (both for the primary MMU and the
      secondary MMUs with the MMU notifier).  With only 16GB of address space
      shared in the same KSM page, that would amount to dozens of seconds of
      kernel runtime.
      
      A ~256 max deduplication factor will reduce the latencies of the rmap
      walks on KSM pages to order of a few msec.  Just doing the
      cond_resched() during the rmap walks is not enough, the list size must
      have a limit too, otherwise the caller could get blocked in (schedule
      friendly) kernel computations for seconds, unexpectedly.
      
      There's room for optimization to significantly reduce the IPI delivery
      cost during the page_referenced(), but at least for page_migration in
      the KSM case (used by hard NUMA bindings, compaction and NUMA balancing)
      it may be inevitable to send lots of IPIs if each rmap_item->mm is
      active on a different CPU and there are lots of CPUs.  Even if we ignore
      the IPI delivery cost, we've still to walk the whole KSM rmap list, so
      we can't allow millions or billions (ulimited) number of entries in the
      KSM stable_node rmap_item lists.
      
      The limit is enforced efficiently by adding a second dimension to the
      stable rbtree.  So there are three types of stable_nodes: the regular
      ones (identical as before, living in the first flat dimension of the
      stable rbtree), the "chains" and the "dups".
      
      Every "chain" and all "dups" linked into a "chain" enforce the invariant
      that they represent the same write protected memory content, even if
      each "dup" will be pointed by a different KSM page copy of that content.
      This way the stable rbtree lookup computational complexity is unaffected
      if compared to an unlimited max_sharing_limit.  It is still enforced
      that there cannot be KSM page content duplicates in the stable rbtree
      itself.
      
      Adding the second dimension to the stable rbtree only after the
      max_page_sharing limit hits, provides for a zero memory footprint
      increase on 64bit archs.  The memory overhead of the per-KSM page
      stable_tree and per virtual mapping rmap_item is unchanged.  Only after
      the max_page_sharing limit hits, we need to allocate a stable_tree
      "chain" and rb_replace() the "regular" stable_node with the newly
      allocated stable_node "chain".  After that we simply add the "regular"
      stable_node to the chain as a stable_node "dup" by linking hlist_dup in
      the stable_node_chain->hlist.  This way the "regular" (flat) stable_node
      is converted to a stable_node "dup" living in the second dimension of
      the stable rbtree.
      
      During stable rbtree lookups the stable_node "chain" is identified as
      stable_node->rmap_hlist_len == STABLE_NODE_CHAIN (aka
      is_stable_node_chain()).
      
      When dropping stable_nodes, the stable_node "dup" is identified as
      stable_node->head == STABLE_NODE_DUP_HEAD (aka is_stable_node_dup()).
      
      The STABLE_NODE_DUP_HEAD must be an unique valid pointer never used
      elsewhere in any stable_node->head/node to avoid a clashes with the
      stable_node->node.rb_parent_color pointer, and different from
      &migrate_nodes.  So the second field of &migrate_nodes is picked and
      verified as always safe with a BUILD_BUG_ON in case the list_head
      implementation changes in the future.
      
      The STABLE_NODE_DUP is picked as a random negative value in
      stable_node->rmap_hlist_len.  rmap_hlist_len cannot become negative when
      it's a "regular" stable_node or a stable_node "dup".
      
      The stable_node_chain->nid is irrelevant.  The stable_node_chain->kpfn
      is aliased in a union with a time field used to rate limit the
      stable_node_chain->hlist prunes.
      
      The garbage collection of the stable_node_chain happens lazily during
      stable rbtree lookups (as for all other kind of stable_nodes), or while
      disabling KSM with "echo 2 >/sys/kernel/mm/ksm/run" while collecting the
      entire stable rbtree.
      
      While the "regular" stable_nodes and the stable_node "dups" must wait
      for their underlying tree_page to be freed before they can be freed
      themselves, the stable_node "chains" can be freed immediately if the
      stable_node->hlist turns empty.  This is because the "chains" are never
      pointed by any page->mapping and they're effectively stable rbtree KSM
      self contained metadata.
      
      [akpm@linux-foundation.org: fix non-NUMA build]
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Tested-by: default avatarPetr Holasek <pholasek@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Evgheni Dereveanchin <ederevea@redhat.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Gavin Guo <gavin.guo@canonical.com>
      Cc: Jay Vosburgh <jay.vosburgh@canonical.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2c653d0e
  27. 02 Jun, 2017 1 commit
    • Andrea Arcangeli's avatar
      ksm: prevent crash after write_protect_page fails · a7306c34
      Andrea Arcangeli authored
      "err" needs to be left set to -EFAULT if split_huge_page succeeds.
      Otherwise if "err" gets clobbered with zero and write_protect_page
      fails, try_to_merge_one_page() will succeed instead of returning -EFAULT
      and then try_to_merge_with_ksm_page() will continue thinking kpage is a
      PageKsm when in fact it's still an anonymous page.  Eventually it'll
      crash in page_add_anon_rmap.
      
      This has been reproduced on Fedora25 kernel but I can reproduce with
      upstream too.
      
      The bug was introduced in commit f765f540 ("ksm: prepare to new THP
      semantics") introduced in v4.5.
      
          page:fffff67546ce1cc0 count:4 mapcount:2 mapping:ffffa094551e36e1 index:0x7f0f46673
          flags: 0x2ffffc0004007c(referenced|uptodate|dirty|lru|active|swapbacked)
          page dumped because: VM_BUG_ON_PAGE(!PageLocked(page))
          page->mem_cgroup:ffffa09674bf0000
          ------------[ cut here ]------------
          kernel BUG at mm/rmap.c:1222!
          CPU: 1 PID: 76 Comm: ksmd Not tainted 4.9.3-200.fc25.x86_64 #1
          RIP: do_page_add_anon_rmap+0x1c4/0x240
          Call Trace:
            page_add_anon_rmap+0x18/0x20
            try_to_merge_with_ksm_page+0x50b/0x780
            ksm_scan_thread+0x1211/0x1410
            ? prepare_to_wait_event+0x100/0x100
            ? try_to_merge_with_ksm_page+0x780/0x780
            kthread+0xd9/0xf0
            ? kthread_park+0x60/0x60
            ret_from_fork+0x25/0x30
      
      Fixes: f765f540 ("ksm: prepare to new THP semantics")
      Link: http://lkml.kernel.org/r/20170513131040.21732-1-aarcange@redhat.com
      
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: default avatarFederico Simoncelli <fsimonce@redhat.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a7306c34
  28. 03 May, 2017 1 commit