1. 29 Jul, 2020 1 commit
  2. 10 Apr, 2020 2 commits
  3. 07 Apr, 2020 2 commits
  4. 02 Apr, 2020 2 commits
  5. 20 Feb, 2020 1 commit
  6. 31 Jan, 2020 1 commit
  7. 14 Jan, 2020 1 commit
  8. 06 Jan, 2020 1 commit
    • Catalin Marinas's avatar
      arm64: Revert support for execute-only user mappings · 24cecc37
      Catalin Marinas authored
      The ARMv8 64-bit architecture supports execute-only user permissions by
      clearing the PTE_USER and PTE_UXN bits, practically making it a mostly
      privileged mapping but from which user running at EL0 can still execute.
      
      The downside, however, is that the kernel at EL1 inadvertently reading
      such mapping would not trip over the PAN (privileged access never)
      protection.
      
      Revert the relevant bits from commit cab15ce6 ("arm64: Introduce
      execute-only page access permissions") so that PROT_EXEC implies
      PROT_READ (and therefore PTE_USER) until the architecture gains proper
      support for execute-only user mappings.
      
      Fixes: cab15ce6
      
       ("arm64: Introduce execute-only page access permissions")
      Cc: <stable@vger.kernel.org> # 4.9.x-
      Acked-by: default avatarWill Deacon <will@kernel.org>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      24cecc37
  9. 01 Dec, 2019 7 commits
  10. 26 Sep, 2019 2 commits
  11. 24 Sep, 2019 2 commits
  12. 20 Aug, 2019 1 commit
  13. 21 May, 2019 1 commit
  14. 09 May, 2019 1 commit
    • Dave Hansen's avatar
      x86/mpx, mm/core: Fix recursive munmap() corruption · 5a28fc94
      Dave Hansen authored
      This is a bit of a mess, to put it mildly.  But, it's a bug
      that only seems to have showed up in 4.20 but wasn't noticed
      until now, because nobody uses MPX.
      
      MPX has the arch_unmap() hook inside of munmap() because MPX
      uses bounds tables that protect other areas of memory.  When
      memory is unmapped, there is also a need to unmap the MPX
      bounds tables.  Barring this, unused bounds tables can eat 80%
      of the address space.
      
      But, the recursive do_munmap() that gets called vi arch_unmap()
      wreaks havoc with __do_munmap()'s state.  It can result in
      freeing populated page tables, accessing bogus VMA state,
      double-freed VMAs and more.
      
      See the "long story" further below for the gory details.
      
      To fix this, call arch_unmap() before __do_unmap() has a chance
      to do anything meaningful.  Also, remove the 'vma' argument
      and force the MPX code to do its own, independent VMA lookup.
      
      == UML / unicore32 impact ==
      
      Remove unused 'vma' argument to arch_unmap().  No functiona...
      5a28fc94
  15. 19 Apr, 2019 1 commit
    • Andrea Arcangeli's avatar
      coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping · 04f5866e
      Andrea Arcangeli authored
      The core dumping code has always run without holding the mmap_sem for
      writing, despite that is the only way to ensure that the entire vma
      layout will not change from under it.  Only using some signal
      serialization on the processes belonging to the mm is not nearly enough.
      This was pointed out earlier.  For example in Hugh's post from Jul 2017:
      
        https://lkml.kernel.org/r/alpine.LSU.2.11.1707191716030.2055@eggly.anvils
      
        "Not strictly relevant here, but a related note: I was very surprised
         to discover, only quite recently, how handle_mm_fault() may be called
         without down_read(mmap_sem) - when core dumping. That seems a
         misguided optimization to me, which would also be nice to correct"
      
      In particular because the growsdown and growsup can move the
      vm_start/vm_end the various loops the core dump does around the vma will
      not be consistent if page faults can happen concurrently.
      
      Pretty much all users calling mmget_not_zero()/get_task_mm() and then
      taking the mmap_sem had the potential to introduce unexpected side
      effects in the core dumping code.
      
      Adding mmap_sem for writing around the ->core_dump invocation is a
      viable long term fix, but it requires removing all copy user and page
      faults and to replace them with get_dump_page() for all binary formats
      which is not suitable as a short term fix.
      
      For the time being this solution manually covers the places that can
      confuse the core dump either by altering the vma layout or the vma flags
      while it runs.  Once ->core_dump runs under mmap_sem for writing the
      function mmget_still_valid() can be dropped.
      
      Allowing mmap_sem protected sections to run in parallel with the
      coredump provides some minor parallelism advantage to the swapoff code
      (which seems to be safe enough by never mangling any vma field and can
      keep doing swapins in parallel to the core dumping) and to some other
      corner case.
      
      In order to facilitate the backporting I added "Fixes: 86039bd3"
      however the side effect of this same race condition in /proc/pid/mem
      should be reproducible since before 2.6.12-rc2 so I couldn't add any
      other "Fixes:" because there's no hash beyond the git genesis commit.
      
      Because find_extend_vma() is the only location outside of the process
      context that could modify the "mm" structures under mmap_sem for
      reading, by adding the mmget_still_valid() check to it, all other cases
      that take the mmap_sem for reading don't need the new check after
      mmget_not_zero()/get_task_mm().  The expand_stack() in page fault
      context also doesn't need the new check, because all tasks under core
      dumping are frozen.
      
      Link: http://lkml.kernel.org/r/20190325224949.11068-1-aarcange@redhat.com
      Fixes: 86039bd3
      
       ("userfaultfd: add new syscall to provide memory externalization")
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: default avatarJann Horn <jannh@google.com>
      Suggested-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reviewed-by: default avatarJann Horn <jannh@google.com>
      Acked-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      04f5866e
  16. 06 Mar, 2019 2 commits
  17. 28 Feb, 2019 1 commit
  18. 28 Dec, 2018 1 commit
  19. 10 Dec, 2018 1 commit
    • Steve Capper's avatar
      mm: mmap: Allow for "high" userspace addresses · f6795053
      Steve Capper authored
      
      This patch adds support for "high" userspace addresses that are
      optionally supported on the system and have to be requested via a hint
      mechanism ("high" addr parameter to mmap).
      
      Architectures such as powerpc and x86 achieve this by making changes to
      their architectural versions of arch_get_unmapped_* functions. However,
      on arm64 we use the generic versions of these functions.
      
      Rather than duplicate the generic arch_get_unmapped_* implementations
      for arm64, this patch instead introduces two architectural helper macros
      and applies them to arch_get_unmapped_*:
       arch_get_mmap_end(addr) - get mmap upper limit depending on addr hint
       arch_get_mmap_base(addr, base) - get mmap_base depending on addr hint
      
      If these macros are not defined in architectural code then they default
      to (TASK_SIZE) and (base) so should not introduce any behavioural
      changes to architectures that do not define them.
      Signed-off-by: default avatarSteve Capper <steve.capper@arm.com>
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      f6795053
  20. 26 Oct, 2018 5 commits
    • Yang Shi's avatar
      mm: brk: downgrade mmap_sem to read when shrinking · 9bc8039e
      Yang Shi authored
      brk might be used to shrink memory mapping too other than munmap().  So,
      it may hold write mmap_sem for long time when shrinking large mapping, as
      what commit ("mm: mmap: zap pages with read mmap_sem in munmap")
      described.
      
      The brk() will not manipulate vmas anymore after __do_munmap() call for
      the mapping shrink use case.  But, it may set mm->brk after __do_munmap(),
      which needs hold write mmap_sem.
      
      However, a simple trick can workaround this by setting mm->brk before
      __do_munmap().  Then restore the original value if __do_munmap() fails.
      With this trick, it is safe to downgrade to read mmap_sem.
      
      So, the same optimization, which downgrades mmap_sem to read for zapping
      pages, is also feasible and reasonable to this case.
      
      The period of holding exclusive mmap_sem for shrinking large mapping would
      be reduced significantly with this optimization.
      
      [akpm@linux-foundation.org: tweak comment]
      [yang.shi@linux.alibaba.com: fix unsigned compare against 0 issue]
        Link: http://lkml.kernel.org/r/1538687672-17795-1-git-send-email-yang.shi@linux.alibaba.com
      Link: http://lkml.kernel.org/r/1538067582-60038-2-git-send-email-yang.shi@linux.alibaba.com
      
      Signed-off-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
      Cc: Colin Ian King <colin.king@canonical.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9bc8039e
    • Yang Shi's avatar
      mm: mremap: downgrade mmap_sem to read when shrinking · 85a06835
      Yang Shi authored
      Other than munmap, mremap might be used to shrink memory mapping too.
      So, it may hold write mmap_sem for long time when shrinking large
      mapping, as what commit ("mm: mmap: zap pages with read mmap_sem in
      munmap") described.
      
      The mremap() will not manipulate vmas anymore after __do_munmap() call for
      the mapping shrink use case, so it is safe to downgrade to read mmap_sem.
      
      So, the same optimization, which downgrades mmap_sem to read for zapping
      pages, is also feasible and reasonable to this case.
      
      The period of holding exclusive mmap_sem for shrinking large mapping
      would be reduced significantly with this optimization.
      
      MREMAP_FIXED and MREMAP_MAYMOVE are more complicated to adopt this
      optimization since they need manipulate vmas after do_munmap(),
      downgrading mmap_sem may create race window.
      
      Simple mapping shrink is the low hanging fruit, and it may cover the
      most cases of unmap with munmap together.
      
      [akpm@linux-foundation.org: tweak comment]
      [yang.shi@linux.alibaba.com: fix unsigned compare against 0 issue]
        Link: http://lkml.kernel.org/r/1538687672-17795-2-git-send-email-yang.shi@linux.alibaba.com
      Link: http://lkml.kernel.org/r/1538067582-60038-1-git-send-email-yang.shi@linux.alibaba.com
      
      Signed-off-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
      Cc: Colin Ian King <colin.king@canonical.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      85a06835
    • Yang Shi's avatar
      mm: unmap VM_PFNMAP mappings with optimized path · cb492249
      Yang Shi authored
      When unmapping VM_PFNMAP mappings, vm flags need to be updated.  Since the
      vmas have been detached, so it sounds safe to update vm flags with read
      mmap_sem.
      
      Link: http://lkml.kernel.org/r/1537376621-51150-4-git-send-email-yang.shi@linux.alibaba.com
      
      Signed-off-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: default avatarMatthew Wilcox <willy@infradead.org>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cb492249
    • Yang Shi's avatar
      mm: unmap VM_HUGETLB mappings with optimized path · b4cefb36
      Yang Shi authored
      When unmapping VM_HUGETLB mappings, vm flags need to be updated.  Since
      the vmas have been detached, so it sounds safe to update vm flags with
      read mmap_sem.
      
      Link: http://lkml.kernel.org/r/1537376621-51150-3-git-send-email-yang.shi@linux.alibaba.com
      
      Signed-off-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: default avatarMatthew Wilcox <willy@infradead.org>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b4cefb36
    • Yang Shi's avatar
      mm: mmap: zap pages with read mmap_sem in munmap · dd2283f2
      Yang Shi authored
      Patch series "mm: zap pages with read mmap_sem in munmap for large
      mapping", v11.
      
      Background:
      Recently, when we ran some vm scalability tests on machines with large memory,
      we ran into a couple of mmap_sem scalability issues when unmapping large memory
      space, please refer to https://lkml.org/lkml/2017/12/14/733 and
      https://lkml.org/lkml/2018/2/20/576.
      
      History:
      Then akpm suggested to unmap large mapping section by section and drop mmap_sem
      at a time to mitigate it (see https://lkml.org/lkml/2018/3/6/784).
      
      V1 patch series was submitted to the mailing list per Andrew's suggestion
      (see https://lkml.org/lkml/2018/3/20/786).  Then I received a lot great
      feedback and suggestions.
      
      Then this topic was discussed on LSFMM summit 2018.  In the summit, Michal
      Hocko suggested (also in the v1 patches review) to try "two phases"
      approach.  Zapping pages with read mmap_sem, then doing via cleanup with
      write mmap_sem (for discussion detail, see
      https://lwn.net/Articles/753269/)
      
      Approach:
      Zapping pages is the most time consuming part, according to the suggestion from
      Michal Hocko [1], zapping pages can be done with holding read mmap_sem, like
      what MADV_DONTNEED does. Then re-acquire write mmap_sem to cleanup vmas.
      
      But, we can't call MADV_DONTNEED directly, since there are two major drawbacks:
        * The unexpected state from PF if it wins the race in the middle of munmap.
          It may return zero page, instead of the content or SIGSEGV.
        * Can't handle VM_LOCKED | VM_HUGETLB | VM_PFNMAP and uprobe mappings, which
          is a showstopper from akpm
      
      But, some part may need write mmap_sem, for example, vma splitting. So,
      the design is as follows:
              acquire write mmap_sem
              lookup vmas (find and split vmas)
              deal with special mappings
              detach vmas
              downgrade_write
      
              zap pages
              free page tables
              release mmap_sem
      
      The vm events with read mmap_sem may come in during page zapping, but
      since vmas have been detached before, they, i.e.  page fault, gup, etc,
      will not be able to find valid vma, then just return SIGSEGV or -EFAULT as
      expected.
      
      If the vma has VM_HUGETLB | VM_PFNMAP, they are considered as special
      mappings.  They will be handled by falling back to regular do_munmap()
      with exclusive mmap_sem held in this patch since they may update vm flags.
      
      But, with the "detach vmas first" approach, the vmas have been detached
      when vm flags are updated, so it sounds safe to update vm flags with read
      mmap_sem for this specific case.  So, VM_HUGETLB and VM_PFNMAP will be
      handled by using the optimized path in the following separate patches for
      bisectable sake.
      
      Unmapping uprobe areas may need update mm flags (MMF_RECALC_UPROBES).
      However it is fine to have false-positive MMF_RECALC_UPROBES according to
      uprobes developer.  So, uprobe unmap will not be handled by the regular
      path.
      
      With the "detach vmas first" approach we don't have to re-acquire mmap_sem
      again to clean up vmas to avoid race window which might get the address
      space changed since downgrade_write() doesn't release the lock to lead
      regression, which simply downgrades to read lock.
      
      And, since the lock acquire/release cost is managed to the minimum and
      almost as same as before, the optimization could be extended to any size
      of mapping without incurring significant penalty to small mappings.
      
      For the time being, just do this in munmap syscall path.  Other
      vm_munmap() or do_munmap() call sites (i.e mmap, mremap, etc) remain
      intact due to some implementation difficulties since they acquire write
      mmap_sem from very beginning and hold it until the end, do_munmap() might
      be called in the middle.  But, the optimized do_munmap would like to be
      called without mmap_sem held so that we can do the optimization.  So, if
      we want to do the similar optimization for mmap/mremap path, I'm afraid we
      would have to redesign them.  mremap might be called on very large area
      depending on the usecases, the optimization to it will be considered in
      the future.
      
      This patch (of 3):
      
      When running some mmap/munmap scalability tests with large memory (i.e.
      > 300GB), the below hung task issue may happen occasionally.
      
      INFO: task ps:14018 blocked for more than 120 seconds.
             Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
       "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
      message.
       ps              D    0 14018      1 0x00000004
        ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
        ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
        00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
       Call Trace:
        [<ffffffff817154d0>] ? __schedule+0x250/0x730
        [<ffffffff817159e6>] schedule+0x36/0x80
        [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
        [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
        [<ffffffff81717db0>] down_read+0x20/0x40
        [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
        [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
        [<ffffffff81241d87>] __vfs_read+0x37/0x150
        [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
        [<ffffffff81242266>] vfs_read+0x96/0x130
        [<ffffffff812437b5>] SyS_read+0x55/0xc0
        [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5
      
      It is because munmap holds mmap_sem exclusively from very beginning to all
      the way down to the end, and doesn't release it in the middle.  When
      unmapping large mapping, it may take long time (take ~18 seconds to unmap
      320GB mapping with every single page mapped on an idle machine).
      
      Zapping pages is the most time consuming part, according to the suggestion
      from Michal Hocko [1], zapping pages can be done with holding read
      mmap_sem, like what MADV_DONTNEED does.  Then re-acquire write mmap_sem to
      cleanup vmas.
      
      But, some part may need write mmap_sem, for example, vma splitting. So,
      the design is as follows:
              acquire write mmap_sem
              lookup vmas (find and split vmas)
              deal with special mappings
              detach vmas
              downgrade_write
      
              zap pages
              free page tables
              release mmap_sem
      
      The vm events with read mmap_sem may come in during page zapping, but
      since vmas have been detached before, they, i.e.  page fault, gup, etc,
      will not be able to find valid vma, then just return SIGSEGV or -EFAULT as
      expected.
      
      If the vma has VM_HUGETLB | VM_PFNMAP, they are considered as special
      mappings.  They will be handled by without downgrading mmap_sem in this
      patch since they may update vm flags.
      
      But, with the "detach vmas first" approach, the vmas have been detached
      when vm flags are updated, so it sounds safe to update vm flags with read
      mmap_sem for this specific case.  So, VM_HUGETLB and VM_PFNMAP will be
      handled by using the optimized path in the following separate patches for
      bisectable sake.
      
      Unmapping uprobe areas may need update mm flags (MMF_RECALC_UPROBES).
      However it is fine to have false-positive MMF_RECALC_UPROBES according to
      uprobes developer.
      
      With the "detach vmas first" approach we don't have to re-acquire mmap_sem
      again to clean up vmas to avoid race window which might get the address
      space changed since downgrade_write() doesn't release the lock to lead
      regression, which simply downgrades to read lock.
      
      And, since the lock acquire/release cost is managed to the minimum and
      almost as same as before, the optimization could be extended to any size
      of mapping without incurring significant penalty to small mappings.
      
      For the time being, just do this in munmap syscall path.  Other
      vm_munmap() or do_munmap() call sites (i.e mmap, mremap, etc) remain
      intact due to some implementation difficulties since they acquire write
      mmap_sem from very beginning and hold it until the end, do_munmap() might
      be called in the middle.  But, the optimized do_munmap would like to be
      called without mmap_sem held so that we can do the optimization.  So, if
      we want to do the similar optimization for mmap/mremap path, I'm afraid we
      would have to redesign them.  mremap might be called on very large area
      depending on the usecases, the optimization to it will be considered in
      the future.
      
      With the patches, exclusive mmap_sem hold time when munmap a 80GB address
      space on a machine with 32 cores of E5-2680 @ 2.70GHz dropped to us level
      from second.
      
      munmap_test-15002 [008]   594.380138: funcgraph_entry: |
      __vm_munmap() {
      munmap_test-15002 [008]   594.380146: funcgraph_entry:      !2485684 us
      |    unmap_region();
      munmap_test-15002 [008]   596.865836: funcgraph_exit:       !2485692 us
      |  }
      
      Here the execution time of unmap_region() is used to evaluate the time of
      holding read mmap_sem, then the remaining time is used with holding
      exclusive lock.
      
      [1] https://lwn.net/Articles/753269/
      
      Link: http://lkml.kernel.org/r/1537376621-51150-2-git-send-email-yang.shi@linux.alibaba.com
      
      Signed-off-by: default avatarYang Shi &lt;yang.shi@linux.alibaba.com&gt;Suggested-by: Michal Hocko <mhocko@kernel.org>
      Suggested-by: default avatarKirill A. Shutemov <kirill@shutemov.name>
      Suggested-by: default avatarMatthew Wilcox <willy@infradead.org>
      Reviewed-by: default avatarMatthew Wilcox <willy@infradead.org>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dd2283f2
  21. 13 Oct, 2018 1 commit
    • Jann Horn's avatar
      mm/mmap.c: don't clobber partially overlapping VMA with MAP_FIXED_NOREPLACE · 7aa867dd
      Jann Horn authored
      Daniel Micay reports that attempting to use MAP_FIXED_NOREPLACE in an
      application causes that application to randomly crash.  The existing check
      for handling MAP_FIXED_NOREPLACE looks up the first VMA that either
      overlaps or follows the requested region, and then bails out if that VMA
      overlaps *the start* of the requested region.  It does not bail out if the
      VMA only overlaps another part of the requested region.
      
      Fix it by checking that the found VMA only starts at or after the end of
      the requested region, in which case there is no overlap.
      
      Test case:
      
      user@debian:~$ cat mmap_fixed_simple.c
      #include <sys/mman.h>
      #include <errno.h>
      #include <stdio.h>
      #include <stdlib.h>
      #include <unistd.h>
      
      #ifndef MAP_FIXED_NOREPLACE
      #define MAP_FIXED_NOREPLACE 0x100000
      #endif
      
      int main(void) {
        char *p;
      
        errno = 0;
        p = mmap((void*)0x10001000, 0x4000, PROT_NONE,
      MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED_NOREPLACE, -1, 0);
        printf("p1=%p err=%m\n", p);
      
        errno = 0;
        p = mmap((void*)0x10000000, 0x2000, PROT_READ,
      MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED_NOREPLACE, -1, 0);
        printf("p2=%p err=%m\n", p);
      
        char cmd[100];
        sprintf(cmd, "cat /proc/%d/maps", getpid());
        system(cmd);
      
        return 0;
      }
      user@debian:~$ gcc -o mmap_fixed_simple mmap_fixed_simple.c
      user@debian:~$ ./mmap_fixed_simple
      p1=0x10001000 err=Success
      p2=0x10000000 err=Success
      10000000-10002000 r--p 00000000 00:00 0
      10002000-10005000 ---p 00000000 00:00 0
      564a9a06f000-564a9a070000 r-xp 00000000 fe:01 264004
        /home/user/mmap_fixed_simple
      564a9a26f000-564a9a270000 r--p 00000000 fe:01 264004
        /home/user/mmap_fixed_simple
      564a9a270000-564a9a271000 rw-p 00001000 fe:01 264004
        /home/user/mmap_fixed_simple
      564a9a54a000-564a9a56b000 rw-p 00000000 00:00 0                          [heap]
      7f8eba447000-7f8eba5dc000 r-xp 00000000 fe:01 405885
        /lib/x86_64-linux-gnu/libc-2.24.so
      7f8eba5dc000-7f8eba7dc000 ---p 00195000 fe:01 405885
        /lib/x86_64-linux-gnu/libc-2.24.so
      7f8eba7dc000-7f8eba7e0000 r--p 00195000 fe:01 405885
        /lib/x86_64-linux-gnu/libc-2.24.so
      7f8eba7e0000-7f8eba7e2000 rw-p 00199000 fe:01 405885
        /lib/x86_64-linux-gnu/libc-2.24.so
      7f8eba7e2000-7f8eba7e6000 rw-p 00000000 00:00 0
      7f8eba7e6000-7f8eba809000 r-xp 00000000 fe:01 405876
        /lib/x86_64-linux-gnu/ld-2.24.so
      7f8eba9e9000-7f8eba9eb000 rw-p 00000000 00:00 0
      7f8ebaa06000-7f8ebaa09000 rw-p 00000000 00:00 0
      7f8ebaa09000-7f8ebaa0a000 r--p 00023000 fe:01 405876
        /lib/x86_64-linux-gnu/ld-2.24.so
      7f8ebaa0a000-7f8ebaa0b000 rw-p 00024000 fe:01 405876
        /lib/x86_64-linux-gnu/ld-2.24.so
      7f8ebaa0b000-7f8ebaa0c000 rw-p 00000000 00:00 0
      7ffcc99fa000-7ffcc9a1b000 rw-p 00000000 00:00 0                          [stack]
      7ffcc9b44000-7ffcc9b47000 r--p 00000000 00:00 0                          [vvar]
      7ffcc9b47000-7ffcc9b49000 r-xp 00000000 00:00 0                          [vdso]
      ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0
        [vsyscall]
      user@debian:~$ uname -a
      Linux debian 4.19.0-rc6+ #181 SMP Wed Oct 3 23:43:42 CEST 2018 x86_64 GNU/Linux
      user@debian:~$
      
      As you can see, the first page of the mapping at 0x10001000 was clobbered.
      
      Link: http://lkml.kernel.org/r/20181010152736.99475-1-jannh@google.com
      Fixes: a4ff8e86
      
       ("mm: introduce MAP_FIXED_NOREPLACE")
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Reported-by: default avatarDaniel Micay <danielmicay@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7aa867dd
  22. 22 Aug, 2018 2 commits
    • Michal Hocko's avatar
      mm, oom: remove oom_lock from oom_reaper · af5679fb
      Michal Hocko authored
      oom_reaper used to rely on the oom_lock since e2fe1456 ("oom_reaper:
      close race with exiting task").  We do not really need the lock anymore
      though.  21292580 ("mm: oom: let oom_reap_task and exit_mmap run
      concurrently") has removed serialization with the exit path based on the
      mm reference count and so we do not really rely on the oom_lock anymore.
      
      Tetsuo was arguing that at least MMF_OOM_SKIP should be set under the lock
      to prevent from races when the page allocator didn't manage to get the
      freed (reaped) memory in __alloc_pages_may_oom but it sees the flag later
      on and move on to another victim.  Although this is possible in principle
      let's wait for it to actually happen in real life before we make the
      locking more complex again.
      
      Therefore remove the oom_lock for oom_reaper paths (both exit_mmap and
      oom_reap_task_mm).  The reaper serializes with exit_mmap by mmap_sem +
      MMF_OOM_SKIP flag.  There is no synchronization with out_of_memory path
      now.
      
      [mhocko@kernel.org: oom_reap_task_mm should return false when __oom_reap_task_mm did]
        Link: http://lkml.kernel.org/r/20180724141747.GP28386@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20180719075922.13784-1-mhocko@kernel.org
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Suggested-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      af5679fb
    • Michal Hocko's avatar
      mm, oom: distinguish blockable mode for mmu notifiers · 93065ac7
      Michal Hocko authored
      There are several blockable mmu notifiers which might sleep in
      mmu_notifier_invalidate_range_start and that is a problem for the
      oom_reaper because it needs to guarantee a forward progress so it cannot
      depend on any sleepable locks.
      
      Currently we simply back off and mark an oom victim with blockable mmu
      notifiers as done after a short sleep.  That can result in selecting a new
      oom victim prematurely because the previous one still hasn't torn its
      memory down yet.
      
      We can do much better though.  Even if mmu notifiers use sleepable locks
      there is no reason to automatically assume those locks are held.  Moreover
      majority of notifiers only care about a portion of the address space and
      there is absolutely zero reason to fail when we are unmapping an unrelated
      range.  Many notifiers do really block and wait for HW which is harder to
      handle and we have to bail out though.
      
      This patch handles the low hanging fruit.
      __mmu_notifier_invalidate_range_start gets a blockable flag and callbacks
      are not allowed to sleep if the flag is set to false.  This is achieved by
      using trylock instead of the sleepable lock for most callbacks and
      continue as long as we do not block down the call chain.
      
      I think we can improve that even further because there is a common pattern
      to do a range lookup first and then do something about that.  The first
      part can be done without a sleeping lock in most cases AFAICS.
      
      The oom_reaper end then simply retries if there is at least one notifier
      which couldn't make any progress in !blockable mode.  A retry loop is
      already implemented to wait for the mmap_sem and this is basically the
      same thing.
      
      The simplest way for driver developers to test this code path is to wrap
      userspace code which uses these notifiers into a memcg and set the hard
      limit to hit the oom.  This can be done e.g.  after the test faults in all
      the mmu notifier managed memory and set the hard limit to something really
      small.  Then we are looking for a proper process tear down.
      
      [akpm@linux-foundation.org: coding style fixes]
      [akpm@linux-foundation.org: minor code simplification]
      Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.org
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: Christian König <christian.koenig@amd.com> # AMD notifiers
      Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx and umem_odp
      Reported-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: "David (ChunMing) Zhou" <David1.Zhou@amd.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Doug Ledford <dledford@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
      Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
      Cc: Sudeep Dutt <sudeep.dutt@intel.com>
      Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
      Cc: Dimitri Sivanich <sivanich@sgi.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Felix Kuehling <felix.kuehling@amd.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      93065ac7
  23. 17 Aug, 2018 1 commit