1. 15 May, 2022 1 commit
  2. 10 Mar, 2022 1 commit
  3. 03 Feb, 2022 1 commit
    • John Hubbard's avatar
      Revert "mm/gup: small refactoring: simplify try_grab_page()" · c36c04c2
      John Hubbard authored
      This reverts commit 54d516b1
      
      That commit did a refactoring that effectively combined fast and slow
      gup paths (again).  And that was again incorrect, for two reasons:
      
       a) Fast gup and slow gup get reference counts on pages in different
          ways and with different goals: see Linus' writeup in commit
          cd1adf1b ("Revert "mm/gup: remove try_get_page(), call
          try_get_compound_head() directly""), and
      
       b) try_grab_compound_head() also has a specific check for
          "FOLL_LONGTERM && !is_pinned(page)", that assumes that the caller
          can fall back to slow gup. This resulted in new failures, as
          recently report by Will McVicker [1].
      
      But (a) has problems too, even though they may not have been reported
      yet.  So just revert this.
      
      Link: https://lore.kernel.org/r/20220131203504.3458775-1-willmcvicker@google.com [1]
      Fixes: 54d516b1
      
       ("mm/gup: small refactoring: simplify try_grab_page()")
      Reported-and-tested-by: default avatarWill McVicker <willmcvicker@google.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Minchan Kim <minchan@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: stable@vger.kernel.org # 5.15
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c36c04c2
  4. 15 Jan, 2022 2 commits
  5. 06 Nov, 2021 1 commit
  6. 24 Oct, 2021 1 commit
  7. 20 Oct, 2021 1 commit
  8. 18 Oct, 2021 1 commit
    • Andreas Gruenbacher's avatar
      gup: Turn fault_in_pages_{readable,writeable} into fault_in_{readable,writeable} · bb523b40
      Andreas Gruenbacher authored
      
      Turn fault_in_pages_{readable,writeable} into versions that return the
      number of bytes not faulted in, similar to copy_to_user, instead of
      returning a non-zero value when any of the requested pages couldn't be
      faulted in.  This supports the existing users that require all pages to
      be faulted in as well as new users that are happy if any pages can be
      faulted in.
      
      Rename the functions to fault_in_{readable,writeable} to make sure
      this change doesn't silently break things.
      
      Neither of these functions is entirely trivial and it doesn't seem
      useful to inline them, so move them to mm/gup.c.
      Signed-off-by: default avatarAndreas Gruenbacher <agruenba@redhat.com>
      bb523b40
  9. 07 Sep, 2021 1 commit
    • Linus Torvalds's avatar
      Revert "mm/gup: remove try_get_page(), call try_get_compound_head() directly" · cd1adf1b
      Linus Torvalds authored
      This reverts commit 9857a17f
      
      .
      
      That commit was completely broken, and I should have caught on to it
      earlier.  But happily, the kernel test robot noticed the breakage fairly
      quickly.
      
      The breakage is because "try_get_page()" is about avoiding the page
      reference count overflow case, but is otherwise the exact same as a
      plain "get_page()".
      
      In contrast, "try_get_compound_head()" is an entirely different beast,
      and uses __page_cache_add_speculative() because it's not just about the
      page reference count, but also about possibly racing with the underlying
      page going away.
      
      So all the commentary about how
      
       "try_get_page() has fallen a little behind in terms of maintenance,
        try_get_compound_head() handles speculative page references more
        thoroughly"
      
      was just completely wrong: yes, try_get_compound_head() handles
      speculative page references, but the point is that try_get_page() does
      not, and must not.
      
      So there's no lack of maintainance - there are fundamentally different
      semantics.
      
      A speculative page reference would be entirely wrong in "get_page()",
      and it's entirely wrong in "try_get_page()".  It's not about
      speculation, it's purely about "uhhuh, you can't get this page because
      you've tried to increment the reference count too much already".
      
      The reason the kernel test robot noticed this bug was that it hit the
      VM_BUG_ON() in __page_cache_add_speculative(), which is all about
      verifying that the context of any speculative page access is correct.
      But since that isn't what try_get_page() is all about, the VM_BUG_ON()
      tests things that are not correct to test for try_get_page().
      Reported-by: default avatarkernel test robot <oliver.sang@intel.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cd1adf1b
  10. 03 Sep, 2021 9 commits
  11. 14 Aug, 2021 1 commit
    • David Hildenbrand's avatar
      mm/madvise: report SIGBUS as -EFAULT for MADV_POPULATE_(READ|WRITE) · eb2faa51
      David Hildenbrand authored
      Doing some extended tests and polishing the man page update for
      MADV_POPULATE_(READ|WRITE), I realized that we end up converting also
      SIGBUS (via -EFAULT) to -EINVAL, making it look like yet another
      madvise() user error.
      
      We want to report only problematic mappings and permission problems that
      the user could have know as -EINVAL.
      
      Let's not convert -EFAULT arising due to SIGBUS (or SIGSEGV) to -EINVAL,
      but instead indicate -EFAULT to user space.  While we could also convert
      it to -ENOMEM, using -EFAULT looks more helpful when user space might
      want to troubleshoot what's going wrong: MADV_POPULATE_(READ|WRITE) is
      not part of an final Linux release and we can still adjust the behavior.
      
      Link: https://lkml.kernel.org/r/20210726154932.102880-1-david@redhat.com
      Fixes: 4ca9b385
      
       ("mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rolf Eike Beer <eike-kernel@sf-tec.de>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eb2faa51
  12. 08 Jul, 2021 1 commit
    • Mike Rapoport's avatar
      mm: introduce memfd_secret system call to create "secret" memory areas · 1507f512
      Mike Rapoport authored
      Introduce "memfd_secret" system call with the ability to create memory
      areas visible only in the context of the owning process and not mapped not
      only to other processes but in the kernel page tables as well.
      
      The secretmem feature is off by default and the user must explicitly
      enable it at the boot time.
      
      Once secretmem is enabled, the user will be able to create a file
      descriptor using the memfd_secret() system call.  The memory areas created
      by mmap() calls from this file descriptor will be unmapped from the kernel
      direct map and they will be only mapped in the page table of the processes
      that have access to the file descriptor.
      
      Secretmem is designed to provide the following protections:
      
      * Enhanced protection (in conjunction with all the other in-kernel
        attack prevention systems) against ROP attacks.  Seceretmem makes
        "simple" ROP insufficient to perform exfiltration, which increases the
        required complexity of the attack.  Along with other protections like
        the kernel stack size limit and address space layout randomization which
        make finding gadgets is really hard, absence of any in-kernel primitive
        for accessing secret memory means the one gadget ROP attack can't work.
        Since the only way to access secret memory is to reconstruct the missing
        mapping entry, the attacker has to recover the physical page and insert
        a PTE pointing to it in the kernel and then retrieve the contents.  That
        takes at least three gadgets which is a level of difficulty beyond most
        standard attacks.
      
      * Prevent cross-process secret userspace memory exposures.  Once the
        secret memory is allocated, the user can't accidentally pass it into the
        kernel to be transmitted somewhere.  The secreremem pages cannot be
        accessed via the direct map and they are disallowed in GUP.
      
      * Harden against exploited kernel flaws.  In order to access secretmem,
        a kernel-side attack would need to either walk the page tables and
        create new ones, or spawn a new privileged uiserspace process to perform
        secrets exfiltration using ptrace.
      
      The file descriptor based memory has several advantages over the
      "traditional" mm interfaces, such as mlock(), mprotect(), madvise().  File
      descriptor approach allows explicit and controlled sharing of the memory
      areas, it allows to seal the operations.  Besides, file descriptor based
      memory paves the way for VMMs to remove the secret memory range from the
      userspace hipervisor process, for instance QEMU.  Andy Lutomirski says:
      
        "Getting fd-backed memory into a guest will take some possibly major
        work in the kernel, but getting vma-backed memory into a guest without
        mapping it in the host user address space seems much, much worse."
      
      memfd_secret() is made a dedicated system call rather than an extension to
      memfd_create() because it's purpose is to allow the user to create more
      secure memory mappings rather than to simply allow file based access to
      the memory.  Nowadays a new system call cost is negligible while it is way
      simpler for userspace to deal with a clear-cut system calls than with a
      multiplexer or an overloaded syscall.  Moreover, the initial
      implementation of memfd_secret() is completely distinct from
      memfd_create() so there is no much sense in overloading memfd_create() to
      begin with.  If there will be a need for code sharing between these
      implementation it can be easily achieved without a need to adjust user
      visible APIs.
      
      The secret memory remains accessible in the process context using uaccess
      primitives, but it is not exposed to the kernel otherwise; secret memory
      areas are removed from the direct map and functions in the
      follow_page()/get_user_page() family will refuse to return a page that
      belongs to the secret memory area.
      
      Once there will be a use case that will require exposing secretmem to the
      kernel it will be an opt-in request in the system call flags so that user
      would have to decide what data can be exposed to the kernel.
      
      Removing of the pages from the direct map may cause its fragmentation on
      architectures that use large pages to map the physical memory which
      affects the system performance.  However, the original Kconfig text for
      CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "...  can
      improve the kernel's performance a tiny bit ..." (commit 00d1c5e0
      ("x86: add gbpages switches")) and the recent report [1] showed that "...
      although 1G mappings are a good default choice, there is no compelling
      evidence that it must be the only choice".  Hence, it is sufficient to
      have secretmem disabled by default with the ability of a system
      administrator to enable it at boot time.
      
      Pages in the secretmem regions are unevictable and unmovable to avoid
      accidental exposure of the sensitive data via swap or during page
      migration.
      
      Since the secretmem mappings are locked in memory they cannot exceed
      RLIMIT_MEMLOCK.  Since these mappings are already locked independently
      from mlock(), an attempt to mlock()/munlock() secretmem range would fail
      and mlockall()/munlockall() will ignore secretmem mappings.
      
      However, unlike mlock()ed memory, secretmem currently behaves more like
      long-term GUP: secretmem mappings are unmovable mappings directly consumed
      by user space.  With default limits, there is no excessive use of
      secretmem and it poses no real problem in combination with
      ZONE_MOVABLE/CMA, but in the future this should be addressed to allow
      balanced use of large amounts of secretmem along with ZONE_MOVABLE/CMA.
      
      A page that was a part of the secret memory area is cleared when it is
      freed to ensure the data is not exposed to the next user of that page.
      
      The following example demonstrates creation of a secret mapping (error
      handling is omitted):
      
      	fd = memfd_secret(0);
      	ftruncate(fd, MAP_SIZE);
      	ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE,
      		   MAP_SHARED, fd, 0);
      
      [1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/
      
      [akpm@linux-foundation.org: suppress Kconfig whine]
      
      Link: https://lkml.kernel.org/r/20210518072034.31572-5-rppt@kernel.org
      
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Acked-by: default avatarHagen Paul Pfeifer <hagen@jauu.net>
      Acked-by: default avatarJames Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Elena Reshetova <elena.reshetova@intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Bottomley <jejb@linux.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Palmer Dabbelt <palmerdabbelt@google.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tycho Andersen <tycho@tycho.ws>
      Cc: Will Deacon <will@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: kernel test robot <lkp@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1507f512
  13. 01 Jul, 2021 1 commit
    • David Hildenbrand's avatar
      mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables · 4ca9b385
      David Hildenbrand authored
      I. Background: Sparse Memory Mappings
      
      When we manage sparse memory mappings dynamically in user space - also
      sometimes involving MAP_NORESERVE - we want to dynamically populate/
      discard memory inside such a sparse memory region.  Example users are
      hypervisors (especially implementing memory ballooning or similar
      technologies like virtio-mem) and memory allocators.  In addition, we want
      to fail in a nice way (instead of generating SIGBUS) if populating does
      not succeed because we are out of backend memory (which can happen easily
      with file-based mappings, especially tmpfs and hugetlbfs).
      
      While MADV_DONTNEED, MADV_REMOVE and FALLOC_FL_PUNCH_HOLE allow for
      reliably discarding memory for most mapping types, there is no generic
      approach to populate page tables and preallocate memory.
      
      Although mmap() supports MAP_POPULATE, it is not applicable to the concept
      of sparse memory mappings, where we want to populate/discard dynamically
      and avoid expensive/problematic remappings.  In addition, we never
      actually report errors during the final populate phase - it is best-effort
      only.
      
      fallocate() can be used to preallocate file-based memory and fail in a
      safe way.  However, it cannot really be used for any private mappings on
      anonymous files via memfd due to COW semantics.  In addition, fallocate()
      does not actually populate page tables, so we still always get pagefaults
      on first access - which is sometimes undesired (i.e., real-time workloads)
      and requires real prefaulting of page tables, not just a preallocation of
      backend storage.  There might be interesting use cases for sparse memory
      regions along with mlockall(MCL_ONFAULT) which fallocate() cannot satisfy
      as it does not prefault page tables.
      
      II. On preallcoation/prefaulting from user space
      
      Because we don't have a proper interface, what applications (like QEMU and
      databases) end up doing is touching (i.e., reading+writing one byte to not
      overwrite existing data) all individual pages.
      
      However, that approach
      1) Can result in wear on storage backing, because we end up reading/writing
         each page; this is especially a problem for dax/pmem.
      2) Can result in mmap_sem contention when prefaulting via multiple
         threads.
      3) Requires expensive signal handling, especially to catch SIGBUS in case
         of hugetlbfs/shmem/file-backed memory. For example, this is
         problematic in hypervisors like QEMU where SIGBUS handlers might already
         be used by other subsystems concurrently to e.g, handle hardware errors.
         "Simply" doing preallocation concurrently from other thread is not that
         easy.
      
      III. On MADV_WILLNEED
      
      Extending MADV_WILLNEED is not an option because
      1. It would change the semantics: "Expect access in the near future." and
         "might be a good idea to read some pages" vs. "Definitely populate/
         preallocate all memory and definitely fail on errors.".
      2. Existing users (like virtio-balloon in QEMU when deflating the balloon)
         don't want populate/prealloc semantics. They treat this rather as a hint
         to give a little performance boost without too much overhead - and don't
         expect that a lot of memory might get consumed or a lot of time
         might be spent.
      
      IV. MADV_POPULATE_READ and MADV_POPULATE_WRITE
      
      Let's introduce MADV_POPULATE_READ and MADV_POPULATE_WRITE, inspired by
      MAP_POPULATE, with the following semantics:
      1. MADV_POPULATE_READ can be used to prefault page tables just like
         manually reading each individual page. This will not break any COW
         mappings. The shared zero page might get mapped and no backend storage
         might get preallocated -- allocation might be deferred to
         write-fault time. Especially shared file mappings require an explicit
         fallocate() upfront to actually preallocate backend memory (blocks in
         the file system) in case the file might have holes.
      2. If MADV_POPULATE_READ succeeds, all page tables have been populated
         (prefaulted) readable once.
      3. MADV_POPULATE_WRITE can be used to preallocate backend memory and
         prefault page tables just like manually writing (or
         reading+writing) each individual page. This will break any COW
         mappings -- e.g., the shared zeropage is never populated.
      4. If MADV_POPULATE_WRITE succeeds, all page tables have been populated
         (prefaulted) writable once.
      5. MADV_POPULATE_READ and MADV_POPULATE_WRITE cannot be applied to special
         mappings marked with VM_PFNMAP and VM_IO. Also, proper access
         permissions (e.g., PROT_READ, PROT_WRITE) are required. If any such
         mapping is encountered, madvise() fails with -EINVAL.
      6. If MADV_POPULATE_READ or MADV_POPULATE_WRITE fails, some page tables
         might have been populated.
      7. MADV_POPULATE_READ and MADV_POPULATE_WRITE will return -EHWPOISON
         when encountering a HW poisoned page in the range.
      8. Similar to MAP_POPULATE, MADV_POPULATE_READ and MADV_POPULATE_WRITE
         cannot protect from the OOM (Out Of Memory) handler killing the
         process.
      
      While the use case for MADV_POPULATE_WRITE is fairly obvious (i.e.,
      preallocate memory and prefault page tables for VMs), one issue is that
      whenever we prefault pages writable, the pages have to be marked dirty,
      because the CPU could dirty them any time.  while not a real problem for
      hugetlbfs or dax/pmem, it can be a problem for shared file mappings: each
      page will be marked dirty and has to be written back later when evicting.
      
      MADV_POPULATE_READ allows for optimizing this scenario: Pre-read a whole
      mapping from backend storage without marking it dirty, such that eviction
      won't have to write it back.  As discussed above, shared file mappings
      might require an explciit fallocate() upfront to achieve
      preallcoation+prepopulation.
      
      Although sparse memory mappings are the primary use case, this will also
      be useful for other preallocate/prefault use cases where MAP_POPULATE is
      not desired or the semantics of MAP_POPULATE are not sufficient: as one
      example, QEMU users can trigger preallocation/prefaulting of guest RAM
      after the mapping was created -- and don't want errors to be silently
      suppressed.
      
      Looking at the history, MADV_POPULATE was already proposed in 2013 [1],
      however, the main motivation back than was performance improvements --
      which should also still be the case.
      
      V. Single-threaded performance comparison
      
      I did a short experiment, prefaulting page tables on completely *empty
      mappings/files* and repeated the experiment 10 times.  The results
      correspond to the shortest execution time.  In general, the performance
      benefit for huge pages is negligible with small mappings.
      
      V.1: Private mappings
      
      POPULATE_READ and POPULATE_WRITE is fastest.  Note that
      Reading/POPULATE_READ will populate the shared zeropage where applicable
      -- which result in short population times.
      
      The fastest way to allocate backend storage (here: swap or huge pages) and
      prefault page tables is POPULATE_WRITE.
      
      V.2: Shared mappings
      
      fallocate() is fastest, however, doesn't prefault page tables.
      POPULATE_WRITE is faster than simple writes and read/writes.
      POPULATE_READ is faster than simple reads.
      
      Without a fd, the fastest way to allocate backend storage and prefault
      page tables is POPULATE_WRITE.  With an fd, the fastest way is usually
      FALLOCATE+POPULATE_READ or FALLOCATE+POPULATE_WRITE respectively; one
      exception are actual files: FALLOCATE+Read is slightly faster than
      FALLOCATE+POPULATE_READ.
      
      The fastest way to allocate backend storage prefault page tables is
      FALLOCATE+POPULATE_WRITE -- except when dealing with actual files; then,
      FALLOCATE+POPULATE_READ is fastest and won't directly mark all pages as
      dirty.
      
      v.3: Detailed results
      
      ==================================================
      2 MiB MAP_PRIVATE:
      **************************************************
      Anon 4 KiB     : Read                     :     0.119 ms
      Anon 4 KiB     : Write                    :     0.222 ms
      Anon 4 KiB     : Read/Write               :     0.380 ms
      Anon 4 KiB     : POPULATE_READ            :     0.060 ms
      Anon 4 KiB     : POPULATE_WRITE           :     0.158 ms
      Memfd 4 KiB    : Read                     :     0.034 ms
      Memfd 4 KiB    : Write                    :     0.310 ms
      Memfd 4 KiB    : Read/Write               :     0.362 ms
      Memfd 4 KiB    : POPULATE_READ            :     0.039 ms
      Memfd 4 KiB    : POPULATE_WRITE           :     0.229 ms
      Memfd 2 MiB    : Read                     :     0.030 ms
      Memfd 2 MiB    : Write                    :     0.030 ms
      Memfd 2 MiB    : Read/Write               :     0.030 ms
      Memfd 2 MiB    : POPULATE_READ            :     0.030 ms
      Memfd 2 MiB    : POPULATE_WRITE           :     0.030 ms
      tmpfs          : Read                     :     0.033 ms
      tmpfs          : Write                    :     0.313 ms
      tmpfs          : Read/Write               :     0.406 ms
      tmpfs          : POPULATE_READ            :     0.039 ms
      tmpfs          : POPULATE_WRITE           :     0.285 ms
      file           : Read                     :     0.033 ms
      file           : Write                    :     0.351 ms
      file           : Read/Write               :     0.408 ms
      file           : POPULATE_READ            :     0.039 ms
      file           : POPULATE_WRITE           :     0.290 ms
      hugetlbfs      : Read                     :     0.030 ms
      hugetlbfs      : Write                    :     0.030 ms
      hugetlbfs      : Read/Write               :     0.030 ms
      hugetlbfs      : POPULATE_READ            :     0.030 ms
      hugetlbfs      : POPULATE_WRITE           :     0.030 ms
      **************************************************
      4096 MiB MAP_PRIVATE:
      **************************************************
      Anon 4 KiB     : Read                     :   237.940 ms
      Anon 4 KiB     : Write                    :   708.409 ms
      Anon 4 KiB     : Read/Write               :  1054.041 ms
      Anon 4 KiB     : POPULATE_READ            :   124.310 ms
      Anon 4 KiB     : POPULATE_WRITE           :   572.582 ms
      Memfd 4 KiB    : Read                     :   136.928 ms
      Memfd 4 KiB    : Write                    :   963.898 ms
      Memfd 4 KiB    : Read/Write               :  1106.561 ms
      Memfd 4 KiB    : POPULATE_READ            :    78.450 ms
      Memfd 4 KiB    : POPULATE_WRITE           :   805.881 ms
      Memfd 2 MiB    : Read                     :   357.116 ms
      Memfd 2 MiB    : Write                    :   357.210 ms
      Memfd 2 MiB    : Read/Write               :   357.606 ms
      Memfd 2 MiB    : POPULATE_READ            :   356.094 ms
      Memfd 2 MiB    : POPULATE_WRITE           :   356.937 ms
      tmpfs          : Read                     :   137.536 ms
      tmpfs          : Write                    :   954.362 ms
      tmpfs          : Read/Write               :  1105.954 ms
      tmpfs          : POPULATE_READ            :    80.289 ms
      tmpfs          : POPULATE_WRITE           :   822.826 ms
      file           : Read                     :   137.874 ms
      file           : Write                    :   987.025 ms
      file           : Read/Write               :  1107.439 ms
      file           : POPULATE_READ            :    80.413 ms
      file           : POPULATE_WRITE           :   857.622 ms
      hugetlbfs      : Read                     :   355.607 ms
      hugetlbfs      : Write                    :   355.729 ms
      hugetlbfs      : Read/Write               :   356.127 ms
      hugetlbfs      : POPULATE_READ            :   354.585 ms
      hugetlbfs      : POPULATE_WRITE           :   355.138 ms
      **************************************************
      2 MiB MAP_SHARED:
      **************************************************
      Anon 4 KiB     : Read                     :     0.394 ms
      Anon 4 KiB     : Write                    :     0.348 ms
      Anon 4 KiB     : Read/Write               :     0.400 ms
      Anon 4 KiB     : POPULATE_READ            :     0.326 ms
      Anon 4 KiB     : POPULATE_WRITE           :     0.273 ms
      Anon 2 MiB     : Read                     :     0.030 ms
      Anon 2 MiB     : Write                    :     0.030 ms
      Anon 2 MiB     : Read/Write               :     0.030 ms
      Anon 2 MiB     : POPULATE_READ            :     0.030 ms
      Anon 2 MiB     : POPULATE_WRITE           :     0.030 ms
      Memfd 4 KiB    : Read                     :     0.412 ms
      Memfd 4 KiB    : Write                    :     0.372 ms
      Memfd 4 KiB    : Read/Write               :     0.419 ms
      Memfd 4 KiB    : POPULATE_READ            :     0.343 ms
      Memfd 4 KiB    : POPULATE_WRITE           :     0.288 ms
      Memfd 4 KiB    : FALLOCATE                :     0.137 ms
      Memfd 4 KiB    : FALLOCATE+Read           :     0.446 ms
      Memfd 4 KiB    : FALLOCATE+Write          :     0.330 ms
      Memfd 4 KiB    : FALLOCATE+Read/Write     :     0.454 ms
      Memfd 4 KiB    : FALLOCATE+POPULATE_READ  :     0.379 ms
      Memfd 4 KiB    : FALLOCATE+POPULATE_WRITE :     0.268 ms
      Memfd 2 MiB    : Read                     :     0.030 ms
      Memfd 2 MiB    : Write                    :     0.030 ms
      Memfd 2 MiB    : Read/Write               :     0.030 ms
      Memfd 2 MiB    : POPULATE_READ            :     0.030 ms
      Memfd 2 MiB    : POPULATE_WRITE           :     0.030 ms
      Memfd 2 MiB    : FALLOCATE                :     0.030 ms
      Memfd 2 MiB    : FALLOCATE+Read           :     0.031 ms
      Memfd 2 MiB    : FALLOCATE+Write          :     0.031 ms
      Memfd 2 MiB    : FALLOCATE+Read/Write     :     0.031 ms
      Memfd 2 MiB    : FALLOCATE+POPULATE_READ  :     0.030 ms
      Memfd 2 MiB    : FALLOCATE+POPULATE_WRITE :     0.030 ms
      tmpfs          : Read                     :     0.416 ms
      tmpfs          : Write                    :     0.369 ms
      tmpfs          : Read/Write               :     0.425 ms
      tmpfs          : POPULATE_READ            :     0.346 ms
      tmpfs          : POPULATE_WRITE           :     0.295 ms
      tmpfs          : FALLOCATE                :     0.139 ms
      tmpfs          : FALLOCATE+Read           :     0.447 ms
      tmpfs          : FALLOCATE+Write          :     0.333 ms
      tmpfs          : FALLOCATE+Read/Write     :     0.454 ms
      tmpfs          : FALLOCATE+POPULATE_READ  :     0.380 ms
      tmpfs          : FALLOCATE+POPULATE_WRITE :     0.272 ms
      file           : Read                     :     0.191 ms
      file           : Write                    :     0.511 ms
      file           : Read/Write               :     0.524 ms
      file           : POPULATE_READ            :     0.196 ms
      file           : POPULATE_WRITE           :     0.434 ms
      file           : FALLOCATE                :     0.004 ms
      file           : FALLOCATE+Read           :     0.197 ms
      file           : FALLOCATE+Write          :     0.554 ms
      file           : FALLOCATE+Read/Write     :     0.480 ms
      file           : FALLOCATE+POPULATE_READ  :     0.201 ms
      file           : FALLOCATE+POPULATE_WRITE :     0.381 ms
      hugetlbfs      : Read                     :     0.030 ms
      hugetlbfs      : Write                    :     0.030 ms
      hugetlbfs      : Read/Write               :     0.030 ms
      hugetlbfs      : POPULATE_READ            :     0.030 ms
      hugetlbfs      : POPULATE_WRITE           :     0.030 ms
      hugetlbfs      : FALLOCATE                :     0.030 ms
      hugetlbfs      : FALLOCATE+Read           :     0.031 ms
      hugetlbfs      : FALLOCATE+Write          :     0.031 ms
      hugetlbfs      : FALLOCATE+Read/Write     :     0.030 ms
      hugetlbfs      : FALLOCATE+POPULATE_READ  :     0.030 ms
      hugetlbfs      : FALLOCATE+POPULATE_WRITE :     0.030 ms
      **************************************************
      4096 MiB MAP_SHARED:
      **************************************************
      Anon 4 KiB     : Read                     :  1053.090 ms
      Anon 4 KiB     : Write                    :   913.642 ms
      Anon 4 KiB     : Read/Write               :  1060.350 ms
      Anon 4 KiB     : POPULATE_READ            :   893.691 ms
      Anon 4 KiB     : POPULATE_WRITE           :   782.885 ms
      Anon 2 MiB     : Read                     :   358.553 ms
      Anon 2 MiB     : Write                    :   358.419 ms
      Anon 2 MiB     : Read/Write               :   357.992 ms
      Anon 2 MiB     : POPULATE_READ            :   357.533 ms
      Anon 2 MiB     : POPULATE_WRITE           :   357.808 ms
      Memfd 4 KiB    : Read                     :  1078.144 ms
      Memfd 4 KiB    : Write                    :   942.036 ms
      Memfd 4 KiB    : Read/Write               :  1100.391 ms
      Memfd 4 KiB    : POPULATE_READ            :   925.829 ms
      Memfd 4 KiB    : POPULATE_WRITE           :   804.394 ms
      Memfd 4 KiB    : FALLOCATE                :   304.632 ms
      Memfd 4 KiB    : FALLOCATE+Read           :  1163.359 ms
      Memfd 4 KiB    : FALLOCATE+Write          :   933.186 ms
      Memfd 4 KiB    : FALLOCATE+Read/Write     :  1187.304 ms
      Memfd 4 KiB    : FALLOCATE+POPULATE_READ  :  1013.660 ms
      Memfd 4 KiB    : FALLOCATE+POPULATE_WRITE :   794.560 ms
      Memfd 2 MiB    : Read                     :   358.131 ms
      Memfd 2 MiB    : Write                    :   358.099 ms
      Memfd 2 MiB    : Read/Write               :   358.250 ms
      Memfd 2 MiB    : POPULATE_READ            :   357.563 ms
      Memfd 2 MiB    : POPULATE_WRITE           :   357.334 ms
      Memfd 2 MiB    : FALLOCATE                :   356.735 ms
      Memfd 2 MiB    : FALLOCATE+Read           :   358.152 ms
      Memfd 2 MiB    : FALLOCATE+Write          :   358.331 ms
      Memfd 2 MiB    : FALLOCATE+Read/Write     :   358.018 ms
      Memfd 2 MiB    : FALLOCATE+POPULATE_READ  :   357.286 ms
      Memfd 2 MiB    : FALLOCATE+POPULATE_WRITE :   357.523 ms
      tmpfs          : Read                     :  1087.265 ms
      tmpfs          : Write                    :   950.840 ms
      tmpfs          : Read/Write               :  1107.567 ms
      tmpfs          : POPULATE_READ            :   922.605 ms
      tmpfs          : POPULATE_WRITE           :   810.094 ms
      tmpfs          : FALLOCATE                :   306.320 ms
      tmpfs          : FALLOCATE+Read           :  1169.796 ms
      tmpfs          : FALLOCATE+Write          :   933.730 ms
      tmpfs          : FALLOCATE+Read/Write     :  1191.610 ms
      tmpfs          : FALLOCATE+POPULATE_READ  :  1020.474 ms
      tmpfs          : FALLOCATE+POPULATE_WRITE :   798.945 ms
      file           : Read                     :   654.101 ms
      file           : Write                    :  1259.142 ms
      file           : Read/Write               :  1289.509 ms
      file           : POPULATE_READ            :   661.642 ms
      file           : POPULATE_WRITE           :  1106.816 ms
      file           : FALLOCATE                :     1.864 ms
      file           : FALLOCATE+Read           :   656.328 ms
      file           : FALLOCATE+Write          :  1153.300 ms
      file           : FALLOCATE+Read/Write     :  1180.613 ms
      file           : FALLOCATE+POPULATE_READ  :   668.347 ms
      file           : FALLOCATE+POPULATE_WRITE :   996.143 ms
      hugetlbfs      : Read                     :   357.245 ms
      hugetlbfs      : Write                    :   357.413 ms
      hugetlbfs      : Read/Write               :   357.120 ms
      hugetlbfs      : POPULATE_READ            :   356.321 ms
      hugetlbfs      : POPULATE_WRITE           :   356.693 ms
      hugetlbfs      : FALLOCATE                :   355.927 ms
      hugetlbfs      : FALLOCATE+Read           :   357.074 ms
      hugetlbfs      : FALLOCATE+Write          :   357.120 ms
      hugetlbfs      : FALLOCATE+Read/Write     :   356.983 ms
      hugetlbfs      : FALLOCATE+POPULATE_READ  :   356.413 ms
      hugetlbfs      : FALLOCATE+POPULATE_WRITE :   356.266 ms
      **************************************************
      
      [1] https://lkml.org/lkml/2013/6/27/698
      
      [akpm@linux-foundation.org: coding style fixes]
      
      Link: https://lkml.kernel.org/r/20210419135443.12822-3-david@redhat.com
      
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rolf Eike Beer <eike-kernel@sf-tec.de>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4ca9b385
  14. 29 Jun, 2021 3 commits
    • Andrea Arcangeli's avatar
      mm: gup: pack has_pinned in MMF_HAS_PINNED · a458b76a
      Andrea Arcangeli authored
      has_pinned 32bit can be packed in the MMF_HAS_PINNED bit as a noop
      cleanup.
      
      Any atomic_inc/dec to the mm cacheline shared by all threads in pin-fast
      would reintroduce a loss of SMP scalability to pin-fast, so there's no
      future potential usefulness to keep an atomic in the mm for this.
      
      set_bit(MMF_HAS_PINNED) will be theoretically a bit slower than WRITE_ONCE
      (atomic_set is equivalent to WRITE_ONCE), but the set_bit (just like
      atomic_set after this commit) has to be still issued only once per "mm",
      so the difference between the two will be lost in the noise.
      
      will-it-scale "mmap2" shows no change in performance with enterprise
      config as expected.
      
      will-it-scale "pin_fast" retains the > 4000% SMP scalability performance
      improvement against upstream as expected.
      
      This is a noop as far as overall performance and SMP scalability are
      concerned.
      
      [peterx@redhat.com: pack has_pinned in MMF_HAS_PINNED]
        Link: https://lkml.kernel.org/r/YJqWESqyxa8OZA+2@t490s
      [akpm@linux-foundation.org: coding style fixes]
      [peterx@redhat.com: fix build for task_mmu.c, introduce mm_set_has_pinned_flag, fix comments]
      
      Link: https://lkml.kernel.org/r/20210507150553.208763-4-peterx@redhat.com
      
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Kirill Shutemov <kirill@shutemov.name>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a458b76a
    • Andrea Arcangeli's avatar
      mm: gup: allow FOLL_PIN to scale in SMP · 292648ac
      Andrea Arcangeli authored
      has_pinned cannot be written by each pin-fast or it won't scale in SMP.
      This isn't "false sharing" strictly speaking (it's more like "true
      non-sharing"), but it creates the same SMP scalability bottleneck of
      "false sharing".
      
      To verify the improvement, below test is done on 40 cpus host with
      Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz (must be with
      CONFIG_GUP_TEST=y):
      
        $ sudo chrt -f 1 ./gup_test -a  -m 512 -j 40
      
      Where we can get (average value for 40 threads):
      
        Old kernel: 477729.97 (+- 3.79%)
        New kernel:  89144.65 (+-11.76%)
      
      On a similar condition with 256 cpus, this commits increases the SMP
      scalability of pin_user_pages_fast() executed by different threads of the
      same process by more than 4000%.
      
      [peterx@redhat.com: rewrite commit message, add parentheses against "(A & B)"]
      
      Link: https://lkml.kernel.org/r/20210507150553.208763-3-peterx@redhat.com
      
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Kirill Shutemov <kirill@shutemov.name>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      292648ac
    • Jann Horn's avatar
      mm/gup: fix try_grab_compound_head() race with split_huge_page() · c24d3732
      Jann Horn authored
      try_grab_compound_head() is used to grab a reference to a page from
      get_user_pages_fast(), which is only protected against concurrent freeing
      of page tables (via local_irq_save()), but not against concurrent TLB
      flushes, freeing of data pages, or splitting of compound pages.
      
      Because no reference is held to the page when try_grab_compound_head() is
      called, the page may have been freed and reallocated by the time its
      refcount has been elevated; therefore, once we're holding a stable
      reference to the page, the caller re-checks whether the PTE still points
      to the same page (with the same access rights).
      
      The problem is that try_grab_compound_head() has to grab a reference on
      the head page; but between the time we look up what the head page is and
      the time we actually grab a reference on the head page, the compound page
      may have been split up (either explicitly through split_huge_page() or by
      freeing the compound page to the buddy allocator and then allocating its
      individual order-0 pages).  If that happens, get_user_pages_fast() may end
      up returning the right page but lifting the refcount on a now-unrelated
      page, leading to use-after-free of pages.
      
      To fix it: Re-check whether the pages still belong together after lifting
      the refcount on the head page.  Move anything else that checks
      compound_head(page) below the refcount increment.
      
      This can't actually happen on bare-metal x86 (because there, disabling
      IRQs locks out remote TLB flushes), but it can happen on virtualized x86
      (e.g.  under KVM) and probably also on arm64.  The race window is pretty
      narrow, and constantly allocating and shattering hugepages isn't exactly
      fast; for now I've only managed to reproduce this in an x86 KVM guest with
      an artificially widened timing window (by adding a loop that repeatedly
      calls `inl(0x3f8 + 5)` in `try_get_compound_head()` to force VM exits, so
      that PV TLB flushes are used instead of IPIs).
      
      As requested on the list, also replace the existing VM_BUG_ON_PAGE() with
      a warning and bailout.  Since the existing code only performed the BUG_ON
      check on DEBUG_VM kernels, ensure that the new code also only performs the
      check under that configuration - I don't want to mix two logically
      separate changes together too much.  The macro VM_WARN_ON_ONCE_PAGE()
      doesn't return a value on !DEBUG_VM, so wrap the whole check in an #ifdef
      block.  An alternative would be to change the VM_WARN_ON_ONCE_PAGE()
      definition for !DEBUG_VM such that it always returns false, but since that
      would differ from the behavior of the normal WARN macros, it might be too
      confusing for readers.
      
      Link: https://lkml.kernel.org/r/20210615012014.1100672-1-jannh@google.com
      Fixes: 7aef4172
      
       ("mm: handle PTE-mapped tail pages in gerneric fast gup implementaiton")
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Jan Kara <jack@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c24d3732
  15. 23 May, 2021 1 commit
  16. 07 May, 2021 1 commit
  17. 05 May, 2021 8 commits
    • Pavel Tatashin's avatar
      mm/gup: longterm pin migration cleanup · f68749ec
      Pavel Tatashin authored
      When pages are longterm pinned, we must migrated them out of movable zone.
      The function that migrates them has a hidden loop with goto.  The loop is
      to retry on isolation failures, and after successful migration.
      
      Make this code better by moving this loop to the caller.
      
      Link: https://lkml.kernel.org/r/20210215161349.246722-13-pasha.tatashin@soleen.com
      
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@soleen.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Tyler Hicks <tyhicks@linux.microsoft.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f68749ec
    • Pavel Tatashin's avatar
      mm/gup: change index type to long as it counts pages · 24dc20c7
      Pavel Tatashin authored
      In __get_user_pages_locked() i counts number of pages which should be
      long, as long is used in all other places to contain number of pages, and
      32-bit becomes increasingly small for handling page count proportional
      values.
      
      Link: https://lkml.kernel.org/r/20210215161349.246722-12-pasha.tatashin@soleen.com
      
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@soleen.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Tyler Hicks <tyhicks@linux.microsoft.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      24dc20c7
    • Pavel Tatashin's avatar
      mm/gup: migrate pinned pages out of movable zone · d1e153fe
      Pavel Tatashin authored
      We should not pin pages in ZONE_MOVABLE.  Currently, we do not pin only
      movable CMA pages.  Generalize the function that migrates CMA pages to
      migrate all movable pages.  Use is_pinnable_page() to check which pages
      need to be migrated
      
      Link: https://lkml.kernel.org/r/20210215161349.246722-10-pasha.tatashin@soleen.com
      
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@soleen.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Tyler Hicks <tyhicks@linux.microsoft.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d1e153fe
    • Pavel Tatashin's avatar
      mm cma: rename PF_MEMALLOC_NOCMA to PF_MEMALLOC_PIN · 1a08ae36
      Pavel Tatashin authored
      PF_MEMALLOC_NOCMA is used ot guarantee that the allocator will not
      return pages that might belong to CMA region.  This is currently used
      for long term gup to make sure that such pins are not going to be done
      on any CMA pages.
      
      When PF_MEMALLOC_NOCMA has been introduced we haven't realized that it
      is focusing on CMA pages too much and that there is larger class of
      pages that need the same treatment.  MOVABLE zone cannot contain any
      long term pins as well so it makes sense to reuse and redefine this flag
      for that usecase as well.  Rename the flag to PF_MEMALLOC_PIN which
      defines an allocation context which can only get pages suitable for
      long-term pins.
      
      Also rename: memalloc_nocma_save()/memalloc_nocma_restore to
      memalloc_pin_save()/memalloc_pin_restore() and make the new functions
      common.
      
      [rppt@linux.ibm.com: fix renaming of PF_MEMALLOC_NOCMA to PF_MEMALLOC_PIN]
        Link: https://lkml.kernel.org/r/20210331163816.11517-1-rppt@kernel.org
      
      Link: https://lkml.kernel.org/r/20210215161349.246722-6-pasha.tatashin@soleen.com
      
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@soleen.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Tyler Hicks <tyhicks@linux.microsoft.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1a08ae36
    • Pavel Tatashin's avatar
      mm/gup: check for isolation errors · 6e7f34eb
      Pavel Tatashin authored
      It is still possible that we pin movable CMA pages if there are
      isolation errors and cma_page_list stays empty when we check again.
      
      Check for isolation errors, and return success only when there are no
      isolation errors, and cma_page_list is empty after checking.
      
      Because isolation errors are transient, we retry indefinitely.
      
      Link: https://lkml.kernel.org/r/20210215161349.246722-5-pasha.tatashin@soleen.com
      Fixes: 9a4e9f3b
      
       ("mm: update get_user_pages_longterm to migrate pages allocated from CMA region")
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@soleen.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Tyler Hicks <tyhicks@linux.microsoft.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6e7f34eb
    • Pavel Tatashin's avatar
      mm/gup: return an error on migration failure · f0f44638
      Pavel Tatashin authored
      When migration failure occurs, we still pin pages, which means that we
      may pin CMA movable pages which should never be the case.
      
      Instead return an error without pinning pages when migration failure
      happens.
      
      No need to retry migrating, because migrate_pages() already retries 10
      times.
      
      Link: https://lkml.kernel.org/r/20210215161349.246722-4-pasha.tatashin@soleen.com
      
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@soleen.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Tyler Hicks <tyhicks@linux.microsoft.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f0f44638
    • Pavel Tatashin's avatar
      mm/gup: check every subpage of a compound page during isolation · 83c02c23
      Pavel Tatashin authored
      When pages are isolated in check_and_migrate_movable_pages() we skip
      compound number of pages at a time.  However, as Jason noted, it is not
      necessary correct that pages[i] corresponds to the pages that we
      skipped.  This is because it is possible that the addresses in this
      range had split_huge_pmd()/split_huge_pud(), and these functions do not
      update the compound page metadata.
      
      The problem can be reproduced if something like this occurs:
      
      1. User faulted huge pages.
      2. split_huge_pmd() was called for some reason
      3. User has unmapped some sub-pages in the range
      4. User tries to longterm pin the addresses.
      
      The resulting pages[i] might end-up having pages which are not compound
      size page aligned.
      
      Link: https://lkml.kernel.org/r/20210215161349.246722-3-pasha.tatashin@soleen.com
      Fixes: aa712399
      
       ("mm/gup: speed up check_and_migrate_cma_pages() on huge page")
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@soleen.com>
      Reported-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Tyler Hicks <tyhicks@linux.microsoft.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      83c02c23
    • Pavel Tatashin's avatar
      mm/gup: don't pin migrated cma pages in movable zone · c991ffef
      Pavel Tatashin authored
      Patch series "prohibit pinning pages in ZONE_MOVABLE", v11.
      
      When page is pinned it cannot be moved and its physical address stays
      the same until pages is unpinned.
      
      This is useful functionality to allows userland to implementation DMA
      access. For example, it is used by vfio in vfio_pin_pages().
      
      However, this functionality breaks memory hotplug/hotremove assumptions
      that pages in ZONE_MOVABLE can always be migrated.
      
      This patch series fixes this issue by forcing new allocations during
      page pinning to omit ZONE_MOVABLE, and also to migrate any existing
      pages from ZONE_MOVABLE during pinning.
      
      It uses the same scheme logic that is currently used by CMA, and extends
      the functionality for all allocations.
      
      For more information read the discussion [1] about this problem.
      [1] https://lore.kernel.org/lkml/CA+CK2bBffHBxjmb9jmSKacm0fJMinyt3Nhk8Nx6iudcQSj80_w@mail.gmail.com
      
      This patch (of 14):
      
      In order not to fragment CMA the pinned pages are migrated.  However, they
      are migrated to ZONE_MOVABLE, which also should not have pinned pages.
      
      Remove __GFP_MOVABLE, so pages can be migrated to zones where pinning is
      allowed.
      
      Link: https://lkml.kernel.org/r/20210215161349.246722-1-pasha.tatashin@soleen.com
      Link: https://lkml.kernel.org/r/20210215161349.246722-2-pasha.tatashin@soleen.com
      
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@soleen.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Tyler Hicks <tyhicks@linux.microsoft.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c991ffef
  18. 30 Apr, 2021 4 commits
  19. 09 Apr, 2021 1 commit
    • Aili Yao's avatar
      mm/gup: check page posion status for coredump. · d3378e86
      Aili Yao authored
      When we do coredump for user process signal, this may be an SIGBUS signal
      with BUS_MCEERR_AR or BUS_MCEERR_AO code, which means this signal is
      resulted from ECC memory fail like SRAR or SRAO, we expect the memory
      recovery work is finished correctly, then the get_dump_page() will not
      return the error page as its process pte is set invalid by
      memory_failure().
      
      But memory_failure() may fail, and the process's related pte may not be
      correctly set invalid, for current code, we will return the poison page,
      get it dumped, and then lead to system panic as its in kernel code.
      
      So check the poison status in get_dump_page(), and if TRUE, return NULL.
      
      There maybe other scenario that is also better to check the posion status
      and not to panic, so make a wrapper for this check, Thanks to David's
      suggestion(<david@redhat.com>).
      
      [akpm@linux-foundation.org: s/0/false/]
      [yaoaili@kingsoft.com: is_page_poisoned() arg cannot be null, per Matthew]
      
      Link: https://lkml.kernel.org/r/20210322115233.05e4e82a@alex-virtual-machine
      Link: https://lkml.kernel.org/r/20210319104437.6f30e80d@alex-virtual-machine
      
      Signed-off-by: default avatarAili Yao <yaoaili@kingsoft.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Aili Yao <yaoaili@kingsoft.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d3378e86