- 14 Aug, 2021 1 commit
-
-
David Hildenbrand authored
Doing some extended tests and polishing the man page update for MADV_POPULATE_(READ|WRITE), I realized that we end up converting also SIGBUS (via -EFAULT) to -EINVAL, making it look like yet another madvise() user error. We want to report only problematic mappings and permission problems that the user could have know as -EINVAL. Let's not convert -EFAULT arising due to SIGBUS (or SIGSEGV) to -EINVAL, but instead indicate -EFAULT to user space. While we could also convert it to -ENOMEM, using -EFAULT looks more helpful when user space might want to troubleshoot what's going wrong: MADV_POPULATE_(READ|WRITE) is not part of an final Linux release and we can still adjust the behavior. Link: https://lkml.kernel.org/r/20210726154932.102880-1-david@redhat.com Fixes: 4ca9b385 ("mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables") Signed-off-by:
David Hildenbrand <david@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc...
-
- 01 Jul, 2021 1 commit
-
-
David Hildenbrand authored
I. Background: Sparse Memory Mappings When we manage sparse memory mappings dynamically in user space - also sometimes involving MAP_NORESERVE - we want to dynamically populate/ discard memory inside such a sparse memory region. Example users are hypervisors (especially implementing memory ballooning or similar technologies like virtio-mem) and memory allocators. In addition, we want to fail in a nice way (instead of generating SIGBUS) if populating does not succeed because we are out of backend memory (which can happen easily with file-based mappings, especially tmpfs and hugetlbfs). While MADV_DONTNEED, MADV_REMOVE and FALLOC_FL_PUNCH_HOLE allow for reliably discarding memory for most mapping types, there is no generic approach to populate page tables and preallocate memory. Although mmap() supports MAP_POPULATE, it is not applicable to the concept of sparse memory mappings, where we want to populate/discard dynamically and avoid expensive/problematic remappings. In addition, we never actually report errors during the final populate phase - it is best-effort only. fallocate() can be used to preallocate file-based memory and fail in a safe way. However, it cannot really be used for any private mappings on anonymous files via memfd due to COW semantics. In addition, fallocate() does not actually populate page tables, so we still always get pagefaults on first access - which is sometimes undesired (i.e., real-time workloads) and requires real prefaulting of page tables, not just a preallocation of backend storage. There might be interesting use cases for sparse memory regions along with mlockall(MCL_ONFAULT) which fallocate() cannot satisfy as it does not prefault page tables. II. On preallcoation/prefaulting from user space Because we don't have a proper interface, what applications (like QEMU and databases) end up doing is touching (i.e., reading+writing one byte to not overwrite existing data) all individual pages. However, that approach 1) Can result in wear on storage backing, because we end up reading/writing each page; this is especially a problem for dax/pmem. 2) Can result in mmap_sem contention when prefaulting via multiple threads. 3) Requires expensive signal handling, especially to catch SIGBUS in case of hugetlbfs/shmem/file-backed memory. For example, this is problematic in hypervisors like QEMU where SIGBUS handlers might already be used by other subsystems concurrently to e.g, handle hardware errors. "Simply" doing preallocation concurrently from other thread is not that easy. III. On MADV_WILLNEED Extending MADV_WILLNEED is not an option because 1. It would change the semantics: "Expect access in the near future." and "might be a good idea to read some pages" vs. "Definitely populate/ preallocate all memory and definitely fail on errors.". 2. Existing users (like virtio-balloon in QEMU when deflating the balloon) don't want populate/prealloc semantics. They treat this rather as a hint to give a little performance boost without too much overhead - and don't expect that a lot of memory might get consumed or a lot of time might be spent. IV. MADV_POPULATE_READ and MADV_POPULATE_WRITE Let's introduce MADV_POPULATE_READ and MADV_POPULATE_WRITE, inspired by MAP_POPULATE, with the following semantics: 1. MADV_POPULATE_READ can be used to prefault page tables just like manually reading each individual page. This will not break any COW mappings. The shared zero page might get mapped and no backend storage might get preallocated -- allocation might be deferred to write-fault time. Especially shared file mappings require an explicit fallocate() upfront to actually preallocate backend memory (blocks in the file system) in case the file might have holes. 2. If MADV_POPULATE_READ succeeds, all page tables have been populated (prefaulted) readable once. 3. MADV_POPULATE_WRITE can be used to preallocate backend memory and prefault page tables just like manually writing (or reading+writing) each individual page. This will break any COW mappings -- e.g., the shared zeropage is never populated. 4. If MADV_POPULATE_WRITE succeeds, all page tables have been populated (prefaulted) writable once. 5. MADV_POPULATE_READ and MADV_POPULATE_WRITE cannot be applied to special mappings marked with VM_PFNMAP and VM_IO. Also, proper access permissions (e.g., PROT_READ, PROT_WRITE) are required. If any such mapping is encountered, madvise() fails with -EINVAL. 6. If MADV_POPULATE_READ or MADV_POPULATE_WRITE fails, some page tables might have been populated. 7. MADV_POPULATE_READ and MADV_POPULATE_WRITE will return -EHWPOISON when encountering a HW poisoned page in the range. 8. Similar to MAP_POPULATE, MADV_POPULATE_READ and MADV_POPULATE_WRITE cannot protect from the OOM (Out Of Memory) handler killing the process. While the use case for MADV_POPULATE_WRITE is fairly obvious (i.e., preallocate memory and prefault page tables for VMs), one issue is that whenever we prefault pages writable, the pages have to be marked dirty, because the CPU could dirty them any time. while not a real problem for hugetlbfs or dax/pmem, it can be a problem for shared file mappings: each page will be marked dirty and has to be written back later when evicting. MADV_POPULATE_READ allows for optimizing this scenario: Pre-read a whole mapping from backend storage without marking it dirty, such that eviction won't have to write it back. As discussed above, shared file mappings might require an explciit fallocate() upfront to achieve preallcoation+prepopulation. Although sparse memory mappings are the primary use case, this will also be useful for other preallocate/prefault use cases where MAP_POPULATE is not desired or the semantics of MAP_POPULATE are not sufficient: as one example, QEMU users can trigger preallocation/prefaulting of guest RAM after the mapping was created -- and don't want errors to be silently suppressed. Looking at the history, MADV_POPULATE was already proposed in 2013 [1], however, the main motivation back than was performance improvements -- which should also still be the case. V. Single-threaded performance comparison I did a short experiment, prefaulting page tables on completely *empty mappings/files* and repeated the experiment 10 times. The results correspond to the shortest execution time. In general, the performance benefit for huge pages is negligible with small mappings. V.1: Private mappings POPULATE_READ and POPULATE_WRITE is fastest. Note that Reading/POPULATE_READ will populate the shared zeropage where applicable -- which result in short population times. The fastest way to allocate backend storage (here: swap or huge pages) and prefault page tables is POPULATE_WRITE. V.2: Shared mappings fallocate() is fastest, however, doesn't prefault page tables. POPULATE_WRITE is faster than simple writes and read/writes. POPULATE_READ is faster than simple reads. Without a fd, the fastest way to allocate backend storage and prefault page tables is POPULATE_WRITE. With an fd, the fastest way is usually FALLOCATE+POPULATE_READ or FALLOCATE+POPULATE_WRITE respectively; one exception are actual files: FALLOCATE+Read is slightly faster than FALLOCATE+POPULATE_READ. The fastest way to allocate backend storage prefault page tables is FALLOCATE+POPULATE_WRITE -- except when dealing with actual files; then, FALLOCATE+POPULATE_READ is fastest and won't directly mark all pages as dirty. v.3: Detailed results ================================================== 2 MiB MAP_PRIVATE: ************************************************** Anon 4 KiB : Read : 0.119 ms Anon 4 KiB : Write : 0.222 ms Anon 4 KiB : Read/Write : 0.380 ms Anon 4 KiB : POPULATE_READ : 0.060 ms Anon 4 KiB : POPULATE_WRITE : 0.158 ms Memfd 4 KiB : Read : 0.034 ms Memfd 4 KiB : Write : 0.310 ms Memfd 4 KiB : Read/Write : 0.362 ms Memfd 4 KiB : POPULATE_READ : 0.039 ms Memfd 4 KiB : POPULATE_WRITE : 0.229 ms Memfd 2 MiB : Read : 0.030 ms Memfd 2 MiB : Write : 0.030 ms Memfd 2 MiB : Read/Write : 0.030 ms Memfd 2 MiB : POPULATE_READ : 0.030 ms Memfd 2 MiB : POPULATE_WRITE : 0.030 ms tmpfs : Read : 0.033 ms tmpfs : Write : 0.313 ms tmpfs : Read/Write : 0.406 ms tmpfs : POPULATE_READ : 0.039 ms tmpfs : POPULATE_WRITE : 0.285 ms file : Read : 0.033 ms file : Write : 0.351 ms file : Read/Write : 0.408 ms file : POPULATE_READ : 0.039 ms file : POPULATE_WRITE : 0.290 ms hugetlbfs : Read : 0.030 ms hugetlbfs : Write : 0.030 ms hugetlbfs : Read/Write : 0.030 ms hugetlbfs : POPULATE_READ : 0.030 ms hugetlbfs : POPULATE_WRITE : 0.030 ms ************************************************** 4096 MiB MAP_PRIVATE: ************************************************** Anon 4 KiB : Read : 237.940 ms Anon 4 KiB : Write : 708.409 ms Anon 4 KiB : Read/Write : 1054.041 ms Anon 4 KiB : POPULATE_READ : 124.310 ms Anon 4 KiB : POPULATE_WRITE : 572.582 ms Memfd 4 KiB : Read : 136.928 ms Memfd 4 KiB : Write : 963.898 ms Memfd 4 KiB : Read/Write : 1106.561 ms Memfd 4 KiB : POPULATE_READ : 78.450 ms Memfd 4 KiB : POPULATE_WRITE : 805.881 ms Memfd 2 MiB : Read : 357.116 ms Memfd 2 MiB : Write : 357.210 ms Memfd 2 MiB : Read/Write : 357.606 ms Memfd 2 MiB : POPULATE_READ : 356.094 ms Memfd 2 MiB : POPULATE_WRITE : 356.937 ms tmpfs : Read : 137.536 ms tmpfs : Write : 954.362 ms tmpfs : Read/Write : 1105.954 ms tmpfs : POPULATE_READ : 80.289 ms tmpfs : POPULATE_WRITE : 822.826 ms file : Read : 137.874 ms file : Write : 987.025 ms file : Read/Write : 1107.439 ms file : POPULATE_READ : 80.413 ms file : POPULATE_WRITE : 857.622 ms hugetlbfs : Read : 355.607 ms hugetlbfs : Write : 355.729 ms hugetlbfs : Read/Write : 356.127 ms hugetlbfs : POPULATE_READ : 354.585 ms hugetlbfs : POPULATE_WRITE : 355.138 ms ************************************************** 2 MiB MAP_SHARED: ************************************************** Anon 4 KiB : Read : 0.394 ms Anon 4 KiB : Write : 0.348 ms Anon 4 KiB : Read/Write : 0.400 ms Anon 4 KiB : POPULATE_READ : 0.326 ms Anon 4 KiB : POPULATE_WRITE : 0.273 ms Anon 2 MiB : Read : 0.030 ms Anon 2 MiB : Write : 0.030 ms Anon 2 MiB : Read/Write : 0.030 ms Anon 2 MiB : POPULATE_READ : 0.030 ms Anon 2 MiB : POPULATE_WRITE : 0.030 ms Memfd 4 KiB : Read : 0.412 ms Memfd 4 KiB : Write : 0.372 ms Memfd 4 KiB : Read/Write : 0.419 ms Memfd 4 KiB : POPULATE_READ : 0.343 ms Memfd 4 KiB : POPULATE_WRITE : 0.288 ms Memfd 4 KiB : FALLOCATE : 0.137 ms Memfd 4 KiB : FALLOCATE+Read : 0.446 ms Memfd 4 KiB : FALLOCATE+Write : 0.330 ms Memfd 4 KiB : FALLOCATE+Read/Write : 0.454 ms Memfd 4 KiB : FALLOCATE+POPULATE_READ : 0.379 ms Memfd 4 KiB : FALLOCATE+POPULATE_WRITE : 0.268 ms Memfd 2 MiB : Read : 0.030 ms Memfd 2 MiB : Write : 0.030 ms Memfd 2 MiB : Read/Write : 0.030 ms Memfd 2 MiB : POPULATE_READ : 0.030 ms Memfd 2 MiB : POPULATE_WRITE : 0.030 ms Memfd 2 MiB : FALLOCATE : 0.030 ms Memfd 2 MiB : FALLOCATE+Read : 0.031 ms Memfd 2 MiB : FALLOCATE+Write : 0.031 ms Memfd 2 MiB : FALLOCATE+Read/Write : 0.031 ms Memfd 2 MiB : FALLOCATE+POPULATE_READ : 0.030 ms Memfd 2 MiB : FALLOCATE+POPULATE_WRITE : 0.030 ms tmpfs : Read : 0.416 ms tmpfs : Write : 0.369 ms tmpfs : Read/Write : 0.425 ms tmpfs : POPULATE_READ : 0.346 ms tmpfs : POPULATE_WRITE : 0.295 ms tmpfs : FALLOCATE : 0.139 ms tmpfs : FALLOCATE+Read : 0.447 ms tmpfs : FALLOCATE+Write : 0.333 ms tmpfs : FALLOCATE+Read/Write : 0.454 ms tmpfs : FALLOCATE+POPULATE_READ : 0.380 ms tmpfs : FALLOCATE+POPULATE_WRITE : 0.272 ms file : Read : 0.191 ms file : Write : 0.511 ms file : Read/Write : 0.524 ms file : POPULATE_READ : 0.196 ms file : POPULATE_WRITE : 0.434 ms file : FALLOCATE : 0.004 ms file : FALLOCATE+Read : 0.197 ms file : FALLOCATE+Write : 0.554 ms file : FALLOCATE+Read/Write : 0.480 ms file : FALLOCATE+POPULATE_READ : 0.201 ms file : FALLOCATE+POPULATE_WRITE : 0.381 ms hugetlbfs : Read : 0.030 ms hugetlbfs : Write : 0.030 ms hugetlbfs : Read/Write : 0.030 ms hugetlbfs : POPULATE_READ : 0.030 ms hugetlbfs : POPULATE_WRITE : 0.030 ms hugetlbfs : FALLOCATE : 0.030 ms hugetlbfs : FALLOCATE+Read : 0.031 ms hugetlbfs : FALLOCATE+Write : 0.031 ms hugetlbfs : FALLOCATE+Read/Write : 0.030 ms hugetlbfs : FALLOCATE+POPULATE_READ : 0.030 ms hugetlbfs : FALLOCATE+POPULATE_WRITE : 0.030 ms ************************************************** 4096 MiB MAP_SHARED: ************************************************** Anon 4 KiB : Read : 1053.090 ms Anon 4 KiB : Write : 913.642 ms Anon 4 KiB : Read/Write : 1060.350 ms Anon 4 KiB : POPULATE_READ : 893.691 ms Anon 4 KiB : POPULATE_WRITE : 782.885 ms Anon 2 MiB : Read : 358.553 ms Anon 2 MiB : Write : 358.419 ms Anon 2 MiB : Read/Write : 357.992 ms Anon 2 MiB : POPULATE_READ : 357.533 ms Anon 2 MiB : POPULATE_WRITE : 357.808 ms Memfd 4 KiB : Read : 1078.144 ms Memfd 4 KiB : Write : 942.036 ms Memfd 4 KiB : Read/Write : 1100.391 ms Memfd 4 KiB : POPULATE_READ : 925.829 ms Memfd 4 KiB : POPULATE_WRITE : 804.394 ms Memfd 4 KiB : FALLOCATE : 304.632 ms Memfd 4 KiB : FALLOCATE+Read : 1163.359 ms Memfd 4 KiB : FALLOCATE+Write : 933.186 ms Memfd 4 KiB : FALLOCATE+Read/Write : 1187.304 ms Memfd 4 KiB : FALLOCATE+POPULATE_READ : 1013.660 ms Memfd 4 KiB : FALLOCATE+POPULATE_WRITE : 794.560 ms Memfd 2 MiB : Read : 358.131 ms Memfd 2 MiB : Write : 358.099 ms Memfd 2 MiB : Read/Write : 358.250 ms Memfd 2 MiB : POPULATE_READ : 357.563 ms Memfd 2 MiB : POPULATE_WRITE : 357.334 ms Memfd 2 MiB : FALLOCATE : 356.735 ms Memfd 2 MiB : FALLOCATE+Read : 358.152 ms Memfd 2 MiB : FALLOCATE+Write : 358.331 ms Memfd 2 MiB : FALLOCATE+Read/Write : 358.018 ms Memfd 2 MiB : FALLOCATE+POPULATE_READ : 357.286 ms Memfd 2 MiB : FALLOCATE+POPULATE_WRITE : 357.523 ms tmpfs : Read : 1087.265 ms tmpfs : Write : 950.840 ms tmpfs : Read/Write : 1107.567 ms tmpfs : POPULATE_READ : 922.605 ms tmpfs : POPULATE_WRITE : 810.094 ms tmpfs : FALLOCATE : 306.320 ms tmpfs : FALLOCATE+Read : 1169.796 ms tmpfs : FALLOCATE+Write : 933.730 ms tmpfs : FALLOCATE+Read/Write : 1191.610 ms tmpfs : FALLOCATE+POPULATE_READ : 1020.474 ms tmpfs : FALLOCATE+POPULATE_WRITE : 798.945 ms file : Read : 654.101 ms file : Write : 1259.142 ms file : Read/Write : 1289.509 ms file : POPULATE_READ : 661.642 ms file : POPULATE_WRITE : 1106.816 ms file : FALLOCATE : 1.864 ms file : FALLOCATE+Read : 656.328 ms file : FALLOCATE+Write : 1153.300 ms file : FALLOCATE+Read/Write : 1180.613 ms file : FALLOCATE+POPULATE_READ : 668.347 ms file : FALLOCATE+POPULATE_WRITE : 996.143 ms hugetlbfs : Read : 357.245 ms hugetlbfs : Write : 357.413 ms hugetlbfs : Read/Write : 357.120 ms hugetlbfs : POPULATE_READ : 356.321 ms hugetlbfs : POPULATE_WRITE : 356.693 ms hugetlbfs : FALLOCATE : 355.927 ms hugetlbfs : FALLOCATE+Read : 357.074 ms hugetlbfs : FALLOCATE+Write : 357.120 ms hugetlbfs : FALLOCATE+Read/Write : 356.983 ms hugetlbfs : FALLOCATE+POPULATE_READ : 356.413 ms hugetlbfs : FALLOCATE+POPULATE_WRITE : 356.266 ms ************************************************** [1] https://lkml.org/lkml/2013/6/27/698 [akpm@linux-foundation.org: coding style fixes] Link: https://lkml.kernel.org/r/20210419135443.12822-3-david@redhat.com Signed-off-by:
David Hildenbrand <david@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@surriel.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Matt Turner <mattst88@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Helge Deller <deller@gmx.de> Cc: Chris Zankel <chris@zankel.net> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rolf Eike Beer <eike-kernel@sf-tec.de> Cc: Ram Pai <linuxram@us.ibm.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- 07 May, 2021 1 commit
-
-
Ingo Molnar authored
Fix ~94 single-word typos in locking code comments, plus a few very obvious grammar mistakes. Link: https://lkml.kernel.org/r/20210322212624.GA1963421@gmail.com Link: https://lore.kernel.org/r/20210322205203.GB1959563@gmail.com Signed-off-by:
Ingo Molnar <mingo@kernel.org> Reviewed-by:
Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by:
Randy Dunlap <rdunlap@infradead.org> Cc: Bhaskar Chowdhury <unixbhaskar@gmail.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- 13 Mar, 2021 1 commit
-
-
Suren Baghdasaryan authored
process_madvise currently requires ptrace attach capability. PTRACE_MODE_ATTACH gives one process complete control over another process. It effectively removes the security boundary between the two processes (in one direction). Granting ptrace attach capability even to a system process is considered dangerous since it creates an attack surface. This severely limits the usage of this API. The operations process_madvise can perform do not affect the correctness of the operation of the target process; they only affect where the data is physically located (and therefore, how fast it can be accessed). What we want is the ability for one process to influence another process in order to optimize performance across the entire system while leaving the security boundary intact. Replace PTRACE_MODE_ATTACH with a combination of PTRACE_MODE_READ and CAP_SYS_NICE. PTRACE_MODE_READ to prevent leaking ASLR metadata and CAP_SYS_NICE for influencing process performance. Link: https://lkml.kernel.org/r/20210303185807.2160264-1-surenb@google.com Signed-off-by:
Suren Baghdasaryan <surenb@google.com> Reviewed-by:
Kees Cook <keescook@chromium.org> Acked-by:
Minchan Kim <minchan@kernel.org> Acked-by:
David Rientjes <rientjes@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jeff Vander Stoep <jeffv@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tim Murray <timmurray@google.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: James Morris <jmorris@namei.org> Cc: <stable@vger.kernel.org> [5.10+] Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- 29 Jan, 2021 2 commits
-
-
Will Deacon authored
The 'start' and 'end' arguments to tlb_gather_mmu() are no longer needed now that there is a separate function for 'fullmm' flushing. Remove the unused arguments and update all callers. Suggested-by:
Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by:
Will Deacon <will@kernel.org> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by:
Yu Zhao <yuzhao@google.com> Acked-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by:
Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/CAHk-=wjQWa14_4UpfDf=fiineNP+RH74kZeDMo_f1D35xNzq9w@mail.gmail.com
-
Will Deacon authored
Since commit 7a30df49 ("mm: mmu_gather: remove __tlb_reset_range() for force flush"), the 'start' and 'end' arguments to tlb_finish_mmu() are no longer used, since we flush the whole mm in case of a nested invalidation. Remove the unused arguments and update all callers. Signed-off-by:
Will Deacon <will@kernel.org> Signed-off-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by:
Yu Zhao <yuzhao@google.com> Acked-by:
Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by:
Linus Torvalds <torvalds@linux-foundation.org> Link: https://lkml.kernel.org/r/20210127235347.1402-3-will@kernel.org
-
- 24 Jan, 2021 2 commits
-
-
Christian Brauner authored
The inode_owner_or_capable() helper determines whether the caller is the owner of the inode or is capable with respect to that inode. Allow it to handle idmapped mounts. If the inode is accessed through an idmapped mount it according to the mount's user namespace. Afterwards the checks are identical to non-idmapped mounts. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Similarly, allow the inode_init_owner() helper to handle idmapped mounts. It initializes a new inode on idmapped mounts by mapping the fsuid and fsgid of the caller from the mount's user namespace. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-7-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> ...
-
Christian Brauner authored
Add two simple helpers to check permissions on a file and path respectively and convert over some callers. It simplifies quite a few codepaths and also reduces the churn in later patches quite a bit. Christoph also correctly points out that this makes codepaths (e.g. ioctls) way easier to follow that would otherwise have to do more complex argument passing than necessary. Link: https://lore.kernel.org/r/20210121131959.646623-4-christian.brauner@ubuntu.com Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Suggested-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
James Morris <jamorris@linux.microsoft.com> Signed-off-by:
Christian Brauner <christian.brauner@ubuntu.com>
-
- 15 Dec, 2020 2 commits
-
-
Oscar Salvador authored
madvise_inject_error() uses get_user_pages_fast to translate the address we specified to a page. After [1], we drop the extra reference count for memory_failure() path. That commit says that memory_failure wanted to keep the pin in order to take the page out of circulation. The truth is that we need to keep the page pinned, otherwise the page might be re-used after the put_page() and we can end up messing with someone else's memory. E.g: CPU0 process X CPU1 madvise_inject_error get_user_pages put_page page gets reclaimed process Y allocates the page memory_failure // We mess with process Y memory madvise() is meant to operate on a self address space, so messing with pages that do not belong to us seems the wrong thing to do. To avoid that, let us keep the page pinned for memory_failure as well. Pages for DAX mappings will release this extra refcount in memory_failure_dev_pagemap. [1] ("23e7b5c2: mm, madvise_inject_error: Let memory_failure() optionally take a page reference") Link: https://lkml.kernel.org/r/20201207094818.8518-1-osalvador@suse.de Fixes: 23e7b5c2 ("mm, madvise_inject_error: Let memory_failure() optionally take a page reference") Signed-off-by:
Oscar Salvador <osalvador@suse.de> Suggested-by:
Vlastimil Babka <vbabka@suse.cz> Acked-by:
Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Oscar Salvador authored
memory_failure and soft_offline_path paths now drain pcplists by calling get_hwpoison_page. memory_failure flags the page as HWPoison before, so that page cannot longer go into a pcplist, and soft_offline_page only flags a page as HWPoison if 1) we took the page off a buddy freelist 2) the page was in-use and we migrated it 3) was a clean pagecache. Because of that, a page cannot longer be poisoned and be in a pcplist. Link: https://lkml.kernel.org/r/20201013144447.6706-5-osalvador@suse.de Signed-off-by:
Oscar Salvador <osalvador@suse.de> Acked-by:
Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- 09 Dec, 2020 1 commit
-
-
Minchan Kim authored
Jann spotted the security hole due to race of mm ownership check. If the task is sharing the mm_struct but goes through execve() before mm_access(), it could skip process_madvise_behavior_valid check. That makes *any advice hint* to reach into the remote process. This patch removes the mm ownership check. With it, it will lose the ability that local process could give *any* advice hint with vector interface for some reason (e.g., performance). Since there is no concrete example in upstream yet, it would be better to remove the abiliity at this moment and need to review when such new advice comes up. Fixes: ecb8ac8b ("mm/madvise: introduce process_madvise() syscall: an external memory hinting API") Reported-by:
Jann Horn <jannh@google.com> Suggested-by:
Jann Horn <jannh@google.com> Signed-off-by:
Minchan Kim <minchan@kernel.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- 22 Nov, 2020 2 commits
-
-
Matthew Wilcox (Oracle) authored
The calculation of the end page index was incorrect, leading to a regression of 70% when running stress-ng. With this fix, we instead see a performance improvement of 3%. Fixes: e6e88712 ("mm: optimise madvise WILLNEED") Reported-by:
kernel test robot <rong.a.chen@intel.com> Signed-off-by:
Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Tested-by:
Xing Zhengjun <zhengjun.xing@linux.intel.com> Acked-by:
Johannes Weiner <hannes@cmpxchg.org> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Feng Tang <feng.tang@intel.com> Cc: "Chen, Rong A" <rong.a.chen@intel.com> Link: https://lkml.kernel.org/r/20201109134851.29692-1-willy@infradead.org Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Eric Dumazet authored
The early return in process_madvise() will produce a memory leak. Fix it. Fixes: ecb8ac8b ("mm/madvise: introduce process_madvise() syscall: an external memory hinting API") Signed-off-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
Minchan Kim <minchan@kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Link: https://lkml.kernel.org/r/20201116155132.GA3805951@google.com Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- 18 Oct, 2020 2 commits
-
-
Minchan Kim authored
There is usecase that System Management Software(SMS) want to give a memory hint like MADV_[COLD|PAGEEOUT] to other processes and in the case of Android, it is the ActivityManagerService. The information required to make the reclaim decision is not known to the app. Instead, it is known to the centralized userspace daemon(ActivityManagerService), and that daemon must be able to initiate reclaim on its own without any app involvement. To solve the issue, this patch introduces a new syscall process_madvise(2). It uses pidfd of an external process to give the hint. It also supports vector address range because Android app has thousands of vmas due to zygote so it's totally waste of CPU and power if we should call the syscall one by one for each vma.(With testing 2000-vma syscall vs 1-vector syscall, it showed 15% performance improvement. I think it would be bigger in real practice because the testing ran very cache friendly environment). Another potential use case for the vector range is to amortize the cost ofTLB shootdowns for multiple ranges when using MADV_DONTNEED; this could benefit users like TCP receive zerocopy and malloc implementations. In future, we could find more usecases for other advises so let's make it happens as API since we introduce a new syscall at this moment. With that, existing madvise(2) user could replace it with process_madvise(2) with their own pid if they want to have batch address ranges support feature. ince it could affect other process's address range, only privileged process(PTRACE_MODE_ATTACH_FSCREDS) or something else(e.g., being the same UID) gives it the right to ptrace the process could use it successfully. The flag argument is reserved for future use if we need to extend the API. I think supporting all hints madvise has/will supported/support to process_madvise is rather risky. Because we are not sure all hints make sense from external process and implementation for the hint may rely on the caller being in the current context so it could be error-prone. Thus, I just limited hints as MADV_[COLD|PAGEOUT] in this patch. If someone want to add other hints, we could hear the usecase and review it for each hint. It's safer for maintenance rather than introducing a buggy syscall but hard to fix it later. So finally, the API is as follows, ssize_t process_madvise(int pidfd, const struct iovec *iovec, unsigned long vlen, int advice, unsigned int flags); DESCRIPTION The process_madvise() system call is used to give advice or directions to the kernel about the address ranges from external process as well as local process. It provides the advice to address ranges of process described by iovec and vlen. The goal of such advice is to improve system or application performance. The pidfd selects the process referred to by the PID file descriptor specified in pidfd. (See pidofd_open(2) for further information) The pointer iovec points to an array of iovec structures, defined in <sys/uio.h> as: struct iovec { void *iov_base; /* starting address */ size_t iov_len; /* number of bytes to be advised */ }; The iovec describes address ranges beginning at address(iov_base) and with size length of bytes(iov_len). The vlen represents the number of elements in iovec. The advice is indicated in the advice argument, which is one of the following at this moment if the target process specified by pidfd is external. MADV_COLD MADV_PAGEOUT Permission to provide a hint to external process is governed by a ptrace access mode PTRACE_MODE_ATTACH_FSCREDS check; see ptrace(2). The process_madvise supports every advice madvise(2) has if target process is in same thread group with calling process so user could use process_madvise(2) to extend existing madvise(2) to support vector address ranges. RETURN VALUE On success, process_madvise() returns the number of bytes advised. This return value may be less than the total number of requested bytes, if an error occurred. The caller should check return value to determine whether a partial advice occurred. FAQ: Q.1 - Why does any external entity have better knowledge? Quote from Sandeep "For Android, every application (including the special SystemServer) are forked from Zygote. The reason of course is to share as many libraries and classes between the two as possible to benefit from the preloading during boot. After applications start, (almost) all of the APIs end up calling into this SystemServer process over IPC (binder) and back to the application. In a fully running system, the SystemServer monitors every single process periodically to calculate their PSS / RSS and also decides which process is "important" to the user for interactivity. So, because of how these processes start _and_ the fact that the SystemServer is looping to monitor each process, it does tend to *know* which address range of the application is not used / useful. Besides, we can never rely on applications to clean things up themselves. We've had the "hey app1, the system is low on memory, please trim your memory usage down" notifications for a long time[1]. They rely on applications honoring the broadcasts and very few do. So, if we want to avoid the inevitable killing of the application and restarting it, some way to be able to tell the OS about unimportant memory in these applications will be useful. - ssp Q.2 - How to guarantee the race(i.e., object validation) between when giving a hint from an external process and get the hint from the target process? process_madvise operates on the target process's address space as it exists at the instant that process_madvise is called. If the space target process can run between the time the process_madvise process inspects the target process address space and the time that process_madvise is actually called, process_madvise may operate on memory regions that the calling process does not expect. It's the responsibility of the process calling process_madvise to close this race condition. For example, the calling process can suspend the target process with ptrace, SIGSTOP, or the freezer cgroup so that it doesn't have an opportunity to change its own address space before process_madvise is called. Another option is to operate on memory regions that the caller knows a priori will be unchanged in the target process. Yet another option is to accept the race for certain process_madvise calls after reasoning that mistargeting will do no harm. The suggested API itself does not provide synchronization. It also apply other APIs like move_pages, process_vm_write. The race isn't really a problem though. Why is it so wrong to require that callers do their own synchronization in some manner? Nobody objects to write(2) merely because it's possible for two processes to open the same file and clobber each other's writes --- instead, we tell people to use flock or something. Think about mmap. It never guarantees newly allocated address space is still valid when the user tries to access it because other threads could unmap the memory right before. That's where we need synchronization by using other API or design from userside. It shouldn't be part of API itself. If someone needs more fine-grained synchronization rather than process level, there were two ideas suggested - cookie[2] and anon-fd[3]. Both are applicable via using last reserved argument of the API but I don't think it's necessary right now since we have already ways to prevent the race so don't want to add additional complexity with more fine-grained optimization model. To make the API extend, it reserved an unsigned long as last argument so we could support it in future if someone really needs it. Q.3 - Why doesn't ptrace work? Injecting an madvise in the target process using ptrace would not work for us because such injected madvise would have to be executed by the target process, which means that process would have to be runnable and that creates the risk of the abovementioned race and hinting a wrong VMA. Furthermore, we want to act the hint in caller's context, not the callee's, because the callee is usually limited in cpuset/cgroups or even freezed state so they can't act by themselves quick enough, which causes more thrashing/kill. It doesn't work if the target process are ptraced(e.g., strace, debugger, minidump) because a process can have at most one ptracer. [1] https://developer.android.com/topic/performance/memory" [2] process_getinfo for getting the cookie which is updated whenever vma of process address layout are changed - Daniel Colascione - https://lore.kernel.org/lkml/20190520035254.57579-1-minchan@kernel.org/T/#m7694416fd179b2066a2c62b5b139b14e3894e224 [3] anonymous fd which is used for the object(i.e., address range) validation - Michal Hocko - https://lore.kernel.org/lkml/20200120112722.GY18451@dhcp22.suse.cz/ [minchan@kernel.org: fix process_madvise build break for arm64] Link: http://lkml.kernel.org/r/20200303145756.GA219683@google.com [minchan@kernel.org: fix build error for mips of process_madvise] Link: http://lkml.kernel.org/r/20200508052517.GA197378@google.com [akpm@linux-foundation.org: fix patch ordering issue] [akpm@linux-foundation.org: fix arm64 whoops] [minchan@kernel.org: make process_madvise() vlen arg have type size_t, per Florian] [akpm@linux-foundation.org: fix i386 build] [sfr@canb.auug.org.au: fix syscall numbering] Link: https://lkml.kernel.org/r/20200905142639.49fc3f1a@canb.auug.org.au [sfr@canb.auug.org.au: madvise.c needs compat.h] Link: https://lkml.kernel.org/r/20200908204547.285646b4@canb.auug.org.au [minchan@kernel.org: fix mips build] Link: https://lkml.kernel.org/r/20200909173655.GC2435453@google.com [yuehaibing@huawei.com: remove duplicate header which is included twice] Link: https://lkml.kernel.org/r/20200915121550.30584-1-yuehaibing@huawei.com [minchan@kernel.org: do not use helper functions for process_madvise] Link: https://lkml.kernel.org/r/20200921175539.GB387368@google.com [akpm@linux-foundation.org: pidfd_get_pid() gained an argument] [sfr@canb.auug.org.au: fix up for "iov_iter: transparently handle compat iovecs in import_iovec"] Link: https://lkml.kernel.org/r/20200928212542.468e1fef@canb.auug.org.au Signed-off-by:
Minchan Kim <minchan@kernel.org> Signed-off-by:
YueHaibing <yuehaibing@huawei.com> Signed-off-by:
Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Reviewed-by:
Suren Baghdasaryan <surenb@google.com> Reviewed-by:
Vlastimil Babka <vbabka@suse.cz> Acked-by:
David Rientjes <rientjes@google.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Brian Geffon <bgeffon@google.com> Cc: Christian Brauner <christian@brauner.io> Cc: Daniel Colascione <dancol@google.com> Cc: Jann Horn <jannh@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Dias <joaodias@google.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Sandeep Patil <sspatil@google.com> Cc: SeongJae Park <sj38.park@gmail.com> Cc: SeongJae Park <sjpark@amazon.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Tim Murray <timmurray@google.com> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Florian Weimer <fw@deneb.enyo.de> Cc: <linux-man@vger.kernel.org> Link: http://lkml.kernel.org/r/20200302193630.68771-3-minchan@kernel.org Link: http://lkml.kernel.org/r/20200508183320.GA125527@google.com Link: http://lkml.kernel.org/r/20200622192900.22757-4-minchan@kernel.org Link: https://lkml.kernel.org/r/20200901000633.1920247-4-minchan@kernel.org Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Minchan Kim authored
Patch series "introduce memory hinting API for external process", v9. Now, we have MADV_PAGEOUT and MADV_COLD as madvise hinting API. With that, application could give hints to kernel what memory range are preferred to be reclaimed. However, in some platform(e.g., Android), the information required to make the hinting decision is not known to the app. Instead, it is known to a centralized userspace daemon(e.g., ActivityManagerService), and that daemon must be able to initiate reclaim on its own without any app involvement. To solve the concern, this patch introduces new syscall - process_madvise(2). Bascially, it's same with madvise(2) syscall but it has some differences. 1. It needs pidfd of target process to provide the hint 2. It supports only MADV_{COLD|PAGEOUT|MERGEABLE|UNMEREABLE} at this moment. Other hints in madvise will be opened when there are explicit requests from community to prevent unexpected bugs we couldn't support. 3. Only privileged processes can do something for other process's address space. For more detail of the new API, please see "mm: introduce external memory hinting API" description in this patchset. This patch (of 3): In upcoming patches, do_madvise will be called from external process context so we shouldn't asssume "current" is always hinted process's task_struct. Furthermore, we must not access mm_struct via task->mm, but obtain it via access_mm() once (in the following patch) and only use that pointer [1], so pass it to do_madvise() as well. Note the vma->vm_mm pointers are safe, so we can use them further down the call stack. And let's pass current->mm as arguments of do_madvise so it shouldn't change existing behavior but prepare next patch to make review easy. [vbabka@suse.cz: changelog tweak] [minchan@kernel.org: use current->mm for io_uring] Link: http://lkml.kernel.org/r/20200423145215.72666-1-minchan@kernel.org [akpm@linux-foundation.org: fix it for upstream changes] [akpm@linux-foundation.org: whoops] [rdunlap@infradead.org: add missing includes] Signed-off-by:
Minchan Kim <minchan@kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Reviewed-by:
Suren Baghdasaryan <surenb@google.com> Reviewed-by:
Vlastimil Babka <vbabka@suse.cz> Acked-by:
David Rientjes <rientjes@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jann Horn <jannh@google.com> Cc: Tim Murray <timmurray@google.com> Cc: Daniel Colascione <dancol@google.com> Cc: Sandeep Patil <sspatil@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Brian Geffon <bgeffon@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: John Dias <joaodias@google.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: SeongJae Park <sj38.park@gmail.com> Cc: Christian Brauner <christian@brauner.io> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: SeongJae Park <sjpark@amazon.de> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Florian Weimer <fw@deneb.enyo.de> Cc: <linux-man@vger.kernel.org> Link: https://lkml.kernel.org/r/20200901000633.1920247-1-minchan@kernel.org Link: http://lkml.kernel.org/r/20200622192900.22757-1-minchan@kernel.org Link: http://lkml.kernel.org/r/20200302193630.68771-2-minchan@kernel.org Link: http://lkml.kernel.org/r/20200622192900.22757-2-minchan@kernel.org Link: https://lkml.kernel.org/r/20200901000633.1920247-2-minchan@kernel.org Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- 16 Oct, 2020 3 commits
-
-
Jann Horn authored
The preceding patches have ensured that core dumping properly takes the mmap_lock. Thanks to that, we can now remove mmget_still_valid() and all its users. Signed-off-by:
Jann Horn <jannh@google.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Acked-by:
Linus Torvalds <torvalds@linux-foundation.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Link: http://lkml.kernel.org/r/20200827114932.3572699-8-jannh@google.com Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Oscar Salvador authored
Currently, there is an inconsistency when calling soft-offline from different paths on a page that is already poisoned. 1) madvise: madvise_inject_error skips any poisoned page and continues the loop. If that was the only page to madvise, it returns 0. 2) /sys/devices/system/memory/: When calling soft_offline_page_store()->soft_offline_page(), we return -EBUSY in case the page is already poisoned. This is inconsistent with a) the above example and b) memory_failure, where we return 0 if the page was poisoned. Fix this by dropping the PageHWPoison() check in madvise_inject_error, and let soft_offline_page return 0 if it finds the page already poisoned. Please, note that this represents a user-api change, since now the return error when calling soft_offline_page_store()->soft_offline_page() will be different. Signed-off-by:
Oscar Salvador <osalvador@suse.de> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Acked-by:
Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Aristeu Rozanski <aris@ruivo.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dmitry Yakunin <zeil@yandex-team.ru> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.com> Cc: Qian Cai <cai@lca.pw> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20200922135650.1634-12-osalvador@suse.de Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Oscar Salvador authored
Make a proper if-else condition for {hard,soft}-offline. Signed-off-by:
Oscar Salvador <osalvador@suse.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Acked-by:
Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qian Cai <cai@lca.pw> Cc: Tony Luck <tony.luck@intel.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Aristeu Rozanski <aris@ruivo.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dmitry Yakunin <zeil@yandex-team.ru> Cc: Mike Kravetz <mike.kravetz@oracle.com> Link: https://lkml.kernel.org/r/20200908075626.11976-3-osalvador@suse.de Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- 14 Oct, 2020 1 commit
-
-
Matthew Wilcox (Oracle) authored
Instead of calling find_get_entry() for every page index, use an XArray iterator to skip over NULL entries, and avoid calling get_page(), because we only want the swap entries. [willy@infradead.org: fix LTP soft lockups] Link: https://lkml.kernel.org/r/20200914165032.GS6583@casper.infradead.org Signed-off-by:
Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Acked-by:
Johannes Weiner <hannes@cmpxchg.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Huang Ying <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Qian Cai <cai@redhat.com> Link: https://lkml.kernel.org/r/20200910183318.20139-4-willy@infradead.org Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- 26 Sep, 2020 1 commit
-
-
Minchan Kim authored
syzbot reported the following KASAN splat: general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f] CPU: 1 PID: 6826 Comm: syz-executor142 Not tainted 5.9.0-rc4-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__lock_acquire+0x84/0x2ae0 kernel/locking/lockdep.c:4296 Code: ff df 8a 04 30 84 c0 0f 85 e3 16 00 00 83 3d 56 58 35 08 00 0f 84 0e 17 00 00 83 3d 25 c7 f5 07 00 74 2c 4c 89 e8 48 c1 e8 03 <80> 3c 30 00 74 12 4c 89 ef e8 3e d1 5a 00 48 be 00 00 00 00 00 fc RSP: 0018:ffffc90004b9f850 EFLAGS: 00010006 Call Trace: lock_acquire+0x140/0x6f0 kernel/locking/lockdep.c:5006 __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline] _raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:151 spin_lock include/linux/spinlock.h:354 [inline] madvise_cold_or_pageout_pte_range+0x52f/0x25c0 mm/madvise.c:389 walk_pmd_range mm/pagewalk.c:89 [inline] walk_pud_range mm/pagewalk.c:160 [inline] walk_p4d_range mm/pagewalk.c:193 [inline] walk_pgd_range mm/pagewalk.c:229 [inline] __walk_page_range+0xe7b/0x1da0 mm/pagewalk.c:331 walk_page_range+0x2c3/0x5c0 mm/pagewalk.c:427 madvise_pageout_page_range mm/madvise.c:521 [inline] madvise_pageout mm/madvise.c:557 [inline] madvise_vma mm/madvise.c:946 [inline] do_madvise+0x12d0/0x2090 mm/madvise.c:1145 __do_sys_madvise mm/madvise.c:1171 [inline] __se_sys_madvise mm/madvise.c:1169 [inline] __x64_sys_madvise+0x76/0x80 mm/madvise.c:1169 do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The backing vma was shmem. In case of split page of file-backed THP, madvise zaps the pmd instead of remapping of sub-pages. So we need to check pmd validity after split. Reported-by: syzbot+ecf80462cb7d5d552bc7@syzkaller.appspotmail.com Fixes: 1a4e58cc ("mm: introduce MADV_PAGEOUT") Signed-off-by:
Minchan Kim <minchan@kernel.org> Acked-by:
Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- 05 Sep, 2020 1 commit
-
-
Yang Shi authored
The syzbot reported the below use-after-free: BUG: KASAN: use-after-free in madvise_willneed mm/madvise.c:293 [inline] BUG: KASAN: use-after-free in madvise_vma mm/madvise.c:942 [inline] BUG: KASAN: use-after-free in do_madvise.part.0+0x1c8b/0x1cf0 mm/madvise.c:1145 Read of size 8 at addr ffff8880a6163eb0 by task syz-executor.0/9996 CPU: 0 PID: 9996 Comm: syz-executor.0 Not tainted 5.9.0-rc1-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x18f/0x20d lib/dump_stack.c:118 print_address_description.constprop.0.cold+0xae/0x497 mm/kasan/report.c:383 __kasan_report mm/kasan/report.c:513 [inline] kasan_report.cold+0x1f/0x37 mm/kasan/report.c:530 madvise_willneed mm/madvise.c:293 [inline] madvise_vma mm/madvise.c:942 [inline] do_madvise.part.0+0x1c8b/0x1cf0 mm/madvise.c:1145 do_madvise mm/madvise.c:1169 [inline] __do_sys_madvise mm/madvise.c:1171 [inline] __se_sys_madvise mm/madvise.c:1169 [inline] __x64_sys_madvise+0xd9/0x110 mm/madvise.c:1169 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Allocated by task 9992: kmem_cache_alloc+0x138/0x3a0 mm/slab.c:3482 vm_area_alloc+0x1c/0x110 kernel/fork.c:347 mmap_region+0x8e5/0x1780 mm/mmap.c:1743 do_mmap+0xcf9/0x11d0 mm/mmap.c:1545 vm_mmap_pgoff+0x195/0x200 mm/util.c:506 ksys_mmap_pgoff+0x43a/0x560 mm/mmap.c:1596 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Freed by task 9992: kmem_cache_free.part.0+0x67/0x1f0 mm/slab.c:3693 remove_vma+0x132/0x170 mm/mmap.c:184 remove_vma_list mm/mmap.c:2613 [inline] __do_munmap+0x743/0x1170 mm/mmap.c:2869 do_munmap mm/mmap.c:2877 [inline] mmap_region+0x257/0x1780 mm/mmap.c:1716 do_mmap+0xcf9/0x11d0 mm/mmap.c:1545 vm_mmap_pgoff+0x195/0x200 mm/util.c:506 ksys_mmap_pgoff+0x43a/0x560 mm/mmap.c:1596 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 It is because vma is accessed after releasing mmap_lock, but someone else acquired the mmap_lock and the vma is gone. Releasing mmap_lock after accessing vma should fix the problem. Fixes: 692fe624 ("mm: Handle MADV_WILLNEED through vfs_fadvise()") Reported-by: syzbot+b90df26038d1d5d85c97@syzkaller.appspotmail.com Signed-off-by:
Yang Shi <shy828301@gmail.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Reviewed-by:
Andrew Morton <akpm@linux-foundation.org> Reviewed-by:
Jan Kara <jack@suse.cz> Cc: <stable@vger.kernel.org> [5.4+] Link: https://lkml.kernel.org/r/20200816141204.162624-1-shy828301@gmail.com Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- 09 Jun, 2020 2 commits
-
-
Michel Lespinasse authored
Convert comments that reference mmap_sem to reference mmap_lock instead. [akpm@linux-foundation.org: fix up linux-next leftovers] [akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil] [akpm@linux-foundation.org: more linux-next fixups, per Michel] Signed-off-by:
Michel Lespinasse <walken@google.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Reviewed-by:
Vlastimil Babka <vbabka@suse.cz> Reviewed-by:
Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com S...
-
Michel Lespinasse authored
This change converts the existing mmap_sem rwsem calls to use the new mmap locking API instead. The change is generated using coccinelle with the following rule: // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir . @@ expression mm; @@ ( -init_rwsem +mmap_init_lock | -down_write +mmap_write_lock | -down_write_killable +mmap_write_lock_killable | -down_write_trylock +mmap_write_trylock | -up_write +mmap_write_unlock | -downgrade_write +mmap_write_downgrade | -down_read +mmap_read_lock | -down_read_killable +mmap_read_lock_killable | -down_read_trylock +mmap_read_trylock | -up_read +mmap_read_unlock ) -(&mm->mmap_sem) +(mm) Signed-off-by:
Michel Lespinasse <walken@google.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Reviewed-by:
Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by:
Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by:
Vlastimil Babka <vbabka@suse.cz> Cc: Davidlohr Bueso <dbueso@...
-
- 24 Apr, 2020 1 commit
-
-
Linus Torvalds authored
IORING_OP_MADVISE can end up basically doing mprotect() on the VM of another process, which means that it can race with our crazy core dump handling which accesses the VM state without holding the mmap_sem (because it incorrectly thinks that it is the final user). This is clearly a core dumping problem, but we've never fixed it the right way, and instead have the notion of "check that the mm is still ok" using mmget_still_valid() after getting the mmap_sem for writing in any situation where we're not the original VM thread. See commit 04f5866e ("coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping") for more background on this whole mmget_still_valid() thing. You might want to have a barf bag handy when you do. We're discussing just fixing this properly in the only remaining core dumping routines. But even if we do that, let's make do_madvise() do the right thing, and then when we fix core dumping, we can remove all these mmget_still_valid() checks. Reported-and-tested-by:
Jann Horn <jannh@google.com> Fixes: c1ca757b ("io_uring: add IORING_OP_MADVISE") Acked-by:
Jens Axboe <axboe@kernel.dk> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- 22 Mar, 2020 1 commit
-
-
Michal Hocko authored
Jann has brought up a very interesting point [1]. While shared pages are excluded from MADV_PAGEOUT normally, CoW pages can be easily reclaimed that way. This can lead to all sorts of hard to debug problems. E.g. performance problems outlined by Daniel [2]. There are runtime environments where there is a substantial memory shared among security domains via CoW memory and a easy to reclaim way of that memory, which MADV_{COLD,PAGEOUT} offers, can lead to either performance degradation in for the parent process which might be more privileged or even open side channel attacks. The feasibility of the latter is not really clear to me TBH but there is no real reason for exposure at this stage. It seems there is no real use case to depend on reclaiming CoW memory via madvise at this stage so it is much easier to simply disallow it and this is what this patch does. Put it simply MADV_{PAGEOUT,COLD} can operate only on the exclusively owned memory which is a straightforward semantic. [1] http://lkml.kernel.org/r/CAG48ez0G3JkMq61gUmyQAaCq=_TwHbi1XKzWRooxZkv08PQKuw@mail.gmail.com [2] http://lkml.kernel.org/r/CAKOZueua_v8jHCpmEtTB6f3i9e2YnmX4mqdYVWhV4E=Z-n+zRQ@mail.gmail.com Fixes: 9c276cc6 ("mm: introduce MADV_COLD") Reported-by:
Jann Horn <jannh@google.com> Signed-off-by:
Michal Hocko <mhocko@suse.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Acked-by:
Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200312082248.GS23944@dhcp22.suse.cz Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- 21 Jan, 2020 1 commit
-
-
Jens Axboe authored
This is in preparation for enabling this functionality through io_uring. Add a helper that is just exporting what sys_madvise() does, and have the system call use it. No functional changes in this patch. Reviewed-by:
Pavel Begunkov <asml.silence@gmail.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- 01 Dec, 2019 3 commits
-
-
Wei Yang authored
Improve readability, no functional change. Link: http://lkml.kernel.org/r/20191118032857.22683-1-richardw.yang@linux.intel.com Signed-off-by:
Wei Yang <richardw.yang@linux.intel.com> Reviewed-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Yunfeng Ye authored
page_size() is supported after the commit a50b854e ("mm: introduce page_size()"). Use page_size() in madvise_inject_error() for readability. [akpm@linux-foundation.org: use ulong for `size', per David] Link: http://lkml.kernel.org/r/29dce60c-38d6-0220-f292-e298f0c78c4d@huawei.com Signed-off-by:
Yunfeng Ye <yeyunfeng@huawei.com> Reviewed-by:
Andrew Morton <akpm@linux-foundation.org> Acked-by:
David Rientjes <rientjes@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jan Kara <jack@suse.cz> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Hu Shiyuan <hushiyuan@huawei.com> Cc: Feilong Lin <linfeilong@huawei.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Naoya Horiguchi authored
Currently soft_offline_page() receives struct page, and its sibling memory_failure() receives pfn. This discrepancy looks weird and makes precheck on pfn validity tricky. So let's align them. Link: http://lkml.kernel.org/r/20191016234706.GA5493@www9186uo.sakura.ne.jp Signed-off-by:
Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by:
Andrew Morton <akpm@linux-foundation.org> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- 16 Nov, 2019 1 commit
-
-
zhong jiang authored
Recently, I hit the following issue when running upstream. kernel BUG at mm/vmscan.c:1521! invalid opcode: 0000 [#1] SMP KASAN PTI CPU: 0 PID: 23385 Comm: syz-executor.6 Not tainted 5.4.0-rc4+ #1 RIP: 0010:shrink_page_list+0x12b6/0x3530 mm/vmscan.c:1521 Call Trace: reclaim_pages+0x499/0x800 mm/vmscan.c:2188 madvise_cold_or_pageout_pte_range+0x58a/0x710 mm/madvise.c:453 walk_pmd_range mm/pagewalk.c:53 [inline] walk_pud_range mm/pagewalk.c:112 [inline] walk_p4d_range mm/pagewalk.c:139 [inline] walk_pgd_range mm/pagewalk.c:166 [inline] __walk_page_range+0x45a/0xc20 mm/pagewalk.c:261 walk_page_range+0x179/0x310 mm/pagewalk.c:349 madvise_pageout_page_range mm/madvise.c:506 [inline] madvise_pageout+0x1f0/0x330 mm/madvise.c:542 madvise_vma mm/madvise.c:931 [inline] __do_sys_madvise+0x7d2/0x1600 mm/madvise.c:1113 do_syscall_64+0x9f/0x4c0 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe madvise_pageout() accesses the specified range of the vma and isolates them, then runs shrink_page_list() to reclaim its memory. But it also isolates the unevictable pages to reclaim. Hence, we can catch the cases in shrink_page_list(). The root cause is that we scan the page tables instead of specific LRU list. and so we need to filter out the unevictable lru pages from our end. Link: http://lkml.kernel.org/r/1572616245-18946-1-git-send-email-zhongjiang@huawei.com Fixes: 1a4e58cc ("mm: introduce MADV_PAGEOUT") Signed-off-by:
zhong jiang <zhongjiang@huawei.com> Suggested-by:
Johannes Weiner <hannes@cmpxchg.org> Acked-by:
Johannes Weiner <hannes@cmpxchg.org> Acked-by:
Minchan Kim <minchan@kernel.org> Acked-by:
Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- 26 Sep, 2019 4 commits
-
-
Minchan Kim authored
There are many common parts between MADV_COLD and MADV_PAGEOUT. This patch factor them out to save code duplication. Link: http://lkml.kernel.org/r/20190726023435.214162-6-minchan@kernel.org Signed-off-by:
Minchan Kim <minchan@kernel.org> Suggested-by:
Johannes Weiner <hannes@cmpxchg.org> Acked-by:
Michal Hocko <mhocko@suse.com> Cc: Chris Zankel <chris@zankel.net> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: kbuild test robot <lkp@intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Minchan Kim authored
When a process expects no accesses to a certain memory range for a long time, it could hint kernel that the pages can be reclaimed instantly but data should be preserved for future use. This could reduce workingset eviction so it ends up increasing performance. This patch introduces the new MADV_PAGEOUT hint to madvise(2) syscall. MADV_PAGEOUT can be used by a process to mark a memory range as not expected to be used for a long time so that kernel reclaims *any LRU* pages instantly. The hint can help kernel in deciding which pages to evict proactively. A note: It doesn't apply SWAP_CLUSTER_MAX LRU page isolation limit intentionally because it's automatically bounded by PMD size. If PMD size(e.g., 256) makes some trouble, we could fix it later by limit it to SWAP_CLUSTER_MAX[1]. - man-page material MADV_PAGEOUT (since Linux x.x) Do not expect access in the near future so pages in the specified regions could be reclaimed instantly regardless of memory pressure. Thus, access in the range after successful operation could cause major page fault but never lose the up-to-date contents unlike MADV_DONTNEED. Pages belonging to a shared mapping are only processed if a write access is allowed for the calling process. MADV_PAGEOUT cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP pages. [1] https://lore.kernel.org/lkml/20190710194719.GS29695@dhcp22.suse.cz/ [minchan@kernel.org: clear PG_active on MADV_PAGEOUT] Link: http://lkml.kernel.org/r/20190802200643.GA181880@google.com [akpm@linux-foundation.org: resolve conflicts with hmm.git] Link: http://lkml.kernel.org/r/20190726023435.214162-5-minchan@kernel.org Signed-off-by:
Minchan Kim <minchan@kernel.org> Reported-by:
kbuild test robot <lkp@intel.com> Acked-by:
Michal Hocko <mhocko@suse.com> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Chris Zankel <chris@zankel.net> Cc: Daniel Colascione <dancol@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oleksandr Natalenko <oleksandr@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Sonny Rao <sonnyrao@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tim Murray <timmurray@google.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Minchan Kim authored
Patch series "Introduce MADV_COLD and MADV_PAGEOUT", v7. - Background The Android terminology used for forking a new process and starting an app from scratch is a cold start, while resuming an existing app is a hot start. While we continually try to improve the performance of cold starts, hot starts will always be significantly less power hungry as well as faster so we are trying to make hot start more likely than cold start. To increase hot start, Android userspace manages the order that apps should be killed in a process called ActivityManagerService. ActivityManagerService tracks every Android app or service that the user could be interacting with at any time and translates that into a ranked list for lmkd(low memory killer daemon). They are likely to be killed by lmkd if the system has to reclaim memory. In that sense they are similar to entries in any other cache. Those apps are kept alive for opportunistic performance improvements but those performance improvements wil...
-
Andrey Konovalov authored
This patch is a part of a series that extends kernel ABI to allow to pass tagged user pointers (with the top byte set to something else other than 0x00) as syscall arguments. This patch allows tagged pointers to be passed to the following memory syscalls: get_mempolicy, madvise, mbind, mincore, mlock, mlock2, mprotect, mremap, msync, munlock, move_pages. The mmap and mremap syscalls do not currently accept tagged addresses. Architectures may interpret the tag as a background colour for the corresponding vma. Link: http://lkml.kernel.org/r/aaf0c0969d46b2feb9017f3e1b3ef3970b633d91.1563904656.git.andreyknvl@google.com Signed-off-by:
Andrey Konovalov <andreyknvl@google.com> Reviewed-by:
Khalid Aziz <khalid.aziz@oracle.com> Reviewed-by:
Vincenzo Frascino <vincenzo.frascino@arm.com> Reviewed-by:
Catalin Marinas <catalin.marinas@arm.com> Reviewed-by:
Kees Cook <keescook@chromium.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dave Hansen <dave.hansen@intel.com>...
-
- 24 Sep, 2019 1 commit
-
-
Mike Rapoport authored
madvise_behavior() converts -ENOMEM to -EAGAIN in several places using identical code. Move that code to a common error handling path. No functional changes. Link: http://lkml.kernel.org/r/1564640896-1210-1-git-send-email-rppt@linux.ibm.com Signed-off-by:
Mike Rapoport <rppt@linux.ibm.com> Acked-by:
Pankaj Gupta <pagupta@redhat.com> Reviewed-by:
Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- 07 Sep, 2019 2 commits
-
-
Christoph Hellwig authored
The mm_walk structure currently mixed data and code. Split out the operations vectors into a new mm_walk_ops structure, and while we are changing the API also declare the mm_walk structure inside the walk_page_range and walk_page_vma functions. Based on patch from Linus Torvalds. Link: https://lore.kernel.org/r/20190828141955.22210-3-hch@lst.de Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Thomas Hellstrom <thellstrom@vmware.com> Reviewed-by:
Steven Price <steven.price@arm.com> Reviewed-by:
Jason Gunthorpe <jgg@mellanox.com> Signed-off-by:
Jason Gunthorpe <jgg@mellanox.com>
-
Christoph Hellwig authored
Add a new header for the two handful of users of the walk_page_range / walk_page_vma interface instead of polluting all users of mm.h with it. Link: https://lore.kernel.org/r/20190828141955.22210-2-hch@lst.de Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Thomas Hellstrom <thellstrom@vmware.com> Reviewed-by:
Steven Price <steven.price@arm.com> Reviewed-by:
Jason Gunthorpe <jgg@mellanox.com> Signed-off-by:
Jason Gunthorpe <jgg@mellanox.com>
-
- 31 Aug, 2019 1 commit
-
-
Jan Kara authored
Currently handling of MADV_WILLNEED hint calls directly into readahead code. Handle it by calling vfs_fadvise() instead so that filesystem can use its ->fadvise() callback to acquire necessary locks or otherwise prepare for the request. Suggested-by:
Amir Goldstein <amir73il@gmail.com> Reviewed-by:
Boaz Harrosh <boazh@netapp.com> CC: stable@vger.kernel.org Signed-off-by:
Jan Kara <jack@suse.cz> Reviewed-by:
Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by:
Darrick J. Wong <darrick.wong@oracle.com>
-
- 02 Jul, 2019 1 commit
-
-
Christoph Hellwig authored
The code hasn't been used since it was added to the tree, and doesn't appear to actually be usable. Signed-off-by:
Christoph Hellwig <hch@lst.de> Reviewed-by:
Jason Gunthorpe <jgg@mellanox.com> Acked-by:
Michal Hocko <mhocko@suse.com> Reviewed-by:
Dan Williams <dan.j.williams@intel.com> Tested-by:
Dan Williams <dan.j.williams@intel.com> Signed-off-by:
Jason Gunthorpe <jgg@mellanox.com>
-
- 14 May, 2019 1 commit
-
-
Jérôme Glisse authored
This updates each existing invalidation to use the correct mmu notifier event that represent what is happening to the CPU page table. See the patch which introduced the events to see the rational behind this. Link: http://lkml.kernel.org/r/20190326164747.24405-7-jglisse@redhat.com Signed-off-by:
Jérôme Glisse <jglisse@redhat.com> Reviewed-by:
Ralph Campbell <rcampbell@nvidia.com> Reviewed-by:
Ira Weiny <ira.weiny@intel.com> Cc: Christian König <christian.koenig@amd.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: Ross Zwisler <zwisler@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krcmar <rkrcmar@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Christian Koenig <christian.koenig@amd.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-