- 11 Jan, 2012 1 commit
-
-
Andrea Arcangeli authored
migrate was doing an rmap_walk with speculative lock-less access on pagetables. That could lead it to not serializing properly against mremap PT locks. But a second problem remains in the order of vmas in the same_anon_vma list used by the rmap_walk. If vma_merge succeeds in copy_vma, the src vma could be placed after the dst vma in the same_anon_vma list. That could still lead to migrate missing some pte. This patch adds an anon_vma_moveto_tail() function to force the dst vma at the end of the list before mremap starts to solve the problem. If the mremap is very large and there are a lots of parents or childs sharing the anon_vma root lock, this should still scale better than taking the anon_vma root lock around every pte copy practically for the whole duration of mremap. Update: Hugh noticed special care is needed in the error path where move_page_tables goes in the reverse direction, a second anon_vma_moveto_tail() call is needed in the error path. This program exercises the anon_vma_moveto_tail: === int main() { static struct timeval oldstamp, newstamp; long diffsec; char *p, *p2, *p3, *p4; if (posix_memalign((void **)&p, 2*1024*1024, SIZE)) perror("memalign"), exit(1); if (posix_memalign((void **)&p2, 2*1024*1024, SIZE)) perror("memalign"), exit(1); if (posix_memalign((void **)&p3, 2*1024*1024, SIZE)) perror("memalign"), exit(1); memset(p, 0xff, SIZE); printf("%p\n", p); memset(p2, 0xff, SIZE); memset(p3, 0x77, 4096); if (memcmp(p, p2, SIZE)) printf("error\n"); p4 = mremap(p+SIZE/2, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p3); if (p4 != p3) perror("mremap"), exit(1); p4 = mremap(p4, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p+SIZE/2); if (p4 != p+SIZE/2) perror("mremap"), exit(1); if (memcmp(p, p2, SIZE)) printf("error\n"); printf("ok\n"); return 0; } === $ perf probe -a anon_vma_moveto_tail Add new event: probe:anon_vma_moveto_tail (on anon_vma_moveto_tail) You can now use it on all perf tools, such as: perf record -e probe:anon_vma_moveto_tail -aR sleep 1 $ perf record -e probe:anon_vma_moveto_tail -aR ./anon_vma_moveto_tail 0x7f2ca2800000 ok [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.043 MB perf.data (~1860 samples) ] $ perf report --stdio 100.00% anon_vma_moveto [kernel.kallsyms] [k] anon_vma_moveto_tail Signed-off-by:
Andrea Arcangeli <aarcange@redhat.com> Reported-by:
Nai Xia <nai.xia@gmail.com> Acked-by:
Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Pawel Sikora <pluto@agmk.net> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- 01 Nov, 2011 3 commits
-
-
Andrea Arcangeli authored
This adds THP support to mremap (decreases the number of split_huge_page() calls). Here are also some benchmarks with a proggy like this: === #define _GNU_SOURCE #include <sys/mman.h> #include <stdlib.h> #include <stdio.h> #include <string.h> #include <sys/time.h> #define SIZE (5UL*1024*1024*1024) int main() { static struct timeval oldstamp, newstamp; long diffsec; char *p, *p2, *p3, *p4; if (posix_memalign((void **)&p, 2*1024*1024, SIZE)) perror("memalign"), exit(1); if (posix_memalign((void **)&p2, 2*1024*1024, SIZE)) perror("memalign"), exit(1); if (posix_memalign((void **)&p3, 2*1024*1024, 4096)) perror("memalign"), exit(1); memset(p, 0xff, SIZE); memset(p2, 0xff, SIZE); memset(p3, 0x77, 4096); gettimeofday(&oldstamp, NULL); p4 = mremap(p, SIZE, SIZE, MREMAP_FIXED|MREMAP_MAYMOVE, p3); gettimeofday(&newstamp, NULL); diffsec = newstamp.tv_sec - oldstamp.tv_sec; diffsec = newstamp.tv_usec - oldstamp.tv_usec + 1000000 * diffsec; printf("usec %ld\n", diffsec); if (p == MAP_FAILED || p4 != p3) //if (p == MAP_FAILED) perror("mremap"), exit(1); if (memcmp(p4, p2, SIZE)) printf("mremap bug\n"), exit(1); printf("ok\n"); return 0; } === THP on Performance counter stats for './largepage13' (3 runs): 69195836 dTLB-loads ( +- 3.546% ) (scaled from 50.30%) 60708 dTLB-load-misses ( +- 11.776% ) (scaled from 52.62%) 676266476 dTLB-stores ( +- 5.654% ) (scaled from 69.54%) 29856 dTLB-store-misses ( +- 4.081% ) (scaled from 89.22%) 1055848782 iTLB-loads ( +- 4.526% ) (scaled from 80.18%) 8689 iTLB-load-misses ( +- 2.987% ) (scaled from 58.20%) 7.314454164 seconds time elapsed ( +- 0.023% ) THP off Performance counter stats for './largepage13' (3 runs): 1967379311 dTLB-loads ( +- 0.506% ) (scaled from 60.59%) 9238687 dTLB-load-misses ( +- 22.547% ) (scaled from 61.87%) 2014239444 dTLB-stores ( +- 0.692% ) (scaled from 60.40%) 3312335 dTLB-store-misses ( +- 7.304% ) (scaled from 67.60%) 6764372065 iTLB-loads ( +- 0.925% ) (scaled from 79.00%) 8202 iTLB-load-misses ( +- 0.475% ) (scaled from 70.55%) 9.693655243 seconds time elapsed ( +- 0.069% ) grep thp /proc/vmstat thp_fault_alloc 35849 thp_fault_fallback 0 thp_collapse_alloc 3 thp_collapse_alloc_failed 0 thp_split 0 thp_split 0 confirms no thp split despite plenty of hugepages allocated. The measurement of only the mremap time (so excluding the 3 long memset and final long 10GB memory accessing memcmp): THP on usec 14824 usec 14862 usec 14859 THP off usec 256416 usec 255981 usec 255847 With an older kernel without the mremap optimizations (the below patch optimizes the non THP version too). THP on usec 392107 usec 390237 usec 404124 THP off usec 444294 usec 445237 usec 445820 I guess with a threaded program that sends more IPI on large SMP it'd create an even larger difference. All debug options are off except DEBUG_VM to avoid skewing the results. The only problem for native 2M mremap like it happens above both the source and destination address must be 2M aligned or the hugepmd can't be moved without a split but that is an hardware limitation. [akpm@linux-foundation.org: coding-style nitpicking] Signed-off-by:
Andrea Arcangeli <aarcange@redhat.com> Acked-by:
Johannes Weiner <jweiner@redhat.com> Acked-by:
Mel Gorman <mgorman@suse.de> Acked-by:
Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Andrea Arcangeli authored
This replaces ptep_clear_flush() with ptep_get_and_clear() and a single flush_tlb_range() at the end of the loop, to avoid sending one IPI for each page. The mmu_notifier_invalidate_range_start/end section is enlarged accordingly but this is not going to fundamentally change things. It was more by accident that the region under mremap was for the most part still available for secondary MMUs: the primary MMU was never allowed to reliably access that region for the duration of the mremap (modulo trapping SIGSEGV on the old address range which sounds unpractical and flakey). If users wants secondary MMUs not to lose access to a large region under mremap they should reduce the mremap size accordingly in userland and run multiple calls. Overall this will run faster so it's actually going to reduce the time the region is under mremap for the primary MMU which should provide a net benefit to apps. For KVM this is a noop because the guest physical memory is never mremapped, there's just no point it ever moving it while guest runs. One target of this optimization is JVM GC (so unrelated to the mmu notifier logic). Signed-off-by:
Andrea Arcangeli <aarcange@redhat.com> Acked-by:
Johannes Weiner <jweiner@redhat.com> Acked-by:
Mel Gorman <mgorman@suse.de> Acked-by:
Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Andrea Arcangeli authored
Using "- 1" relies on the old_end to be page aligned and PAGE_SIZE > 1, those are reasonable requirements but the check remains obscure and it looks more like an off by one error than an overflow check. This I feel will improve readability. Signed-off-by:
Andrea Arcangeli <aarcange@redhat.com> Acked-by:
Johannes Weiner <jweiner@redhat.com> Acked-by:
Mel Gorman <mgorman@suse.de> Acked-by:
Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- 25 May, 2011 2 commits
-
-
Peter Zijlstra authored
Straightforward conversion of i_mmap_lock to a mutex. Signed-off-by:
Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by:
Hugh Dickins <hughd@google.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Miller <davem@davemloft.net> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Jeff Dike <jdike@addtoit.com> Cc: Richard Weinberger <richard@nod.at> Cc: Tony Luck <tony.luck@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Namhyung Kim <namhyung@gmail.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Peter Zijlstra authored
Hugh says: "The only significant loser, I think, would be page reclaim (when concurrent with truncation): could spin for a long time waiting for the i_mmap_mutex it expects would soon be dropped? " Counter points: - cpu contention makes the spin stop (need_resched()) - zap pages should be freeing pages at a higher rate than reclaim ever can I think the simplification of the truncate code is definitely worth it. Effectively reverts: 2aa15890 ("mm: prevent concurrent unmap_mapping_range() on the same inode") and takes out the code that caused its problem. Signed-off-by:
Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by:
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Miller <davem@davemloft.net> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Jeff Dike <jdike@addtoit.com> Cc: Richard Weinberger <richard@nod.at> Cc: Tony Luck <tony.luck@intel.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Namhyung Kim <namhyung@gmail.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- 07 Apr, 2011 1 commit
-
-
Linus Torvalds authored
The normal mmap paths all avoid creating a mapping where the pgoff inside the mapping could wrap around due to overflow. However, an expanding mremap() can take such a non-wrapping mapping and make it bigger and cause a wrapping condition. Noticed by Robert Swiecki when running a system call fuzzer, where it caused a BUG_ON() due to terminally confusing the vma_prio_tree code. A vma dumping patch by Hugh then pinpointed the crazy wrapped case. Reported-and-tested-by:
Robert Swiecki <robert@swiecki.net> Acked-by:
Hugh Dickins <hughd@google.com> Cc: stable@kernel.org Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- 24 Feb, 2011 1 commit
-
-
Hugh Dickins authored
Robert Swiecki reported a BUG_ON(page_mapped) from a fuzzer, punching a hole with madvise(,, MADV_REMOVE). That path is under mutex, and cannot be explained by lack of serialization in unmap_mapping_range(). Reviewing the code, I found one place where vm_truncate_count handling should have been updated, when I switched at the last minute from one way of managing the restart_addr to another: mremap move changes the virtual addresses, so it ought to adjust the restart_addr. But rather than exporting the notion of restart_addr from memory.c, or converting to restart_pgoff throughout, simply reset vm_truncate_count to 0 to force a rescan if mremap move races with preempted truncation. We have no confirmation that this fixes Robert's BUG, but it is a fix that's worth making anyway. Signed-off-by:
Hugh Dickins <hughd@google.com> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- 14 Jan, 2011 2 commits
-
-
Andrea Arcangeli authored
split_huge_page_pmd compat code. Each one of those would need to be expanded to hundred of lines of complex code without a fully reliable split_huge_page_pmd design. Signed-off-by:
Andrea Arcangeli <aarcange@redhat.com> Acked-by:
Rik van Riel <riel@redhat.com> Acked-by:
Mel Gorman <mel@csn.ul.ie> Signed-off-by:
Johannes Weiner <hannes@cmpxchg.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Andrea Arcangeli authored
pte alloc routines must wait for split_huge_page if the pmd is not present and not null (i.e. pmd_trans_splitting). The additional branches are optimized away at compile time by pmd_trans_splitting if the config option is off. However we must pass the vma down in order to know the anon_vma lock to wait for. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by:
Andrea Arcangeli <aarcange@redhat.com> Acked-by:
Rik van Riel <riel@redhat.com> Acked-by:
Mel Gorman <mel@csn.ul.ie> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- 26 Oct, 2010 1 commit
-
-
Peter Zijlstra authored
Since we no longer need to provide KM_type, the whole pte_*map_nested() API is now redundant, remove it. Signed-off-by:
Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by:
Chris Metcalf <cmetcalf@tilera.com> Cc: David Howells <dhowells@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: David Miller <davem@davemloft.net> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- 30 Mar, 2010 1 commit
-
-
Tejun Heo authored
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include bloc...
-
- 06 Mar, 2010 2 commits
-
-
Rik van Riel authored
The old anon_vma code can lead to scalability issues with heavily forking workloads. Specifically, each anon_vma will be shared between the parent process and all its child processes. In a workload with 1000 child processes and a VMA with 1000 anonymous pages per process that get COWed, this leads to a system with a million anonymous pages in the same anon_vma, each of which is mapped in just one of the 1000 processes. However, the current rmap code needs to walk them all, leading to O(N) scanning complexity for each page. This can result in systems where one CPU is walking the page tables of 1000 processes in page_referenced_one, while all other CPUs are stuck on the anon_vma lock. This leads to catastrophic failure for a benchmark like AIM7, where the total number of processes can reach in the tens of thousands. Real workloads are still a factor 10 less process intensive than AIM7, but they are catching up. This patch changes the way anon_vmas and VMAs are linked, which allows us to associate multiple anon_vmas with a VMA. At fork time, each child process gets its own anon_vmas, in which its COWed pages will be instantiated. The parents' anon_vma is also linked to the VMA, because non-COWed pages could be present in any of the children. This reduces rmap scanning complexity to O(1) for the pages of the 1000 child processes, with O(N) complexity for at most 1/N pages in the system. This reduces the average scanning cost in heavily forking workloads from O(N) to 2. The only real complexity in this patch stems from the fact that linking a VMA to anon_vmas now involves memory allocations. This means vma_adjust can fail, if it needs to attach a VMA to anon_vma structures. This in turn means error handling needs to be added to the calling functions. A second source of complexity is that, because there can be multiple anon_vmas, the anon_vma linking in vma_adjust can no longer be done under "the" anon_vma lock. To prevent the rmap code from walking up an incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h to make sure it is impossible to compile a kernel that needs both symbolic values for the same bitflag. Some test results: Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test box with 16GB RAM and not quite enough IO), the system ends up running >99% in system time, with every CPU on the same anon_vma lock in the pageout code. With these changes, AIM7 hits the cross-over point around 29.7k users. This happens with ~99% IO wait time, there never seems to be any spike in system time. The anon_vma lock contention appears to be resolved. [akpm@linux-foundation.org: cleanups] Signed-off-by:
Rik van Riel <riel@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Jiri Slaby authored
Make sure compiler won't do weird things with limits. E.g. fetching them twice may return 2 different values after writable limits are implemented. I.e. either use rlimit helpers added in 3e10e716 ("resource: add helpers for fetching rlimits") or ACCESS_ONCE if not applicable. Signed-off-by:
Jiri Slaby <jslaby@suse.cz> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- 11 Dec, 2009 7 commits
-
-
Al Viro authored
Acked-by:
Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
Al Viro authored
Acked-by:
Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
Al Viro authored
Acked-by:
Russell King <rmk+kernel@arm.linux.org.uk> Acked-by:
Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
Al Viro authored
Acked-by:
Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
Al Viro authored
Take the check for being able to expand vma in place into a separate helper. Acked-by:
Russell King <rmk+kernel@arm.linux.org.uk> Acked-by:
Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
Al Viro authored
Take the MREMAP_FIXED into a separate helper, simplify the living hell out of conditions in both cases. Acked-by:
Russell King <rmk+kernel@arm.linux.org.uk> Acked-by:
Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
Al Viro authored
Take locating vma and checks on it to a separate helper (it will be shared between MREMAP_FIXED/non-MREMAP_FIXED cases when we split them in the next patch) Acked-by:
Russell King <rmk+kernel@arm.linux.org.uk> Acked-by:
Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
- 24 Sep, 2009 1 commit
-
-
npiggin@suse.de authored
Introduce new truncate helpers truncate_pagecache and inode_newsize_ok. vmtruncate is also consolidated from mm/memory.c and mm/nommu.c and into mm/truncate.c. Reviewed-by:
Christoph Hellwig <hch@lst.de> Signed-off-by:
Nick Piggin <npiggin@suse.de> Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
- 22 Sep, 2009 2 commits
-
-
Hugh Dickins authored
mremap move's use of ksm_madvise() was assuming -ENOMEM on failure, because ksm_madvise used to say -EAGAIN for that; but ksm_madvise now says -ENOMEM (letting madvise convert that to -EAGAIN), and can also say -ERESTARTSYS when signalled: so pass the error from ksm_madvise. Signed-off-by:
Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by:
Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Hugh Dickins authored
KSM's scan allows for user pages to be COWed or unmapped at any time, without requiring any notification. But its stable tree does assume that when it finds a KSM page where it placed a KSM page, then it is the same KSM page that it placed there. mremap move could break that assumption: if an area containing a KSM page was unmapped, then an area containing a different KSM page was moved with mremap into the place of the original, before KSM's scan came around to notice. That could then poison a node of the stable tree, so that memcmps would "lie" and upset the ordering of the tree. Probably noone will ever need mremap move on a VM_MERGEABLE area; except that prohibiting it would make trouble for schemes in which we try making everything VM_MERGEABLE e.g. for testing: an mremap which normally works would then fail mysteriously. There's no need to go to any trouble, such as re-sorting KSM's list of rmap_items to match the new layout: simply unmerge the area to COW all its KSM pages before moving, but leave VM_MERGEABLE on so that they're remerged later. Signed-off-by:
Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by:
Chris Wright <chrisw@redhat.com> Signed-off-by:
Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Avi Kivity <avi@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- 14 Jan, 2009 2 commits
-
-
Heiko Carstens authored
Signed-off-by:
Heiko Carstens <heiko.carstens@de.ibm.com>
-
Heiko Carstens authored
Convert all system calls to return a long. This should be a NOP since all converted types should have the same size anyway. With the exception of sys_exit_group which returned void. But that doesn't matter since the system call doesn't return. Signed-off-by:
Heiko Carstens <heiko.carstens@de.ibm.com>
-
- 06 Jan, 2009 1 commit
-
-
Alan Cox authored
Signed-off-by:
Alan Cox <alan@redhat.com> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- 20 Oct, 2008 1 commit
-
-
Rik van Riel authored
Originally by Nick Piggin <npiggin@suse.de> Remove mlocked pages from the LRU using "unevictable infrastructure" during mmap(), munmap(), mremap() and truncate(). Try to move back to normal LRU lists on munmap() when last mlocked mapping removed. Remove PageMlocked() status when page truncated from file. [akpm@linux-foundation.org: cleanup] [kamezawa.hiroyu@jp.fujitsu.com: fix double unlock_page()] [kosaki.motohiro@jp.fujitsu.com: split LRU: munlock rework] [lee.schermerhorn@hp.com: mlock: fix __mlock_vma_pages_range comment block] [akpm@linux-foundation.org: remove bogus kerneldoc token] Signed-off-by:
Nick Piggin <npiggin@suse.de> Signed-off-by:
Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by:
Rik van Riel <riel@redhat.com> Signed-off-by:
KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by:
KAMEZAWA Hiroyuki <kamewzawa.hiroyu@jp.fujitsu.com> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- 28 Jul, 2008 1 commit
-
-
Andrea Arcangeli authored
With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages. There are secondary MMUs (with secondary sptes and secondary tlbs) too. sptes in the kvm case are shadow pagetables, but when I say spte in mmu-notifier context, I mean "secondary pte". In GRU case there's no actual secondary pte and there's only a secondary tlb because the GRU secondary MMU has no knowledge about sptes and every secondary tlb miss event in the MMU always generates a page fault that has to be resolved by the CPU (this is not the case of KVM where the a secondary tlb miss will walk sptes in hardware and it will refill the secondary tlb transparently to software if the corresponding spte is present). The same way zap_page_range has to invalidate the pte before freeing the page, the spte (and secondary tlb) must also be invalidated before any page is freed and reused. Currently we take a page_count pin on every page mapped by sptes, but that means the pages can't be swapped whenever they're map...
-
- 18 Oct, 2007 1 commit
-
-
Stephen Hemminger authored
Get rid of sparse related warnings from places that use integer as NULL pointer. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by:
Stephen Hemminger <shemminger@linux-foundation.org> Cc: Andi Kleen <ak@suse.de> Cc: Jeff Garzik <jeff@garzik.org> Cc: Matt Mackall <mpm@selenic.com> Cc: Ian Kent <raven@themaw.net> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- 19 Jul, 2007 1 commit
-
-
Ollie Wild authored
Remove the arg+env limit of MAX_ARG_PAGES by copying the strings directly from the old mm into the new mm. We create the new mm before the binfmt code runs, and place the new stack at the very top of the address space. Once the binfmt code runs and figures out where the stack should be, we move it downwards. It is a bit peculiar in that we have one task with two mm's, one of which is inactive. [a.p.zijlstra@chello.nl: limit stack size] Signed-off-by:
Ollie Wild <aaw@google.com> Signed-off-by:
Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: <linux-arch@vger.kernel.org> Cc: Hugh Dickins <hugh@veritas.com> [bunk@stusta.de: unexport bprm_mm_init] Signed-off-by:
Adrian Bunk <bunk@stusta.de> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- 12 Jul, 2007 1 commit
-
-
Eric Paris authored
Add a new security check on mmap operations to see if the user is attempting to mmap to low area of the address space. The amount of space protected is indicated by the new proc tunable /proc/sys/vm/mmap_min_addr and defaults to 0, preserving existing behavior. This patch uses a new SELinux security class "memprotect." Policy already contains a number of allow rules like a_t self:process * (unconfined_t being one of them) which mean that putting this check in the process class (its best current fit) would make it useless as all user processes, which we also want to protect against, would be allowed. By taking the memprotect name of the new class it will also make it possible for us to move some of the other memory protect permissions out of 'process' and into the new class next time we bump the policy version number (which I also think is a good future idea) Acked-by:
Stephen Smalley <sds@tycho.nsa.gov> Acked-by:
Chris Wright <chrisw@sous-sol.org> Signed-off-by:
Eric Paris <eparis@redhat.com> Signed-off-by:
James Morris <jmorris@namei.org>
-
- 30 Jan, 2007 1 commit
-
-
Hugh Dickins authored
Nick Piggin points out that page accounting on MIPS multiple ZERO_PAGEs is not maintained by its move_pte, and could lead to freeing a ZERO_PAGE. Instead of complicating that move_pte, just forget the minor optimization when mremapping, and change the one thing which needed it for correctness - filemap_xip use ZERO_PAGE(0) throughout instead of according to address. [ "There is no block device driver one could use for XIP on mips platforms" - Carsten Otte ] Signed-off-by:
Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Andrew Morton <akpm@osdl.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Carsten Otte <cotte@de.ibm.com> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
- 01 Oct, 2006 1 commit
-
-
Zachary Amsden authored
Implement lazy MMU update hooks which are SMP safe for both direct and shadow page tables. The idea is that PTE updates and page invalidations while in lazy mode can be batched into a single hypercall. We use this in VMI for shadow page table synchronization, and it is a win. It also can be used by PPC and for direct page tables on Xen. For SMP, the enter / leave must happen under protection of the page table locks for page tables which are being modified. This is because otherwise, you end up with stale state in the batched hypercall, which other CPUs can race ahead of. Doing this under the protection of the locks guarantees the synchronization is correct, and also means that spurious faults which are generated during this window by remote CPUs are properly handled, as the page fault handler must re-check the PTE under protection of the same lock. Signed-off-by:
Zachary Amsden <zach@vmware.com> Signed-off-by:
Jeremy Fitzhardinge <jeremy@xensource.com> Cc...
-
- 03 Jul, 2006 1 commit
-
-
Ingo Molnar authored
Teach special (recursive) locking code to the lock validator. Has no effect on non-lockdep kernels. Signed-off-by:
Ingo Molnar <mingo@elte.hu> Signed-off-by:
Arjan van de Ven <arjan@linux.intel.com> Signed-off-by:
Andrew Morton <akpm@osdl.org> Signed-off-by:
Linus Torvalds <torvalds@osdl.org>
-
- 12 Jan, 2006 1 commit
-
-
Randy.Dunlap authored
- Move capable() from sched.h to capability.h; - Use <linux/capability.h> where capable() is used (in include/, block/, ipc/, kernel/, a few drivers/, mm/, security/, & sound/; many more drivers/ to go) Signed-off-by:
Randy Dunlap <rdunlap@xenotime.net> Signed-off-by:
Andrew Morton <akpm@osdl.org> Signed-off-by:
Linus Torvalds <torvalds@osdl.org>
-
- 16 Dec, 2005 1 commit
-
-
Linus Torvalds authored
The logic that decides that a fork() might be able to avoid copying a VM area when it can be re-created by page faults didn't know about the new vm_insert_page() case. Also make some things a bit more anal wrt VM_PFNMAP. Pointed out by Hugh Dickins <hugh@veritas.com> Signed-off-by:
Linus Torvalds <torvalds@osdl.org>
-
- 30 Oct, 2005 3 commits
-
-
Hugh Dickins authored
Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with a many-threaded application which concurrently initializes different parts of a large anonymous area. This patch corrects that, by using a separate spinlock per page table page, to guard the page table entries in that page, instead of using the mm's single page_table_lock. (But even then, page_table_lock is still used to guard page table allocation, and anon_vma allocation.) In this implementation, the spinlock is tucked inside the struct page of the page table page: with a BUILD_BUG_ON in case it overflows - which it would in the case of 32-bit PA-RISC with spinlock debugging enabled. Splitting the lock is not quite for free: another cacheline access. Ideally, I suppose we would use split ptlock only for multi-threaded processes on multi-cpu machines; but deciding that dynamically would have its own costs. So for now enable it by config, at some number of cpus - since the Kconfig language doesn't support inequalities, let preprocessor compare that with NR_CPUS. But I don't think it's worth being user-configurable: for good testing of both split and unsplit configs, split now at 4 cpus, and perhaps change that to 8 later. There is a benefit even for singly threaded processes: kswapd can be attacking one part of the mm while another part is busy faulting. Signed-off-by:
Hugh Dickins <hugh@veritas.com> Signed-off-by:
Andrew Morton <akpm@osdl.org> Signed-off-by:
Linus Torvalds <torvalds@osdl.org>
-
Hugh Dickins authored
Second step in pushing down the page_table_lock. Remove the temporary bridging hack from __pud_alloc, __pmd_alloc, __pte_alloc: expect callers not to hold page_table_lock, whether it's on init_mm or a user mm; take page_table_lock internally to check if a racing task already allocated. Convert their callers from common code. But avoid coming back to change them again later: instead of moving the spin_lock(&mm->page_table_lock) down, switch over to new macros pte_alloc_map_lock and pte_unmap_unlock, which encapsulate the mapping+locking and unlocking+unmapping together, and in the end may use alternatives to the mm page_table_lock itself. These callers all hold mmap_sem (some exclusively, some not), so at no level can a page table be whipped away from beneath them; and pte_alloc uses the "atomic" pmd_present to test whether it needs to allocate. It appears that on all arches we can safely descend without page_table_lock. Signed-off-by:
Hugh Dickins <hugh@veritas.com> Signed-off-by:
Andrew Morton <akpm@osdl.org> Signed-off-by:
Linus Torvalds <torvalds@osdl.org>
-
Hugh Dickins authored
It seems odd to me that, whereas pud_alloc and pmd_alloc test inline, only calling out-of-line __pud_alloc __pmd_alloc if allocation needed, pte_alloc_map and pte_alloc_kernel are entirely out-of-line. Though it does add a little to kernel size, change them to macros testing inline, calling __pte_alloc or __pte_alloc_kernel to allocate out-of-line. Mark none of them as fastcalls, leave that to CONFIG_REGPARM or not. It also seems more natural for the out-of-line functions to leave the offset calculation and map to the inline, which has to do it anyway for the common case. At least mremap move wants __pte_alloc without _map. Macros rather than inline functions, certainly to avoid the header file issues which arise from CONFIG_HIGHPTE needing kmap_types.h, but also in case any architectures I haven't built would have other such problems. Signed-off-by:
Hugh Dickins <hugh@veritas.com> Signed-off-by:
Andrew Morton <akpm@osdl.org> Signed-off-by:
Linus Torvalds <torvalds@osdl.org>
-