1. 11 Jan, 2012 1 commit
    • Andrea Arcangeli's avatar
      mremap: enforce rmap src/dst vma ordering in case of vma_merge() succeeding in copy_vma() · 948f017b
      Andrea Arcangeli authored
      
      migrate was doing an rmap_walk with speculative lock-less access on
      pagetables.  That could lead it to not serializing properly against mremap
      PT locks.  But a second problem remains in the order of vmas in the
      same_anon_vma list used by the rmap_walk.
      
      If vma_merge succeeds in copy_vma, the src vma could be placed after the
      dst vma in the same_anon_vma list.  That could still lead to migrate
      missing some pte.
      
      This patch adds an anon_vma_moveto_tail() function to force the dst vma at
      the end of the list before mremap starts to solve the problem.
      
      If the mremap is very large and there are a lots of parents or childs
      sharing the anon_vma root lock, this should still scale better than taking
      the anon_vma root lock around every pte copy practically for the whole
      duration of mremap.
      
      Update: Hugh noticed special care is needed in the error path where
      move_page_tables goes in the reverse direction, a second
      anon_vma_moveto_tail() call is needed in the error path.
      
      This program exercises the anon_vma_moveto_tail:
      
      ===
      
      int main()
      {
      	static struct timeval oldstamp, newstamp;
      	long diffsec;
      	char *p, *p2, *p3, *p4;
      	if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
      		perror("memalign"), exit(1);
      	if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
      		perror("memalign"), exit(1);
      	if (posix_memalign((void **)&p3, 2*1024*1024, SIZE))
      		perror("memalign"), exit(1);
      
      	memset(p, 0xff, SIZE);
      	printf("%p\n", p);
      	memset(p2, 0xff, SIZE);
      	memset(p3, 0x77, 4096);
      	if (memcmp(p, p2, SIZE))
      		printf("error\n");
      	p4 = mremap(p+SIZE/2, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
      	if (p4 != p3)
      		perror("mremap"), exit(1);
      	p4 = mremap(p4, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p+SIZE/2);
      	if (p4 != p+SIZE/2)
      		perror("mremap"), exit(1);
      	if (memcmp(p, p2, SIZE))
      		printf("error\n");
      	printf("ok\n");
      
      	return 0;
      }
      ===
      
      $ perf probe -a anon_vma_moveto_tail
      Add new event:
        probe:anon_vma_moveto_tail (on anon_vma_moveto_tail)
      
      You can now use it on all perf tools, such as:
      
              perf record -e probe:anon_vma_moveto_tail -aR sleep 1
      
      $ perf record -e probe:anon_vma_moveto_tail -aR ./anon_vma_moveto_tail
      0x7f2ca2800000
      ok
      [ perf record: Woken up 1 times to write data ]
      [ perf record: Captured and wrote 0.043 MB perf.data (~1860 samples) ]
      $ perf report --stdio
         100.00%  anon_vma_moveto  [kernel.kallsyms]  [k] anon_vma_moveto_tail
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: default avatarNai Xia <nai.xia@gmail.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Pawel Sikora <pluto@agmk.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      948f017b
  2. 01 Nov, 2011 3 commits
    • Andrea Arcangeli's avatar
      thp: mremap support and TLB optimization · 37a1c49a
      Andrea Arcangeli authored
      
      This adds THP support to mremap (decreases the number of split_huge_page()
      calls).
      
      Here are also some benchmarks with a proggy like this:
      
      ===
      #define _GNU_SOURCE
      #include <sys/mman.h>
      #include <stdlib.h>
      #include <stdio.h>
      #include <string.h>
      #include <sys/time.h>
      
      #define SIZE (5UL*1024*1024*1024)
      
      int main()
      {
              static struct timeval oldstamp, newstamp;
      	long diffsec;
      	char *p, *p2, *p3, *p4;
      	if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
      		perror("memalign"), exit(1);
      	if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
      		perror("memalign"), exit(1);
      	if (posix_memalign((void **)&p3, 2*1024*1024, 4096))
      		perror("memalign"), exit(1);
      
      	memset(p, 0xff, SIZE);
      	memset(p2, 0xff, SIZE);
      	memset(p3, 0x77, 4096);
      	gettimeofday(&oldstamp, NULL);
      	p4 = mremap(p, SIZE, SIZE, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
      	gettimeofday(&newstamp, NULL);
      	diffsec = newstamp.tv_sec - oldstamp.tv_sec;
      	diffsec = newstamp.tv_usec - oldstamp.tv_usec + 1000000 * diffsec;
      	printf("usec %ld\n", diffsec);
      	if (p == MAP_FAILED || p4 != p3)
      	//if (p == MAP_FAILED)
      		perror("mremap"), exit(1);
      	if (memcmp(p4, p2, SIZE))
      		printf("mremap bug\n"), exit(1);
      	printf("ok\n");
      
      	return 0;
      }
      ===
      
      THP on
      
       Performance counter stats for './largepage13' (3 runs):
      
                69195836 dTLB-loads                 ( +-   3.546% )  (scaled from 50.30%)
                   60708 dTLB-load-misses           ( +-  11.776% )  (scaled from 52.62%)
               676266476 dTLB-stores                ( +-   5.654% )  (scaled from 69.54%)
                   29856 dTLB-store-misses          ( +-   4.081% )  (scaled from 89.22%)
              1055848782 iTLB-loads                 ( +-   4.526% )  (scaled from 80.18%)
                    8689 iTLB-load-misses           ( +-   2.987% )  (scaled from 58.20%)
      
              7.314454164  seconds time elapsed   ( +-   0.023% )
      
      THP off
      
       Performance counter stats for './largepage13' (3 runs):
      
              1967379311 dTLB-loads                 ( +-   0.506% )  (scaled from 60.59%)
                 9238687 dTLB-load-misses           ( +-  22.547% )  (scaled from 61.87%)
              2014239444 dTLB-stores                ( +-   0.692% )  (scaled from 60.40%)
                 3312335 dTLB-store-misses          ( +-   7.304% )  (scaled from 67.60%)
              6764372065 iTLB-loads                 ( +-   0.925% )  (scaled from 79.00%)
                    8202 iTLB-load-misses           ( +-   0.475% )  (scaled from 70.55%)
      
              9.693655243  seconds time elapsed   ( +-   0.069% )
      
      grep thp /proc/vmstat
      thp_fault_alloc 35849
      thp_fault_fallback 0
      thp_collapse_alloc 3
      thp_collapse_alloc_failed 0
      thp_split 0
      
      thp_split 0 confirms no thp split despite plenty of hugepages allocated.
      
      The measurement of only the mremap time (so excluding the 3 long
      memset and final long 10GB memory accessing memcmp):
      
      THP on
      
      usec 14824
      usec 14862
      usec 14859
      
      THP off
      
      usec 256416
      usec 255981
      usec 255847
      
      With an older kernel without the mremap optimizations (the below patch
      optimizes the non THP version too).
      
      THP on
      
      usec 392107
      usec 390237
      usec 404124
      
      THP off
      
      usec 444294
      usec 445237
      usec 445820
      
      I guess with a threaded program that sends more IPI on large SMP it'd
      create an even larger difference.
      
      All debug options are off except DEBUG_VM to avoid skewing the
      results.
      
      The only problem for native 2M mremap like it happens above both the
      source and destination address must be 2M aligned or the hugepmd can't be
      moved without a split but that is an hardware limitation.
      
      [akpm@linux-foundation.org: coding-style nitpicking]
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarJohannes Weiner <jweiner@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      37a1c49a
    • Andrea Arcangeli's avatar
      mremap: avoid sending one IPI per page · 7b6efc2b
      Andrea Arcangeli authored
      
      This replaces ptep_clear_flush() with ptep_get_and_clear() and a single
      flush_tlb_range() at the end of the loop, to avoid sending one IPI for
      each page.
      
      The mmu_notifier_invalidate_range_start/end section is enlarged
      accordingly but this is not going to fundamentally change things.  It was
      more by accident that the region under mremap was for the most part still
      available for secondary MMUs: the primary MMU was never allowed to
      reliably access that region for the duration of the mremap (modulo
      trapping SIGSEGV on the old address range which sounds unpractical and
      flakey).  If users wants secondary MMUs not to lose access to a large
      region under mremap they should reduce the mremap size accordingly in
      userland and run multiple calls.  Overall this will run faster so it's
      actually going to reduce the time the region is under mremap for the
      primary MMU which should provide a net benefit to apps.
      
      For KVM this is a noop because the guest physical memory is never
      mremapped, there's just no point it ever moving it while guest runs.  One
      target of this optimization is JVM GC (so unrelated to the mmu notifier
      logic).
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarJohannes Weiner <jweiner@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7b6efc2b
    • Andrea Arcangeli's avatar
      mremap: check for overflow using deltas · ebed4846
      Andrea Arcangeli authored
      
      Using "- 1" relies on the old_end to be page aligned and PAGE_SIZE > 1,
      those are reasonable requirements but the check remains obscure and it
      looks more like an off by one error than an overflow check.  This I feel
      will improve readability.
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarJohannes Weiner <jweiner@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ebed4846
  3. 25 May, 2011 2 commits
    • Peter Zijlstra's avatar
      mm: Convert i_mmap_lock to a mutex · 3d48ae45
      Peter Zijlstra authored
      
      Straightforward conversion of i_mmap_lock to a mutex.
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Namhyung Kim <namhyung@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3d48ae45
    • Peter Zijlstra's avatar
      mm: Remove i_mmap_lock lockbreak · 97a89413
      Peter Zijlstra authored
      Hugh says:
       "The only significant loser, I think, would be page reclaim (when
        concurrent with truncation): could spin for a long time waiting for
        the i_mmap_mutex it expects would soon be dropped? "
      
      Counter points:
       - cpu contention makes the spin stop (need_resched())
       - zap pages should be freeing pages at a higher rate than reclaim
         ever can
      
      I think the simplification of the truncate code is definitely worth it.
      
      Effectively reverts: 2aa15890
      
       ("mm: prevent concurrent
      unmap_mapping_range() on the same inode") and takes out the code that
      caused its problem.
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Namhyung Kim <namhyung@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      97a89413
  4. 07 Apr, 2011 1 commit
  5. 24 Feb, 2011 1 commit
    • Hugh Dickins's avatar
      mm: fix possible cause of a page_mapped BUG · a3e8cc64
      Hugh Dickins authored
      
      Robert Swiecki reported a BUG_ON(page_mapped) from a fuzzer, punching
      a hole with madvise(,, MADV_REMOVE).  That path is under mutex, and
      cannot be explained by lack of serialization in unmap_mapping_range().
      
      Reviewing the code, I found one place where vm_truncate_count handling
      should have been updated, when I switched at the last minute from one
      way of managing the restart_addr to another: mremap move changes the
      virtual addresses, so it ought to adjust the restart_addr.
      
      But rather than exporting the notion of restart_addr from memory.c, or
      converting to restart_pgoff throughout, simply reset vm_truncate_count
      to 0 to force a rescan if mremap move races with preempted truncation.
      
      We have no confirmation that this fixes Robert's BUG,
      but it is a fix that's worth making anyway.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a3e8cc64
  6. 14 Jan, 2011 2 commits
  7. 26 Oct, 2010 1 commit
  8. 30 Mar, 2010 1 commit
    • Tejun Heo's avatar
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking... · 5a0e3ad6
      Tejun Heo authored
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
      
      percpu.h is included by sched.h and module.h and thus ends up being
      included when building most .c files.  percpu.h includes slab.h which
      in turn includes gfp.h making everything defined by the two files
      universally available and complicating inclusion dependencies.
      
      percpu.h -> slab.h dependency is about to be removed.  Prepare for
      this change by updating users of gfp and slab facilities include those
      headers directly instead of assuming availability.  As this conversion
      needs to touch large number of source files, the following script is
      used as the basis of conversion.
      
        http://userweb.kernel.org/~tj/misc/slabh-sweep.py
      
      The script does the followings.
      
      * Scan files for gfp and slab usages and update includes such that
        only the necessary includes are there.  ie. if only gfp is used,
        gfp.h, if slab is used, slab.h.
      
      * When the script inserts a new include, it looks at the include
        bloc...
      5a0e3ad6
  9. 06 Mar, 2010 2 commits
    • Rik van Riel's avatar
      mm: change anon_vma linking to fix multi-process server scalability issue · 5beb4930
      Rik van Riel authored
      
      The old anon_vma code can lead to scalability issues with heavily forking
      workloads.  Specifically, each anon_vma will be shared between the parent
      process and all its child processes.
      
      In a workload with 1000 child processes and a VMA with 1000 anonymous
      pages per process that get COWed, this leads to a system with a million
      anonymous pages in the same anon_vma, each of which is mapped in just one
      of the 1000 processes.  However, the current rmap code needs to walk them
      all, leading to O(N) scanning complexity for each page.
      
      This can result in systems where one CPU is walking the page tables of
      1000 processes in page_referenced_one, while all other CPUs are stuck on
      the anon_vma lock.  This leads to catastrophic failure for a benchmark
      like AIM7, where the total number of processes can reach in the tens of
      thousands.  Real workloads are still a factor 10 less process intensive
      than AIM7, but they are catching up.
      
      This patch changes the way anon_vmas and VMAs are linked, which allows us
      to associate multiple anon_vmas with a VMA.  At fork time, each child
      process gets its own anon_vmas, in which its COWed pages will be
      instantiated.  The parents' anon_vma is also linked to the VMA, because
      non-COWed pages could be present in any of the children.
      
      This reduces rmap scanning complexity to O(1) for the pages of the 1000
      child processes, with O(N) complexity for at most 1/N pages in the system.
       This reduces the average scanning cost in heavily forking workloads from
      O(N) to 2.
      
      The only real complexity in this patch stems from the fact that linking a
      VMA to anon_vmas now involves memory allocations.  This means vma_adjust
      can fail, if it needs to attach a VMA to anon_vma structures.  This in
      turn means error handling needs to be added to the calling functions.
      
      A second source of complexity is that, because there can be multiple
      anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
      "the" anon_vma lock.  To prevent the rmap code from walking up an
      incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag.  This bit
      flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
      to make sure it is impossible to compile a kernel that needs both symbolic
      values for the same bitflag.
      
      Some test results:
      
      Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
      box with 16GB RAM and not quite enough IO), the system ends up running
      >99% in system time, with every CPU on the same anon_vma lock in the
      pageout code.
      
      With these changes, AIM7 hits the cross-over point around 29.7k users.
      This happens with ~99% IO wait time, there never seems to be any spike in
      system time.  The anon_vma lock contention appears to be resolved.
      
      [akpm@linux-foundation.org: cleanups]
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5beb4930
    • Jiri Slaby's avatar
      mm: use rlimit helpers · 59e99e5b
      Jiri Slaby authored
      Make sure compiler won't do weird things with limits.  E.g.  fetching them
      twice may return 2 different values after writable limits are implemented.
      
      I.e.  either use rlimit helpers added in
      3e10e716
      
       ("resource: add helpers for
      fetching rlimits") or ACCESS_ONCE if not applicable.
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      59e99e5b
  10. 11 Dec, 2009 7 commits
  11. 24 Sep, 2009 1 commit
  12. 22 Sep, 2009 2 commits
    • Hugh Dickins's avatar
      ksm: mremap use err from ksm_madvise · 7103ad32
      Hugh Dickins authored
      
      mremap move's use of ksm_madvise() was assuming -ENOMEM on failure,
      because ksm_madvise used to say -EAGAIN for that; but ksm_madvise now says
      -ENOMEM (letting madvise convert that to -EAGAIN), and can also say
      -ERESTARTSYS when signalled: so pass the error from ksm_madvise.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: default avatarIzik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7103ad32
    • Hugh Dickins's avatar
      ksm: prevent mremap move poisoning · 1ff82995
      Hugh Dickins authored
      
      KSM's scan allows for user pages to be COWed or unmapped at any time,
      without requiring any notification.  But its stable tree does assume that
      when it finds a KSM page where it placed a KSM page, then it is the same
      KSM page that it placed there.
      
      mremap move could break that assumption: if an area containing a KSM page
      was unmapped, then an area containing a different KSM page was moved with
      mremap into the place of the original, before KSM's scan came around to
      notice.  That could then poison a node of the stable tree, so that memcmps
      would "lie" and upset the ordering of the tree.
      
      Probably noone will ever need mremap move on a VM_MERGEABLE area; except
      that prohibiting it would make trouble for schemes in which we try making
      everything VM_MERGEABLE e.g.  for testing: an mremap which normally works
      would then fail mysteriously.
      
      There's no need to go to any trouble, such as re-sorting KSM's list of
      rmap_items to match the new layout: simply unmerge the area to COW all its
      KSM pages before moving, but leave VM_MERGEABLE on so that they're
      remerged later.
      Signed-off-by: default avatarHugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: default avatarChris Wright <chrisw@redhat.com>
      Signed-off-by: default avatarIzik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1ff82995
  13. 14 Jan, 2009 2 commits
  14. 06 Jan, 2009 1 commit
  15. 20 Oct, 2008 1 commit
  16. 28 Jul, 2008 1 commit
    • Andrea Arcangeli's avatar
      mmu-notifiers: core · cddb8a5c
      Andrea Arcangeli authored
      With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages.
       There are secondary MMUs (with secondary sptes and secondary tlbs) too.
      sptes in the kvm case are shadow pagetables, but when I say spte in
      mmu-notifier context, I mean "secondary pte".  In GRU case there's no
      actual secondary pte and there's only a secondary tlb because the GRU
      secondary MMU has no knowledge about sptes and every secondary tlb miss
      event in the MMU always generates a page fault that has to be resolved by
      the CPU (this is not the case of KVM where the a secondary tlb miss will
      walk sptes in hardware and it will refill the secondary tlb transparently
      to software if the corresponding spte is present).  The same way
      zap_page_range has to invalidate the pte before freeing the page, the spte
      (and secondary tlb) must also be invalidated before any page is freed and
      reused.
      
      Currently we take a page_count pin on every page mapped by sptes, but that
      means the pages can't be swapped whenever they're map...
      cddb8a5c
  17. 18 Oct, 2007 1 commit
  18. 19 Jul, 2007 1 commit
  19. 12 Jul, 2007 1 commit
    • Eric Paris's avatar
      security: Protection for exploiting null dereference using mmap · ed032189
      Eric Paris authored
      
      Add a new security check on mmap operations to see if the user is attempting
      to mmap to low area of the address space.  The amount of space protected is
      indicated by the new proc tunable /proc/sys/vm/mmap_min_addr and defaults to
      0, preserving existing behavior.
      
      This patch uses a new SELinux security class "memprotect."  Policy already
      contains a number of allow rules like a_t self:process * (unconfined_t being
      one of them) which mean that putting this check in the process class (its
      best current fit) would make it useless as all user processes, which we also
      want to protect against, would be allowed. By taking the memprotect name of
      the new class it will also make it possible for us to move some of the other
      memory protect permissions out of 'process' and into the new class next time
      we bump the policy version number (which I also think is a good future idea)
      Acked-by: default avatarStephen Smalley <sds@tycho.nsa.gov>
      Acked-by: default avatarChris Wright <chrisw@sous-sol.org>
      Signed-off-by: default avatarEric Paris <eparis@redhat.com>
      Signed-off-by: default avatarJames Morris <jmorris@namei.org>
      ed032189
  20. 30 Jan, 2007 1 commit
    • Hugh Dickins's avatar
      [PATCH] mm: mremap correct rmap accounting · 701dfbc1
      Hugh Dickins authored
      
      Nick Piggin points out that page accounting on MIPS multiple ZERO_PAGEs
      is not maintained by its move_pte, and could lead to freeing a ZERO_PAGE.
      
      Instead of complicating that move_pte, just forget the minor optimization
      when mremapping, and change the one thing which needed it for correctness
      - filemap_xip use ZERO_PAGE(0) throughout instead of according to address.
      
      [ "There is no block device driver one could use for XIP on mips
         platforms" - Carsten Otte ]
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Andrew Morton <akpm@osdl.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      701dfbc1
  21. 01 Oct, 2006 1 commit
    • Zachary Amsden's avatar
      [PATCH] paravirt: lazy mmu mode hooks.patch · 6606c3e0
      Zachary Amsden authored
      
      Implement lazy MMU update hooks which are SMP safe for both direct and shadow
      page tables.  The idea is that PTE updates and page invalidations while in
      lazy mode can be batched into a single hypercall.  We use this in VMI for
      shadow page table synchronization, and it is a win.  It also can be used by
      PPC and for direct page tables on Xen.
      
      For SMP, the enter / leave must happen under protection of the page table
      locks for page tables which are being modified.  This is because otherwise,
      you end up with stale state in the batched hypercall, which other CPUs can
      race ahead of.  Doing this under the protection of the locks guarantees the
      synchronization is correct, and also means that spurious faults which are
      generated during this window by remote CPUs are properly handled, as the page
      fault handler must re-check the PTE under protection of the same lock.
      Signed-off-by: default avatarZachary Amsden <zach@vmware.com>
      Signed-off-by: default avatarJeremy Fitzhardinge <jeremy@xensource.com>
      Cc...
      6606c3e0
  22. 03 Jul, 2006 1 commit
  23. 12 Jan, 2006 1 commit
  24. 16 Dec, 2005 1 commit
  25. 30 Oct, 2005 3 commits
    • Hugh Dickins's avatar
      [PATCH] mm: split page table lock · 4c21e2f2
      Hugh Dickins authored
      
      Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with
      a many-threaded application which concurrently initializes different parts of
      a large anonymous area.
      
      This patch corrects that, by using a separate spinlock per page table page, to
      guard the page table entries in that page, instead of using the mm's single
      page_table_lock.  (But even then, page_table_lock is still used to guard page
      table allocation, and anon_vma allocation.)
      
      In this implementation, the spinlock is tucked inside the struct page of the
      page table page: with a BUILD_BUG_ON in case it overflows - which it would in
      the case of 32-bit PA-RISC with spinlock debugging enabled.
      
      Splitting the lock is not quite for free: another cacheline access.  Ideally,
      I suppose we would use split ptlock only for multi-threaded processes on
      multi-cpu machines; but deciding that dynamically would have its own costs.
      So for now enable it by config, at some number of cpus - since the Kconfig
      language doesn't support inequalities, let preprocessor compare that with
      NR_CPUS.  But I don't think it's worth being user-configurable: for good
      testing of both split and unsplit configs, split now at 4 cpus, and perhaps
      change that to 8 later.
      
      There is a benefit even for singly threaded processes: kswapd can be attacking
      one part of the mm while another part is busy faulting.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      4c21e2f2
    • Hugh Dickins's avatar
      [PATCH] mm: ptd_alloc take ptlock · c74df32c
      Hugh Dickins authored
      
      Second step in pushing down the page_table_lock.  Remove the temporary
      bridging hack from __pud_alloc, __pmd_alloc, __pte_alloc: expect callers not
      to hold page_table_lock, whether it's on init_mm or a user mm; take
      page_table_lock internally to check if a racing task already allocated.
      
      Convert their callers from common code.  But avoid coming back to change them
      again later: instead of moving the spin_lock(&mm->page_table_lock) down,
      switch over to new macros pte_alloc_map_lock and pte_unmap_unlock, which
      encapsulate the mapping+locking and unlocking+unmapping together, and in the
      end may use alternatives to the mm page_table_lock itself.
      
      These callers all hold mmap_sem (some exclusively, some not), so at no level
      can a page table be whipped away from beneath them; and pte_alloc uses the
      "atomic" pmd_present to test whether it needs to allocate.  It appears that on
      all arches we can safely descend without page_table_lock.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      c74df32c
    • Hugh Dickins's avatar
      [PATCH] mm: ptd_alloc inline and out · 1bb3630e
      Hugh Dickins authored
      
      It seems odd to me that, whereas pud_alloc and pmd_alloc test inline, only
      calling out-of-line __pud_alloc __pmd_alloc if allocation needed,
      pte_alloc_map and pte_alloc_kernel are entirely out-of-line.  Though it does
      add a little to kernel size, change them to macros testing inline, calling
      __pte_alloc or __pte_alloc_kernel to allocate out-of-line.  Mark none of them
      as fastcalls, leave that to CONFIG_REGPARM or not.
      
      It also seems more natural for the out-of-line functions to leave the offset
      calculation and map to the inline, which has to do it anyway for the common
      case.  At least mremap move wants __pte_alloc without _map.
      
      Macros rather than inline functions, certainly to avoid the header file issues
      which arise from CONFIG_HIGHPTE needing kmap_types.h, but also in case any
      architectures I haven't built would have other such problems.
      Signed-off-by: default avatarHugh Dickins <hugh@veritas.com>
      Signed-off-by: default avatarAndrew Morton <akpm@osdl.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@osdl.org>
      1bb3630e