1. 17 Nov, 2011 1 commit
  2. 01 Nov, 2011 1 commit
    • David Rientjes's avatar
      oom: remove oom_disable_count · c9f01245
      David Rientjes authored
      
      This removes mm->oom_disable_count entirely since it's unnecessary and
      currently buggy.  The counter was intended to be per-process but it's
      currently decremented in the exit path for each thread that exits, causing
      it to underflow.
      
      The count was originally intended to prevent oom killing threads that
      share memory with threads that cannot be killed since it doesn't lead to
      future memory freeing.  The counter could be fixed to represent all
      threads sharing the same mm, but it's better to remove the count since:
      
       - it is possible that the OOM_DISABLE thread sharing memory with the
         victim is waiting on that thread to exit and will actually cause
         future memory freeing, and
      
       - there is no guarantee that a thread is disabled from oom killing just
         because another thread sharing its mm is oom disabled.
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Reported-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reviewed-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c9f01245
  3. 03 Oct, 2011 1 commit
    • Wu Fengguang's avatar
      writeback: per task dirty rate limit · 9d823e8f
      Wu Fengguang authored
      
      Add two fields to task_struct.
      
      1) account dirtied pages in the individual tasks, for accuracy
      2) per-task balance_dirty_pages() call intervals, for flexibility
      
      The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will
      scale near-sqrt to the safety gap between dirty pages and threshold.
      
      The main problem of per-task nr_dirtied is, if 1k+ tasks start dirtying
      pages at exactly the same time, each task will be assigned a large
      initial nr_dirtied_pause, so that the dirty threshold will be exceeded
      long before each task reached its nr_dirtied_pause and hence call
      balance_dirty_pages().
      
      The solution is to watch for the number of pages dirtied on each CPU in
      between the calls into balance_dirty_pages(). If it exceeds ratelimit_pages
      (3% dirty threshold), force call balance_dirty_pages() for a chance to
      set bdi->dirty_exceeded. In normal situations, this safeguarding
      condition is not expected to trigger at all.
      
      On the sqrt in dirty_poll_interval():
      
      It will serve as an initial guess when dirty pages are still in the
      freerun area.
      
      When dirty pages are floating inside the dirty control scope [freerun,
      limit], a followup patch will use some refined dirty poll interval to
      get the desired pause time.
      
         thresh-dirty (MB)    sqrt
      		   1      16
      		   2      22
      		   4      32
      		   8      45
      		  16      64
      		  32      90
      		  64     128
      		 128     181
      		 256     256
      		 512     362
      		1024     512
      
      The above table means, given 1MB (or 1GB) gap and the dd tasks polling
      balance_dirty_pages() on every 16 (or 512) pages, the dirty limit won't
      be exceeded as long as there are less than 16 (or 512) concurrent dd's.
      
      So sqrt naturally leads to less overheads and more safe concurrent tasks
      for large memory servers, which have large (thresh-freerun) gaps.
      
      peter: keep the per-CPU ratelimit for safeguarding the 1k+ tasks case
      
      CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Reviewed-by: default avatarAndrea Righi <andrea@betterlinux.com>
      Signed-off-by: default avatarWu Fengguang <fengguang.wu@intel.com>
      9d823e8f
  4. 11 Aug, 2011 1 commit
    • Vasiliy Kulikov's avatar
      move RLIMIT_NPROC check from set_user() to do_execve_common() · 72fa5997
      Vasiliy Kulikov authored
      The patch http://lkml.org/lkml/2003/7/13/226
      
       introduced an RLIMIT_NPROC
      check in set_user() to check for NPROC exceeding via setuid() and
      similar functions.
      
      Before the check there was a possibility to greatly exceed the allowed
      number of processes by an unprivileged user if the program relied on
      rlimit only.  But the check created new security threat: many poorly
      written programs simply don't check setuid() return code and believe it
      cannot fail if executed with root privileges.  So, the check is removed
      in this patch because of too often privilege escalations related to
      buggy programs.
      
      The NPROC can still be enforced in the common code flow of daemons
      spawning user processes.  Most of daemons do fork()+setuid()+execve().
      The check introduced in execve() (1) enforces the same limit as in
      setuid() and (2) doesn't create similar security issues.
      
      Neil Brown suggested to track what specific process has exceeded the
      limit by setting PF_NPROC_EXCEEDED process flag.  With the change only
      this process would fail on execve(), and other processes' execve()
      behaviour is not changed.
      
      Solar Designer suggested to re-check whether NPROC limit is still
      exceeded at the moment of execve().  If the process was sleeping for
      days between set*uid() and execve(), and the NPROC counter step down
      under the limit, the defered execve() failure because NPROC limit was
      exceeded days ago would be unexpected.  If the limit is not exceeded
      anymore, we clear the flag on successful calls to execve() and fork().
      
      The flag is also cleared on successful calls to set_user() as the limit
      was exceeded for the previous user, not the current one.
      
      Similar check was introduced in -ow patches (without the process flag).
      
      v3 - clear PF_NPROC_EXCEEDED on successful calls to set_user().
      Reviewed-by: default avatarJames Morris <jmorris@namei.org>
      Signed-off-by: default avatarVasiliy Kulikov <segoon@openwall.com>
      Acked-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      72fa5997
  5. 26 Jul, 2011 2 commits
  6. 20 Jul, 2011 1 commit
  7. 17 Jul, 2011 1 commit
    • Oleg Nesterov's avatar
      ptrace: mv send-SIGSTOP from do_fork() to ptrace_init_task() · dcace06c
      Oleg Nesterov authored
      
      If the new child is traced, do_fork() adds the pending SIGSTOP.
      It assumes that either it is traced because of auto-attach or the
      tracer attached later, in both cases sigaddset/set_thread_flag is
      correct even if SIGSTOP is already pending.
      
      Now that we have PTRACE_SEIZE this is no longer right in the latter
      case. If the tracer does PTRACE_SEIZE after copy_process() makes the
      child visible the queued SIGSTOP is wrong.
      
      We could check PT_SEIZED bit and change ptrace_attach() to set both
      PT_PTRACED and PT_SEIZED bits simultaneously but see the next patch,
      we need to know whether this child was auto-attached or not anyway.
      
      So this patch simply moves this code to ptrace_init_task(), this
      way we can never race with ptrace_attach().
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      dcace06c
  8. 12 Jul, 2011 1 commit
    • Justin TerAvest's avatar
      fixlet: Remove fs_excl from struct task. · 4aede84b
      Justin TerAvest authored
      
      fs_excl is a poor man's priority inheritance for filesystems to hint to
      the block layer that an operation is important. It was never clearly
      specified, not widely adopted, and will not prevent starvation in many
      cases (like across cgroups).
      
      fs_excl was introduced with the time sliced CFQ IO scheduler, to
      indicate when a process held FS exclusive resources and thus needed
      a boost.
      
      It doesn't cover all file systems, and it was never fully complete.
      Lets kill it.
      Signed-off-by: default avatarJustin TerAvest <teravest@google.com>
      Signed-off-by: default avatarJens Axboe <jaxboe@fusionio.com>
      4aede84b
  9. 08 Jul, 2011 1 commit
  10. 22 Jun, 2011 2 commits
    • Tejun Heo's avatar
      ptrace: kill clone/exec tracehooks · 4b9d33e6
      Tejun Heo authored
      
      At this point, tracehooks aren't useful to mainline kernel and mostly
      just add an extra layer of obfuscation.  Although they have comments,
      without actual in-kernel users, it is difficult to tell what are their
      assumptions and they're actually trying to achieve.  To mainline
      kernel, they just aren't worth keeping around.
      
      This patch kills the following clone and exec related tracehooks.
      
      	tracehook_prepare_clone()
      	tracehook_finish_clone()
      	tracehook_report_clone()
      	tracehook_report_clone_complete()
      	tracehook_unsafe_exec()
      
      The changes are mostly trivial - logic is moved to the caller and
      comments are merged and adjusted appropriately.
      
      The only exception is in check_unsafe_exec() where LSM_UNSAFE_PTRACE*
      are OR'd to bprm->unsafe instead of setting it, which produces the
      same result as the field is always zero on entry.  It also tests
      p->ptrace instead of (p->ptrace & PT_PTRACED) for consistency, which
      also gives the same result.
      
      This doesn't introduce any behavior change.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      4b9d33e6
    • Tejun Heo's avatar
      ptrace: kill trivial tracehooks · a288eecc
      Tejun Heo authored
      
      At this point, tracehooks aren't useful to mainline kernel and mostly
      just add an extra layer of obfuscation.  Although they have comments,
      without actual in-kernel users, it is difficult to tell what are their
      assumptions and they're actually trying to achieve.  To mainline
      kernel, they just aren't worth keeping around.
      
      This patch kills the following trivial tracehooks.
      
      * Ones testing whether task is ptraced.  Replace with ->ptrace test.
      
      	tracehook_expect_breakpoints()
      	tracehook_consider_ignored_signal()
      	tracehook_consider_fatal_signal()
      
      * ptrace_event() wrappers.  Call directly.
      
      	tracehook_report_exec()
      	tracehook_report_exit()
      	tracehook_report_vfork_done()
      
      * ptrace_release_task() wrapper.  Call directly.
      
      	tracehook_finish_release_task()
      
      * noop
      
      	tracehook_prepare_release_task()
      	tracehook_report_death()
      
      This doesn't introduce any behavior change.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      a288eecc
  11. 29 May, 2011 1 commit
    • Linus Torvalds's avatar
      mm: Fix boot crash in mm_alloc() · 6345d24d
      Linus Torvalds authored
      Thomas Gleixner reports that we now have a boot crash triggered by
      CONFIG_CPUMASK_OFFSTACK=y:
      
          BUG: unable to handle kernel NULL pointer dereference at   (null)
          IP: [<c11ae035>] find_next_bit+0x55/0xb0
          Call Trace:
           [<c11addda>] cpumask_any_but+0x2a/0x70
           [<c102396b>] flush_tlb_mm+0x2b/0x80
           [<c1022705>] pud_populate+0x35/0x50
           [<c10227ba>] pgd_alloc+0x9a/0xf0
           [<c103a3fc>] mm_init+0xec/0x120
           [<c103a7a3>] mm_alloc+0x53/0xd0
      
      which was introduced by commit de03c72c
      
       ("mm: convert
      mm->cpu_vm_cpumask into cpumask_var_t"), and is due to wrong ordering of
      mm_init() vs mm_init_cpumask
      
      Thomas wrote a patch to just fix the ordering of initialization, but I
      hate the new double allocation in the fork path, so I ended up instead
      doing some more radical surgery to clean it all up.
      Reported-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reported-by: default avatarIngo Molnar <mingo@elte.hu>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6345d24d
  12. 27 May, 2011 3 commits
  13. 25 May, 2011 3 commits
    • KOSAKI Motohiro's avatar
      mm: convert mm->cpu_vm_cpumask into cpumask_var_t · de03c72c
      KOSAKI Motohiro authored
      
      cpumask_t is very big struct and cpu_vm_mask is placed wrong position.
      It might lead to reduce cache hit ratio.
      
      This patch has two change.
      1) Move the place of cpumask into last of mm_struct. Because usually cpumask
         is accessed only front bits when the system has cpu-hotplug capability
      2) Convert cpu_vm_mask into cpumask_var_t. It may help to reduce memory
         footprint if cpumask_size() will use nr_cpumask_bits properly in future.
      
      In addition, this patch change the name of cpu_vm_mask with cpu_vm_mask_var.
      It may help to detect out of tree cpu_vm_mask users.
      
      This patch has no functional change.
      
      [akpm@linux-foundation.org: build fix]
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      de03c72c
    • Peter Zijlstra's avatar
      mm: Convert i_mmap_lock to a mutex · 3d48ae45
      Peter Zijlstra authored
      
      Straightforward conversion of i_mmap_lock to a mutex.
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Namhyung Kim <namhyung@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3d48ae45
    • Peter Zijlstra's avatar
      mm: Remove i_mmap_lock lockbreak · 97a89413
      Peter Zijlstra authored
      Hugh says:
       "The only significant loser, I think, would be page reclaim (when
        concurrent with truncation): could spin for a long time waiting for
        the i_mmap_mutex it expects would soon be dropped? "
      
      Counter points:
       - cpu contention makes the spin stop (need_resched())
       - zap pages should be freeing pages at a higher rate than reclaim
         ever can
      
      I think the simplification of the truncate code is definitely worth it.
      
      Effectively reverts: 2aa15890
      
       ("mm: prevent concurrent
      unmap_mapping_range() on the same inode") and takes out the code that
      caused its problem.
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Reviewed-by: default avatarKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Namhyung Kim <namhyung@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      97a89413
  14. 12 May, 2011 1 commit
  15. 24 Apr, 2011 1 commit
  16. 24 Mar, 2011 2 commits
  17. 23 Mar, 2011 4 commits
  18. 17 Mar, 2011 1 commit
  19. 10 Mar, 2011 1 commit
    • Jens Axboe's avatar
      block: initial patch for on-stack per-task plugging · 73c10101
      Jens Axboe authored
      
      This patch adds support for creating a queuing context outside
      of the queue itself. This enables us to batch up pieces of IO
      before grabbing the block device queue lock and submitting them to
      the IO scheduler.
      
      The context is created on the stack of the process and assigned in
      the task structure, so that we can auto-unplug it if we hit a schedule
      event.
      
      The current queue plugging happens implicitly if IO is submitted to
      an empty device, yet callers have to remember to unplug that IO when
      they are going to wait for it. This is an ugly API and has caused bugs
      in the past. Additionally, it requires hacks in the vm (->sync_page()
      callback) to handle that logic. By switching to an explicit plugging
      scheme we make the API a lot nicer and can get rid of the ->sync_page()
      hack in the vm.
      Signed-off-by: default avatarJens Axboe <jaxboe@fusionio.com>
      73c10101
  20. 14 Jan, 2011 4 commits
    • Andrea Arcangeli's avatar
      thp: khugepaged · ba76149f
      Andrea Arcangeli authored
      
      Add khugepaged to relocate fragmented pages into hugepages if new
      hugepages become available.  (this is indipendent of the defrag logic that
      will have to make new hugepages available)
      
      The fundamental reason why khugepaged is unavoidable, is that some memory
      can be fragmented and not everything can be relocated.  So when a virtual
      machine quits and releases gigabytes of hugepages, we want to use those
      freely available hugepages to create huge-pmd in the other virtual
      machines that may be running on fragmented memory, to maximize the CPU
      efficiency at all times.  The scan is slow, it takes nearly zero cpu time,
      except when it copies data (in which case it means we definitely want to
      pay for that cpu time) so it seems a good tradeoff.
      
      In addition to the hugepages being released by other process releasing
      memory, we have the strong suspicion that the performance impact of
      potentially defragmenting hugepages during or before each page fault could
      lead to more performance inconsistency than allocating small pages at
      first and having them collapsed into large pages later...  if they prove
      themselfs to be long lived mappings (khugepaged scan is slow so short
      lived mappings have low probability to run into khugepaged if compared to
      long lived mappings).
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ba76149f
    • Andrea Arcangeli's avatar
      thp: add pmd_huge_pte to mm_struct · e7a00c45
      Andrea Arcangeli authored
      
      This increase the size of the mm struct a bit but it is needed to
      preallocate one pte for each hugepage so that split_huge_page will not
      require a fail path.  Guarantee of success is a fundamental property of
      split_huge_page to avoid decrasing swapping reliability and to avoid
      adding -ENOMEM fail paths that would otherwise force the hugepage-unaware
      VM code to learn rolling back in the middle of its pte mangling operations
      (if something we need it to learn handling pmd_trans_huge natively rather
      being capable of rollback).  When split_huge_page runs a pte is needed to
      succeed the split, to map the newly splitted regular pages with a regular
      pte.  This way all existing VM code remains backwards compatible by just
      adding a split_huge_page* one liner.  The memory waste of those
      preallocated ptes is negligible and so it is worth it.
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarMel Gorman <mel@csn.ul.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e7a00c45
    • Mandeep Singh Baines's avatar
      oom: allow a non-CAP_SYS_RESOURCE proces to oom_score_adj down · dabb16f6
      Mandeep Singh Baines authored
      
      We'd like to be able to oom_score_adj a process up/down as it
      enters/leaves the foreground.  Currently, it is not possible to oom_adj
      down without CAP_SYS_RESOURCE.  This patch allows a task to decrease its
      oom_score_adj back to the value that a CAP_SYS_RESOURCE thread set it to
      or its inherited value at fork.  Assuming the thread that has forked it
      has oom_score_adj of 0, each process could decrease it back from 0 upon
      activation unless a CAP_SYS_RESOURCE thread elevated it to something
      higher.
      
      Alternative considered:
      
      * a setuid binary
      * a daemon with CAP_SYS_RESOURCE
      
      Since you don't wan't all processes to be able to reduce their oom_adj, a
      setuid or daemon implementation would be complex.  The alternatives also
      have much higher overhead.
      
      This patch updated from original patch based on feedback from David
      Rientjes.
      Signed-off-by: default avatarMandeep Singh Baines <msb@chromium.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dabb16f6
    • Dave Jones's avatar
      sched: remove long deprecated CLONE_STOPPED flag · 43bb40c9
      Dave Jones authored
      This warning was added in commit bdff746a
      
       ("clone: prepare to recycle
      CLONE_STOPPED") three years ago.  2.6.26 came and went.  As far as I know,
      no-one is actually using CLONE_STOPPED.
      Signed-off-by: default avatarDave Jones <davej@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      43bb40c9
  21. 07 Jan, 2011 1 commit
  22. 04 Jan, 2011 1 commit
  23. 17 Dec, 2010 1 commit
  24. 08 Dec, 2010 1 commit
    • Mike Galbraith's avatar
      Sched: fix skip_clock_update optimization · f26f9aff
      Mike Galbraith authored
      
      idle_balance() drops/retakes rq->lock, leaving the previous task
      vulnerable to set_tsk_need_resched().  Clear it after we return
      from balancing instead, and in setup_thread_stack() as well, so
      no successfully descheduled or never scheduled task has it set.
      
      Need resched confused the skip_clock_update logic, which assumes
      that the next call to update_rq_clock() will come nearly immediately
      after being set.  Make the optimization robust against the waking
      a sleeper before it sucessfully deschedules case by checking that
      the current task has not been dequeued before setting the flag,
      since it is that useless clock update we're trying to save, and
      clear unconditionally in schedule() proper instead of conditionally
      in put_prev_task().
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Reported-by: default avatarBjoern B. Brandenburg <bbb.lst@gmail.com>
      Tested-by: default avatarYong Zhang <yong.zhang0@gmail.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: stable@kernel.org
      LKML-Reference: <1291802742.1417.9.camel@marge.simson.net>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      f26f9aff
  25. 30 Nov, 2010 1 commit
    • Mike Galbraith's avatar
      sched: Add 'autogroup' scheduling feature: automated per session task groups · 5091faa4
      Mike Galbraith authored
      
      A recurring complaint from CFS users is that parallel kbuild has
      a negative impact on desktop interactivity.  This patch
      implements an idea from Linus, to automatically create task
      groups.  Currently, only per session autogroups are implemented,
      but the patch leaves the way open for enhancement.
      
      Implementation: each task's signal struct contains an inherited
      pointer to a refcounted autogroup struct containing a task group
      pointer, the default for all tasks pointing to the
      init_task_group.  When a task calls setsid(), a new task group
      is created, the process is moved into the new task group, and a
      reference to the preveious task group is dropped.  Child
      processes inherit this task group thereafter, and increase it's
      refcount.  When the last thread of a process exits, the
      process's reference is dropped, such that when the last process
      referencing an autogroup exits, the autogroup is destroyed.
      
      At runqueue selection time, IFF a task has no cgroup assignment,
      its current autogroup is used.
      
      Autogroup bandwidth is controllable via setting it's nice level
      through the proc filesystem:
      
        cat /proc/<pid>/autogroup
      
      Displays the task's group and the group's nice level.
      
        echo <nice level> > /proc/<pid>/autogroup
      
      Sets the task group's shares to the weight of nice <level> task.
      Setting nice level is rate limited for !admin users due to the
      abuse risk of task group locking.
      
      The feature is enabled from boot by default if
      CONFIG_SCHED_AUTOGROUP=y is selected, but can be disabled via
      the boot option noautogroup, and can also be turned on/off on
      the fly via:
      
        echo [01] > /proc/sys/kernel/sched_autogroup_enabled
      
      ... which will automatically move tasks to/from the root task group.
      Signed-off-by: default avatarMike Galbraith <efault@gmx.de>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      [ Removed the task_group_path() debug code, and fixed !EVENTFD build failure. ]
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      LKML-Reference: <1290281700.28711.9.camel@maggy.simson.net>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
      5091faa4
  26. 28 Oct, 2010 1 commit
  27. 26 Oct, 2010 1 commit
    • Ying Han's avatar
      oom: add per-mm oom disable count · 3d5992d2
      Ying Han authored
      
      It's pointless to kill a task if another thread sharing its mm cannot be
      killed to allow future memory freeing.  A subsequent patch will prevent
      kills in such cases, but first it's necessary to have a way to flag a task
      that shares memory with an OOM_DISABLE task that doesn't incur an
      additional tasklist scan, which would make select_bad_process() an O(n^2)
      function.
      
      This patch adds an atomic counter to struct mm_struct that follows how
      many threads attached to it have an oom_score_adj of OOM_SCORE_ADJ_MIN.
      They cannot be killed by the kernel, so their memory cannot be freed in
      oom conditions.
      
      This only requires task_lock() on the task that we're operating on, it
      does not require mm->mmap_sem since task_lock() pins the mm and the
      operation is atomic.
      
      [rientjes@google.com: changelog and sys_unshare() code]
      [rientjes@google.com: protect oom_disable_count with task_lock in fork]
      [rientjes@google.com: use old_mm for oom_disable_count in exec]
      Signed-off-by: default avatarYing Han <yinghan@google.com>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3d5992d2