1. 13 Aug, 2019 1 commit
  2. 25 Sep, 2018 2 commits
    • Mel Gorman's avatar
      futex: Remove unnecessary warning from get_futex_key · 3887a9ac
      Mel Gorman authored
      commit 48fb6f4d upstream.
      
      Commit 65d8fc77
      
       ("futex: Remove requirement for lock_page() in
      get_futex_key()") removed an unnecessary lock_page() with the
      side-effect that page->mapping needed to be treated very carefully.
      
      Two defensive warnings were added in case any assumption was missed and
      the first warning assumed a correct application would not alter a
      mapping backing a futex key.  Since merging, it has not triggered for
      any unexpected case but Mark Rutland reported the following bug
      triggering due to the first warning.
      
        kernel BUG at kernel/futex.c:679!
        Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
        Modules linked in:
        CPU: 0 PID: 3695 Comm: syz-executor1 Not tainted 4.13.0-rc3-00020-g307fec773ba3 #3
        Hardware name: linux,dummy-virt (DT)
        task: ffff80001e271780 task.stack: ffff000010908000
        PC is at get_futex_key+0x6a4/0xcf0 kernel/futex.c:679
        LR is at get_futex_key+0x6a4/0xcf0 kernel/futex.c:679
        pc : [<ffff00000821ac14>] lr : [<ffff00000821ac14>] pstate: 80000145
      
      The fact that it's a bug instead of a warning was due to an unrelated
      arm64 problem, but the warning itself triggered because the underlying
      mapping changed.
      
      This is an application issue but from a kernel perspective it's a
      recoverable situation and the warning is unnecessary so this patch
      removes the warning.  The warning may potentially be triggered with the
      following test program from Mark although it may be necessary to adjust
      NR_FUTEX_THREADS to be a value smaller than the number of CPUs in the
      system.
      
          #include <linux/futex.h>
          #include <pthread.h>
          #include <stdio.h>
          #include <stdlib.h>
          #include <sys/mman.h>
          #include <sys/syscall.h>
          #include <sys/time.h>
          #include <unistd.h>
      
          #define NR_FUTEX_THREADS 16
          pthread_t threads[NR_FUTEX_THREADS];
      
          void *mem;
      
          #define MEM_PROT  (PROT_READ | PROT_WRITE)
          #define MEM_SIZE  65536
      
          static int futex_wrapper(int *uaddr, int op, int val,
                                   const struct timespec *timeout,
                                   int *uaddr2, int val3)
          {
              syscall(SYS_futex, uaddr, op, val, timeout, uaddr2, val3);
          }
      
          void *poll_futex(void *unused)
          {
              for (;;) {
                  futex_wrapper(mem, FUTEX_CMP_REQUEUE_PI, 1, NULL, mem + 4, 1);
              }
          }
      
          int main(int argc, char *argv[])
          {
              int i;
      
              mem = mmap(NULL, MEM_SIZE, MEM_PROT,
                     MAP_SHARED | MAP_ANONYMOUS, -1, 0);
      
              printf("Mapping @ %p\n", mem);
      
              printf("Creating futex threads...\n");
      
              for (i = 0; i < NR_FUTEX_THREADS; i++)
                  pthread_create(&threads[i], NULL, poll_futex, NULL);
      
              printf("Flipping mapping...\n");
              for (;;) {
                  mmap(mem, MEM_SIZE, MEM_PROT,
                       MAP_FIXED | MAP_SHARED | MAP_ANONYMOUS, -1, 0);
              }
      
              return 0;
          }
      Reported-and-tested-by: default avatarMark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      3887a9ac
    • Mel Gorman's avatar
      futex: Remove requirement for lock_page() in get_futex_key() · 862b19bc
      Mel Gorman authored
      commit 65d8fc77
      
       upstream.
      
      When dealing with key handling for shared futexes, we can drastically reduce
      the usage/need of the page lock. 1) For anonymous pages, the associated futex
      object is the mm_struct which does not require the page lock. 2) For inode
      based, keys, we can check under RCU read lock if the page mapping is still
      valid and take reference to the inode. This just leaves one rare race that
      requires the page lock in the slow path when examining the swapcache.
      
      Additionally realtime users currently have a problem with the page lock being
      contended for unbounded periods of time during futex operations.
      
      Task A
           get_futex_key()
           lock_page()
          ---> preempted
      
      Now any other task trying to lock that page will have to wait until
      task A gets scheduled back in, which is an unbound time.
      
      With this patch, we pretty much have a lockless futex_get_key().
      
      Experiments show that this patch can boost/speedup the hashing of shared
      futexes with the perf futex benchmarks (which is good for measuring such
      change) by up to 45% when there are high (> 100) thread counts on a 60 core
      Westmere. Lower counts are pretty much in the noise range or less than 10%,
      but mid range can be seen at over 30% overall throughput (hash ops/sec).
      This makes anon-mem shared futexes much closer to its private counterpart.
      Signed-off-by: default avatarMel Gorman <mgorman@suse.de>
      [ Ported on top of thp refcount rework, changelog, comments, fixes. ]
      Signed-off-by: default avatarDavidlohr Bueso <dbueso@suse.de>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Chris Mason <clm@fb.com>
      Cc: Darren Hart <dvhart@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: dave@stgolabs.net
      Link: http://lkml.kernel.org/r/1455045314-8305-3-git-send-email-dave@stgolabs.net
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarChenbo Feng <fengc@google.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      [bwh: Backported to 3.16: s/READ_ONCE/ACCESS_ONCE/]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      862b19bc
  3. 03 Mar, 2018 1 commit
  4. 15 Sep, 2017 1 commit
    • Jann Horn's avatar
      ptrace: use fsuid, fsgid, effective creds for fs access checks · 229aba44
      Jann Horn authored
      commit caaee623
      
       upstream.
      
      By checking the effective credentials instead of the real UID / permitted
      capabilities, ensure that the calling process actually intended to use its
      credentials.
      
      To ensure that all ptrace checks use the correct caller credentials (e.g.
      in case out-of-tree code or newly added code omits the PTRACE_MODE_*CREDS
      flag), use two new flags and require one of them to be set.
      
      The problem was that when a privileged task had temporarily dropped its
      privileges, e.g.  by calling setreuid(0, user_uid), with the intent to
      perform following syscalls with the credentials of a user, it still passed
      ptrace access checks that the user would not be able to pass.
      
      While an attacker should not be able to convince the privileged task to
      perform a ptrace() syscall, this is a problem because the ptrace access
      check is reused for things in procfs.
      
      In particular, the following somewhat interesting procfs entries only rely
      on ptrace access checks:
      
       /proc/$pid/stat - uses the check for determining whether pointers
           should be visible, useful for bypassing ASLR
       /proc/$pid/maps - also useful for bypassing ASLR
       /proc/$pid/cwd - useful for gaining access to restricted
           directories that contain files with lax permissions, e.g. in
           this scenario:
           lrwxrwxrwx root root /proc/13020/cwd -> /root/foobar
           drwx------ root root /root
           drwxr-xr-x root root /root/foobar
           -rw-r--r-- root root /root/foobar/secret
      
      Therefore, on a system where a root-owned mode 6755 binary changes its
      effective credentials as described and then dumps a user-specified file,
      this could be used by an attacker to reveal the memory layout of root's
      processes or reveal the contents of files he is not allowed to access
      (through /proc/$pid/cwd).
      
      [akpm@linux-foundation.org: fix warning]
      Signed-off-by: default avatarJann Horn <jann@thejh.net>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Casey Schaufler <casey@schaufler-ca.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com>
      Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      [bwh: Backported to 3.16:
       - Update mm_access() calls in fs/proc/task_{,no}mmu.c too
       - Adjust context]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      229aba44
  5. 18 Jul, 2017 2 commits
  6. 16 Mar, 2017 1 commit
    • Yang Yang's avatar
      futex: Move futex_init() to core_initcall · dcc9b826
      Yang Yang authored
      commit 25f71d1c
      
       upstream.
      
      The UEVENT user mode helper is enabled before the initcalls are executed
      and is available when the root filesystem has been mounted.
      
      The user mode helper is triggered by device init calls and the executable
      might use the futex syscall.
      
      futex_init() is marked __initcall which maps to device_initcall, but there
      is no guarantee that futex_init() is invoked _before_ the first device init
      call which triggers the UEVENT user mode helper.
      
      If the user mode helper uses the futex syscall before futex_init() then the
      syscall crashes with a NULL pointer dereference because the futex subsystem
      has not been initialized yet.
      
      Move futex_init() to core_initcall so futexes are initialized before the
      root filesystem is mounted and the usermode helper becomes available.
      
      [ tglx: Rewrote changelog ]
      Signed-off-by: default avatarYang Yang <yang.yang29@zte.com.cn>
      Cc: jiang.biao2@zte.com.cn
      Cc: jiang.zhengxiong@zte.com.cn
      Cc: zhong.weidong@zte.com.cn
      Cc: deng.huali@zte.com.cn
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1483085875-6130-1-git-send-email-yang.yang29@zte.com.cn
      
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      dcc9b826
  7. 15 Jun, 2016 1 commit
  8. 02 Feb, 2016 1 commit
  9. 13 Nov, 2014 1 commit
  10. 30 Oct, 2014 1 commit
    • Catalin Marinas's avatar
      futex: Ensure get_futex_key_refs() always implies a barrier · b8981499
      Catalin Marinas authored
      commit 76835b0e upstream.
      
      Commit b0c29f79 (futexes: Avoid taking the hb->lock if there's
      nothing to wake up) changes the futex code to avoid taking a lock when
      there are no waiters. This code has been subsequently fixed in commit
      11d4616b
      
       (futex: revert back to the explicit waiter counting code).
      Both the original commit and the fix-up rely on get_futex_key_refs() to
      always imply a barrier.
      
      However, for private futexes, none of the cases in the switch statement
      of get_futex_key_refs() would be hit and the function completes without
      a memory barrier as required before checking the "waiters" in
      futex_wake() -> hb_waiters_pending(). The consequence is a race with a
      thread waiting on a futex on another CPU, allowing the waker thread to
      read "waiters == 0" while the waiter thread to have read "futex_val ==
      locked" (in kernel).
      
      Without this fix, the problem (user space deadlocks) can be seen with
      Android bionic's mutex implementation on an arm64 multi-cluster system.
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Reported-by: default avatarMatteo Franchin <Matteo.Franchin@arm.com>
      Fixes: b0c29f79
      
       (futexes: Avoid taking the hb->lock if there's nothing to wake up)
      Acked-by: default avatarDavidlohr Bueso <dave@stgolabs.net>
      Tested-by: default avatarMike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Darren Hart <dvhart@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b8981499
  11. 05 Oct, 2014 1 commit
  12. 05 Jun, 2014 4 commits
    • Thomas Gleixner's avatar
      futex: Make lookup_pi_state more robust · 54a21788
      Thomas Gleixner authored
      
      The current implementation of lookup_pi_state has ambigous handling of
      the TID value 0 in the user space futex.  We can get into the kernel
      even if the TID value is 0, because either there is a stale waiters bit
      or the owner died bit is set or we are called from the requeue_pi path
      or from user space just for fun.
      
      The current code avoids an explicit sanity check for pid = 0 in case
      that kernel internal state (waiters) are found for the user space
      address.  This can lead to state leakage and worse under some
      circumstances.
      
      Handle the cases explicit:
      
             Waiter | pi_state | pi->owner | uTID      | uODIED | ?
      
        [1]  NULL   | ---      | ---       | 0         | 0/1    | Valid
        [2]  NULL   | ---      | ---       | >0        | 0/1    | Valid
      
        [3]  Found  | NULL     | --        | Any       | 0/1    | Invalid
      
        [4]  Found  | Found    | NULL      | 0         | 1      | Valid
        [5]  Found  | Found    | NULL      | >0        | 1      | Invalid
      
        [6]  Found  | Found    | task      | 0         | 1      | Valid
      
        [7]  Found  | Found    | NULL      | Any       | 0      | Invalid
      
        [8]  Found  | Found    | task      | ==taskTID | 0/1    | Valid
        [9]  Found  | Found    | task      | 0         | 0      | Invalid
        [10] Found  | Found    | task      | !=taskTID | 0/1    | Invalid
      
       [1] Indicates that the kernel can acquire the futex atomically. We
           came came here due to a stale FUTEX_WAITERS/FUTEX_OWNER_DIED bit.
      
       [2] Valid, if TID does not belong to a kernel thread. If no matching
           thread is found then it indicates that the owner TID has died.
      
       [3] Invalid. The waiter is queued on a non PI futex
      
       [4] Valid state after exit_robust_list(), which sets the user space
           value to FUTEX_WAITERS | FUTEX_OWNER_DIED.
      
       [5] The user space value got manipulated between exit_robust_list()
           and exit_pi_state_list()
      
       [6] Valid state after exit_pi_state_list() which sets the new owner in
           the pi_state but cannot access the user space value.
      
       [7] pi_state->owner can only be NULL when the OWNER_DIED bit is set.
      
       [8] Owner and user space value match
      
       [9] There is no transient state which sets the user space TID to 0
           except exit_robust_list(), but this is indicated by the
           FUTEX_OWNER_DIED bit. See [4]
      
      [10] There is no transient state which leaves owner and user space
           TID out of sync.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Will Drewry <wad@chromium.org>
      Cc: Darren Hart <dvhart@linux.intel.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      54a21788
    • Thomas Gleixner's avatar
      futex: Always cleanup owner tid in unlock_pi · 13fbca4c
      Thomas Gleixner authored
      
      If the owner died bit is set at futex_unlock_pi, we currently do not
      cleanup the user space futex.  So the owner TID of the current owner
      (the unlocker) persists.  That's observable inconsistant state,
      especially when the ownership of the pi state got transferred.
      
      Clean it up unconditionally.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Will Drewry <wad@chromium.org>
      Cc: Darren Hart <dvhart@linux.intel.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      13fbca4c
    • Thomas Gleixner's avatar
      futex: Validate atomic acquisition in futex_lock_pi_atomic() · b3eaa9fc
      Thomas Gleixner authored
      
      We need to protect the atomic acquisition in the kernel against rogue
      user space which sets the user space futex to 0, so the kernel side
      acquisition succeeds while there is existing state in the kernel
      associated to the real owner.
      
      Verify whether the futex has waiters associated with kernel state.  If
      it has, return -EINVAL.  The state is corrupted already, so no point in
      cleaning it up.  Subsequent calls will fail as well.  Not our problem.
      
      [ tglx: Use futex_top_waiter() and explain why we do not need to try
        	restoring the already corrupted user space state. ]
      Signed-off-by: default avatarDarren Hart <dvhart@linux.intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Will Drewry <wad@chromium.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b3eaa9fc
    • Thomas Gleixner's avatar
      futex-prevent-requeue-pi-on-same-futex.patch futex: Forbid uaddr == uaddr2 in... · e9c243a5
      Thomas Gleixner authored
      futex-prevent-requeue-pi-on-same-futex.patch futex: Forbid uaddr == uaddr2 in futex_requeue(..., requeue_pi=1)
      
      If uaddr == uaddr2, then we have broken the rule of only requeueing from
      a non-pi futex to a pi futex with this call.  If we attempt this, then
      dangling pointers may be left for rt_waiter resulting in an exploitable
      condition.
      
      This change brings futex_requeue() in line with futex_wait_requeue_pi()
      which performs the same check as per commit 6f7b0a2a
      
       ("futex: Forbid
      uaddr == uaddr2 in futex_wait_requeue_pi()")
      
      [ tglx: Compare the resulting keys as well, as uaddrs might be
        	different depending on the mapping ]
      
      Fixes CVE-2014-3153.
      
      Reported-by: Pinkie Pie
      Signed-off-by: default avatarWill Drewry <wad@chromium.org>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarDarren Hart <dvhart@linux.intel.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e9c243a5
  13. 19 May, 2014 2 commits
    • Thomas Gleixner's avatar
      futex: Prevent attaching to kernel threads · f0d71b3d
      Thomas Gleixner authored
      
      We happily allow userspace to declare a random kernel thread to be the
      owner of a user space PI futex.
      
      Found while analysing the fallout of Dave Jones syscall fuzzer.
      
      We also should validate the thread group for private futexes and find
      some fast way to validate whether the "alleged" owner has RW access on
      the file which backs the SHM, but that's a separate issue.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Darren Hart <darren@dvhart.com>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Clark Williams <williams@redhat.com>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Carlos ODonell <carlos@redhat.com>
      Cc: Jakub Jelinek <jakub@redhat.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Link: http://lkml.kernel.org/r/20140512201701.194824402@linutronix.de
      
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      f0d71b3d
    • Thomas Gleixner's avatar
      futex: Add another early deadlock detection check · 866293ee
      Thomas Gleixner authored
      Dave Jones trinity syscall fuzzer exposed an issue in the deadlock
      detection code of rtmutex:
        http://lkml.kernel.org/r/20140429151655.GA14277@redhat.com
      
      That underlying issue has been fixed with a patch to the rtmutex code,
      but the futex code must not call into rtmutex in that case because
          - it can detect that issue early
          - it avoids a different and more complex fixup for backing out
      
      If the user space variable got manipulated to 0x80000000 which means
      no lock holder, but the waiters bit set and an active pi_state in the
      kernel is found we can figure out the recursive locking issue by
      looking at the pi_state owner. If that is the current task, then we
      can safely return -EDEADLK.
      
      The check should have been added in commit 59fa6245
      
       (futex: Handle
      futex_pi OWNER_DIED take over correctly) already, but I did not see
      the above issue caused by user space manipulation back then.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Darren Hart <darren@dvhart.com>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Clark Williams <williams@redhat.com>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Carlos ODonell <carlos@redhat.com>
      Cc: Jakub Jelinek <jakub@redhat.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Link: http://lkml.kernel.org/r/20140512201701.097349971@linutronix.de
      
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      866293ee
  14. 18 Apr, 2014 1 commit
  15. 13 Apr, 2014 1 commit
  16. 09 Apr, 2014 1 commit
    • Linus Torvalds's avatar
      futex: avoid race between requeue and wake · 69cd9eba
      Linus Torvalds authored
      
      Jan Stancek reported:
       "pthread_cond_broadcast/4-1.c testcase from openposix testsuite (LTP)
        occasionally fails, because some threads fail to wake up.
      
        Testcase creates 5 threads, which are all waiting on same condition.
        Main thread then calls pthread_cond_broadcast() without holding mutex,
        which calls:
      
            futex(uaddr1, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, uaddr2, ..)
      
        This immediately wakes up single thread A, which unlocks mutex and
        tries to wake up another thread:
      
            futex(uaddr2, FUTEX_WAKE_PRIVATE, 1)
      
        If thread A manages to call futex_wake() before any waiters are
        requeued for uaddr2, no other thread is woken up"
      
      The ordering constraints for the hash bucket waiter counting are that
      the waiter counts have to be incremented _before_ getting the spinlock
      (because the spinlock acts as part of the memory barrier), but the
      "requeue" operation didn't honor those rules, and nobody had even
      thought about that case.
      
      This fairly simple patch just increments the waiter count for the target
      hash bucket (hb2) when requeing a futex before taking the locks.  It
      then decrements them again after releasing the lock - the code that
      actually moves the futex(es) between hash buckets will do the additional
      required waiter count housekeeping.
      Reported-and-tested-by: default avatarJan Stancek <jstancek@redhat.com>
      Acked-by: default avatarDavidlohr Bueso <davidlohr@hp.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org # 3.14
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      69cd9eba
  17. 21 Mar, 2014 1 commit
    • Linus Torvalds's avatar
      futex: revert back to the explicit waiter counting code · 11d4616b
      Linus Torvalds authored
      Srikar Dronamraju reports that commit b0c29f79
      
       ("futexes: Avoid
      taking the hb->lock if there's nothing to wake up") causes java threads
      getting stuck on futexes when runing specjbb on a power7 numa box.
      
      The cause appears to be that the powerpc spinlocks aren't using the same
      ticket lock model that we use on x86 (and other) architectures, which in
      turn result in the "spin_is_locked()" test in hb_waiters_pending()
      occasionally reporting an unlocked spinlock even when there are pending
      waiters.
      
      So this reinstates Davidlohr Bueso's original explicit waiter counting
      code, which I had convinced Davidlohr to drop in favor of figuring out
      the pending waiters by just using the existing state of the spinlock and
      the wait queue.
      Reported-and-tested-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Original-code-by: default avatarDavidlohr Bueso <davidlohr@hp.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      11d4616b
  18. 03 Mar, 2014 1 commit
  19. 16 Jan, 2014 1 commit
    • Heiko Carstens's avatar
      futexes: Fix futex_hashsize initialization · 63b1a816
      Heiko Carstens authored
      
      "futexes: Increase hash table size for better performance"
      introduces a new alloc_large_system_hash() call.
      
      alloc_large_system_hash() however may allocate less memory than
      requested, e.g. limited by MAX_ORDER.
      
      Hence pass a pointer to alloc_large_system_hash() which will
      contain the hash shift when the function returns. Afterwards
      correctly set futex_hashsize.
      
      Fixes a crash on s390 where the requested allocation size was
      4MB but only 1MB was allocated.
      Signed-off-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Darren Hart <dvhart@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Waiman Long <Waiman.Long@hp.com>
      Cc: Jason Low <jason.low2@hp.com>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Link: http://lkml.kernel.org/r/20140116135450.GA4345@osiris
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      63b1a816
  20. 13 Jan, 2014 5 commits
    • Peter Zijlstra's avatar
      rtmutex: Turn the plist into an rb-tree · fb00aca4
      Peter Zijlstra authored
      
      Turn the pi-chains from plist to rb-tree, in the rt_mutex code,
      and provide a proper comparison function for -deadline and
      -priority tasks.
      
      This is done mainly because:
       - classical prio field of the plist is just an int, which might
         not be enough for representing a deadline;
       - manipulating such a list would become O(nr_deadline_tasks),
         which might be to much, as the number of -deadline task increases.
      
      Therefore, an rb-tree is used, and tasks are queued in it according
      to the following logic:
       - among two -priority (i.e., SCHED_BATCH/OTHER/RR/FIFO) tasks, the
         one with the higher (lower, actually!) prio wins;
       - among a -priority and a -deadline task, the latter always wins;
       - among two -deadline tasks, the one with the earliest deadline
         wins.
      
      Queueing and dequeueing functions are changed accordingly, for both
      the list of a task's pi-waiters and the list of tasks blocked on
      a pi-lock.
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarDario Faggioli <raistlin@linux.it>
      Signed-off-by: default avatarJuri Lelli <juri.lelli@gmail.com>
      Signed-off-again-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1383831828-15501-10-git-send-email-juri.lelli@gmail.com
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      fb00aca4
    • Davidlohr Bueso's avatar
      futexes: Avoid taking the hb->lock if there's nothing to wake up · b0c29f79
      Davidlohr Bueso authored
      In futex_wake() there is clearly no point in taking the hb->lock
      if we know beforehand that there are no tasks to be woken. While
      the hash bucket's plist head is a cheap way of knowing this, we
      cannot rely 100% on it as there is a racy window between the
      futex_wait call and when the task is actually added to the
      plist. To this end, we couple it with the spinlock check as
      tasks trying to enter the critical region are most likely
      potential waiters that will be added to the plist, thus
      preventing tasks sleeping forever if wakers don't acknowledge
      all possible waiters.
      
      Furthermore, the futex ordering guarantees are preserved,
      ensuring that waiters either observe the changed user space
      value before blocking or is woken by a concurrent waker. For
      wakers, this is done by relying on the barriers in
      get_futex_key_refs() -- for archs that do not have implicit mb
      in atomic_inc(), we explicitly add them through a new
      futex_get_mm function. For waiters we rely on the fact that
      spin_lock calls already update the head counter, so spinners
      are visible even if the lock hasn't been acquired yet.
      
      For more details please refer to the updated comments in the
      code and related discussion:
      
        https://lkml.org/lkml/2013/11/26/556
      
      
      
      Special thanks to tglx for careful review and feedback.
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: default avatarDarren Hart <dvhart@linux.intel.com>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarDavidlohr Bueso <davidlohr@hp.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Scott Norton <scott.norton@hp.com>
      Cc: Tom Vaden <tom.vaden@hp.com>
      Cc: Aswin Chandramouleeswaran <aswin@hp.com>
      Cc: Waiman Long <Waiman.Long@hp.com>
      Cc: Jason Low <jason.low2@hp.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1389569486-25487-5-git-send-email-davidlohr@hp.com
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      b0c29f79
    • Thomas Gleixner's avatar
      futexes: Document multiprocessor ordering guarantees · 99b60ce6
      Thomas Gleixner authored
      
      That's essential, if you want to hack on futexes.
      Reviewed-by: default avatarDarren Hart <dvhart@linux.intel.com>
      Reviewed-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Reviewed-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarDavidlohr Bueso <davidlohr@hp.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Scott Norton <scott.norton@hp.com>
      Cc: Tom Vaden <tom.vaden@hp.com>
      Cc: Aswin Chandramouleeswaran <aswin@hp.com>
      Cc: Waiman Long <Waiman.Long@hp.com>
      Cc: Jason Low <jason.low2@hp.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1389569486-25487-4-git-send-email-davidlohr@hp.com
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      99b60ce6
    • Davidlohr Bueso's avatar
      futexes: Increase hash table size for better performance · a52b89eb
      Davidlohr Bueso authored
      Currently, the futex global hash table suffers from its fixed,
      smallish (for today's standards) size of 256 entries, as well as
      its lack of NUMA awareness. Large systems, using many futexes,
      can be prone to high amounts of collisions; where these futexes
      hash to the same bucket and lead to extra contention on the same
      hb->lock. Furthermore, cacheline bouncing is a reality when we
      have multiple hb->locks residing on the same cacheline and
      different futexes hash to adjacent buckets.
      
      This patch keeps the current static size of 16 entries for small
      systems, or otherwise, 256 * ncpus (or larger as we need to
      round the number to a power of 2). Note that this number of CPUs
      accounts for all CPUs that can ever be available in the system,
      taking into consideration things like hotpluging. While we do
      impose extra overhead at bootup by making the hash table larger,
      this is a one time thing, and does not shadow the benefits of
      this patch.
      
      Furthermore, as suggested by tglx, by cache aligning the hash
      buckets we can avoid access across cacheline boundaries and also
      avoid massive cache line bouncing if multiple cpus are hammering
      away at different hash buckets which happen to reside in the
      same cache line.
      
      Also, similar to other core kernel components (pid, dcache,
      tcp), by using alloc_large_system_hash() we benefit from its
      NUMA awareness and thus the table is distributed among the nodes
      instead of in a single one.
      
      For a custom microbenchmark that pounds on the uaddr hashing --
      making the wait path fail at futex_wait_setup() returning
      -EWOULDBLOCK for large amounts of futexes, we can see the
      following benefits on a 80-core, 8-socket 1Tb server:
      
       +---------+--------------------+------------------------+-----------------------+-------------------------------+
       | threads | baseline (ops/sec) | aligned-only (ops/sec) | large table (ops/sec) | large table+aligned (ops/sec) |
       +---------+--------------------+------------------------+-----------------------+-------------------------------+
       |     512 |              32426 | 50531  (+55.8%)        | 255274  (+687.2%)     | 292553  (+802.2%)             |
       |     256 |              65360 | 99588  (+52.3%)        | 443563  (+578.6%)     | 508088  (+677.3%)             |
       |     128 |             125635 | 200075 (+59.2%)        | 742613  (+491.1%)     | 835452  (+564.9%)             |
       |      80 |             193559 | 323425 (+67.1%)        | 1028147 (+431.1%)     | 1130304 (+483.9%)             |
       |      64 |             247667 | 443740 (+79.1%)        | 997300  (+302.6%)     | 11454943
      
       (+362.5%)             |
       |      32 |             628412 | 721401 (+14.7%)        | 965996  (+53.7%)      | 1122115 (+78.5%)              |
       +---------+--------------------+------------------------+-----------------------+-------------------------------+
      Reviewed-by: default avatarDarren Hart <dvhart@linux.intel.com>
      Reviewed-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Reviewed-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: default avatarWaiman Long <Waiman.Long@hp.com>
      Reviewed-and-tested-by: default avatarJason Low <jason.low2@hp.com>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarDavidlohr Bueso <davidlohr@hp.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Scott Norton <scott.norton@hp.com>
      Cc: Tom Vaden <tom.vaden@hp.com>
      Cc: Aswin Chandramouleeswaran <aswin@hp.com>
      Link: http://lkml.kernel.org/r/1389569486-25487-3-git-send-email-davidlohr@hp.com
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      a52b89eb
    • Jason Low's avatar
      futexes: Clean up various details · 0d00c7b2
      Jason Low authored
      
      - Remove unnecessary head variables.
      - Delete unused parameter in queue_unlock().
      Reviewed-by: default avatarDarren Hart <dvhart@linux.intel.com>
      Reviewed-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Reviewed-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarJason Low <jason.low2@hp.com>
      Signed-off-by: default avatarDavidlohr Bueso <davidlohr@hp.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Scott Norton <scott.norton@hp.com>
      Cc: Tom Vaden <tom.vaden@hp.com>
      Cc: Aswin Chandramouleeswaran <aswin@hp.com>
      Cc: Waiman Long <Waiman.Long@hp.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1389569486-25487-2-git-send-email-davidlohr@hp.com
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      0d00c7b2
  21. 12 Dec, 2013 2 commits
    • Linus Torvalds's avatar
      futex: move user address verification up to common code · 5cdec2d8
      Linus Torvalds authored
      
      When debugging the read-only hugepage case, I was confused by the fact
      that get_futex_key() did an access_ok() only for the non-shared futex
      case, since the user address checking really isn't in any way specific
      to the private key handling.
      
      Now, it turns out that the shared key handling does effectively do the
      equivalent checks inside get_user_pages_fast() (it doesn't actually
      check the address range on x86, but does check the page protections for
      being a user page).  So it wasn't actually a bug, but the fact that we
      treat the address differently for private and shared futexes threw me
      for a loop.
      
      Just move the check up, so that it gets done for both cases.  Also, use
      the 'rw' parameter for the type, even if it doesn't actually matter any
      more (it's a historical artifact of the old racy i386 "page faults from
      kernel space don't check write protections").
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5cdec2d8
    • Linus Torvalds's avatar
      futex: fix handling of read-only-mapped hugepages · f12d5bfc
      Linus Torvalds authored
      The hugepage code had the exact same bug that regular pages had in
      commit 7485d0d3 ("futexes: Remove rw parameter from
      get_futex_key()").
      
      The regular page case was fixed by commit 9ea71503 ("futex: Fix
      regression with read only mappings"), but the transparent hugepage case
      (added in a5b338f2
      
      : "thp: update futex compound knowledge") case
      remained broken.
      
      Found by Dave Jones and his trinity tool.
      Reported-and-tested-by: default avatarDave Jones <davej@fedoraproject.org>
      Cc: stable@kernel.org # v2.6.38+
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Darren Hart <dvhart@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f12d5bfc
  22. 06 Nov, 2013 1 commit
  23. 25 Jun, 2013 2 commits
    • Colin Cross's avatar
      futex: Use freezable blocking call · 88c8004f
      Colin Cross authored
      
      Avoid waking up every thread sleeping in a futex_wait call during
      suspend and resume by calling a freezable blocking call.  Previous
      patches modified the freezer to avoid sending wakeups to threads
      that are blocked in freezable blocking calls.
      
      This call was selected to be converted to a freezable call because
      it doesn't hold any locks or release any resources when interrupted
      that might be needed by another freezing task or a kernel driver
      during suspend, and is a common site where idle userspace tasks are
      blocked.
      Signed-off-by: default avatarColin Cross <ccross@android.com>
      Cc: Rafael J. Wysocki <rjw@sisk.pl>
      Cc: arve@android.com
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Darren Hart <dvhart@linux.intel.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Link: http://lkml.kernel.org/r/1367458508-9133-8-git-send-email-ccross@android.com
      
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      88c8004f
    • Zhang Yi's avatar
      futex: Take hugepages into account when generating futex_key · 13d60f4b
      Zhang Yi authored
      
      The futex_keys of process shared futexes are generated from the page
      offset, the mapping host and the mapping index of the futex user space
      address. This should result in an unique identifier for each futex.
      
      Though this is not true when futexes are located in different subpages
      of an hugepage. The reason is, that the mapping index for all those
      futexes evaluates to the index of the base page of the hugetlbfs
      mapping. So a futex at offset 0 of the hugepage mapping and another
      one at offset PAGE_SIZE of the same hugepage mapping have identical
      futex_keys. This happens because the futex code blindly uses
      page->index.
      
      Steps to reproduce the bug:
      
      1. Map a file from hugetlbfs. Initialize pthread_mutex1 at offset 0
         and pthread_mutex2 at offset PAGE_SIZE of the hugetlbfs
         mapping.
      
         The mutexes must be initialized as PTHREAD_PROCESS_SHARED because
         PTHREAD_PROCESS_PRIVATE mutexes are not affected by this issue as
         their keys solely depend on the user space address.
      
      2. Lock mutex1 and mutex2
      
      3. Create thread1 and in the thread function lock mutex1, which
         results in thread1 blocking on the locked mutex1.
      
      4. Create thread2 and in the thread function lock mutex2, which
         results in thread2 blocking on the locked mutex2.
      
      5. Unlock mutex2. Despite the fact that mutex2 got unlocked, thread2
         still blocks on mutex2 because the futex_key points to mutex1.
      
      To solve this issue we need to take the normal page index of the page
      which contains the futex into account, if the futex is in an hugetlbfs
      mapping. In other words, we calculate the normal page mapping index of
      the subpage in the hugetlbfs mapping.
      
      Mappings which are not based on hugetlbfs are not affected and still
      use page->index.
      
      Thanks to Mel Gorman who provided a patch for adding proper evaluation
      functions to the hugetlbfs code to avoid exposing hugetlbfs specific
      details to the futex code.
      
      [ tglx: Massaged changelog ]
      Signed-off-by: default avatarZhang Yi <zhang.yi20@zte.com.cn>
      Reviewed-by: default avatarJiang Biao <jiang.biao2@zte.com.cn>
      Tested-by: default avatarMa Chenggong <ma.chenggong@zte.com.cn>
      Reviewed-by: default avatar'Mel Gorman' <mgorman@suse.de>
      Acked-by: default avatar'Darren Hart' <dvhart@linux.intel.com>
      Cc: 'Peter Zijlstra' <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/000101ce71a6%24a83c5880%24f8b50980%24@com
      
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      13d60f4b
  24. 12 May, 2013 1 commit
  25. 13 Mar, 2013 1 commit
  26. 27 Feb, 2013 1 commit
  27. 19 Feb, 2013 1 commit
    • Thomas Gleixner's avatar
      futex: Revert "futex: Mark get_robust_list as deprecated" · fe2b05f7
      Thomas Gleixner authored
      This reverts commit ec0c4274
      
      .
      
      get_robust_list() is in use and a removal would break existing user
      space. With the permission checks in place it's not longer a security
      hole. Remove the deprecation warnings.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: akpm@linux-foundation.org
      Cc: paul.gortmaker@windriver.com
      Cc: davej@redhat.com
      Cc: keescook@chromium.org
      Cc: stable@vger.kernel.org
      Cc: ebiederm@xmission.com
      fe2b05f7
  28. 07 Feb, 2013 1 commit