1. 11 Feb, 2022 1 commit
  2. 08 Jan, 2022 5 commits
  3. 14 Dec, 2021 1 commit
  4. 19 Nov, 2021 2 commits
  5. 17 Nov, 2021 3 commits
    • Eric W. Biederman's avatar
      signal: Requeue ptrace signals · b171f667
      Eric W. Biederman authored
      Kyle Huey <me@kylehuey.com> writes:
      
      > rr, a userspace record and replay debugger[0], uses the recorded register
      > state at PTRACE_EVENT_EXIT to find the point in time at which to cease
      > executing the program during replay.
      >
      > If a SIGKILL races with processing another signal in get_signal, it is
      > possible for the kernel to decline to notify the tracer of the original
      > signal. But if the original signal had a handler, the kernel proceeds
      > with setting up a signal handler frame as if the tracer had chosen to
      > deliver the signal unmodified to the tracee. When the kernel goes to
      > execute the signal handler that it has now modified the stack and registers
      > for, it will discover the pending SIGKILL, and terminate the tracee
      > without executing the handler. When PTRACE_EVENT_EXIT is delivered to
      > the tracer, however, the effects of handler setup will be visible to
      > the tracer.
      >
      > Because rr (the tracer) was never notified of the signal, it is not aware
      > that a signal handler frame was set up and expects the state of the program
      > at PTRACE_EVENT_EXIT to be a state that will be reconstructed naturally
      > by allowing the program to execute from the last event. When that fails
      > to happen during replay, rr will assert and die.
      >
      > The following patches add an explicit check for a newly pending SIGKILL
      > after the ptracer has been notified and the siglock has been reacquired.
      > If this happens, we stop processing the current signal and proceed
      > immediately to handling the SIGKILL. This makes the state reported at
      > PTRACE_EVENT_EXIT the unmodified state of the program, and also avoids the
      > work to set up a signal handler frame that will never be used.
      >
      > [0] https://rr-project.org/
      
      
      
      The problem is that while the traced process makes it into ptrace_stop,
      the tracee is killed before the tracer manages to wait for the
      tracee and discover which signal was about to be delivered.
      
      More generally the problem is that while siglock was dropped a signal
      with process wide effect is short cirucit delivered to the entire
      process killing it, but the process continues to try and deliver another
      signal.
      
      In general it impossible to avoid all cases where work is performed
      after the process has been killed.  In particular if the process is
      killed after get_signal returns the code will simply not know it has
      been killed until after delivering the signal frame to userspace.
      
      On the other hand when the code has already discovered the process
      has been killed and taken user space visible action that shows
      the kernel knows the process has been killed, it is just silly
      to then write the signal frame to the user space stack.
      
      Instead of being silly detect the process has been killed
      in ptrace_signal and requeue the signal so the code can pretend
      it was simply never dequeued for delivery.
      
      To test the process has been killed I use fatal_signal_pending rather
      than signal_group_exit to match the test in signal_pending_state which
      is used in schedule which is where ptrace_stop detects the process has
      been killed.
      
      Requeuing the signal so the code can pretend it was simply never
      dequeued improves the user space visible behavior that has been
      present since ebf5ebe31d2c ("[PATCH] signal-fixes-2.5.59-A4").
      
      Kyle Huey verified that this change in behavior and makes rr happy.
      Reported-by: default avatarKyle Huey <khuey@kylehuey.com>
      Reported-by: default avatarMarko Mäkelä <marko.makela@mariadb.com>
      History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.gi
      
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Link: https://lkml.kernel.org/r/87tugcd5p2.fsf_-_@email.froward.int.ebiederm.org
      
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      b171f667
    • Eric W. Biederman's avatar
      signal: Requeue signals in the appropriate queue · 5768d890
      Eric W. Biederman authored
      In the event that a tracer changes which signal needs to be delivered
      and that signal is currently blocked then the signal needs to be
      requeued for later delivery.
      
      With the advent of CLONE_THREAD the kernel has 2 signal queues per
      task.  The per process queue and the per task queue.  Update the code
      so that if the signal is removed from the per process queue it is
      requeued on the per process queue.  This is necessary to make it
      appear the signal was never dequeued.
      
      The rr debugger reasonably believes that the state of the process from
      the last ptrace_stop it observed until PTRACE_EVENT_EXIT can be recreated
      by simply letting a process run.  If a SIGKILL interrupts a ptrace_stop
      this is not true today.
      
      So return signals to their original queue in ptrace_signal so that
      signals that are not delivered appear like they were never dequeued.
      
      Fixes: 794aa320b79d ("[PATCH] sigfix-2.5.40-D6")
      History Tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.gi
      
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Link: https://lkml.kernel.org/r/87zgq4d5r4.fsf_-_@email.froward.int.ebiederm.org
      
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      5768d890
    • Eric W. Biederman's avatar
      signal: In get_signal test for signal_group_exit every time through the loop · e7f7c99b
      Eric W. Biederman authored
      Recently while investigating a problem with rr and signals I noticed
      that siglock is dropped in ptrace_signal and get_signal does not jump
      to relock.
      
      Looking farther to see if the problem is anywhere else I see that
      do_signal_stop also returns if signal_group_exit is true.  I believe
      that test can now never be true, but it is a bit hard to trace
      through and be certain.
      
      Testing signal_group_exit is not expensive, so move the test for
      signal_group_exit into the for loop inside of get_signal to ensure
      the test is never skipped improperly.
      
      This has been a potential problem since I added the test for
      signal_group_exit was added.
      
      Fixes: 35634ffa
      
       ("signal: Always notice exiting tasks")
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Link: https://lkml.kernel.org/r/875yssekcd.fsf_-_@email.froward.int.ebiederm.org
      
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      e7f7c99b
  6. 03 Nov, 2021 1 commit
  7. 29 Oct, 2021 1 commit
    • Eric W. Biederman's avatar
      signal: Implement force_fatal_sig · 26d5badb
      Eric W. Biederman authored
      Add a simple helper force_fatal_sig that causes a signal to be
      delivered to a process as if the signal handler was set to SIG_DFL.
      
      Reimplement force_sigsegv based upon this new helper.  This fixes
      force_sigsegv so that when it forces the default signal handler
      to be used the code now forces the signal to be unblocked as well.
      
      Reusing the tested logic in force_sig_info_to_task that was built for
      force_sig_seccomp this makes the implementation trivial.
      
      This is interesting both because it makes force_sigsegv simpler and
      because there are a couple of buggy places in the kernel that call
      do_exit(SIGILL) or do_exit(SIGSYS) because there is no straight
      forward way today for those places to simply force the exit of a
      process with the chosen signal.  Creating force_fatal_sig allows
      those places to be implemented with normal signal exits.
      
      Link: https://lkml.kernel.org/r/20211020174406.17889-13-ebiederm@xmission.com
      
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      26d5badb
  8. 26 Oct, 2021 1 commit
    • Thomas Gleixner's avatar
      signal: Add an optional check for altstack size · 1bdda24c
      Thomas Gleixner authored
      
      New x86 FPU features will be very large, requiring ~10k of stack in
      signal handlers.  These new features require a new approach called
      "dynamic features".
      
      The kernel currently tries to ensure that altstacks are reasonably
      sized. Right now, on x86, sys_sigaltstack() requires a size of >=2k.
      However, that 2k is a constant. Simply raising that 2k requirement
      to >10k for the new features would break existing apps which have a
      compiled-in size of 2k.
      
      Instead of universally enforcing a larger stack, prohibit a process from
      using dynamic features without properly-sized altstacks. This must be
      enforced in two places:
      
       * A dynamic feature can not be enabled without an large-enough altstack
         for each process thread.
       * Once a dynamic feature is enabled, any request to install a too-small
         altstack will be rejected
      
      The dynamic feature enabling code must examine each thread in a
      process to ensure that the altstacks are large enough. Add a new lock
      (sigaltstack_lock()) to ensure that threads can not race and change
      their altstack after being examined.
      
      Add the infrastructure in form of a config option and provide empty
      stubs for architectures which do not need dynamic altstack size checks.
      
      This implementation will be fleshed out for x86 in a future patch called
      
        x86/arch_prctl: Add controls for dynamic XSTATE components
      
        [dhansen: commit message. ]
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarChang S. Bae <chang.seok.bae@intel.com>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Link: https://lkml.kernel.org/r/20211021225527.10184-2-chang.seok.bae@intel.com
      1bdda24c
  9. 18 Oct, 2021 1 commit
    • Eric W. Biederman's avatar
      ucounts: Fix signal ucount refcounting · 15bc01ef
      Eric W. Biederman authored
      In commit fda31c50 ("signal: avoid double atomic counter
      increments for user accounting") Linus made a clever optimization to
      how rlimits and the struct user_struct.  Unfortunately that
      optimization does not work in the obvious way when moved to nested
      rlimits.  The problem is that the last decrement of the per user
      namespace per user sigpending counter might also be the last decrement
      of the sigpending counter in the parent user namespace as well.  Which
      means that simply freeing the leaf ucount in __free_sigqueue is not
      enough.
      
      Maintain the optimization and handle the tricky cases by introducing
      inc_rlimit_get_ucounts and dec_rlimit_put_ucounts.
      
      By moving the entire optimization into functions that perform all of
      the work it becomes possible to ensure that every level is handled
      properly.
      
      The new function inc_rlimit_get_ucounts returns 0 on failure to
      increment the ucount.  This is different than inc_rlimit_ucounts which
      increments the ucounts and returns LONG_MAX if the ucount counter has
      exceeded it's maximum or it wrapped (to indicate the counter needs to
      decremented).
      
      I wish we had a single user to account all pending signals to across
      all of the threads of a process so this complexity was not necessary
      
      Cc: stable@vger.kernel.org
      Fixes: d6469690 ("Reimplement RLIMIT_SIGPENDING on top of ucounts")
      v1: https://lkml.kernel.org/r/87mtnavszx.fsf_-_@disp2133
      Link: https://lkml.kernel.org/r/87fssytizw.fsf_-_@disp2133
      
      Reviewed-by: default avatarAlexey Gladkov <legion@kernel.org>
      Tested-by: default avatarRune Kleveland <rune.kleveland@infomedia.dk>
      Tested-by: default avatarYu Zhao <yuzhao@google.com>
      Tested-by: default avatarJordan Glover <Golden_Miller83@protonmail.ch>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      15bc01ef
  10. 06 Oct, 2021 3 commits
    • Eric W. Biederman's avatar
      coredump: Don't perform any cleanups before dumping core · 92307383
      Eric W. Biederman authored
      Rename coredump_exit_mm to coredump_task_exit and call it from do_exit
      before PTRACE_EVENT_EXIT, and before any cleanup work for a task
      happens.  This ensures that an accurate copy of the process can be
      captured in the coredump as no cleanup for the process happens before
      the coredump completes.  This also ensures that PTRACE_EVENT_EXIT
      will not be visited by any thread until the coredump is complete.
      
      Add a new flag PF_POSTCOREDUMP so that tasks that have passed through
      coredump_task_exit can be recognized and ignored in zap_process.
      
      Now that all of the coredumping happens before exit_mm remove code to
      test for a coredump in progress from mm_release.
      
      Replace "may_ptrace_stop()" with a simple test of "current->ptrace".
      The other tests in may_ptrace_stop all concern avoiding stopping
      during a coredump.  These tests are no longer necessary as it is now
      guaranteed that fatal_signal_pending will be set if the code enters
      ptrace_stop during a coredump.  The code in ptrace_stop is guaranteed
      not to stop if fatal_signal_pending returns true.
      
      Until this change "ptrace_event(PTRACE_EVENT_EXIT)" could call
      ptrace_stop without fatal_signal_pending being true, as signals are
      dequeued in get_signal before calling do_exit.  This is no longer
      an issue as "ptrace_event(PTRACE_EVENT_EXIT)" is no longer reached
      until after the coredump completes.
      
      Link: https://lkml.kernel.org/r/874kaax26c.fsf@disp2133
      
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      92307383
    • Eric W. Biederman's avatar
      ptrace: Remove the unnecessary arguments from arch_ptrace_stop · 4f627af8
      Eric W. Biederman authored
      Both arch_ptrace_stop_needed and arch_ptrace_stop are called with an
      exit_code and a siginfo structure.  Neither argument is used by any of
      the implementations so just remove the unneeded arguments.
      
      The two arechitectures that implement arch_ptrace_stop are ia64 and
      sparc.  Both architectures flush their register stacks before a
      ptrace_stack so that all of the register information can be accessed
      by debuggers.
      
      As the question of if a register stack needs to be flushed is
      independent of why ptrace is stopping not needing arguments make sense.
      
      Cc: David Miller <davem@davemloft.net>
      Cc: sparclinux@vger.kernel.org
      Link: https://lkml.kernel.org/r/87lf3mx290.fsf@disp2133
      
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      4f627af8
    • Eric W. Biederman's avatar
      signal: Remove the bogus sigkill_pending in ptrace_stop · 7d613f9f
      Eric W. Biederman authored
      The existence of sigkill_pending is a little silly as it is
      functionally a duplicate of fatal_signal_pending that is used in
      exactly one place.
      
      Checking for pending fatal signals and returning early in ptrace_stop
      is actively harmful.  It casues the ptrace_stop called by
      ptrace_signal to return early before setting current->exit_code.
      Later when ptrace_signal reads the signal number from
      current->exit_code is undefined, making it unpredictable what will
      happen.
      
      Instead rely on the fact that schedule will not sleep if there is a
      pending signal that can awaken a task.
      
      Removing the explict sigkill_pending test fixes fixes ptrace_signal
      when ptrace_stop does not stop because current->exit_code is always
      set to to signr.
      
      Cc: stable@vger.kernel.org
      Fixes: 3d749b9e ("ptrace: simplify ptrace_stop()->sigkill_pending() path")
      Fixes: 1a669c2f ("Add arch_ptrace_stop")
      Link: https://lkml.kernel.org/r/87pmsyx29t.fsf@disp2133
      
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      7d613f9f
  11. 03 Sep, 2021 1 commit
    • Vasily Averin's avatar
      memcg: enable accounting for signals · 5f58c398
      Vasily Averin authored
      When a user send a signal to any another processes it forces the kernel to
      allocate memory for 'struct sigqueue' objects.  The number of signals is
      limited by RLIMIT_SIGPENDING resource limit, but even the default settings
      allow each user to consume up to several megabytes of memory.
      
      It makes sense to account for these allocations to restrict the host's
      memory consumption from inside the memcg-limited container.
      
      Link: https://lkml.kernel.org/r/e34e958c-e785-712e-a62a-2c7b66c646c7@virtuozzo.com
      
      Signed-off-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Dmitry Safonov <0x7f454c46@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jiri Slaby <jirislaby@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Serge Hallyn <serge@hallyn.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Yutian Yang <nglaive@gmail.com>
      Cc: Zefan Li <lizefan.x@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5f58c398
  12. 26 Aug, 2021 1 commit
  13. 10 Aug, 2021 1 commit
  14. 23 Jul, 2021 4 commits
  15. 08 Jul, 2021 1 commit
    • Alexey Gladkov's avatar
      Fix UCOUNT_RLIMIT_SIGPENDING counter leak · f3791f4d
      Alexey Gladkov authored
      We must properly handle an errors when we increase the rlimit counter
      and the ucounts reference counter. We have to this with RCU protection
      to prevent possible use-after-free that could occur due to concurrent
      put_cred_rcu().
      
      The following reproducer triggers the problem:
      
        $ cat testcase.sh
        case "${STEP:-0}" in
        0)
      	ulimit -Si 1
      	ulimit -Hi 1
      	STEP=1 unshare -rU "$0"
      	killall sleep
      	;;
        1)
      	for i in 1 2 3 4 5; do unshare -rU sleep 5 & done
      	;;
        esac
      
      with the KASAN report being along the lines of
      
        BUG: KASAN: use-after-free in put_ucounts+0x17/0xa0
        Write of size 4 at addr ffff8880045f031c by task swapper/2/0
      
        CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.13.0+ #19
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-alt4 04/01/2014
        Call Trace:
         <IRQ>
         put_ucounts+0x17/0xa0
         put_cred_rcu+0xd5/0x190
         rcu_core+0x3bf/0xcb0
         __do_softirq+0xe3/0x341
         irq_exit_rcu+0xbe/0xe0
         sysvec_apic_timer_interrupt+0x6a/0x90
         </IRQ>
         asm_sysvec_apic_timer_interrupt+0x12/0x20
         default_idle_call+0x53/0x130
         do_idle+0x311/0x3c0
         cpu_startup_entry+0x14/0x20
         secondary_startup_64_no_verify+0xc2/0xcb
      
        Allocated by task 127:
         kasan_save_stack+0x1b/0x40
         __kasan_kmalloc+0x7c/0x90
         alloc_ucounts+0x169/0x2b0
         set_cred_ucounts+0xbb/0x170
         ksys_unshare+0x24c/0x4e0
         __x64_sys_unshare+0x16/0x20
         do_syscall_64+0x37/0x70
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
        Freed by task 0:
         kasan_save_stack+0x1b/0x40
         kasan_set_track+0x1c/0x30
         kasan_set_free_info+0x20/0x30
         __kasan_slab_free+0xeb/0x120
         kfree+0xaa/0x460
         put_cred_rcu+0xd5/0x190
         rcu_core+0x3bf/0xcb0
         __do_softirq+0xe3/0x341
      
        The buggy address belongs to the object at ffff8880045f0300
         which belongs to the cache kmalloc-192 of size 192
        The buggy address is located 28 bytes inside of
         192-byte region [ffff8880045f0300, ffff8880045f03c0)
        The buggy address belongs to the page:
        page:000000008de0a388 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff8880045f0000 pfn:0x45f0
        flags: 0x100000000000200(slab|node=0|zone=1)
        raw: 0100000000000200 ffffea00000f4640 0000000a0000000a ffff888001042a00
        raw: ffff8880045f0000 000000008010000d 00000001ffffffff 0000000000000000
        page dumped because: kasan: bad access detected
      
        Memory state around the buggy address:
         ffff8880045f0200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
         ffff8880045f0280: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
        >ffff8880045f0300: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                    ^
         ffff8880045f0380: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
         ffff8880045f0400: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
        ==================================================================
        Disabling lock debugging due to kernel taint
      
      Fixes: d6469690
      
       ("Reimplement RLIMIT_SIGPENDING on top of ucounts")
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarAlexey Gladkov <legion@kernel.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f3791f4d
  16. 01 Jul, 2021 1 commit
  17. 27 Jun, 2021 1 commit
    • Linus Torvalds's avatar
      Revert "signal: Allow tasks to cache one sigqueue struct" · b4b27b9e
      Linus Torvalds authored
      This reverts commits 4bad58eb (and
      399f8dd9, which tried to fix it).
      
      I do not believe these are correct, and I'm about to release 5.13, so am
      reverting them out of an abundance of caution.
      
      The locking is odd, and appears broken.
      
      On the allocation side (in __sigqueue_alloc()), the locking is somewhat
      straightforward: it depends on sighand->siglock.  Since one caller
      doesn't hold that lock, it further then tests 'sigqueue_flags' to avoid
      the case with no locks held.
      
      On the freeing side (in sigqueue_cache_or_free()), there is no locking
      at all, and the logic instead depends on 'current' being a single
      thread, and not able to race with itself.
      
      To make things more exciting, there's also the data race between freeing
      a signal and allocating one, which is handled by using WRITE_ONCE() and
      READ_ONCE(), and being mutually exclusive wrt the initial state (ie
      freeing will only free if the old state was NULL, while allocating will
      obviously only use the value if it was non-NULL, so only one or the
      other will actually act on the value).
      
      However, while the free->alloc paths do seem mutually exclusive thanks
      to just the data value dependency, it's not clear what the memory
      ordering constraints are on it.  Could writes from the previous
      allocation possibly be delayed and seen by the new allocation later,
      causing logical inconsistencies?
      
      So it's all very exciting and unusual.
      
      And in particular, it seems that the freeing side is incorrect in
      depending on "current" being single-threaded.  Yes, 'current' is a
      single thread, but in the presense of asynchronous events even a single
      thread can have data races.
      
      And such asynchronous events can and do happen, with interrupts causing
      signals to be flushed and thus free'd (for example - sending a
      SIGCONT/SIGSTOP can happen from interrupt context, and can flush
      previously queued process control signals).
      
      So regardless of all the other questions about the memory ordering and
      locking for this new cached allocation, the sigqueue_cache_or_free()
      assumptions seem to be fundamentally incorrect.
      
      It may be that people will show me the errors of my ways, and tell me
      why this is all safe after all.  We can reinstate it if so.  But my
      current belief is that the WRITE_ONCE() that sets the cached entry needs
      to be a smp_store_release(), and the READ_ONCE() that finds a cached
      entry needs to be a smp_load_acquire() to handle memory ordering
      correctly.
      
      And the sequence in sigqueue_cache_or_free() would need to either use a
      lock or at least be interrupt-safe some way (perhaps by using something
      like the percpu 'cmpxchg': it doesn't need to be SMP-safe, but like the
      percpu operations it needs to be interrupt-safe).
      
      Fixes: 399f8dd9 ("signal: Prevent sigqueue caching after task got released")
      Fixes: 4bad58eb
      
       ("signal: Allow tasks to cache one sigqueue struct")
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b4b27b9e
  18. 22 Jun, 2021 1 commit
    • Thomas Gleixner's avatar
      signal: Prevent sigqueue caching after task got released · 399f8dd9
      Thomas Gleixner authored
      syzbot reported a memory leak related to sigqueue caching.
      
      The assumption that a task cannot cache a sigqueue after the signal handler
      has been dropped and exit_task_sigqueue_cache() has been invoked turns out
      to be wrong.
      
      Such a task can still invoke release_task(other_task), which cleans up the
      signals of 'other_task' and ends up in sigqueue_cache_or_free(), which in
      turn will cache the signal because task->sigqueue_cache is NULL. That's
      obviously bogus because nothing will free the cached signal of that task
      anymore, so the cached item is leaked.
      
      This happens when e.g. the last non-leader thread exits and reaps the
      zombie leader.
      
      Prevent this by setting tsk::sigqueue_cache to an error pointer value in
      exit_task_sigqueue_cache() which forces any subsequent invocation of
      sigqueue_cache_or_free() from that task to hand the sigqueue back to the
      kmemcache.
      
      Add comments to all relevant places.
      
      Fixes: 4bad58eb
      
       ("signal: Allow tasks to cache one sigqueue struct")
      Reported-by: syzbot+0bac5fec63d4f399ba98@syzkaller.appspotmail.com
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Link: https://lore.kernel.org/r/878s32g6j5.ffs@nanos.tec.linutronix.de
      399f8dd9
  19. 18 Jun, 2021 1 commit
  20. 18 May, 2021 4 commits
  21. 30 Apr, 2021 1 commit
  22. 23 Apr, 2021 1 commit
  23. 16 Apr, 2021 1 commit
  24. 14 Apr, 2021 2 commits
    • Thomas Gleixner's avatar
      signal: Allow tasks to cache one sigqueue struct · 4bad58eb
      Thomas Gleixner authored
      The idea for this originates from the real time tree to make signal
      delivery for realtime applications more efficient. In quite some of these
      application scenarios a control tasks signals workers to start their
      computations. There is usually only one signal per worker on flight.  This
      works nicely as long as the kmem cache allocations do not hit the slow path
      and cause latencies.
      
      To cure this an optimistic caching was introduced (limited to RT tasks)
      which allows a task to cache a single sigqueue in a pointer in task_struct
      instead of handing it back to the kmem cache after consuming a signal. When
      the next signal is sent to the task then the cached sigqueue is used
      instead of allocating a new one. This solved the problem for this set of
      application scenarios nicely.
      
      The task cache is not preallocated so the first signal sent to a task goes
      always to the cache allocator. The cached sigqueue stays around until the
      task exits and is freed when task::sighand is dropped.
      
      After posting this solution for mainline the discussion came up whether
      this would be useful in general and should not be limited to realtime
      tasks: https://lore.kernel.org/r/m11rcu7nbr.fsf@fess.ebiederm.org
      
      
      
      One concern leading to the original limitation was to avoid a large amount
      of pointlessly cached sigqueues in alive tasks. The other concern was
      vs. RLIMIT_SIGPENDING as these cached sigqueues are not accounted for.
      
      The accounting problem is real, but on the other hand slightly academic.
      After gathering some statistics it turned out that after boot of a regular
      distro install there are less than 10 sigqueues cached in ~1500 tasks.
      
      In case of a 'mass fork and fire signal to child' scenario the extra 80
      bytes of memory per task are well in the noise of the overall memory
      consumption of the fork bomb.
      
      If this should be limited then this would need an extra counter in struct
      user, more atomic instructions and a seperate rlimit. Yet another tunable
      which is mostly unused.
      
      The caching is actually used. After boot and a full kernel compile on a
      64CPU machine with make -j128 the number of 'allocations' looks like this:
      
        From slab:	   23996
        From task cache: 52223
      
      I.e. it reduces the number of slab cache operations by ~68%.
      
      A typical pattern there is:
      
      <...>-58490 __sigqueue_alloc:  for 58488 from slab ffff8881132df460
      <...>-58488 __sigqueue_free:   cache ffff8881132df460
      <...>-58488 __sigqueue_alloc:  for 1149 from cache ffff8881103dc550
        bash-1149 exit_task_sighand: free ffff8881132df460
        bash-1149 __sigqueue_free:   cache ffff8881103dc550
      
      The interesting sequence is that the exiting task 58488 grabs the sigqueue
      from bash's task cache to signal exit and bash sticks it back into it's own
      cache. Lather, rinse and repeat.
      
      The caching is probably not noticable for the general use case, but the
      benefit for latency sensitive applications is clear. While kmem caches are
      usually just serving from the fast path the slab merging (default) can
      depending on the usage pattern of the merged slabs cause occasional slow
      path allocations.
      
      The time spared per cached entry is a few micro seconds per signal which is
      not relevant for e.g. a kernel build, but for signal heavy workloads it's
      measurable.
      
      As there is no real downside of this caching mechanism making it
      unconditionally available is preferred over more conditional code or new
      magic tunables.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarOleg Nesterov <oleg@redhat.com>
      Link: https://lkml.kernel.org/r/87sg4lbmxo.fsf@nanos.tec.linutronix.de
      4bad58eb
    • Thomas Gleixner's avatar
      signal: Hand SIGQUEUE_PREALLOC flag to __sigqueue_alloc() · 69995ebb
      Thomas Gleixner authored
      
      There is no point in having the conditional at the callsite.
      
      Just hand in the allocation mode flag to __sigqueue_alloc() and use it to
      initialize sigqueue::flags.
      
      No functional change.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20210322092258.898677147@linutronix.de
      69995ebb