1. 16 Sep, 2021 1 commit
  2. 15 Sep, 2021 21 commits
    • Andrey Ignatov's avatar
      bpf: Fix possible out of bound write in narrow load handling · b0d7d802
      Andrey Ignatov authored
      [ Upstream commit d7af7e49 ]
      
      Fix a verifier bug found by smatch static checker in [0].
      
      This problem has never been seen in prod to my best knowledge. Fixing it
      still seems to be a good idea since it's hard to say for sure whether
      it's possible or not to have a scenario where a combination of
      convert_ctx_access() and a narrow load would lead to an out of bound
      write.
      
      When narrow load is handled, one or two new instructions are added to
      insn_buf array, but before it was only checked that
      
      	cnt >= ARRAY_SIZE(insn_buf)
      
      And it's safe to add a new instruction to insn_buf[cnt++] only once. The
      second try will lead to out of bound write. And this is what can happen
      if `shift` is set.
      
      Fix it by making sure that if the BPF_RSH instruction has to be added in
      addition to BPF_AND then there is enough space for two more instructions
      in insn_buf.
      
      The full report [0] is below:
      
      kernel/bpf/verifier.c:12304 convert_ctx_accesses() warn: offset 'cnt' incremented past end of array
      kernel/bpf/verifier.c:12311 convert_ctx_accesses() warn: offset 'cnt' incremented past end of array
      
      kernel/bpf/verifier.c
          12282
          12283 			insn->off = off & ~(size_default - 1);
          12284 			insn->code = BPF_LDX | BPF_MEM | size_code;
          12285 		}
          12286
          12287 		target_size = 0;
          12288 		cnt = convert_ctx_access(type, insn, insn_buf, env->prog,
          12289 					 &target_size);
          12290 		if (cnt == 0 || cnt >= ARRAY_SIZE(insn_buf) ||
                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
      Bounds check.
      
          12291 		    (ctx_field_size && !target_size)) {
          12292 			verbose(env, "bpf verifier is misconfigured\n");
          12293 			return -EINVAL;
          12294 		}
          12295
          12296 		if (is_narrower_load && size < target_size) {
          12297 			u8 shift = bpf_ctx_narrow_access_offset(
          12298 				off, size, size_default) * 8;
          12299 			if (ctx_field_size <= 4) {
          12300 				if (shift)
          12301 					insn_buf[cnt++] = BPF_ALU32_IMM(BPF_RSH,
                                                               ^^^^^
      increment beyond end of array
      
          12302 									insn->dst_reg,
          12303 									shift);
      --> 12304 				insn_buf[cnt++] = BPF_ALU32_IMM(BPF_AND, insn->dst_reg,
                                                       ^^^^^
      out of bounds write
      
          12305 								(1 << size * 8) - 1);
          12306 			} else {
          12307 				if (shift)
          12308 					insn_buf[cnt++] = BPF_ALU64_IMM(BPF_RSH,
          12309 									insn->dst_reg,
          12310 									shift);
          12311 				insn_buf[cnt++] = BPF_ALU64_IMM(BPF_AND, insn->dst_reg,
                                              ^^^^^^^^^^^^^^^
      Same.
      
          12312 								(1ULL << size * 8) - 1);
          12313 			}
          12314 		}
          12315
          12316 		new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
          12317 		if (!new_prog)
          12318 			return -ENOMEM;
          12319
          12320 		delta += cnt - 1;
          12321
          12322 		/* keep walking new program and skip insns we just inserted */
          12323 		env->prog = new_prog;
          12324 		insn      = new_prog->insnsi + i + delta;
          12325 	}
          12326
          12327 	return 0;
          12328 }
      
      [0] https://lore.kernel.org/bpf/20210817050843.GA21456@kili/
      
      v1->v2:
      - clarify that problem was only seen by static checker but not in prod;
      
      Fixes: 46f53a65
      
       ("bpf: Allow narrow loads with offset > 0")
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20210820163935.1902398-1-rdna@fb.com
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      b0d7d802
    • Valentin Schneider's avatar
      PM: cpu: Make notifier chain use a raw_spinlock_t · f4a79b7d
      Valentin Schneider authored
      [ Upstream commit b2f6662a ]
      
      Invoking atomic_notifier_chain_notify() requires acquiring a spinlock_t,
      which can block under CONFIG_PREEMPT_RT. Notifications for members of the
      cpu_pm notification chain will be issued by the idle task, which can never
      block.
      
      Making *all* atomic_notifiers use a raw_spinlock is too big of a hammer, as
      only notifications issued by the idle task are problematic.
      
      Special-case cpu_pm_notifier_chain by kludging a raw_notifier and
      raw_spinlock_t together, matching the atomic_notifier behavior with a
      raw_spinlock_t.
      
      Fixes: 70d93298
      
       ("notifier: Fix broken error handling pattern")
      Signed-off-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Acked-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      f4a79b7d
    • Waiman Long's avatar
      cgroup/cpuset: Fix violation of cpuset locking rule · b3d3890e
      Waiman Long authored
      [ Upstream commit 6ba34d3c ]
      
      The cpuset fields that manage partition root state do not strictly
      follow the cpuset locking rule that update to cpuset has to be done
      with both the callback_lock and cpuset_mutex held. This is now fixed
      by making sure that the locking rule is upheld.
      
      Fixes: 3881b861 ("cpuset: Add an error state to cpuset.sched.partition")
      Fixes: 4b842da2
      
       ("cpuset: Make CPU hotplug work with partition")
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      b3d3890e
    • Waiman Long's avatar
      cgroup/cpuset: Miscellaneous code cleanup · 3f75d479
      Waiman Long authored
      [ Upstream commit 0f3adb8a
      
       ]
      
      Use more descriptive variable names for update_prstate(), remove
      unnecessary code and fix some typos. There is no functional change.
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      3f75d479
    • Lukasz Luba's avatar
      PM: EM: Increase energy calculation precision · 2fa4fef6
      Lukasz Luba authored
      [ Upstream commit 7fcc17d0 ]
      
      The Energy Model (EM) provides useful information about device power in
      each performance state to other subsystems like: Energy Aware Scheduler
      (EAS). The energy calculation in EAS does arithmetic operation based on
      the EM em_cpu_energy(). Current implementation of that function uses
      em_perf_state::cost as a pre-computed cost coefficient equal to:
      cost = power * max_frequency / frequency.
      The 'power' is expressed in milli-Watts (or in abstract scale).
      
      There are corner cases when the EAS energy calculation for two Performance
      Domains (PDs) return the same value. The EAS compares these values to
      choose smaller one. It might happen that this values are equal due to
      rounding error. In such scenario, we need better resolution, e.g. 1000
      times better. To provide this possibility increase the resolution in the
      em_perf_state::cost for 64-bit architectures. The cost of increasing
      resolution on 32-bit is pretty high (64-bit division) and is not justified
      since there are no new 32bit big.LITTLE EAS systems expected which would
      benefit from this higher resolution.
      
      This patch allows to avoid the rounding to milli-Watt errors, which might
      occur in EAS energy estimation for each PD. The rounding error is common
      for small tasks which have small utilization value.
      
      There are two places in the code where it makes a difference:
      1. In the find_energy_efficient_cpu() where we are searching for
      best_delta. We might suffer there when two PDs return the same result,
      like in the example below.
      
      Scenario:
      Low utilized system e.g. ~200 sum_util for PD0 and ~220 for PD1. There
      are quite a few small tasks ~10-15 util. These tasks would suffer for
      the rounding error. These utilization values are typical when running games
      on Android. One of our partners has reported 5..10mA less battery drain
      when running with increased resolution.
      
      Some details:
      We have two PDs: PD0 (big) and PD1 (little)
      Let's compare w/o patch set ('old') and w/ patch set ('new')
      We are comparing energy w/ task and w/o task placed in the PDs
      
      a) 'old' w/o patch set, PD0
      task_util = 13
      cost = 480
      sum_util_w/o_task = 215
      sum_util_w_task = 228
      scale_cpu = 1024
      energy_w/o_task = 480 * 215 / 1024 = 100.78 => 100
      energy_w_task = 480 * 228 / 1024 = 106.87 => 106
      energy_diff = 106 - 100 = 6
      (this is equal to 'old' PD1's energy_diff in 'c)')
      
      b) 'new' w/ patch set, PD0
      task_util = 13
      cost = 480 * 1000 = 480000
      sum_util_w/o_task = 215
      sum_util_w_task = 228
      energy_w/o_task = 480000 * 215 / 1024 = 100781
      energy_w_task = 480000 * 228 / 1024  = 106875
      energy_diff = 106875 - 100781 = 6094
      (this is not equal to 'new' PD1's energy_diff in 'd)')
      
      c) 'old' w/o patch set, PD1
      task_util = 13
      cost = 160
      sum_util_w/o_task = 283
      sum_util_w_task = 293
      scale_cpu = 355
      energy_w/o_task = 160 * 283 / 355 = 127.55 => 127
      energy_w_task = 160 * 296 / 355 = 133.41 => 133
      energy_diff = 133 - 127 = 6
      (this is equal to 'old' PD0's energy_diff in 'a)')
      
      d) 'new' w/ patch set, PD1
      task_util = 13
      cost = 160 * 1000 = 160000
      sum_util_w/o_task = 283
      sum_util_w_task = 293
      scale_cpu = 355
      energy_w/o_task = 160000 * 283 / 355 = 127549
      energy_w_task = 160000 * 296 / 355 =   133408
      energy_diff = 133408 - 127549 = 5859
      (this is not equal to 'new' PD0's energy_diff in 'b)')
      
      2. Difference in the 6% energy margin filter at the end of
      find_energy_efficient_cpu(). With this patch the margin comparison also
      has better resolution, so it's possible to have better task placement
      thanks to that.
      
      Fixes: 27871f7a
      
       ("PM: Introduce an Energy Model management framework")
      Reported-by: default avatarCCJ Yeh <CCj.Yeh@mediatek.com>
      Reviewed-by: default avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: default avatarLukasz Luba <lukasz.luba@arm.com>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      2fa4fef6
    • Waiman Long's avatar
      cgroup/cpuset: Fix a partition bug with hotplug · 5e99b869
      Waiman Long authored
      [ Upstream commit 15d428e6 ]
      
      In cpuset_hotplug_workfn(), the detection of whether the cpu list
      has been changed is done by comparing the effective cpus of the top
      cpuset with the cpu_active_mask. However, in the rare case that just
      all the CPUs in the subparts_cpus are offlined, the detection fails
      and the partition states are not updated correctly. Fix it by forcing
      the cpus_updated flag to true in this particular case.
      
      Fixes: 4b842da2
      
       ("cpuset: Make CPU hotplug work with partition")
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      5e99b869
    • He Fengqing's avatar
      bpf: Fix potential memleak and UAF in the verifier. · 746f0ad4
      He Fengqing authored
      [ Upstream commit 75f0fc7b ]
      
      In bpf_patch_insn_data(), we first use the bpf_patch_insn_single() to
      insert new instructions, then use adjust_insn_aux_data() to adjust
      insn_aux_data. If the old env->prog have no enough room for new inserted
      instructions, we use bpf_prog_realloc to construct new_prog and free the
      old env->prog.
      
      There have two errors here. First, if adjust_insn_aux_data() return
      ENOMEM, we should free the new_prog. Second, if adjust_insn_aux_data()
      return ENOMEM, bpf_patch_insn_data() will return NULL, and env->prog has
      been freed in bpf_prog_realloc, but we will use it in bpf_check().
      
      So in this patch, we make the adjust_insn_aux_data() never fails. In
      bpf_patch_insn_data(), we first pre-malloc memory for the new
      insn_aux_data, then call bpf_patch_insn_single() to insert new
      instructions, at last call adjust_insn_aux_data() to adjust
      insn_aux_data.
      
      Fixes: 8041902d
      
       ("bpf: adjust insn_aux_data when patching insns")
      Signed-off-by: default avatarHe Fengqing <hefengqing@huawei.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20210714101815.164322-1-hefengqing@huawei.com
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      746f0ad4
    • Zhen Lei's avatar
      genirq/timings: Fix error return code in irq_timings_test_irqs() · 21862736
      Zhen Lei authored
      [ Upstream commit 290fdc4b ]
      
      Return a negative error code from the error handling case instead of 0, as
      done elsewhere in this function.
      
      Fixes: f52da98d
      
       ("genirq/timings: Add selftest for irqs circular buffer")
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Signed-off-by: default avatarZhen Lei <thunder.leizhen@huawei.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Link: https://lore.kernel.org/r/20210811093333.2376-1-thunder.leizhen@huawei.com
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      21862736
    • Yanfei Xu's avatar
      rcu: Fix stall-warning deadlock due to non-release of rcu_node ->lock · 8af14fb3
      Yanfei Xu authored
      [ Upstream commit dc87740c ]
      
      If rcu_print_task_stall() is invoked on an rcu_node structure that does
      not contain any tasks blocking the current grace period, it takes an
      early exit that fails to release that rcu_node structure's lock.  This
      results in a self-deadlock, which is detected by lockdep.
      
      To reproduce this bug:
      
      tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 3 --trust-make --configs "TREE03" --kconfig "CONFIG_PROVE_LOCKING=y" --bootargs "rcutorture.stall_cpu=30 rcutorture.stall_cpu_block=1 rcutorture.fwd_progress=0 rcutorture.test_boost=0"
      
      This will also result in other complaints, including RCU's scheduler
      hook complaining about blocking rather than preemption and an rcutorture
      writer stall.
      
      Only a partial RCU CPU stall warning message will be printed because of
      the self-deadlock.
      
      This commit therefore releases the lock on the rcu_print_task_stall()
      function's early exit path.
      
      Fixes: c583bcb8
      
       ("rcu: Don't invoke try_invoke_on_locked_down_task() with irqs disabled")
      Tested-by: default avatarQais Yousef <qais.yousef@arm.com>
      Signed-off-by: default avatarYanfei Xu <yanfei.xu@windriver.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      8af14fb3
    • Yanfei Xu's avatar
      rcu: Fix to include first blocked task in stall warning · 0133d52e
      Yanfei Xu authored
      [ Upstream commit e6a901a4 ]
      
      The for loop in rcu_print_task_stall() always omits ts[0], which points
      to the first task blocking the stalled grace period.  This in turn fails
      to count this first task, which means that ndetected will be equal to
      zero when all CPUs have passed through their quiescent states and only
      one task is blocking the stalled grace period.  This zero value for
      ndetected will in turn result in an incorrect "All QSes seen" message:
      
      rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
      rcu:    Tasks blocked on level-1 rcu_node (CPUs 12-23):
              (detected by 15, t=6504 jiffies, g=164777, q=9011209)
      rcu: All QSes seen, last rcu_preempt kthread activity 1 (4295252379-4295252378), jiffies_till_next_fqs=1, root ->qsmask 0x2
      BUG: sleeping function called from invalid context at include/linux/uaccess.h:156
      in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 70613, name: msgstress04
      INFO: lockdep is turned off.
      Preemption disabled at:
      [<ffff8000104031a4>] create_object.isra.0+0x204/0x4b0
      CPU: 15 PID: 70613 Comm: msgstress04 Kdump: loaded Not tainted
      5.12.2-yoctodev-standard #1
      Hardware name: Marvell OcteonTX CN96XX board (DT)
      Call trace:
       dump_backtrace+0x0/0x2cc
       show_stack+0x24/0x30
       dump_stack+0x110/0x188
       ___might_sleep+0x214/0x2d0
       __might_sleep+0x7c/0xe0
      
      This commit therefore fixes the loop to include ts[0].
      
      Fixes: c583bcb8
      
       ("rcu: Don't invoke try_invoke_on_locked_down_task() with irqs disabled")
      Tested-by: default avatarQais Yousef <qais.yousef@arm.com>
      Signed-off-by: default avatarYanfei Xu <yanfei.xu@windriver.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      0133d52e
    • Quentin Perret's avatar
      sched: Fix UCLAMP_FLAG_IDLE setting · ba3396c3
      Quentin Perret authored
      [ Upstream commit ca4984a7 ]
      
      The UCLAMP_FLAG_IDLE flag is set on a runqueue when dequeueing the last
      uclamp active task (that is, when buckets.tasks reaches 0 for all
      buckets) to maintain the last uclamp.max and prevent blocked util from
      suddenly becoming visible.
      
      However, there is an asymmetry in how the flag is set and cleared which
      can lead to having the flag set whilst there are active tasks on the rq.
      Specifically, the flag is cleared in the uclamp_rq_inc() path, which is
      called at enqueue time, but set in uclamp_rq_dec_id() which is called
      both when dequeueing a task _and_ in the update_uclamp_active() path. As
      a result, when both uclamp_rq_{dec,ind}_id() are called from
      update_uclamp_active(), the flag ends up being set but not cleared,
      hence leaving the runqueue in a broken state.
      
      Fix this by clearing the flag in update_uclamp_active() as well.
      
      Fixes: e496187d
      
       ("sched/uclamp: Enforce last task's UCLAMP_MAX")
      Reported-by: default avatarRick Yiu <rickyiu@google.com>
      Signed-off-by: default avatarQuentin Perret <qperret@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarQais Yousef <qais.yousef@arm.com>
      Tested-by: default avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      Link: https://lore.kernel.org/r/20210805102154.590709-2-qperret@google.com
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      ba3396c3
    • Mika Penttilä's avatar
      sched/numa: Fix is_core_idle() · 54ae6683
      Mika Penttilä authored
      [ Upstream commit 1c6829cf ]
      
      Use the loop variable instead of the function argument to test the
      other SMT siblings for idle.
      
      Fixes: ff7db0bf
      
       ("sched/numa: Prefer using an idle CPU as a migration target instead of comparing tasks")
      Signed-off-by: default avatarMika Penttilä <mika.penttila@gmail.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarPankaj Gupta <pankaj.gupta@ionos.com>
      Link: https://lkml.kernel.org/r/20210722063946.28951-1-mika.penttila@gmail.com
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      54ae6683
    • Valentin Schneider's avatar
      sched/debug: Don't update sched_domain debug directories before sched_debug_init() · d2964558
      Valentin Schneider authored
      [ Upstream commit 459b09b5 ]
      
      Since CPU capacity asymmetry can stem purely from maximum frequency
      differences (e.g. Pixel 1), a rebuild of the scheduler topology can be
      issued upon loading cpufreq, see:
      
        arch_topology.c::init_cpu_capacity_callback()
      
      Turns out that if this rebuild happens *before* sched_debug_init() is
      run (which is a late initcall), we end up messing up the sched_domain debug
      directory: passing a NULL parent to debugfs_create_dir() ends up creating
      the directory at the debugfs root, which in this case creates
      /sys/kernel/debug/domains (instead of /sys/kernel/debug/sched/domains).
      
      This currently doesn't happen on asymmetric systems which use cpufreq-scpi
      or cpufreq-dt drivers, as those are loaded via
      deferred_probe_initcall() (it is also a late initcall, but appears to be
      ordered *after* sched_debug_init()).
      
      Ionela has been working on detecting maximum frequency asymmetry via ACPI,
      and that actually happens via a *device* initcall, thus before
      sched_debug_init(), and causes the aforementionned debugfs mayhem.
      
      One option would be to punt sched_debug_init() down to
      fs_initcall_sync(). Preventing update_sched_domain_debugfs() from running
      before sched_debug_init() appears to be the safer option.
      
      Fixes: 3b87f136
      
       ("sched,debug: Convert sysctl sched_domains to debugfs")
      Signed-off-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: http://lore.kernel.org/r/20210514095339.12979-1-ionela.voinescu@arm.com
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      d2964558
    • Valentin Schneider's avatar
      sched/topology: Skip updating masks for non-online nodes · 34c03340
      Valentin Schneider authored
      [ Upstream commit 0083242c
      
       ]
      
      The scheduler currently expects NUMA node distances to be stable from
      init onwards, and as a consequence builds the related data structures
      once-and-for-all at init (see sched_init_numa()).
      
      Unfortunately, on some architectures node distance is unreliable for
      offline nodes and may very well change upon onlining.
      
      Skip over offline nodes during sched_init_numa(). Track nodes that have
      been onlined at least once, and trigger a build of a node's NUMA masks
      when it is first onlined post-init.
      Reported-by: default avatarGeetika Moolchandani <Geetika.Moolchandani1@ibm.com>
      Signed-off-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20210818074333.48645-1-srikar@linux.vnet.ibm.com
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      34c03340
    • Thomas Gleixner's avatar
      hrtimer: Ensure timerfd notification for HIGHRES=n · add6659e
      Thomas Gleixner authored
      [ Upstream commit 8c3b5e6e
      
       ]
      
      If high resolution timers are disabled the timerfd notification about a
      clock was set event is not happening for all cases which use
      clock_was_set_delayed() because that's a NOP for HIGHRES=n, which is wrong.
      
      Make clock_was_set_delayed() unconditially available to fix that.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Link: https://lore.kernel.org/r/20210713135158.196661266@linutronix.de
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      add6659e
    • Thomas Gleixner's avatar
      hrtimer: Avoid double reprogramming in __hrtimer_start_range_ns() · 0d7541f4
      Thomas Gleixner authored
      [ Upstream commit 627ef5ae
      
       ]
      
      If __hrtimer_start_range_ns() is invoked with an already armed hrtimer then
      the timer has to be canceled first and then added back. If the timer is the
      first expiring timer then on removal the clockevent device is reprogrammed
      to the next expiring timer to avoid that the pending expiry fires needlessly.
      
      If the new expiry time ends up to be the first expiry again then the clock
      event device has to reprogrammed again.
      
      Avoid this by checking whether the timer is the first to expire and in that
      case, keep the timer on the current CPU and delay the reprogramming up to
      the point where the timer has been enqueued again.
      Reported-by: default avatarLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Link: https://lore.kernel.org/r/20210713135157.873137732@linutronix.de
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      0d7541f4
    • Frederic Weisbecker's avatar
      posix-cpu-timers: Force next expiration recalc after itimer reset · 56400580
      Frederic Weisbecker authored
      [ Upstream commit 406dd42b
      
       ]
      
      When an itimer deactivates a previously armed expiration, it simply doesn't
      do anything. As a result the process wide cputime counter keeps running and
      the tick dependency stays set until it reaches the old ghost expiration
      value.
      
      This can be reproduced with the following snippet:
      
      	void trigger_process_counter(void)
      	{
      		struct itimerval n = {};
      
      		n.it_value.tv_sec = 100;
      		setitimer(ITIMER_VIRTUAL, &n, NULL);
      		n.it_value.tv_sec = 0;
      		setitimer(ITIMER_VIRTUAL, &n, NULL);
      	}
      
      Fix this with resetting the relevant base expiration. This is similar to
      disarming a timer.
      Signed-off-by: default avatarFrederic Weisbecker <frederic@kernel.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lore.kernel.org/r/20210726125513.271824-4-frederic@kernel.org
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      56400580
    • Sergey Senozhatsky's avatar
      rcu/tree: Handle VM stoppage in stall detection · f08a6566
      Sergey Senozhatsky authored
      [ Upstream commit ccfc9dd6
      
       ]
      
      The soft watchdog timer function checks if a virtual machine
      was suspended and hence what looks like a lockup in fact
      is a false positive.
      
      This is what kvm_check_and_clear_guest_paused() does: it
      tests guest PVCLOCK_GUEST_STOPPED (which is set by the host)
      and if it's set then we need to touch all watchdogs and bail
      out.
      
      Watchdog timer function runs from IRQ, so PVCLOCK_GUEST_STOPPED
      check works fine.
      
      There is, however, one more watchdog that runs from IRQ, so
      watchdog timer fn races with it, and that watchdog is not aware
      of PVCLOCK_GUEST_STOPPED - RCU stall detector.
      
      apic_timer_interrupt()
       smp_apic_timer_interrupt()
        hrtimer_interrupt()
         __hrtimer_run_queues()
          tick_sched_timer()
           tick_sched_handle()
            update_process_times()
             rcu_sched_clock_irq()
      
      This triggers RCU stalls on our devices during VM resume.
      
      If tick_sched_handle()->rcu_sched_clock_irq() runs on a VCPU
      before watchdog_timer_fn()->kvm_check_and_clear_guest_paused()
      then there is nothing on this VCPU that touches watchdogs and
      RCU reads stale gp stall timestamp and new jiffies value, which
      makes it think that RCU has stalled.
      
      Make RCU stall watchdog aware of PVCLOCK_GUEST_STOPPED and
      don't report RCU stalls when we resume the VM.
      Signed-off-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Signed-off-by: default avatarSigned-off-by: Paul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      f08a6566
    • Dietmar Eggemann's avatar
      sched/deadline: Fix missing clock update in migrate_task_rq_dl() · daf2ceb7
      Dietmar Eggemann authored
      [ Upstream commit b4da13aa ]
      
      A missing clock update is causing the following warning:
      
      rq->clock_update_flags < RQCF_ACT_SKIP
      WARNING: CPU: 112 PID: 2041 at kernel/sched/sched.h:1453
      sub_running_bw.isra.0+0x190/0x1a0
      ...
      CPU: 112 PID: 2041 Comm: sugov:112 Tainted: G W 5.14.0-rc1 #1
      Hardware name: WIWYNN Mt.Jade Server System
      B81.030Z1.0007/Mt.Jade Motherboard, BIOS 1.6.20210526 (SCP:
      1.06.20210526) 2021/05/26
      ...
      Call trace:
        sub_running_bw.isra.0+0x190/0x1a0
        migrate_task_rq_dl+0xf8/0x1e0
        set_task_cpu+0xa8/0x1f0
        try_to_wake_up+0x150/0x3d4
        wake_up_q+0x64/0xc0
        __up_write+0xd0/0x1c0
        up_write+0x4c/0x2b0
        cppc_set_perf+0x120/0x2d0
        cppc_cpufreq_set_target+0xe0/0x1a4 [cppc_cpufreq]
        __cpufreq_driver_target+0x74/0x140
        sugov_work+0x64/0x80
        kthread_worker_fn+0xe0/0x230
        kthread+0x138/0x140
        ret_from_fork+0x10/0x18
      
      The task causing this is the `cppc_fie` DL task introduced by
      commit 1eb5dde6
      
       ("cpufreq: CPPC: Add support for frequency
      invariance").
      
      With CONFIG_ACPI_CPPC_CPUFREQ_FIE=y and schedutil cpufreq governor on
      slow-switching system (like on this Ampere Altra WIWYNN Mt. Jade Arm
      Server):
      
      DL task `curr=sugov:112` lets `p=cppc_fie` migrate and since the latter
      is in `non_contending` state, migrate_task_rq_dl() calls
      
        sub_running_bw()->__sub_running_bw()->cpufreq_update_util()->
        rq_clock()->assert_clock_updated()
      
      on p.
      
      Fix this by updating the clock for a non_contending task in
      migrate_task_rq_dl() before calling sub_running_bw().
      Reported-by: default avatarBruno Goncalves <bgoncalv@redhat.com>
      Signed-off-by: default avatarDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarDaniel Bristot de Oliveira <bristot@kernel.org>
      Acked-by: default avatarJuri Lelli <juri.lelli@redhat.com>
      Link: https://lore.kernel.org/r/20210804135925.3734605-1-dietmar.eggemann@arm.com
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      daf2ceb7
    • Quentin Perret's avatar
      sched/deadline: Fix reset_on_fork reporting of DL tasks · 82e0f5ad
      Quentin Perret authored
      [ Upstream commit f9509153
      
       ]
      
      It is possible for sched_getattr() to incorrectly report the state of
      the reset_on_fork flag when called on a deadline task.
      
      Indeed, if the flag was set on a deadline task using sched_setattr()
      with flags (SCHED_FLAG_RESET_ON_FORK | SCHED_FLAG_KEEP_PARAMS), then
      p->sched_reset_on_fork will be set, but __setscheduler() will bail out
      early, which means that the dl_se->flags will not get updated by
      __setscheduler_params()->__setparam_dl(). Consequently, if
      sched_getattr() is then called on the task, __getparam_dl() will
      override kattr.sched_flags with the now out-of-date copy in dl_se->flags
      and report the stale value to userspace.
      
      To fix this, make sure to only copy the flags that are relevant to
      sched_deadline to and from the dl_se->flags field.
      Signed-off-by: default avatarQuentin Perret <qperret@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lore.kernel.org/r/20210727101103.2729607-2-qperret@google.com
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      82e0f5ad
    • Peter Zijlstra's avatar
      locking/mutex: Fix HANDOFF condition · 73c080d7
      Peter Zijlstra authored
      [ Upstream commit 048661a1
      
       ]
      
      Yanfei reported that setting HANDOFF should not depend on recomputing
      @first, only on @first state. Which would then give:
      
        if (ww_ctx || !first)
          first = __mutex_waiter_is_first(lock, &waiter);
        if (first)
          __mutex_set_flag(lock, MUTEX_FLAG_HANDOFF);
      
      But because 'ww_ctx || !first' is basically 'always' and the test for
      first is relatively cheap, omit that first branch entirely.
      Reported-by: default avatarYanfei Xu <yanfei.xu@windriver.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarWaiman Long <longman@redhat.com>
      Reviewed-by: default avatarYanfei Xu <yanfei.xu@windriver.com>
      Link: https://lore.kernel.org/r/20210630154114.896786297@infradead.org
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      73c080d7
  3. 03 Sep, 2021 1 commit
    • Richard Guy Briggs's avatar
      audit: move put_tree() to avoid trim_trees refcount underflow and UAF · 2424b0eb
      Richard Guy Briggs authored
      commit 67d69e9d upstream.
      
      AUDIT_TRIM is expected to be idempotent, but multiple executions resulted
      in a refcount underflow and use-after-free.
      
      git bisect fingered commit fb041bb7	("locking/refcount: Consolidate
      implementations of refcount_t") but this patch with its more thorough
      checking that wasn't in the x86 assembly code merely exposed a previously
      existing tree refcount imbalance in the case of tree trimming code that
      was refactored with prune_one() to remove a tree introduced in
      commit 8432c700 ("audit: Simplify locking around untag_chunk()")
      
      Move the put_tree() to cover only the prune_one() case.
      
      Passes audit-testsuite and 3 passes of "auditctl -t" with at least one
      directory watch.
      
      Cc: Jan Kara <jack@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Seiji Nishikawa <snishika@redhat.com>
      Cc: stable@vger.kernel.org
      Fixes: 8432c700
      
       ("audit: Simplify locking around untag_chunk()")
      Signed-off-by: default avatarRichard Guy Briggs <rgb@redhat.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      [PM: reformatted/cleaned-up the commit description]
      Signed-off-by: default avatarPaul Moore <paul@paul-moore.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2424b0eb
  4. 26 Aug, 2021 1 commit
  5. 23 Aug, 2021 3 commits
    • Alexey Gladkov's avatar
      ucounts: Increase ucounts reference counter before the security hook · bbb6d0f3
      Alexey Gladkov authored
      We need to increment the ucounts reference counter befor security_prepare_creds()
      because this function may fail and abort_creds() will try to decrement
      this reference.
      
      [   96.465056][ T8641] FAULT_INJECTION: forcing a failure.
      [   96.465056][ T8641] name fail_page_alloc, interval 1, probability 0, space 0, times 0
      [   96.478453][ T8641] CPU: 1 PID: 8641 Comm: syz-executor668 Not tainted 5.14.0-rc6-syzkaller #0
      [   96.487215][ T8641] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      [   96.497254][ T8641] Call Trace:
      [   96.500517][ T8641]  dump_stack_lvl+0x1d3/0x29f
      [   96.505758][ T8641]  ? show_regs_print_info+0x12/0x12
      [   96.510944][ T8641]  ? log_buf_vmcoreinfo_setup+0x498/0x498
      [   96.516652][ T8641]  should_fail+0x384/0x4b0
      [   96.521141][ T8641]  prepare_alloc_pages+0x1d1/0x5a0
      [   96.526236][ T8641]  __alloc_pages+0x14d/0x5f0
      [   96.530808][ T8641]  ? __rmqueue_pcplist+0x2030/0x2030
      [   96.536073][ T8641]  ? lockdep_hardirqs_on_prepare+0x3e2/0x750
      [   96.542056][ T8641]  ? alloc_pages+0x3f3/0x500
      [   96.546635][ T8641]  allocate_slab+0xf1/0x540
      [   96.551120][ T8641]  ___slab_alloc+0x1cf/0x350
      [   96.555689][ T8641]  ? kzalloc+0x1d/0x30
      [   96.559740][ T8641]  __kmalloc+0x2e7/0x390
      [   96.563980][ T8641]  ? kzalloc+0x1d/0x30
      [   96.568029][ T8641]  kzalloc+0x1d/0x30
      [   96.571903][ T8641]  security_prepare_creds+0x46/0x220
      [   96.577174][ T8641]  prepare_creds+0x411/0x640
      [   96.581747][ T8641]  __sys_setfsuid+0xe2/0x3a0
      [   96.586333][ T8641]  do_syscall_64+0x3d/0xb0
      [   96.590739][ T8641]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [   96.596611][ T8641] RIP: 0033:0x445a69
      [   96.600483][ T8641] Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 11 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
      [   96.620152][ T8641] RSP: 002b:00007f1054173318 EFLAGS: 00000246 ORIG_RAX: 000000000000007a
      [   96.628543][ T8641] RAX: ffffffffffffffda RBX: 00000000004ca4c8 RCX: 0000000000445a69
      [   96.636600][ T8641] RDX: 0000000000000010 RSI: 00007f10541732f0 RDI: 0000000000000000
      [   96.644550][ T8641] RBP: 00000000004ca4c0 R08: 0000000000000001 R09: 0000000000000000
      [   96.652500][ T8641] R10: 0000000000000000 R11: 0000000000000246 R12: 00000000004ca4cc
      [   96.660631][ T8641] R13: 00007fffffe0b62f R14: 00007f1054173400 R15: 0000000000022000
      
      Fixes: 905ae01c
      
       ("Add a reference to ucounts for each cred")
      Reported-by: syzbot+01985d7909f9468f013c@syzkaller.appspotmail.com
      Signed-off-by: default avatarAlexey Gladkov <legion@kernel.org>
      Link: https://lkml.kernel.org/r/97433b1742c3331f02ad92de5a4f07d673c90613.1629735352.git.legion@kernel.org
      
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      bbb6d0f3
    • Eric W. Biederman's avatar
      ucounts: Fix regression preventing increasing of rlimits in init_user_ns · 5ddf994f
      Eric W. Biederman authored
      "Ma, XinjianX" <xinjianx.ma@intel.com> reported:
      
      > When lkp team run kernel selftests, we found after these series of patches, testcase mqueue: mq_perf_tests
      > in kselftest failed with following message.
      >
      > # selftests: mqueue: mq_perf_tests
      > #
      > # Initial system state:
      > #       Using queue path:                       /mq_perf_tests
      > #       RLIMIT_MSGQUEUE(soft):                  819200
      > #       RLIMIT_MSGQUEUE(hard):                  819200
      > #       Maximum Message Size:                   8192
      > #       Maximum Queue Size:                     10
      > #       Nice value:                             0
      > #
      > # Adjusted system state for testing:
      > #       RLIMIT_MSGQUEUE(soft):                  (unlimited)
      > #       RLIMIT_MSGQUEUE(hard):                  (unlimited)
      > #       Maximum Message Size:                   16777216
      > #       Maximum Queue Size:                     65530
      > #       Nice value:                             -20
      > #       Continuous mode:                        (disabled)
      > #       CPUs to pin:                            3
      > # ./mq_perf_tests: mq_open() at 296: Too many open files
      > not ok 2 selftests: mqueue: mq_perf_tests # exit=1
      > ```
      >
      > Test env:
      > rootfs: debian-10
      > gcc version: 9
      
      After investigation the problem turned out to be that ucount_max for
      the rlimits in init_user_ns was being set to the initial rlimit value.
      The practical problem is that ucount_max provides a limit that
      applications inside the user namespace can not exceed.  Which means in
      practice that rlimits that have been converted to use the ucount
      infrastructure were not able to exceend their initial rlimits.
      
      Solve this by setting the relevant values of ucount_max to
      RLIM_INIFINITY.  A limit in init_user_ns is pointless so the code
      should allow the values to grow as large as possible without riscking
      an underflow or an overflow.
      
      As the ltp test case was a bit of a pain I have reproduced the rlimit failure
      and tested the fix with the following little C program:
      > #include <stdio.h>
      > #include <fcntl.h>
      > #include <sys/stat.h>
      > #include <mqueue.h>
      > #include <sys/time.h>
      > #include <sys/resource.h>
      > #include <errno.h>
      > #include <string.h>
      > #include <stdlib.h>
      > #include <limits.h>
      > #include <unistd.h>
      >
      > int main(int argc, char **argv)
      > {
      > 	struct mq_attr mq_attr;
      > 	struct rlimit rlim;
      > 	mqd_t mqd;
      > 	int ret;
      >
      > 	ret = getrlimit(RLIMIT_MSGQUEUE, &rlim);
      > 	if (ret != 0) {
      > 		fprintf(stderr, "getrlimit(RLIMIT_MSGQUEUE) failed: %s\n", strerror(errno));
      > 		exit(EXIT_FAILURE);
      > 	}
      > 	printf("RLIMIT_MSGQUEUE %lu %lu\n",
      > 	       rlim.rlim_cur, rlim.rlim_max);
      > 	rlim.rlim_cur = RLIM_INFINITY;
      > 	rlim.rlim_max = RLIM_INFINITY;
      > 	ret = setrlimit(RLIMIT_MSGQUEUE, &rlim);
      > 	if (ret != 0) {
      > 		fprintf(stderr, "setrlimit(RLIMIT_MSGQUEUE, RLIM_INFINITY) failed: %s\n", strerror(errno));
      > 		exit(EXIT_FAILURE);
      > 	}
      >
      > 	memset(&mq_attr, 0, sizeof(struct mq_attr));
      > 	mq_attr.mq_maxmsg = 65536 - 1;
      > 	mq_attr.mq_msgsize = 16*1024*1024 - 1;
      >
      > 	mqd = mq_open("/mq_rlimit_test", O_RDONLY|O_CREAT, 0600, &mq_attr);
      > 	if (mqd == (mqd_t)-1) {
      > 		fprintf(stderr, "mq_open failed: %s\n", strerror(errno));
      > 		exit(EXIT_FAILURE);
      > 	}
      > 	ret = mq_close(mqd);
      > 	if (ret) {
      > 		fprintf(stderr, "mq_close failed; %s\n", strerror(errno));
      > 		exit(EXIT_FAILURE);
      > 	}
      >
      > 	return EXIT_SUCCESS;
      > }
      
      Fixes: 6e52a9f0 ("Reimplement RLIMIT_MSGQUEUE on top of ucounts")
      Fixes: d7c9e99a ("Reimplement RLIMIT_MEMLOCK on top of ucounts")
      Fixes: d6469690 ("Reimplement RLIMIT_SIGPENDING on top of ucounts")
      Fixes: 21d1c5e3
      
       ("Reimplement RLIMIT_NPROC on top of ucounts")
      Reported-by: kernel test robot lkp@intel.com
      Acked-by: default avatarAlexey Gladkov <legion@kernel.org>
      Link: https://lkml.kernel.org/r/87eeajswfc.fsf_-_@disp2133
      
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      5ddf994f
    • Daniel Borkmann's avatar
      bpf: Fix ringbuf helper function compatibility · 5b029a32
      Daniel Borkmann authored
      Commit 457f4436 ("bpf: Implement BPF ring buffer and verifier support
      for it") extended check_map_func_compatibility() by enforcing map -> helper
      function match, but not helper -> map type match.
      
      Due to this all of the bpf_ringbuf_*() helper functions could be used with
      a wrong map type such as array or hash map, leading to invalid access due
      to type confusion.
      
      Also, both BPF_FUNC_ringbuf_{submit,discard} have ARG_PTR_TO_ALLOC_MEM as
      argument and not a BPF map. Therefore, their check_map_func_compatibility()
      presence is incorrect since it's only for map type checking.
      
      Fixes: 457f4436
      
       ("bpf: Implement BPF ring buffer and verifier support for it")
      Reported-by: Ryota Shiga (Flatt Security)
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      5b029a32
  6. 20 Aug, 2021 1 commit
  7. 16 Aug, 2021 1 commit
  8. 13 Aug, 2021 1 commit
    • Ilya Leoshkevich's avatar
      bpf: Clear zext_dst of dead insns · 45c709f8
      Ilya Leoshkevich authored
      "access skb fields ok" verifier test fails on s390 with the "verifier
      bug. zext_dst is set, but no reg is defined" message. The first insns
      of the test prog are ...
      
         0:	61 01 00 00 00 00 00 00 	ldxw %r0,[%r1+0]
         8:	35 00 00 01 00 00 00 00 	jge %r0,0,1
        10:	61 01 00 08 00 00 00 00 	ldxw %r0,[%r1+8]
      
      ... and the 3rd one is dead (this does not look intentional to me, but
      this is a separate topic).
      
      sanitize_dead_code() converts dead insns into "ja -1", but keeps
      zext_dst. When opt_subreg_zext_lo32_rnd_hi32() tries to parse such
      an insn, it sees this discrepancy and bails. This problem can be seen
      only with JITs whose bpf_jit_needs_zext() returns true.
      
      Fix by clearning dead insns' zext_dst.
      
      The commits that contributed to this problem are:
      
      1. 5aa5bd14 ("bpf: add initial suite for selftests"), which
         introduced the test with the dead code.
      2. 5327ed3d ("bpf: verifier: mark verified-insn with
         sub-register zext flag"), which introduced the zext_dst flag.
      3. 83a28819 ("bpf: Account for BPF_FETCH in
         insn_has_def32()"), which introduced the sanity check.
      4. 9183671a ("bpf: Fix leakage under speculation on
         mispredicted branches"), which bisect points to.
      
      It's best to fix this on stable branches that contain the second one,
      since that's the point where the inconsistency was introduced.
      
      Fixes: 5327ed3d
      
       ("bpf: verifier: mark verified-insn with sub-register zext flag")
      Signed-off-by: default avatarIlya Leoshkevich <iii@linux.ibm.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20210812151811.184086-2-iii@linux.ibm.com
      45c709f8
  9. 12 Aug, 2021 5 commits
    • Steven Rostedt (VMware)'s avatar
      tracing / histogram: Fix NULL pointer dereference on strcmp() on NULL event name · 5acce0bf
      Steven Rostedt (VMware) authored
      The following commands:
      
       # echo 'read_max u64 size;' > synthetic_events
       # echo 'hist:keys=common_pid:count=count:onmax($count).trace(read_max,count)' > events/syscalls/sys_enter_read/trigger
      
      Causes:
      
       BUG: kernel NULL pointer dereference, address: 0000000000000000
       #PF: supervisor read access in kernel mode
       #PF: error_code(0x0000) - not-present page
       PGD 0 P4D 0
       Oops: 0000 [#1] PREEMPT SMP
       CPU: 4 PID: 1763 Comm: bash Not tainted 5.14.0-rc2-test+ #155
       Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01
      v03.03 07/14/2016
       RIP: 0010:strcmp+0xc/0x20
       Code: 75 f7 31 c0 0f b6 0c 06 88 0c 02 48 83 c0 01 84 c9 75 f1 4c 89 c0
      c3 0f 1f 80 00 00 00 00 31 c0 eb 08 48 83 c0 01 84 d2 74 0f <0f> b6 14 07
      3a 14 06 74 ef 19 c0 83 c8 01 c3 31 c0 c3 66 90 48 89
       RSP: 0018:ffffb5fdc0963ca8 EFLAGS: 00010246
       RAX: 0000000000000000 RBX: ffffffffb3a4e040 RCX: 0000000000000000
       RDX: 0000000000000000 RSI: ffff9714c0d0b640 RDI: ...
      5acce0bf
    • Lukas Bulwahn's avatar
      tracing: define needed config DYNAMIC_FTRACE_WITH_ARGS · 12f9951d
      Lukas Bulwahn authored
      Commit 2860cd8a ("livepatch: Use the default ftrace_ops instead of
      REGS when ARGS is available") intends to enable config LIVEPATCH when
      ftrace with ARGS is available. However, the chain of configs to enable
      LIVEPATCH is incomplete, as HAVE_DYNAMIC_FTRACE_WITH_ARGS is available,
      but the definition of DYNAMIC_FTRACE_WITH_ARGS, combining DYNAMIC_FTRACE
      and HAVE_DYNAMIC_FTRACE_WITH_ARGS, needed to enable LIVEPATCH, is missing
      in the commit.
      
      Fortunately, ./scripts/checkkconfigsymbols.py detects this and warns:
      
      DYNAMIC_FTRACE_WITH_ARGS
      Referencing files: kernel/livepatch/Kconfig
      
      So, define the config DYNAMIC_FTRACE_WITH_ARGS analogously to the already
      existing similar configs, DYNAMIC_FTRACE_WITH_REGS and
      DYNAMIC_FTRACE_WITH_DIRECT_CALLS, in ./kernel/trace/Kconfig to connect the
      chain of configs.
      
      Link: https://lore.kernel.org/kernel-janitors/CAKXUXMwT2zS9fgyQHKUUiqo8ynZBdx2UEUu1WnV_q0OCmknqhw@mail.gmail.com/
      Link: https://lkml.kernel.org/r/20210806...
      12f9951d
    • Daniel Bristot de Oliveira's avatar
      trace/osnoise: Print a stop tracing message · 0e05ba49
      Daniel Bristot de Oliveira authored
      When using osnoise/timerlat with stop tracing, sometimes it is
      not clear in which CPU the stop condition was hit, mainly
      when using some extra events.
      
      Print a message informing in which CPU the trace stopped, like
      in the example below:
      
                <idle>-0       [006] d.h.  2932.676616: #1672599 context    irq timer_latency     34689 ns
                <idle>-0       [006] dNh.  2932.676618: irq_noise: local_timer:236 start 2932.676615639 duration 2391 ns
                <idle>-0       [006] dNh.  2932.676620: irq_noise: virtio0-output.0:47 start 2932.676620180 duration 86 ns
                <idle>-0       [003] d.h.  2932.676621: #1673374 context    irq timer_latency      1200 ns
                <idle>-0       [006] d...  2932.676623: thread_noise: swapper/6:0 start 2932.676615964 duration 4339 ns
                <idle>-0       [003] dNh.  2932.676623: irq_noise: local_timer:236 start 2932.676620597 duration 1881 ns
                <idle>-0       [006] d...  2932.676623: sched_switch: prev_comm=swapper/6 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=timerlat/6 next_pid=852 next_prio=4
            timerlat/6-852     [006] ....  2932.676623: #1672599 context thread timer_latency     41931 ns
                <idle>-0       [003] d...  2932.676623: thread_noise: swapper/3:0 start 2932.676620854 duration 880 ns
                <idle>-0       [003] d...  2932.676624: sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=timerlat/3 next_pid=849 next_prio=4
            timerlat/6-852     [006] ....  2932.676624: timerlat_main: stop tracing hit on cpu 6
            timerlat/3-849     [003] ....  2932.676624: #1673374 context thread timer_latency      4310 ns
      
      Link: https://lkml.kernel.org/r/b30a0d7542adba019185f44ee648e60e14923b11.1626598844.git.bristot@kernel.org
      
      
      
      Cc: Tom Zanussi <zanussi@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: default avatarDaniel Bristot de Oliveira <bristot@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      0e05ba49
    • Daniel Bristot de Oliveira's avatar
      trace/timerlat: Add a header with PREEMPT_RT additional fields · e1c4ad4a
      Daniel Bristot de Oliveira authored
      Some extra flags are printed to the trace header when using the
      PREEMPT_RT config. The extra flags are: need-resched-lazy,
      preempt-lazy-depth, and migrate-disable.
      
      Without printing these fields, the timerlat specific fields are
      shifted by three positions, for example:
      
       # tracer: timerlat
       #
       #                                _-----=> irqs-off
       #                               / _----=> need-resched
       #                              | / _---=> hardirq/softirq
       #                              || / _--=> preempt-depth
       #                              || /
       #                              ||||             ACTIVATION
       #           TASK-PID      CPU# ||||   TIMESTAMP    ID            CONTEXT                LATENCY
       #              | |         |   ||||      |         |                  |                       |
                 <idle>-0       [000] d..h...  3279.798871: #1     context    irq timer_latency       830 ns
                  <...>-807     [000] .......  3279.798881: #1     context thread timer_latency     11301 ns
      
      Add a new header for timerlat with the missing fields, to be used
      when the PREEMPT_RT is enabled.
      
      Link: https://lkml.kernel.org/r/babb83529a3211bd0805be0b8c21608230202c55.1626598844.git.bristot@kernel.org
      
      
      
      Cc: Tom Zanussi <zanussi@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: default avatarDaniel Bristot de Oliveira <bristot@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      e1c4ad4a
    • Daniel Bristot de Oliveira's avatar
      trace/osnoise: Add a header with PREEMPT_RT additional fields · d03721a6
      Daniel Bristot de Oliveira authored
      Some extra flags are printed to the trace header when using the
      PREEMPT_RT config. The extra flags are: need-resched-lazy,
      preempt-lazy-depth, and migrate-disable.
      
      Without printing these fields, the osnoise specific fields are
      shifted by three positions, for example:
      
       # tracer: osnoise
       #
       #                                _-----=> irqs-off
       #                               / _----=> need-resched
       #                              | / _---=> hardirq/softirq
       #                              || / _--=> preempt-depth                            MAX
       #                              || /                                             SINGLE      Interference counters:
       #                              ||||               RUNTIME      NOISE  %% OF CPU  NOISE    +-----------------------------+
       #           TASK-PID      CPU# ||||   TIMESTAMP    IN US       IN US  AVAILABLE  IN US     HW    NMI    IRQ   SIRQ THREAD
       #              | |         |   ||||      |           |             |    |            |      |      |      |      |      |
                  <...>-741     [000] .......  1105.690909: 1000000        234  99.97660      36     21      0   1001     22      3
                  <...>-742     [001] .......  1105.691923: 1000000        281  99.97190     197      7      0   1012     35     14
                  <...>-743     [002] .......  1105.691958: 1000000       1324  99.86760     118     11      0   1016    155    143
                  <...>-744     [003] .......  1105.691998: 1000000        109  99.98910      21      4      0   1004     33      7
                  <...>-745     [004] .......  1105.692015: 1000000       2023  99.79770      97     37      0   1023     52     18
      
      Add a new header for osnoise with the missing fields, to be used
      when the PREEMPT_RT is enabled.
      
      Link: https://lkml.kernel.org/r/1f03289d2a51fde5a58c2e7def063dc630820ad1.1626598844.git.bristot@kernel.org
      
      
      
      Cc: Tom Zanussi <zanussi@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: default avatarDaniel Bristot de Oliveira <bristot@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      d03721a6
  10. 11 Aug, 2021 3 commits
    • Elliot Berman's avatar
      cfi: Use rcu_read_{un}lock_sched_notrace · 14c4c8e4
      Elliot Berman authored
      
      If rcu_read_lock_sched tracing is enabled, the tracing subsystem can
      perform a jump which needs to be checked by CFI. For example, stm_ftrace
      source is enabled as a module and hooks into enabled ftrace events. This
      can cause an recursive loop where find_shadow_check_fn ->
      rcu_read_lock_sched -> (call to stm_ftrace generates cfi slowpath) ->
      find_shadow_check_fn -> rcu_read_lock_sched -> ...
      
      To avoid the recursion, either the ftrace codes needs to be marked with
      __no_cfi or CFI should not trace. Use the "_notrace" in CFI to avoid
      tracing so that CFI can guard ftrace.
      Signed-off-by: default avatarElliot Berman <quic_eberman@quicinc.com>
      Reviewed-by: default avatarSami Tolvanen <samitolvanen@google.com>
      Cc: stable@vger.kernel.org
      Fixes: cf68fffb
      
       ("add support for Clang CFI")
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Link: https://lore.kernel.org/r/20210811155914.19550-1-quic_eberman@quicinc.com
      14c4c8e4
    • Hsuan-Chi Kuo's avatar
      seccomp: Fix setting loaded filter count during TSYNC · b4d8a58f
      Hsuan-Chi Kuo authored
      
      The desired behavior is to set the caller's filter count to thread's.
      This value is reported via /proc, so this fixes the inaccurate count
      exposed to userspace; it is not used for reference counting, etc.
      Signed-off-by: default avatarHsuan-Chi Kuo <hsuanchikuo@gmail.com>
      Link: https://lore.kernel.org/r/20210304233708.420597-1-hsuanchikuo@gmail.com
      
      Co-developed-by: default avatarWiktor Garbacz <wiktorg@google.com>
      Signed-off-by: default avatarWiktor Garbacz <wiktorg@google.com>
      Link: https://lore.kernel.org/lkml/20210810125158.329849-1-wiktorg@google.com
      
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Cc: stable@vger.kernel.org
      Fixes: c818c03b ("seccomp: Report number of loaded filters in /proc/$pid/status")
      b4d8a58f
    • Yonghong Song's avatar
      bpf: Add rcu_read_lock in bpf_get_current_[ancestor_]cgroup_id() helpers · 2d3a1e36
      Yonghong Song authored
      Currently, if bpf_get_current_cgroup_id() or
      bpf_get_current_ancestor_cgroup_id() helper is
      called with sleepable programs e.g., sleepable
      fentry/fmod_ret/fexit/lsm programs, a rcu warning
      may appear. For example, if I added the following
      hack to test_progs/test_lsm sleepable fentry program
      test_sys_setdomainname:
      
        --- a/tools/testing/selftests/bpf/progs/lsm.c
        +++ b/tools/testing/selftests/bpf/progs/lsm.c
        @@ -168,6 +168,10 @@ int BPF_PROG(test_sys_setdomainname, struct pt_regs *regs)
                int buf = 0;
                long ret;
      
        +       __u64 cg_id = bpf_get_current_cgroup_id();
        +       if (cg_id == 1000)
        +               copy_test++;
        +
                ret = bpf_copy_from_user(&buf, sizeof(buf), ptr);
                if (len == -2 && ret == 0 && buf == 1234)
                        copy_test++;
      
      I will hit the following rcu warning:
      
        include/linux/cgroup.h:481 suspicious rcu_dereference_check() usage!
        other info that might help us debug this:
          rcu_scheduler_active = 2, debug_locks = 1
          1 lock held by test_progs/260:
            #0: ffffffffa5173360 (rcu_read_lock_trace){....}-{0:0}, at: __bpf_prog_enter_sleepable+0x0/0xa0
          stack backtrace:
          CPU: 1 PID: 260 Comm: test_progs Tainted: G           O      5.14.0-rc2+ #176
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
          Call Trace:
            dump_stack_lvl+0x56/0x7b
            bpf_get_current_cgroup_id+0x9c/0xb1
            bpf_prog_a29888d1c6706e09_test_sys_setdomainname+0x3e/0x89c
            bpf_trampoline_6442469132_0+0x2d/0x1000
            __x64_sys_setdomainname+0x5/0x110
            do_syscall_64+0x3a/0x80
            entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      I can get similar warning using bpf_get_current_ancestor_cgroup_id() helper.
      syzbot reported a similar issue in [1] for syscall program. Helper
      bpf_get_current_cgroup_id() or bpf_get_current_ancestor_cgroup_id()
      has the following callchain:
         task_dfl_cgroup
           task_css_set
             task_css_set_check
      and we have
         #define task_css_set_check(task, __c)                                   \
                 rcu_dereference_check((task)->cgroups,                          \
                         lockdep_is_held(&cgroup_mutex) ||                       \
                         lockdep_is_held(&css_set_lock) ||                       \
                         ((task)->flags & PF_EXITING) || (__c))
      Since cgroup_mutex/css_set_lock is not held and the task
      is not existing and rcu read_lock is not held, a warning
      will be issued. Note that bpf sleepable program is protected by
      rcu_read_lock_trace().
      
      The above sleepable bpf programs are already protected
      by migrate_disable(). Adding rcu_read_lock() in these
      two helpers will silence the above warning.
      I marked the patch fixing 95b861a7
      ("bpf: Allow bpf_get_current_ancestor_cgroup_id for tracing")
      which added bpf_get_current_ancestor_cgroup_id() to tracing programs
      in 5.14. I think backporting 5.14 is probably good enough as sleepable
      progrems are not widely used.
      
      This patch should fix [1] as well since syscall program is a sleepable
      program protected with migrate_disable().
      
       [1] https://lore.kernel.org/bpf/0000000000006d5cab05c7d9bb87@google.com/
      
      Fixes: 95b861a7
      
       ("bpf: Allow bpf_get_current_ancestor_cgroup_id for tracing")
      Reported-by: syzbot+7ee5c2c09c284495371f@syzkaller.appspotmail.com
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20210810230537.2864668-1-yhs@fb.com
      2d3a1e36
  11. 10 Aug, 2021 2 commits