1. 03 Sep, 2021 1 commit
  2. 04 Aug, 2021 1 commit
    • Tejun Heo's avatar
      blk-iocost: fix operation ordering in iocg_wake_fn() · caed0df2
      Tejun Heo authored
      commit 5ab189cf
      
       upstream.
      
      iocg_wake_fn() open-codes wait_queue_entry removal and wakeup because it
      wants the wq_entry to be always removed whether it ended up waking the
      task or not. finish_wait() tests whether wq_entry needs removal without
      grabbing the wait_queue lock and expects the waker to use
      list_del_init_careful() after all waking operations are complete, which
      iocg_wake_fn() didn't do. The operation order was wrong and the regular
      list_del_init() was used.
      
      The result is that if a waiter wakes up racing the waker, it can free pop
      the wq_entry off stack before the waker is still looking at it, which can
      lead to a backtrace like the following.
      
        [7312084.588951] general protection fault, probably for non-canonical address 0x586bf4005b2b88: 0000 [#1] SMP
        ...
        [7312084.647079] RIP: 0010:queued_spin_lock_slowpath+0x171/0x1b0
        ...
        [7312084.858314] Call Trace:
        [7312084.863548]  _raw_spin_lock_irqsave+0x22/0x30
        [7312084.872605]  try_to_wake_up+0x4c/0x4f0
        [7312084.880444]  iocg_wake_fn+0x71/0x80
        [7312084.887763]  __wake_up_common+0x71/0x140
        [7312084.895951]  iocg_kick_waitq+0xe8/0x2b0
        [7312084.903964]  ioc_rqos_throttle+0x275/0x650
        [7312084.922423]  __rq_qos_throttle+0x20/0x30
        [7312084.930608]  blk_mq_make_request+0x120/0x650
        [7312084.939490]  generic_make_request+0xca/0x310
        [7312084.957600]  submit_bio+0x173/0x200
        [7312084.981806]  swap_readpage+0x15c/0x240
        [7312084.989646]  read_swap_cache_async+0x58/0x60
        [7312084.998527]  swap_cluster_readahead+0x201/0x320
        [7312085.023432]  swapin_readahead+0x2df/0x450
        [7312085.040672]  do_swap_page+0x52f/0x820
        [7312085.058259]  handle_mm_fault+0xa16/0x1420
        [7312085.066620]  do_page_fault+0x2c6/0x5c0
        [7312085.074459]  page_fault+0x2f/0x40
      
      Fix it by switching to list_del_init_careful() and putting it at the end.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarRik van Riel <riel@surriel.com>
      Fixes: 7caa4715
      
       ("blkcg: implement blk-iocost")
      Cc: stable@vger.kernel.org # v5.4+
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      caed0df2
  3. 12 May, 2021 1 commit
    • Tejun Heo's avatar
      blk-iocost: fix weight updates of inner active iocgs · e9f4eee9
      Tejun Heo authored
      
      When the weight of an active iocg is updated, weight_updated() is called
      which in turn calls __propagate_weights() to update the active and inuse
      weights so that the effective hierarchical weights are update accordingly.
      
      The current implementation is incorrect for inner active nodes. For an
      active leaf iocg, inuse can be any value between 1 and active and the
      difference represents how much the iocg is donating. When weight is updated,
      as long as inuse is clamped between 1 and the new weight, we're alright and
      this is what __propagate_weights() currently implements.
      
      However, that's not how an active inner node's inuse is set. An inner node's
      inuse is solely determined by the ratio between the sums of inuse's and
      active's of its children - ie. they're results of propagating the leaves'
      active and inuse weights upwards. __propagate_weights() incorrectly applies
      the same clamping as for a leaf when an active inner node's weight is
      updated. Consider a hierarchy which looks like the following with saturating
      workloads in AA and BB.
      
           R
         /   \
        A     B
        |     |
       AA     BB
      
      1. For both A and B, active=100, inuse=100, hwa=0.5, hwi=0.5.
      
      2. echo 200 > A/io.weight
      
      3. __propagate_weights() update A's active to 200 and leave inuse at 100 as
         it's already between 1 and the new active, making A:active=200,
         A:inuse=100. As R's active_sum is updated along with A's active,
         A:hwa=2/3, B:hwa=1/3. However, because the inuses didn't change, the
         hwi's remain unchanged at 0.5.
      
      4. The weight of A is now twice that of B but AA and BB still have the same
         hwi of 0.5 and thus are doing the same amount of IOs.
      
      Fix it by making __propgate_weights() always calculate the inuse of an
      active inner iocg based on the ratio of child_inuse_sum to child_active_sum.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarDan Schatzberg <dschatzberg@fb.com>
      Fixes: 7caa4715 ("blkcg: implement blk-iocost")
      Cc: stable@vger.kernel.org # v5.4+
      Link: https://lore.kernel.org/r/YJsxnLZV1MnBcqjj@slm.duckdns.org
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e9f4eee9
  4. 26 Apr, 2021 1 commit
  5. 05 Jan, 2021 1 commit
    • Tejun Heo's avatar
      blk-iocost: fix NULL iocg deref from racing against initialization · d16baa3f
      Tejun Heo authored
      
      When initializing iocost for a queue, its rqos should be registered before
      the blkcg policy is activated to allow policy data initiailization to lookup
      the associated ioc. This unfortunately means that the rqos methods can be
      called on bios before iocgs are attached to all existing blkgs.
      
      While the race is theoretically possible on ioc_rqos_throttle(), it mostly
      happened in ioc_rqos_merge() due to the difference in how they lookup ioc.
      The former determines it from the passed in @rqos and then bails before
      dereferencing iocg if the looked up ioc is disabled, which most likely is
      the case if initialization is still in progress. The latter looked up ioc by
      dereferencing the possibly NULL iocg making it a lot more prone to actually
      triggering the bug.
      
      * Make ioc_rqos_merge() use the same method as ioc_rqos_throttle() to look
        up ioc for consistency.
      
      * Make ioc_rqos_throttle() and ioc_rqos_merge() test for NULL iocg before
        dereferencing it.
      
      * Explain the danger of NULL iocgs in blk_iocost_init().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarJonathan Lemon <bsd@fb.com>
      Cc: stable@vger.kernel.org # v5.4+
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d16baa3f
  6. 17 Dec, 2020 1 commit
  7. 07 Dec, 2020 5 commits
  8. 01 Dec, 2020 1 commit
    • Christoph Hellwig's avatar
      block: simplify bdev/disk lookup in blkdev_get · 22ae8ce8
      Christoph Hellwig authored
      
      To simplify block device lookup and a few other upcoming areas, make sure
      that we always have a struct block_device available for each disk and
      each partition, and only find existing block devices in bdget.  The only
      downside of this is that each device and partition uses a little more
      memory.  The upside will be that a lot of code can be simplified.
      
      With that all we need to look up the block device is to lookup the inode
      and do a few sanity checks on the gendisk, instead of the separate lookup
      for the gendisk.  For blk-cgroup which wants to access a gendisk without
      opening it, a new blkdev_{get,put}_no_open low-level interface is added
      to replace the previous get_gendisk use.
      
      Note that the change to look up block device directly instead of the two
      step lookup using struct gendisk causes a subtile change in behavior:
      accessing a non-existing partition on an existing block device can now
      cause a call to request_module.  That call is harmless, and in practice
      no recent system will access these nodes as they aren't created by udev
      and static /dev/ setups are unusual.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      22ae8ce8
  9. 09 Oct, 2020 1 commit
  10. 08 Oct, 2020 1 commit
  11. 25 Sep, 2020 6 commits
    • Tejun Heo's avatar
      iocost: consider iocgs with active delays for debt forgiveness · bec02dbb
      Tejun Heo authored
      
      An iocg may have 0 debt but non-zero delay. The current debt forgiveness
      logic doesn't act on such iocgs. This can lead to unexpected behaviors - an
      iocg with a little bit of debt will have its delay canceled through debt
      forgiveness but one w/o any debt but active delay will have to wait out
      until its delay decays out.
      
      This patch updates the debt handling logic so that it treats delays the same
      as debts. If either debt or delay is active, debt forgiveness logic kicks in
      and acts on both the same way.
      
      Also, avoid turning the debt and delay directly to zero as that can confuse
      state transitions.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      bec02dbb
    • Tejun Heo's avatar
      iocost: add iocg_forgive_debt tracepoint · c5a6561b
      Tejun Heo authored
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c5a6561b
    • Tejun Heo's avatar
      iocost: reimplement debt forgiveness using average usage · c7af2a00
      Tejun Heo authored
      
      Debt forgiveness logic was counting the number of consecutive !busy periods
      as the trigger condition. While this usually works, it can easily be thrown
      off by temporary fluctuations especially on configurations w/ short periods.
      
      This patch reimplements debt forgiveness so that:
      
      * Use the average usage over the forgiveness period instead of counting
        consecutive periods.
      
      * Debt is reduced at around the target rate (1/2 every 100ms) regardless of
        ioc period duration.
      
      * Usage threshold is raised to 50%. Combined with the preceding changes and
        the switch to average usage, this makes debt forgivness a lot more
        effective at reducing the amount of unnecessary idleness.
      
      * Constants are renamed with DFGV_ prefix.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c7af2a00
    • Tejun Heo's avatar
      iocost: recalculate delay after debt reduction · d9517841
      Tejun Heo authored
      
      Debt sets the initial delay duration which is decayed over time. The current
      debt reduction halved the debt but didn't change the delay. It prevented
      future debts from increasing delay but didn't do anything to lower the
      existing delay, limiting the mechanism's ability to reduce unnecessary
      idling.
      
      Reset iocg->delay to 0 after debt reduction so that iocg_kick_waitq()
      recalculates new delay value based on the reduced debt amount.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d9517841
    • Tejun Heo's avatar
      iocost: replace nr_shortages cond in ioc_forgive_debts() with busy_level one · 33a1fe6d
      Tejun Heo authored
      
      Debt reduction was blocked if any iocg was short on budget in the past
      period to avoid reducing debts while some iocgs are saturated. However, this
      ends up unnecessarily blocking debt reduction due to temporary local
      imbalances when the device is generally being underutilized, while also
      failing to block when the underlying device is overwhelmed and the usage
      becomes low from high latency.
      
      Given that debt accumulation mostly happens with swapout bursts which can
      significantly deteriorate the underlying device's latency response, the
      current logic is not great.
      
      Let's replace it with ioc->busy_level based condition so that we block debt
      reduction when the underlying device is being saturated. ioc_forgive_debts()
      call is moved after busy_level determination.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      33a1fe6d
    • Tejun Heo's avatar
      iocost: factor out ioc_forgive_debts() · ab8df828
      Tejun Heo authored
      
      Debt reduction logic is going to be improved and expanded. Factor it out
      into ioc_forgive_debts() and generalize the comment a bit. No functional
      change.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ab8df828
  12. 14 Sep, 2020 1 commit
    • Tejun Heo's avatar
      iocost: fix infinite loop bug in adjust_inuse_and_calc_cost() · aa67db24
      Tejun Heo authored
      adjust_inuse_and_calc_cost() is responsible for reducing the amount of
      donated weights dynamically in period as the budget runs low. Because we
      don't want to do full donation calculation in period, we keep latching up
      inuse by INUSE_ADJ_STEP_PCT of the active weight of the cgroup until the
      resulting hweight_inuse is satisfactory.
      
      Unfortunately, the adj_step calculation was reading the active weight before
      acquiring ioc->lock. Because the current thread could have lost race to
      activate the iocg to another thread before entering this function, it may
      read the active weight as zero before acquiring ioc->lock. When this
      happens, the adj_step is calculated as zero and the incremental adjustment
      loop becomes an infinite one.
      
      Fix it by fetching the active weight after acquiring ioc->lock.
      
      Fixes: b0853ab4
      
       ("blk-iocost: revamp in-period donation snapbacks")
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      aa67db24
  13. 11 Sep, 2020 1 commit
  14. 02 Sep, 2020 18 commits
    • Tejun Heo's avatar
      blk-iocost: add three debug stat - cost.wait, indebt and indelay · f0bf84a5
      Tejun Heo authored
      
      These are really cheap to collect and can be useful in debugging iocost
      behavior. Add them as debug stats for now.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f0bf84a5
    • Tejun Heo's avatar
      blk-iocost: restore inuse update tracepoints · 04603755
      Tejun Heo authored
      
      Update and restore the inuse update tracepoints.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      04603755
    • Tejun Heo's avatar
      blk-iocost: implement vtime loss compensation · ac33e91e
      Tejun Heo authored
      
      When an iocg accumulates too much vtime or gets deactivated, we throw away
      some vtime, which lowers the overall device utilization. As the exact amount
      which is being thrown away is known, we can compensate by accelerating the
      vrate accordingly so that the extra vtime generated in the current period
      matches what got lost.
      
      This significantly improves work conservation when involving high weight
      cgroups with intermittent and bursty IO patterns.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ac33e91e
    • Tejun Heo's avatar
      blk-iocost: halve debts if device stays idle · dda1315f
      Tejun Heo authored
      
      A low weight iocg can amass a large amount of debt, for example, when
      anonymous memory gets reclaimed aggressively. If the system has a lot of
      memory paired with a slow IO device, the debt can span multiple seconds or
      more. If there are no other subsequent IO issuers, the in-debt iocg may end
      up blocked paying its debt while the IO device is idle.
      
      This patch implements a mechanism to protect against such pathological
      cases. If the device has been sufficiently idle for a substantial amount of
      time, the debts are halved. The criteria are on the conservative side as we
      want to resolve the rare extreme cases without impacting regular operation
      by forgiving debts too readily.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      dda1315f
    • Tejun Heo's avatar
      blk-iocost: implement delay adjustment hysteresis · 5160a5a5
      Tejun Heo authored
      
      Curently, iocost syncs the delay duration to the outstanding debt amount,
      which seemed enough to protect the system from anon memory hogs. However,
      that was mostly because the delay calcuation was using hweight_inuse which
      quickly converges towards zero under debt for delay duration calculation,
      often pusnishing debtors overly harshly for longer than deserved.
      
      The previous patch fixed the delay calcuation and now the protection against
      anonymous memory hogs isn't enough because the effect of delay is indirect
      and non-linear and a huge amount of future debt can accumulate abruptly
      while unthrottled.
      
      This patch implements delay hysteresis so that delay is decayed
      exponentially over time instead of getting cleared immediately as debt is
      paid off. While the overall behavior is similar to the blk-cgroup
      implementation used by blk-iolatency, a lot of the details are different and
      due to the empirical nature of the mechanism, it's challenging to adapt the
      mechanism for one controller without negatively impacting the other.
      
      As the delay is gradually decayed now, there's no point in running it from
      its own hrtimer. Periodic updates are now performed from ioc_timer_fn() and
      the dedicated hrtimer is removed.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5160a5a5
    • Tejun Heo's avatar
      blk-iocost: revamp debt handling · c421a3eb
      Tejun Heo authored
      
      Debt handling had several issues.
      
      * How much inuse a debtor carries wasn't clearly defined. inuse would be
        driven down over time from not issuing IOs but it'd be better to clamp it
        to minimum immediately once in debt.
      
      * How much can be paid off was determined by hweight_inuse. As inuse was
        driven down, the payment amount would fall together regardless of the
        debtor's active weight. This means that the debtors were punished harshly.
      
      * ioc_rqos_merge() wasn't calling blkcg_schedule_throttle() after
        iocg_kick_delay().
      
      This patch revamps debt handling so that
      
      * Debt handling owns inuse for iocgs in debt and keeps them at zero.
      
      * Payment amount is determined by hweight_active. This is more deterministic
        and safer than hweight_inuse but still far from ideal in that it doesn't
        factor in possible donations from other iocgs for debt payments. This
        likely needs further improvements in the future.
      
      * iocg_rqos_merge() now calls blkcg_schedule_throttle() as necessary.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Andy Newell <newella@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c421a3eb
    • Tejun Heo's avatar
      blk-iocost: revamp in-period donation snapbacks · b0853ab4
      Tejun Heo authored
      
      When the margin drops below the minimum on a donating iocg, donation is
      immediately canceled in full. There are a couple shortcomings with the
      current behavior.
      
      * It's abrupt. A small temporary budget deficit can lead to a wide swing in
        weight allocation and a large surplus.
      
      * It's open coded in the issue path but not implemented for the merge path.
        A series of merges at a low inuse can make the iocg incur debts and stall
        incorrectly.
      
      This patch reimplements in-period donation snapbacks so that
      
      * inuse adjustment and cost calculations are factored into
        adjust_inuse_and_calc_cost() which is called from both the issue and merge
        paths.
      
      * Snapbacks are more gradual. It occurs in quarter steps.
      
      * A snapback triggers if the margin goes below the low threshold and is
        lower than the budget at the time of the last adjustment.
      
      * For the above, __propagate_weights() stores the margin in
        iocg->saved_margin. Move iocg->last_inuse storing together into
        __propagate_weights() for consistency.
      
      * Full snapback is guaranteed when there are waiters.
      
      * With precise donation and gradual snapbacks, inuse adjustments are now a
        lot more effective and the value of scaling inuse on weight changes isn't
        clear. Removed inuse scaling from weight_update().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b0853ab4
    • Tejun Heo's avatar
      blk-iocost: revamp donation amount determination · f1de2439
      Tejun Heo authored
      
      iocost has various safety nets to combat inuse adjustment calculation
      inaccuracies. With Andy's method implemented in transfer_surpluses(), inuse
      adjustment calculations are now accurate and we can make donation amount
      determinations accurate too.
      
      * Stop keeping track of past usage history and using the maximum. Act on the
        immediate usage information.
      
      * Remove donation constraints defined by SURPLUS_* constants. Donate
        whatever isn't used.
      
      * Determine the donation amount so that the iocg will end up with
        MARGIN_TARGET_PCT budget at the end of the coming period assuming the same
        usage as the previous period. TARGET is set at 50% of period, which is the
        previous maximum. This provides smooth convergence for most repetitive IO
        patterns.
      
      * Apply donation logic early at 20% budget. There's no risk in doing so as
        the calculation is based on the delta between the current budget and the
        target budget at the end of the coming period.
      
      * Remove preemptive iocg activation for zero cost IOs. As donation can reach
        near zero now, the mere activation doesn't provide any protection anymore.
        In the unlikely case that this becomes a problem, the right solution is
        assigning appropriate costs for such IOs.
      
      This significantly improves the donation determination logic while also
      simplifying it. Now all donations are immediate, exact and smooth.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Andy Newell <newella@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f1de2439
    • Tejun Heo's avatar
      blk-iocost: implement Andy's method for donation weight updates · e08d02aa
      Tejun Heo authored
      iocost implements work conservation by reducing iocg->inuse and propagating
      the adjustment upwards proportionally. However, while I knew the target
      absolute hierarchical proportion - adjusted hweight_inuse, I couldn't figure
      out how to determine the iocg->inuse adjustment to achieve that and
      approximated the adjustment by scaling iocg->inuse using the proportion of
      the needed hweight_inuse changes.
      
      When nested, these scalings aren't accurate even when adjusting a single
      node as the donating node also receives the benefit of the donated portion.
      When multiple nodes are donating as they often do, they can be wildly wrong.
      
      iocost employed various safety nets to combat the inaccuracies. There are
      ample buffers in determining how much to donate, the adjustments are
      conservative and gradual. While it can achieve a reasonable level of work
      conservation in simple scenarios, the inaccuracies can easily add up leading
      to significant loss of total work. This in turn makes it difficult to
      closely cap vrate as vrate adjustment is needed to compensate for the loss
      of work. The combination of inaccurate donation calculations and vrate
      adjustments can lead to wide fluctuations and clunky overall behaviors.
      
      Andy Newell devised a method to calculate the needed ->inuse updates to
      achieve the target hweight_inuse's. The method is compatible with the
      proportional inuse adjustment propagation which allows all hot path
      operations to be local to each iocg.
      
      To roughly summarize, Andy's method divides the tree into donating and
      non-donating parts, calculates global donation rate which is used to
      determine the target hweight_inuse for each node, and then derives per-level
      proportions. There's non-trivial amount of math involved. Please refer to
      the following pdfs for detailed descriptions.
      
        https://drive.google.com/file/d/1PsJwxPFtjUnwOY1QJ5AeICCcsL7BM3bo
        https://drive.google.com/file/d/1vONz1-fzVO7oY5DXXsLjSxEtYYQbOvsE
        https://drive.google.com/file/d/1WcrltBOSPN0qXVdBgnKm4mdp9FhuEFQN
      
      
      
      This patch implements Andy's method in transfer_surpluses(). This makes the
      donation calculations accurate per cycle and enables further improvements in
      other parts of the donation logic.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Andy Newell <newella@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e08d02aa
    • Tejun Heo's avatar
      blk-iocost: restructure surplus donation logic · 93f7d2db
      Tejun Heo authored
      
      The way the surplus donation logic is structured isn't great. There are two
      separate paths for starting/increasing donations and decreasing them making
      the logic harder to follow and is prone to unnecessary behavior differences.
      
      In preparation for improved donation handling, this patch restructures the
      code so that
      
      * All donors - new, increasing and decreasing - are funneled through the
        same code path.
      
      * The target donation calculation is factored into hweight_after_donation()
        which is called once from the same spot for all possible donors.
      
      * Actual inuse adjustment is factored into trasnfer_surpluses().
      
      This change introduces a few behavior differences - e.g. donation amount
      reduction now uses the max usage of the recent three periods just like new
      and increasing donations, and inuse now gets adjusted upwards the same way
      it gets downwards. These differences are unlikely to have severely negative
      implications and the whole logic will be revamped soon.
      
      This patch also removes two tracepoints. The existing TPs don't quite fit
      the new implementation. A later patch will update and reinstate them.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      93f7d2db
    • Tejun Heo's avatar
      blk-iocost: decouple vrate adjustment from surplus transfers · 065655c8
      Tejun Heo authored
      
      Budget donations are inaccurate and could take multiple periods to converge.
      To prevent triggering vrate adjustments while surplus transfers were
      catching up, vrate adjustment was suppressed if donations were increasing,
      which was indicated by non-zero nr_surpluses.
      
      This entangling won't be necessary with the scheduled rewrite of donation
      mechanism which will make it precise and immediate. Let's decouple the two
      in preparation.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      065655c8
    • Tejun Heo's avatar
      blk-iocost: replace iocg->has_surplus with ->surplus_list · 8692d2db
      Tejun Heo authored
      
      Instead of marking iocgs with surplus with a flag and filtering for them
      while walking all active iocgs, build a surpluses list. This doesn't make
      much difference now but will help implementing improved donation logic which
      will iterate iocgs with surplus multiple times.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8692d2db
    • Tejun Heo's avatar
      blk-iocost: calculate iocg->usages[] from iocg->local_stat.usage_us · 1aa50d02
      Tejun Heo authored
      
      Currently, iocg->usages[] which are used to guide inuse adjustments are
      calculated from vtime deltas. This, however, assumes that the hierarchical
      inuse weight at the time of calculation held for the entire period, which
      often isn't true and can lead to significant errors.
      
      Now that we have absolute usage information collected, we can derive
      iocg->usages[] from iocg->local_stat.usage_us so that inuse adjustment
      decisions are made based on actual absolute usage. The calculated usage is
      clamped between 1 and WEIGHT_ONE and WEIGHT_ONE is also used to signal
      saturation regardless of the current hierarchical inuse weight.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      1aa50d02
    • Tejun Heo's avatar
      blk-iocost: add absolute usage stat · 97eb1975
      Tejun Heo authored
      
      Currently, iocost doesn't collect or expose any statistics punting off all
      monitoring duties to drgn based iocost_monitor.py. While it works for some
      scenarios, there are some usability and data availability challenges. For
      example, accurate per-cgroup usage information can't be tracked by vtime
      progression at all and the number available in iocg->usages[] are really
      short-term snapshots used for control heuristics with possibly significant
      errors.
      
      This patch implements per-cgroup absolute usage stat counter and exposes it
      through io.stat along with the current vrate. Usage stat collection and
      flushing employ the same method as cgroup rstat on the active iocg's and the
      only hot path overhead is preemption toggling and adding to a percpu
      counter.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      97eb1975
    • Tejun Heo's avatar
      blk-iocost: grab ioc->lock for debt handling · da437b95
      Tejun Heo authored
      
      Currently, debt handling requires only iocg->waitq.lock. In the future, we
      want to adjust and propagate inuse changes depending on debt status. Let's
      grab ioc->lock in debt handling paths in preparation.
      
      * Because ioc->lock nests outside iocg->waitq.lock, the decision to grab
        ioc->lock needs to be made before entering the critical sections.
      
      * Add and use iocg_[un]lock() which handles the conditional double locking.
      
      * Add @pay_debt to iocg_kick_waitq() so that debt payment happens only when
        the caller grabbed both locks.
      
      This patch is prepatory and the comments contain references to future
      changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      da437b95
    • Tejun Heo's avatar
      blk-iocost: streamline vtime margin and timer slack handling · 7ca5b2e6
      Tejun Heo authored
      
      The margin handling was pretty inconsistent.
      
      * ioc->margin_us and ioc->inuse_margin_vtime were used as vtime margin
        thresholds. However, the two are in different units with the former
        requiring conversion to vtime on use.
      
      * iocg_kick_waitq() was using a quarter of WAITQ_TIMER_MARGIN_PCT of
        period_us as the timer slack - ~1.2%. While iocg_kick_delay() was using a
        quarter of ioc->margin_us - ~12.5%. There aren't strong reasons to use
        different values for the two.
      
      This patch cleans up margin and timer slack handling:
      
      * vtime margins are now recorded in ioc->margins.{min, max} on period
        duration changes and used consistently.
      
      * Timer slack is now 1% of period_us and recorded in ioc->timer_slack_ns and
        used consistently for iocg_kick_waitq() and iocg_kick_delay().
      
      The only functional change is shortening of timer slack. No meaningful
      visible change is expected.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7ca5b2e6
    • Tejun Heo's avatar
      blk-iocost: make ioc_now->now and ioc->period_at 64bit · ce95570a
      Tejun Heo authored
      
      They are in microseconds and wrap in around 1.2 hours with u32. While
      unlikely, confusions from wraparounds are still possible. We aren't saving
      anything meaningful by keeping these u32. Let's make them u64.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ce95570a
    • Tejun Heo's avatar
      blk-iocost: use WEIGHT_ONE based fixed point number for weights · bd0adb91
      Tejun Heo authored
      
      To improve weight donations, we want to able to scale inuse with a greater
      accuracy and down below 1. Let's make non-hierarchical weights to use
      WEIGHT_ONE based fixed point numbers too like hierarchical ones.
      
      This doesn't cause any behavior changes yet.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      bd0adb91