1. 31 Jul, 2020 1 commit
  2. 17 Jul, 2020 1 commit
  3. 15 Jul, 2020 1 commit
    • Jens Axboe's avatar
      block: relax jiffies rounding for timeouts · 9054650f
      Jens Axboe authored
      
      In doing high IOPS testing, blk-mq is generally pretty well optimized.
      There are a few things that stuck out as using more CPU than what is
      really warranted, and one thing is the round_jiffies_up() that we do
      twice for each request. That accounts for about 0.8% of the CPU in
      my testing.
      
      We can make this cheaper by avoiding an integer division, by just adding
      a rough HZ mask that we can AND with instead. The timeouts are only on a
      second granularity already, we don't have to be that accurate here and
      this patch barely changes that. All we care about is nice grouping.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9054650f
  4. 24 Jun, 2020 1 commit
  5. 30 Apr, 2019 1 commit
  6. 15 Nov, 2018 2 commits
  7. 09 Nov, 2018 1 commit
  8. 07 Nov, 2018 1 commit
  9. 23 Jun, 2018 1 commit
  10. 29 May, 2018 3 commits
  11. 02 Apr, 2018 1 commit
  12. 08 Mar, 2018 1 commit
  13. 10 Jan, 2018 1 commit
  14. 09 Jan, 2018 3 commits
    • Tejun Heo's avatar
      blk-mq: remove REQ_ATOM_COMPLETE usages from blk-mq · 634f9e46
      Tejun Heo authored
      
      After the recent updates to use generation number and state based
      synchronization, blk-mq no longer depends on REQ_ATOM_COMPLETE except
      to avoid firing the same timeout multiple times.
      
      Remove all REQ_ATOM_COMPLETE usages and use a new rq_flags flag
      RQF_MQ_TIMEOUT_EXPIRED to avoid firing the same timeout multiple
      times.  This removes atomic bitops from hot paths too.
      
      v2: Removed blk_clear_rq_complete() from blk_mq_rq_timed_out().
      
      v3: Added RQF_MQ_TIMEOUT_EXPIRED flag.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      634f9e46
    • Tejun Heo's avatar
      blk-mq: make blk_abort_request() trigger timeout path · 358f70da
      Tejun Heo authored
      
      With issue/complete and timeout paths now using the generation number
      and state based synchronization, blk_abort_request() is the only one
      which depends on REQ_ATOM_COMPLETE for arbitrating completion.
      
      There's no reason for blk_abort_request() to be a completely separate
      path.  This patch makes blk_abort_request() piggyback on the timeout
      path instead of trying to terminate the request directly.
      
      This removes the last dependency on REQ_ATOM_COMPLETE in blk-mq.
      
      Note that this makes blk_abort_request() asynchronous - it initiates
      abortion but the actual termination will happen after a short while,
      even when the caller owns the request.  AFAICS, SCSI and ATA should be
      fine with that and I think mtip32xx and dasd should be safe but not
      completely sure.  It'd be great if people who know the drivers take a
      look.
      
      v2: - Add comment explaining the lack of synchronization around
            ->deadline update as requested by Bart.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Asai Thambi SP <asamymuthupa@micron.com>
      Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
      Cc: Jan Hoeppner <hoeppner@linux.vnet.ibm.com>
      Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      358f70da
    • Tejun Heo's avatar
      blk-mq: replace timeout synchronization with a RCU and generation based scheme · 1d9bd516
      Tejun Heo authored
      
      Currently, blk-mq timeout path synchronizes against the usual
      issue/completion path using a complex scheme involving atomic
      bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
      rules.  Unfortunately, it contains quite a few holes.
      
      There's a complex dancing around REQ_ATOM_STARTED and
      REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
      they don't have a synchronization point across request recycle
      instances and it isn't clear what the barriers add.
      blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
      deadline from N-1'th, blk_mark_rq_complete() against Nth instance.
      
      In fact, it's pretty easy to make blk_mq_check_expired() terminate a
      later instance of a request.  If we induce 5 sec delay before
      time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
      2s, and issue back-to-back large IOs, blk-mq starts timing out
      requests spuriously pretty quickly.  Nothing actually timed out.  It
      just made the call on a recycle instance of a request and then
      terminated a later instance long after the original instance finished.
      The scenario isn't theoretical either.
      
      This patch replaces the broken synchronization mechanism with a RCU
      and generation number based one.
      
      1. Each request has a u64 generation + state value, which can be
         updated only by the request owner.  Whenever a request becomes
         in-flight, the generation number gets bumped up too.  This provides
         the basis for the timeout path to distinguish different recycle
         instances of the request.
      
         Also, marking a request in-flight and setting its deadline are
         protected with a seqcount so that the timeout path can fetch both
         values coherently.
      
      2. The timeout path fetches the generation, state and deadline.  If
         the verdict is timeout, it records the generation into a dedicated
         request abortion field and does RCU wait.
      
      3. The completion path is also protected by RCU (from the previous
         patch) and checks whether the current generation number and state
         match the abortion field.  If so, it skips completion.
      
      4. The timeout path, after RCU wait, scans requests again and
         terminates the ones whose generation and state still match the ones
         requested for abortion.
      
         By now, the timeout path knows that either the generation number
         and state changed if it lost the race or the completion will yield
         to it and can safely timeout the request.
      
      While it's more lines of code, it's conceptually simpler, doesn't
      depend on direct use of subtle memory ordering or coherence, and
      hopefully doesn't terminate the wrong instance.
      
      While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
      between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
      removed yet as it's still used in other places.  Future patches will
      move all state tracking to the new mechanism and remove all bitops in
      the hot paths.
      
      Note that this patch adds a comment explaining a race condition in
      BLK_EH_RESET_TIMER path.  The race has always been there and this
      patch doesn't change it.  It's just documenting the existing race.
      
      v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
          - s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
          - READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.
      
      v3: - Fixed possible extended seqcount / u64_stats_sync read looping
            spotted by Peter.
          - MQ_RQ_IDLE was incorrectly being set in complete_request instead
            of free_request.  Fixed.
      
      v4: - Rebased on top of hctx_lock() refactoring patch.
          - Added comment explaining the use of hctx_lock() in completion path.
      
      v5: - Added comments requested by Bart.
          - Note the addition of BLK_EH_RESET_TIMER race condition in the
            commit message.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      1d9bd516
  15. 30 Oct, 2017 1 commit
  16. 04 Oct, 2017 1 commit
    • Peter Zijlstra's avatar
      blk-mq: attempt to fix atomic flag memory ordering · a7af0af3
      Peter Zijlstra authored
      Attempt to untangle the ordering in blk-mq. The patch introducing the
      single smp_mb__before_atomic() is obviously broken in that it doesn't
      clearly specify a pairing barrier and an obtained guarantee.
      
      The comment is further misleading in that it hints that the
      deadline store and the COMPLETE store also need to be ordered, but
      AFAICT there is no such dependency. However what does appear to be
      important is the clear happening _after_ the store, and that worked by
      pure accident.
      
      This clarifies blk_mq_start_request() -- we should not get there with
      STARTING set -- this simplifies the code and makes the barrier usage
      sane (the old code could be read to allow not having _any_ atomic after
      the barrier, in which case the barrier hasn't got anything to order). We
      then also introduce the missing pairing barrier for it.
      
      Also down-grade the barrier to smp_wmb(), this is cheaper for
      PowerPC/ARM and doesn't cost anything extra on x86.
      
      And it documents the STARTING vs COMPLETE ordering. Although I've not
      been entirely successful in reverse engineering the blk-mq state
      machine so there might still be more funnies around timeout vs
      requeue.
      
      If I got anything wrong, feel free to educate me by adding comments to
      clarify things ;-)
      
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Ming Lei <tom.leiming@gmail.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Bart Van Assche <bart.vanassche@wdc.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Fixes: 538b7534
      
       ("blk-mq: request deadline must be visible before marking rq as started")
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a7af0af3
  17. 21 Jun, 2017 1 commit
  18. 20 Apr, 2017 1 commit
  19. 22 Dec, 2015 2 commits
    • Christoph Hellwig's avatar
      block: remove REQ_NO_TIMEOUT flag · bbc758ec
      Christoph Hellwig authored
      
      This was added for the 'magic' AEN requests in the NVMe driver that never
      return.  We now handle them purely inside the driver and don't need this
      core hack any more.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarKeith Busch <keith.busch@intel.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      bbc758ec
    • Christoph Hellwig's avatar
      block: defer timeouts to a workqueue · 287922eb
      Christoph Hellwig authored
      
      Timer context is not very useful for drivers to perform any meaningful abort
      action from.  So instead of calling the driver from this useless context
      defer it to a workqueue as soon as possible.
      
      Note that while a delayed_work item would seem the right thing here I didn't
      dare to use it due to the magic in blk_add_timer that pokes deep into timer
      internals.  But maybe this encourages Tejun to add a sensible API for that to
      the workqueue API and we'll all be fine in the end :)
      
      Contains a major update from Keith Bush:
      
      "This patch removes synchronizing the timeout work so that the timer can
       start a freeze on its own queue. The timer enters the queue, so timer
       context can only start a freeze, but not wait for frozen."
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarKeith Busch <keith.busch@intel.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      287922eb
  20. 24 Nov, 2015 2 commits
  21. 08 Jan, 2015 1 commit
  22. 22 Sep, 2014 3 commits
  23. 30 May, 2014 1 commit
  24. 13 May, 2014 1 commit
    • Jens Axboe's avatar
      blk-mq: improve support for shared tags maps · 0d2602ca
      Jens Axboe authored
      
      This adds support for active queue tracking, meaning that the
      blk-mq tagging maintains a count of active users of a tag set.
      This allows us to maintain a notion of fairness between users,
      so that we can distribute the tag depth evenly without starving
      some users while allowing others to try unfair deep queues.
      
      If sharing of a tag set is detected, each hardware queue will
      track the depth of its own queue. And if this exceeds the total
      depth divided by the number of active queues, the user is actively
      throttled down.
      
      The active queue count is done lazily to avoid bouncing that data
      between submitter and completer. Each hardware queue gets marked
      active when it allocates its first tag, and gets marked inactive
      when 1) the last tag is cleared, and 2) the queue timeout grace
      period has passed.
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      0d2602ca
  25. 25 Apr, 2014 1 commit
  26. 24 Apr, 2014 1 commit
    • Jens Axboe's avatar
      blk-mq: fix race with timeouts and requeue events · 87ee7b11
      Jens Axboe authored
      
      If a requeue event races with a timeout, we can get into the
      situation where we attempt to complete a request from the
      timeout handler when it's not start anymore. This causes a crash.
      So have the timeout handler check that REQ_ATOM_STARTED is still
      set on the request - if not, we ignore the event. If this happens,
      the request has now been marked as complete. As a consequence, we
      need to ensure to clear REQ_ATOM_COMPLETE in blk_mq_start_request(),
      as to maintain proper request state.
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      87ee7b11
  27. 16 Apr, 2014 1 commit
    • Jens Axboe's avatar
      block: relax when to modify the timeout timer · f793aa53
      Jens Axboe authored
      
      Since we are now, by default, applying timer slack to expiry times,
      the logic for when to modify a timer in the block code is suboptimal.
      The block layer keeps a forward rolling timer per queue for all
      requests, and modifies this timer if a request has a shorter timeout
      than what the current expiry time is. However, this breaks down
      when our rounded timer values get applied slack. Then each new
      request ends up modifying the timer, since we're still a little
      in front of the timer + slack.
      
      Fix this by allowing a tolerance of HZ / 2, the timeout handling
      doesn't need to be very precise. This drastically cuts down
      the number of timer modifications we have to make.
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      f793aa53
  28. 10 Feb, 2014 1 commit
    • Christoph Hellwig's avatar
      blk-mq: rework I/O completions · 30a91cb4
      Christoph Hellwig authored
      
      Rework I/O completions to work more like the old code path.  blk_mq_end_io
      now stays out of the business of deferring completions to others CPUs
      and calling blk_mark_rq_complete.  The latter is very important to allow
      completing requests that have timed out and thus are already marked completed,
      the former allows using the IPI callout even for driver specific completions
      instead of having to reimplement them.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      30a91cb4
  29. 08 Nov, 2013 2 commits
    • Duan Jiong's avatar
      block: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO · 8616ebb1
      Duan Jiong authored
      
      This patch fixes coccinelle error regarding usage of IS_ERR and
      PTR_ERR instead of PTR_ERR_OR_ZERO.
      Signed-off-by: default avatarDuan Jiong <duanj.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8616ebb1
    • Jeff Moyer's avatar
      block: fix race between request completion and timeout handling · 4912aa6c
      Jeff Moyer authored
      
      crocode i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support shpchp ioatdma dca be2net sg ses enclosure ext4 mbcache jbd2 sd_mod crc_t10dif ahci megaraid_sas(U) dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
      
      Pid: 491, comm: scsi_eh_0 Tainted: G        W  ----------------   2.6.32-220.13.1.el6.x86_64 #1 IBM  -[8722PAX]-/00D1461
      RIP: 0010:[<ffffffff8124e424>]  [<ffffffff8124e424>] blk_requeue_request+0x94/0xa0
      RSP: 0018:ffff881057eefd60  EFLAGS: 00010012
      RAX: ffff881d99e3e8a8 RBX: ffff881d99e3e780 RCX: ffff881d99e3e8a8
      RDX: ffff881d99e3e8a8 RSI: ffff881d99e3e780 RDI: ffff881d99e3e780
      RBP: ffff881057eefd80 R08: ffff881057eefe90 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: ffff881057f92338
      R13: 0000000000000000 R14: ffff881057f92338 R15: ffff883058188000
      FS:  0000000000000000(0000) GS:ffff880040200000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
      CR2: 00000000006d3ec0 CR3: 000000302cd7d000 CR4: 00000000000406b0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process scsi_eh_0 (pid: 491, threadinfo ffff881057eee000, task ffff881057e29540)
      Stack:
       0000000000001057 0000000000000286 ffff8810275efdc0 ffff881057f16000
      <0> ffff881057eefdd0 ffffffff81362323 ffff881057eefe20 ffffffff8135f393
      <0> ffff881057e29af8 ffff8810275efdc0 ffff881057eefe78 ffff881057eefe90
      Call Trace:
       [<ffffffff81362323>] __scsi_queue_insert+0xa3/0x150
       [<ffffffff8135f393>] ? scsi_eh_ready_devs+0x5e3/0x850
       [<ffffffff81362a23>] scsi_queue_insert+0x13/0x20
       [<ffffffff8135e4d4>] scsi_eh_flush_done_q+0x104/0x160
       [<ffffffff8135fb6b>] scsi_error_handler+0x35b/0x660
       [<ffffffff8135f810>] ? scsi_error_handler+0x0/0x660
       [<ffffffff810908c6>] kthread+0x96/0xa0
       [<ffffffff8100c14a>] child_rip+0xa/0x20
       [<ffffffff81090830>] ? kthread+0x0/0xa0
       [<ffffffff8100c140>] ? child_rip+0x0/0x20
      Code: 00 00 eb d1 4c 8b 2d 3c 8f 97 00 4d 85 ed 74 bf 49 8b 45 00 49 83 c5 08 48 89 de 4c 89 e7 ff d0 49 8b 45 00 48 85 c0 75 eb eb a4 <0f> 0b eb fe 0f 1f 84 00 00 00 00 00 55 48 89 e5 0f 1f 44 00 00
      RIP  [<ffffffff8124e424>] blk_requeue_request+0x94/0xa0
       RSP <ffff881057eefd60>
      
      The RIP is this line:
              BUG_ON(blk_queued_rq(rq));
      
      After digging through the code, I think there may be a race between the
      request completion and the timer handler running.
      
      A timer is started for each request put on the device's queue (see
      blk_start_request->blk_add_timer).  If the request does not complete
      before the timer expires, the timer handler (blk_rq_timed_out_timer)
      will mark the request complete atomically:
      
      static inline int blk_mark_rq_complete(struct request *rq)
      {
              return test_and_set_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags);
      }
      
      and then call blk_rq_timed_out.  The latter function will call
      scsi_times_out, which will return one of BLK_EH_HANDLED,
      BLK_EH_RESET_TIMER or BLK_EH_NOT_HANDLED.  If BLK_EH_RESET_TIMER is
      returned, blk_clear_rq_complete is called, and blk_add_timer is again
      called to simply wait longer for the request to complete.
      
      Now, if the request happens to complete while this is going on, what
      happens?  Given that we know the completion handler will bail if it
      finds the REQ_ATOM_COMPLETE bit set, we need to focus on the completion
      handler running after that bit is cleared.  So, from the above
      paragraph, after the call to blk_clear_rq_complete.  If the completion
      sets REQ_ATOM_COMPLETE before the BUG_ON in blk_add_timer, we go boom
      there (I haven't seen this in the cores).  Next, if we get the
      completion before the call to list_add_tail, then the timer will
      eventually fire for an old req, which may either be freed or reallocated
      (there is evidence that this might be the case).  Finally, if the
      completion comes in *after* the addition to the timeout list, I think
      it's harmless.  The request will be removed from the timeout list,
      req_atom_complete will be set, and all will be well.
      
      This will only actually explain the coredumps *IF* the request
      structure was freed, reallocated *and* queued before the error handler
      thread had a chance to process it.  That is possible, but it may make
      sense to keep digging for another race.  I think that if this is what
      was happening, we would see other instances of this problem showing up
      as null pointer or garbage pointer dereferences, for example when the
      request structure was not re-used.  It looks like we actually do run
      into that situation in other reports.
      
      This patch moves the BUG_ON(test_bit(REQ_ATOM_COMPLETE,
      &req->atomic_flags)); from blk_add_timer to the only caller that could
      trip over it (blk_start_request).  It then inverts the calls to
      blk_clear_rq_complete and blk_add_timer in blk_rq_timed_out to address
      the race.  I've boot tested this patch, but nothing more.
      Signed-off-by: default avatarJeff Moyer <jmoyer@redhat.com>
      Acked-by: default avatarHannes Reinecke <hare@suse.de>
      Cc: stable@kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4912aa6c
  30. 25 Oct, 2013 1 commit
    • Jens Axboe's avatar
      blk-mq: new multi-queue block IO queueing mechanism · 320ae51f
      Jens Axboe authored
      
      Linux currently has two models for block devices:
      
      - The classic request_fn based approach, where drivers use struct
        request units for IO. The block layer provides various helper
        functionalities to let drivers share code, things like tag
        management, timeout handling, queueing, etc.
      
      - The "stacked" approach, where a driver squeezes in between the
        block layer and IO submitter. Since this bypasses the IO stack,
        driver generally have to manage everything themselves.
      
      With drivers being written for new high IOPS devices, the classic
      request_fn based driver doesn't work well enough. The design dates
      back to when both SMP and high IOPS was rare. It has problems with
      scaling to bigger machines, and runs into scaling issues even on
      smaller machines when you have IOPS in the hundreds of thousands
      per device.
      
      The stacked approach is then most often selected as the model
      for the driver. But this means that everybody has to re-invent
      everything, and along with that we get all the problems again
      that the shared approach solved.
      
      This commit introduces blk-mq, block multi queue support. The
      design is centered around per-cpu queues for queueing IO, which
      then funnel down into x number of hardware submission queues.
      We might have a 1:1 mapping between the two, or it might be
      an N:M mapping. That all depends on what the hardware supports.
      
      blk-mq provides various helper functions, which include:
      
      - Scalable support for request tagging. Most devices need to
        be able to uniquely identify a request both in the driver and
        to the hardware. The tagging uses per-cpu caches for freed
        tags, to enable cache hot reuse.
      
      - Timeout handling without tracking request on a per-device
        basis. Basically the driver should be able to get a notification,
        if a request happens to fail.
      
      - Optional support for non 1:1 mappings between issue and
        submission queues. blk-mq can redirect IO completions to the
        desired location.
      
      - Support for per-request payloads. Drivers almost always need
        to associate a request structure with some driver private
        command structure. Drivers can tell blk-mq this at init time,
        and then any request handed to the driver will have the
        required size of memory associated with it.
      
      - Support for merging of IO, and plugging. The stacked model
        gets neither of these. Even for high IOPS devices, merging
        sequential IO reduces per-command overhead and thus
        increases bandwidth.
      
      For now, this is provided as a potential 3rd queueing model, with
      the hope being that, as it matures, it can replace both the classic
      and stacked model. That would get us back to having just 1 real
      model for block devices, leaving the stacked approach to dm/md
      devices (as it was originally intended).
      
      Contributions in this patch from the following people:
      
      Shaohua Li <shli@fusionio.com>
      Alexander Gordeev <agordeev@redhat.com>
      Christoph Hellwig <hch@infradead.org>
      Mike Christie <michaelc@cs.wisc.edu>
      Matias Bjorling <m@bjorling.me>
      Jeff Moyer <jmoyer@redhat.com>
      Acked-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      320ae51f