1. 21 Jan, 2011 1 commit
    • David Rientjes's avatar
      kconfig: rename CONFIG_EMBEDDED to CONFIG_EXPERT · 6a108a14
      David Rientjes authored
      
      The meaning of CONFIG_EMBEDDED has long since been obsoleted; the option
      is used to configure any non-standard kernel with a much larger scope than
      only small devices.
      
      This patch renames the option to CONFIG_EXPERT in init/Kconfig and fixes
      references to the option throughout the kernel.  A new CONFIG_EMBEDDED
      option is added that automatically selects CONFIG_EXPERT when enabled and
      can be used in the future to isolate options that should only be
      considered for embedded systems (RISC architectures, SLOB, etc).
      
      Calling the option "EXPERT" more accurately represents its intention: only
      expert users who understand the impact of the configuration changes they
      are making should enable it.
      Reviewed-by: default avatarIngo Molnar <mingo@elte.hu>
      Acked-by: default avatarDavid Woodhouse <david.woodhouse@intel.com>
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Greg KH <gregkh@suse.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Arnd Bergmann <arn...
      6a108a14
  2. 14 Jan, 2011 2 commits
  3. 07 Jan, 2011 3 commits
  4. 05 Jan, 2011 1 commit
    • Jerome Marchand's avatar
      block: fix accounting bug on cross partition merges · 09e099d4
      Jerome Marchand authored
      
      /proc/diskstats would display a strange output as follows.
      
      $ cat /proc/diskstats |grep sda
         8       0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
         8       1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
                                                      ~~~~~~~~~~
         8       2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
         8       3 sda3 54 487 2188 92 0 0 0 0 0 88 92
         8       4 sda4 4 0 8 0 0 0 0 0 0 0 0
         8       5 sda5 81 2027 2130 138 0 0 0 0 0 87 137
      
      Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
      merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.
      
      The detailed root cause is as follows.
      
      Assuming that there are two partition, sda1 and sda2.
      
      1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
         is 0 and sda2's one is 1.
      
              | hd_struct->in_flight
         ---------------------------
         sda1 |          0
         sda2 |          1
         ---------------------------
      
      2. A bio belongs to sda1 is issued and is merged into the request mentioned on
         step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
         from sda2 region to sda1 region. However the two partition's
         hd_struct->in_flight are not changed.
      
              | hd_struct->in_flight
         ---------------------------
         sda1 |          0
         sda2 |          1
         ---------------------------
      
      3. The request is finished and blk_account_io_done() is called. In this case,
         sda2's hd_struct->in_flight, not a sda1's one, is decremented.
      
              | hd_struct->in_flight
         ---------------------------
         sda1 |         -1
         sda2 |          1
         ---------------------------
      
      The patch fixes the problem by caching the partition lookup
      inside the request structure, hence making sure that the increment
      and decrement will always happen on the same partition struct. This
      also speeds up IO with accounting enabled, since it cuts down on
      the number of lookups we have to do.
      
      Also add a refcount to struct hd_struct to keep the partition in
      memory as long as users exist. We use kref_test_and_get() to ensure
      we don't add a reference to a partition which is going away.
      Signed-off-by: default avatarJerome Marchand <jmarchan@redhat.com>
      Signed-off-by: default avatarYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: stable@kernel.org
      Signed-off-by: default avatarJens Axboe <jaxboe@fusionio.com>
      09e099d4
  5. 03 Jan, 2011 1 commit
  6. 21 Dec, 2010 1 commit
  7. 17 Dec, 2010 5 commits
  8. 16 Dec, 2010 3 commits
    • Tejun Heo's avatar
      implement in-kernel gendisk events handling · 77ea887e
      Tejun Heo authored
      
      Currently, media presence polling for removeable block devices is done
      from userland.  There are several issues with this.
      
      * Polling is done by periodically opening the device.  For SCSI
        devices, the command sequence generated by such action involves a
        few different commands including TEST_UNIT_READY.  This behavior,
        while perfectly legal, is different from Windows which only issues
        single command, GET_EVENT_STATUS_NOTIFICATION.  Unfortunately, some
        ATAPI devices lock up after being periodically queried such command
        sequences.
      
      * There is no reliable and unintrusive way for a userland program to
        tell whether the target device is safe for media presence polling.
        For example, polling for media presence during an on-going burning
        session can make it fail.  The polling program can avoid this by
        opening the device with O_EXCL but then it risks making a valid
        exclusive user of the device fail w/ -EBUSY.
      
      * Userland polling is unnecessarily heavy and in-kernel implementation
        is lighter and better coordinated (workqueue, timer slack).
      
      This patch implements framework for in-kernel disk event handling,
      which includes media presence polling.
      
      * bdops->check_events() is added, which supercedes ->media_changed().
        It should check whether there's any pending event and return if so.
        Currently, two events are defined - DISK_EVENT_MEDIA_CHANGE and
        DISK_EVENT_EJECT_REQUEST.  ->check_events() is guaranteed not to be
        called parallelly.
      
      * gendisk->events and ->async_events are added.  These should be
        initialized by block driver before passing the device to add_disk().
        The former contains the mask of all supported events and the latter
        the mask of all events which the device can report without polling.
        /sys/block/*/events[_async] export these to userland.
      
      * Kernel parameter block.events_dfl_poll_msecs controls the system
        polling interval (default is 0 which means disable) and
        /sys/block/*/events_poll_msecs control polling intervals for
        individual devices (default is -1 meaning use system setting).  Note
        that if a device can report all supported events asynchronously and
        its polling interval isn't explicitly set, the device won't be
        polled regardless of the system polling interval.
      
      * If a device is opened exclusively with write access, event checking
        is automatically disabled until all write exclusive accesses are
        released.
      
      * There are event 'clearing' events.  For example, both of currently
        defined events are cleared after the device has been successfully
        opened.  This information is passed to ->check_events() callback
        using @clearing argument as a hint.
      
      * Event checking is always performed from system_nrt_wq and timer
        slack is set to 25% for polling.
      
      * Nothing changes for drivers which implement ->media_changed() but
        not ->check_events().  Going forward, all drivers will be converted
        to ->check_events() and ->media_change() will be dropped.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <jaxboe@fusionio.com>
      77ea887e
    • Tejun Heo's avatar
      block: move register_disk() and del_gendisk() to block/genhd.c · d2bf1b67
      Tejun Heo authored
      
      There's no reason for register_disk() and del_gendisk() to be in
      fs/partitions/check.c.  Move both to genhd.c.  While at it, collapse
      unlink_gendisk(), which was artificially in a separate function due to
      genhd.c / check.c split, into del_gendisk().
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <jaxboe@fusionio.com>
      d2bf1b67
    • Tejun Heo's avatar
      block: kill genhd_media_change_notify() · dddd9dc3
      Tejun Heo authored
      
      There's no user of the facility.  Kill it.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <jaxboe@fusionio.com>
      dddd9dc3
  9. 13 Dec, 2010 1 commit
  10. 09 Dec, 2010 1 commit
  11. 01 Dec, 2010 2 commits
    • Vivek Goyal's avatar
      blk-throttle: Correct the placement of smp_rmb() · 04a6b516
      Vivek Goyal authored
      o I was discussing what are the variable being updated without spin lock and
        why do we need barriers and Oleg pointed out that location of smp_rmb()
        should be between read of td->limits_changed and tg->limits_changed. This
        patch fixes it.
      
      o Following is one possible sequence of events. Say cpu0 is executing
        throtl_update_blkio_group_read_bps() and cpu1 is executing
        throtl_process_limit_change().
      
       cpu0                                                cpu1
      
       tg->limits_changed = true;
       smp_mb__before_atomic_inc();
       atomic_inc(&td->limits_changed);
      
                                           if (!atomic_read(&td->limits_changed))
                                                   return;
      
                                           if (tg->limits_changed)
                                                   do_something;
      
       If cpu0 has updated tg->limits_changed and td->limits_changed, we want to
       make sure that if update to td->limits_changed is visible on cpu1, then
       update t...
      04a6b516
    • Vivek Goyal's avatar
      blk-throttle: Trim/adjust slice_end once a bio has been dispatched · d1ae8ffd
      Vivek Goyal authored
      
      o During some testing I did following and noticed throttling stops working.
      
              - Put a very low limit on a cgroup, say 1 byte per second.
              - Start some reads, this will set slice_end to a very high value.
              - Change the limit to higher value say 1MB/s
              - Now IO unthrottles and finishes as expected.
              - Try to do the read again but IO is not limited to 1MB/s as expected.
      
      o What is happening.
              - Initially low value of limit sets slice_end to a very high value.
              - During updation of limit, slice_end is not being truncated.
              - Very high value of slice_end leads to keeping the existing slice
                valid for a very long time and new slice does not start.
              - tg_may_dispatch() is called in blk_throtle_bio(), and trim_slice()
                is not called in this path. So slice_start is some old value and
                practically we are able to do huge amount of IO.
      
      o There are many ways it can be fixed. I have fixed it by trying to
        adjust/cleanup slice_end in trim_slice(). Generally we extend slices if bio
        is big and can't be dispatched in one slice. After dispatch of bio, readjust
        the slice_end to make sure we don't end up with huge values.
      Signed-off-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <jaxboe@fusionio.com>
      d1ae8ffd
  12. 30 Nov, 2010 2 commits
  13. 29 Nov, 2010 1 commit
  14. 17 Nov, 2010 1 commit
  15. 16 Nov, 2010 1 commit
  16. 15 Nov, 2010 2 commits
    • Vivek Goyal's avatar
      blk-cgroup: Allow creation of hierarchical cgroups · bdc85df7
      Vivek Goyal authored
      
      o Allow hierarchical cgroup creation for blkio controller
      
      o Currently we disallow it as both the io controller policies (throttling
        as well as proportion bandwidth) do not support hierarhical accounting
        and control. But the flip side is that blkio controller can not be used with
        libvirt as libvirt creates a cgroup hierarchy deeper than 1 level.
      
        <top-level-cgroup-dir>/<controller>/libvirt/qemu/<virtual-machine-groups>
      
      o So this patch will allow creation of cgroup hierarhcy but at the backend
        everything will be treated as flat. So if somebody created a an hierarchy
        like as follows.
      
      			root
      			/  \
      		     test1 test2
      			|
      		     test3
      
        CFQ and throttling will practically treat all groups at same level.
      
      				pivot
      			     /  |   \  \
      			root  test1 test2  test3
      
      o Once we have actual support for hierarchical accounting and control
        then we can introduce another cgroup tunable file "blkio.use_hierarchy"
        which will be 0 by default but if user wants to enforce hierarhical
        control then it can be set to 1. This way there should not be any
        ABI problems down the line.
      
      o The only not so pretty part is introduction of extra file "use_hierarchy"
        down the line. Kame-san had mentioned that hierarhical accounting is
        expensive in memory controller hence they keep it off by default. I
        suspect same will be the case for IO controller also as for each IO
        completion we shall have to account IO through hierarchy up to the root.
        if yes, then it probably is not a very bad idea to introduce this extra
        file so that it will be used only when somebody needs it and some people
        might enable hierarchy only in part of the hierarchy.
      
      o This is how basically memory controller also uses "use_hierarhcy" and
        they also allowed creation of hierarchies when actual backend support
        was not available.
      Signed-off-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Acked-by: default avatarBalbir Singh <balbir@linux.vnet.ibm.com>
      Reviewed-by: default avatarGui Jianfeng <guijianfeng@cn.fujitsu.com>
      Reviewed-by: default avatarCiju Rajan K <ciju@linux.vnet.ibm.com>
      Tested-by: default avatarCiju Rajan K <ciju@linux.vnet.ibm.com>
      Signed-off-by: default avatarJens Axboe <jaxboe@fusionio.com>
      bdc85df7
    • Vivek Goyal's avatar
      blk-throttle: Fix calculation of max number of WRITES to be dispatched · c2f6805d
      Vivek Goyal authored
      
      o Currently we try to dispatch more READS and less WRITES (75%, 25%) in one
        dispatch round. ummy pointed out that there is a bug in max_nr_writes
        calculation. This patch fixes it.
      Reported-by: default avatarummy y <yummylln@yahoo.com.cn>
      Signed-off-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <jaxboe@fusionio.com>
      c2f6805d
  17. 13 Nov, 2010 1 commit
    • Tejun Heo's avatar
      block: make blkdev_get/put() handle exclusive access · e525fd89
      Tejun Heo authored
      
      Over time, block layer has accumulated a set of APIs dealing with bdev
      open, close, claim and release.
      
      * blkdev_get/put() are the primary open and close functions.
      
      * bd_claim/release() deal with exclusive open.
      
      * open/close_bdev_exclusive() are combination of open and claim and
        the other way around, respectively.
      
      * bd_link/unlink_disk_holder() to create and remove holder/slave
        symlinks.
      
      * open_by_devnum() wraps bdget() + blkdev_get().
      
      The interface is a bit confusing and the decoupling of open and claim
      makes it impossible to properly guarantee exclusive access as
      in-kernel open + claim sequence can disturb the existing exclusive
      open even before the block layer knows the current open if for another
      exclusive access.  Reorganize the interface such that,
      
      * blkdev_get() is extended to include exclusive access management.
        @holder argument is added and, if is @FMODE_EXCL specified, it will
        gain exclusive access atomically w.r.t. other exclusive accesses.
      
      * blkdev_put() is similarly extended.  It now takes @mode argument and
        if @FMODE_EXCL is set, it releases an exclusive access.  Also, when
        the last exclusive claim is released, the holder/slave symlinks are
        removed automatically.
      
      * bd_claim/release() and close_bdev_exclusive() are no longer
        necessary and either made static or removed.
      
      * bd_link_disk_holder() remains the same but bd_unlink_disk_holder()
        is no longer necessary and removed.
      
      * open_bdev_exclusive() becomes a simple wrapper around lookup_bdev()
        and blkdev_get().  It also has an unexpected extra bdev_read_only()
        test which probably should be moved into blkdev_get().
      
      * open_by_devnum() is modified to take @holder argument and pass it to
        blkdev_get().
      
      Most of bdev open/close operations are unified into blkdev_get/put()
      and most exclusive accesses are tested atomically at the open time (as
      it should).  This cleans up code and removes some, both valid and
      invalid, but unnecessary all the same, corner cases.
      
      open_bdev_exclusive() and open_by_devnum() can use further cleanup -
      rename to blkdev_get_by_path() and blkdev_get_by_devt() and drop
      special features.  Well, let's leave them for another day.
      
      Most conversions are straight-forward.  drbd conversion is a bit more
      involved as there was some reordering, but the logic should stay the
      same.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarNeil Brown <neilb@suse.de>
      Acked-by: default avatarRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
      Acked-by: default avatarMike Snitzer <snitzer@redhat.com>
      Acked-by: default avatarPhilipp Reisner <philipp.reisner@linbit.com>
      Cc: Peter Osterlund <petero2@telia.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <joel.becker@oracle.com>
      Cc: Alex Elder <aelder@sgi.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: dm-devel@redhat.com
      Cc: drbd-dev@lists.linbit.com
      Cc: Leo Chen <leochen@broadcom.com>
      Cc: Scott Branden <sbranden@broadcom.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
      Cc: Joern Engel <joern@logfs.org>
      Cc: reiserfs-devel@vger.kernel.org
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      e525fd89
  18. 11 Nov, 2010 1 commit
  19. 10 Nov, 2010 5 commits
  20. 09 Nov, 2010 1 commit
  21. 08 Nov, 2010 3 commits
    • Shaohua Li's avatar
      cfq-iosched: don't idle if a deep seek queue is slow · 8e1ac665
      Shaohua Li authored
      
      If a deep seek queue slowly deliver requests but disk is much faster, idle
      for the queue just wastes disk throughput. If the queue delevers all requests
      before half its slice is used, the patch disable idle for it.
      In my test, application delivers 32 requests one time, the disk can accept
      128 requests at maxium and disk is fast. without the patch, the throughput
      is just around 30m/s, while with it, the speed is about 80m/s. The disk is
      a SSD, but is detected as a rotational disk. I can configure it as SSD, but
      I thought the deep seek queue logic should be fixed too, for example,
      considering a fast raid.
      Signed-off-by: default avatarShaohua Li <shaohua.li@intel.com>
      Signed-off-by: default avatarJens Axboe <jaxboe@fusionio.com>
      8e1ac665
    • Shaohua Li's avatar
      cfq-iosched: schedule dispatch for noidle queue · d2d59e18
      Shaohua Li authored
      
      A queue is idle at cfq_dispatch_requests(), but it gets noidle later. Unless
      other task explictly does unplug or all requests are drained, we will not
      deliever requests to the disk even cfq_arm_slice_timer doesn't make the
      queue idle. For example, cfq_should_idle() returns true because of
      service_tree->count == 1, and then other queues are added. Note, I didn't
      see obvious performance impacts so far with the patch, but just thought
      this could be a problem.
      Signed-off-by: default avatarShaohua Li <shaohua.li@intel.com>
      Signed-off-by: default avatarJens Axboe <jaxboe@fusionio.com>
      d2d59e18
    • Shaohua Li's avatar
      cfq-iosched: do cleanup · c1e44756
      Shaohua Li authored
      
      Some functions should return boolean.
      Signed-off-by: default avatarShaohua Li <shaohua.li@intel.com>
      Signed-off-by: default avatarJens Axboe <jaxboe@fusionio.com>
      c1e44756
  22. 01 Nov, 2010 1 commit