linux

Author	SHA1	Message	Date
Shaohua Li	e9ce335df5	cfq-iosched: fix a kbuild regression Alex Shi reported a kbuild regression which is about 10% performance lost. He bisected to this commit: `3dde36ddea`. The reason is cfqq_close() can't find close cooperator. Restoring cfq_rq_close()'s threshold to original value makes the regression go away. Since for_preempt parameter isn't used anymore, this patch deletes it. Reported-by: Alex Shi <alex.shi@intel.com> Signed-off-by: Shaohua Li <shaohua.li@intel.com> Acked-by: Corrado Zoccolo <czoccolo@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-03-19 08:03:04 +01:00
Li Zefan	910ac735ba	block: make CONFIG_BLK_CGROUP visible Make the config visible, so we can choose from CONFIG_BLK_CGROUP=y and CONFIG_BLK_CGROUP=m when CONFIG_IOSCHED_CFQ=m. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-03-16 08:57:15 +01:00
Martin K. Petersen	c77a5710b7	block: Export max number of segments and max segment size in sysfs These two values are useful when debugging issues surrounding maximum I/O size. Put them in sysfs with the rest of the queue limits. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-03-15 12:48:00 +01:00
Martin K. Petersen	2cda2728aa	block: Fix overrun in lcm() and move it to lib lcm() was defined to take integer-sized arguments. The supplied arguments are multiplied, however, causing us to overflow given sufficiently large input. That in turn led to incorrect optimal I/O size reporting in some cases (RAID over RAID). Switch lcm() over to unsigned long similar to gcd() and move the function from blk-settings.c to lib. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-03-15 12:47:59 +01:00
Linus Torvalds	c32da02342	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (56 commits) doc: fix typo in comment explaining rb_tree usage Remove fs/ntfs/ChangeLog doc: fix console doc typo doc: cpuset: Update the cpuset flag file Fix of spelling in arch/sparc/kernel/leon_kernel.c no longer needed Remove drivers/parport/ChangeLog Remove drivers/char/ChangeLog doc: typo - Table 1-2 should refer to "status", not "statm" tree-wide: fix typos "ass?o[sc]iac?te" -> "associate" in comments No need to patch AMD-provided drivers/gpu/drm/radeon/atombios.h devres/irq: Fix devm_irq_match comment Remove reference to kthread_create_on_cpu tree-wide: Assorted spelling fixes tree-wide: fix 'lenght' typo in comments and code drm/kms: fix spelling in error message doc: capitalization and other minor fixes in pnp doc devres: typo fix s/dev/devm/ Remove redundant trailing semicolons from macros fix typo "definetly" -> "definitely" in comment tree-wide: s/widht/width/g typo in comments ... Fix trivial conflict in Documentation/laptops/00-INDEX	2010-03-12 16:04:50 -08:00
Ben Blum	67523c48aa	cgroups: blkio subsystem as module Modify the Block I/O cgroup subsystem to be able to be built as a module. As the CFQ disk scheduler optionally depends on blk-cgroup, config options in block/Kconfig, block/Kconfig.iosched, and block/blk-cgroup.h are enhanced to support the new module dependency. Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-03-12 15:52:36 -08:00
Jiri Kosina	318ae2edc3	Merge branch 'for-next' into for-linus Conflicts: Documentation/filesystems/proc.txt arch/arm/mach-u300/include/mach/debug-macro.S drivers/net/qlge/qlge_ethtool.c drivers/net/qlge/qlge_main.c drivers/net/typhoon.c	2010-03-08 16:55:37 +01:00
Emese Revfy	52cf25d0ab	Driver core: Constify struct sysfs_ops in struct kobj_type Constify struct sysfs_ops. This is part of the ops structure constification effort started by Arjan van de Ven et al. Benefits of this constification: * prevents modification of data that is shared (referenced) by many other structure instances at runtime * detects/prevents accidental (but not intentional) modification attempts on archs that enforce read-only kernel data at runtime * potentially better optimized code as the compiler can assume that the const data cannot be changed * the compiler/linker move const data into .rodata and therefore exclude them from false sharing Signed-off-by: Emese Revfy <re.emese@gmail.com> Acked-by: David Teigland <teigland@redhat.com> Acked-by: Matt Domsch <Matt_Domsch@dell.com> Acked-by: Maciej Sosnowski <maciej.sosnowski@intel.com> Acked-by: Hans J. Koch <hjk@linutronix.de> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Jens Axboe <jens.axboe@oracle.com> Acked-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2010-03-07 17:04:49 -08:00
Richard Kennedy	4671a13220	block: don't access jiffies when initialising io_context As the comment says the initial value of last_waited is never used, so there is no need to initialise it with the current jiffies. Jiffies is hot enough without accessing it for no reason. Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-03-01 10:57:22 +01:00
Richard Kennedy	73e9ffdd0c	cfq: remove 8 bytes of padding from cfq_rb_root on 64 bit builds Reorder cfq_rb_root to remove 8 bytes of padding on 64 bit builds. Consequently removing 56 bytes from cfq_group and 64 bytes from cfq_data. Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-03-01 10:50:20 +01:00
Shaohua Li	abc3c744d0	cfq-iosched: quantum check tweak Currently a queue can only dispatch up to 4 requests if there are other queues. This isn't optimal, device can handle more requests, for example, AHCI can handle 31 requests. I can understand the limit is for fairness, but we could do a tweak: if the queue still has a lot of slice left, sounds we could ignore the limit. Test shows this boost my workload (two thread randread of a SSD) from 78m/s to 100m/s. Thanks for suggestions from Corrado and Vivek for the patch. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-03-01 09:20:54 +01:00
Corrado Zoccolo	53c583d226	cfq-iosched: requests "in flight" vs "in driver" clarification Counters for requests "in flight" and "in driver" are used asymmetrically in cfq_may_dispatch, and have slightly different meaning. We split the rq_in_flight counter (was sync_flight) to count both sync and async requests, in order to use this one, which is more accurate in some corner cases. The rq_in_driver counter is coalesced, since individual sync/async counts are not used any more. Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-02-28 19:45:05 +01:00
Corrado Zoccolo	41647e7a91	cfq-iosched: rethink seeky detection for SSDs CFQ currently applies the same logic of detecting seeky queues and grouping them together for rotational disks as well as SSDs. For SSDs, the time to complete a request doesn't depend on the request location, but only on the size. This patch therefore changes the criterion to group queues by request size in case of SSDs, in order to achieve better fairness. Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-02-28 19:41:25 +01:00
Corrado Zoccolo	3dde36ddea	cfq-iosched: rework seeky detection Current seeky detection is based on average seek lenght. This is suboptimal, since the average will not distinguish between: * a process doing medium sized seeks * a process doing some sequential requests interleaved with larger seeks and even a medium seek can take lot of time, if the requested sector happens to be behind the disk head in the rotation (50% probability). Therefore, we change the seeky queue detection to work as follows: * each request can be classified as sequential if it is very close to the current head position, i.e. it is likely in the disk cache (disks usually read more data than requested, and put it in cache for subsequent reads). Otherwise, the request is classified as seeky. * an history window of the last 32 requests is kept, storing the classification result. * A queue is marked as seeky if more than 1/8 of the last 32 requests were seeky. This patch fixes a regression reported by Yanmin, on mmap 64k random reads. Reported-by: Yanmin Zhang <yanmin_zhang@linux.intel.com> Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-02-28 19:41:25 +01:00
Martin K. Petersen	8a78362c4e	block: Consolidate phys_segment and hw_segment limits Except for SCSI no device drivers distinguish between physical and hardware segment limits. Consolidate the two into a single segment limit. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-02-26 13:58:08 +01:00
Martin K. Petersen	086fa5ff08	block: Rename blk_queue_max_sectors to blk_queue_max_hw_sectors The block layer calling convention is blk_queue_<limit name>. blk_queue_max_sectors predates this practice, leading to some confusion. Rename the function to appropriately reflect that its intended use is to set max_hw_sectors. Also introduce a temporary wrapper for backwards compability. This can be removed after the merge window is closed. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-02-26 13:58:08 +01:00
Martin K. Petersen	eb28d31bc9	block: Add BLK_ prefix to definitions Add a BLK_ prefix to block layer constants. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-02-26 13:58:08 +01:00
Martin K. Petersen	e751e76a5f	block: Remove unused accessor function blk_queue_max_hw_sectors is no longer called by any subsystem and can be removed. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-02-26 13:58:07 +01:00
Martin K. Petersen	2800aac111	block: Update blk_queue_max_sectors and documentation Clarify blk_queue_max_sectors and update documentation. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-02-26 13:58:07 +01:00
Gui Jianfeng	024f906616	cfq: Remove useless css reference get There's no need to take css reference here, for the caller has already called rcu_read_lock() to prevent cgroup from being removed. Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com> Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-02-26 08:56:15 +01:00
Jens Axboe	7f03292ee1	Merge branch 'master' into for-2.6.34 Conflicts: include/linux/blkdev.h Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-02-25 08:48:05 +01:00
Akinobu Mita	bddd87c7e6	blk-core: use BIO list management functions Now that the bio list management stuff is generic, convert generic_make_request to use bio lists instead of its own private bio list implementation. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-02-23 08:55:42 +01:00
Jens Axboe	79da0644a8	Revert "block: improve queue_should_plug() by looking at IO depths" This reverts commit `fb1e75389b`. "Benjamin S." <sbenni@gmx.de> reports that the patch in question causes a big drop in sequential throughput for him, dropping from 200MB/sec down to only 70MB/sec. Needs to be investigated more fully, for now lets just revert the offending commit. Conflicts: include/linux/blkdev.h Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-02-23 08:40:43 +01:00
Richard Kennedy	c4081ba5c9	cfq: reorder cfq_queue removing padding on 64bit This removes 8 bytes of padding from struct cfq_queue on 64 bit builds, shrinking it's size to 256 bytes, so fitting into 1 fewer cachelines and allowing 1 more object/slab in it's kmem_cache. Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> ---- patch against 2.6.33-rc8 tested on x86_64 AMDX2 Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-02-22 13:49:24 +01:00
Jens Axboe	f11cbd74c5	Merge branch 'master' into for-2.6.34	2010-02-22 13:48:51 +01:00
Daniel Mack	3ad2f3fbb9	tree-wide: Assorted spelling fixes In particular, several occurances of funny versions of 'success', 'unknown', 'therefore', 'acknowledge', 'argument', 'achieve', 'address', 'beginning', 'desirable', 'separate' and 'necessary' are fixed. Signed-off-by: Daniel Mack <daniel@caiaq.de> Cc: Joe Perches <joe@perches.com> Cc: Junio C Hamano <gitster@pobox.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2010-02-09 11:13:56 +01:00
Shaohua Li	ae54abed63	cfq-iosched: split seeky coop queues after one slice Currently we split seeky coop queues after 1s, which is too big. Below patch marks seeky coop queue split_coop flag after one slice. After that, if new requests come in, the queues will be splitted. Patch is suggested by Corrado. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Reviewed-by: Corrado Zoccolo <czoccolo@gmail.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-02-05 13:11:45 +01:00
Vivek Goyal	1efe8fe1c2	cfq-iosched: Do not idle on async queues Few weeks back, Shaohua Li had posted similar patch. I am reposting it with more test results. This patch does two things. - Do not idle on async queues. - It also changes the write queue depth CFQ drives (cfq_may_dispatch()). Currently, we seem to driving queue depth of 1 always for WRITES. This is true even if there is only one write queue in the system and all the logic of infinite queue depth in case of single busy queue as well as slowly increasing queue depth based on last delayed sync request does not seem to be kicking in at all. This patch will allow deeper WRITE queue depths (subjected to the other WRITE queue depth contstraints like cfq_quantum and last delayed sync request). Shaohua Li had reported getting more out of his SSD. For me, I have got one Lun exported from an HP EVA and when pure buffered writes are on, I can get more out of the system. Following are test results of pure buffered writes (with end_fsync=1) with vanilla and patched kernel. These results are average of 3 sets of run with increasing number of threads. AVERAGE[bufwfs][vanilla] ------- job Set NR ReadBW(KB/s) MaxClat(us) WriteBW(KB/s) MaxClat(us) --- --- -- ------------ ----------- ------------- ----------- bufwfs 3 1 0 0 95349 474141 bufwfs 3 2 0 0 100282 806926 bufwfs 3 4 0 0 109989 2.7301e+06 bufwfs 3 8 0 0 116642 3762231 bufwfs 3 16 0 0 118230 6902970 AVERAGE[bufwfs] [patched kernel] ------- bufwfs 3 1 0 0 270722 404352 bufwfs 3 2 0 0 206770 1.06552e+06 bufwfs 3 4 0 0 195277 1.62283e+06 bufwfs 3 8 0 0 260960 2.62979e+06 bufwfs 3 16 0 0 299260 1.70731e+06 I also ran buffered writes along with some sequential reads and some buffered reads going on in the system on a SATA disk because the potential risk could be that we should not be driving queue depth higher in presence of sync IO going to keep the max clat low. With some random and sequential reads going on in the system on one SATA disk I did not see any significant increase in max clat. So it looks like other WRITE queue depth control logic is doing its job. Here are the results. AVERAGE[brr, bsr, bufw together] [vanilla] ------- job Set NR ReadBW(KB/s) MaxClat(us) WriteBW(KB/s) MaxClat(us) --- --- -- ------------ ----------- ------------- ----------- brr 3 1 850 546345 0 0 bsr 3 1 14650 729543 0 0 bufw 3 1 0 0 23908 8274517 brr 3 2 981.333 579395 0 0 bsr 3 2 14149.7 1175689 0 0 bufw 3 2 0 0 21921 1.28108e+07 brr 3 4 898.333 1.75527e+06 0 0 bsr 3 4 12230.7 1.40072e+06 0 0 bufw 3 4 0 0 19722.3 2.4901e+07 brr 3 8 900 3160594 0 0 bsr 3 8 9282.33 1.91314e+06 0 0 bufw 3 8 0 0 18789.3 23890622 AVERAGE[brr, bsr, bufw mixed] [patched kernel] ------- job Set NR ReadBW(KB/s) MaxClat(us) WriteBW(KB/s) MaxClat(us) --- --- -- ------------ ----------- ------------- ----------- brr 3 1 837 417973 0 0 bsr 3 1 14357.7 591275 0 0 bufw 3 1 0 0 24869.7 8910662 brr 3 2 1038.33 543434 0 0 bsr 3 2 13351.3 1205858 0 0 bufw 3 2 0 0 18626.3 13280370 brr 3 4 913 1.86861e+06 0 0 bsr 3 4 12652.3 1430974 0 0 bufw 3 4 0 0 15343.3 2.81305e+07 brr 3 8 890 2.92695e+06 0 0 bsr 3 8 9635.33 1.90244e+06 0 0 bufw 3 8 0 0 17200.3 24424392 So looks like it might make sense to include this patch. Thanks Vivek Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-02-02 20:46:10 +01:00
Gui Jianfeng	bcf4dd4342	blk-cgroup: Fix potential deadlock in blk-cgroup I triggered a lockdep warning as following. ======================================================= [ INFO: possible circular locking dependency detected ] 2.6.33-rc2 #1 ------------------------------------------------------- test_io_control/7357 is trying to acquire lock: (blkio_list_lock){+.+...}, at: [<c053a990>] blkiocg_weight_write+0x82/0x9e but task is already holding lock: (&(&blkcg->lock)->rlock){......}, at: [<c053a949>] blkiocg_weight_write+0x3b/0x9e which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (&(&blkcg->lock)->rlock){......}: [<c04583b7>] validate_chain+0x8bc/0xb9c [<c0458dba>] __lock_acquire+0x723/0x789 [<c0458eb0>] lock_acquire+0x90/0xa7 [<c0692b0a>] _raw_spin_lock_irqsave+0x27/0x5a [<c053a4e1>] blkiocg_add_blkio_group+0x1a/0x6d [<c053cac7>] cfq_get_queue+0x225/0x3de [<c053eec2>] cfq_set_request+0x217/0x42d [<c052c8a6>] elv_set_request+0x17/0x26 [<c0532a0f>] get_request+0x203/0x2c5 [<c0532ae9>] get_request_wait+0x18/0x10e [<c0533470>] __make_request+0x2ba/0x375 [<c0531985>] generic_make_request+0x28d/0x30f [<c0532da7>] submit_bio+0x8a/0x8f [<c04d827a>] submit_bh+0xf0/0x10f [<c04d91d2>] ll_rw_block+0xc0/0xf9 [<f86e9705>] ext3_find_entry+0x319/0x544 [ext3] [<f86eae58>] ext3_lookup+0x2c/0xb9 [ext3] [<c04c3e1b>] do_lookup+0xd3/0x172 [<c04c56c8>] link_path_walk+0x5fb/0x95c [<c04c5a65>] path_walk+0x3c/0x81 [<c04c5b63>] do_path_lookup+0x21/0x8a [<c04c66cc>] do_filp_open+0xf0/0x978 [<c04c0c7e>] open_exec+0x1b/0xb7 [<c04c1436>] do_execve+0xbb/0x266 [<c04081a9>] sys_execve+0x24/0x4a [<c04028a2>] ptregs_execve+0x12/0x18 -> #1 (&(&q->__queue_lock)->rlock){..-.-.}: [<c04583b7>] validate_chain+0x8bc/0xb9c [<c0458dba>] __lock_acquire+0x723/0x789 [<c0458eb0>] lock_acquire+0x90/0xa7 [<c0692b0a>] _raw_spin_lock_irqsave+0x27/0x5a [<c053dd2a>] cfq_unlink_blkio_group+0x17/0x41 [<c053a6eb>] blkiocg_destroy+0x72/0xc7 [<c0467df0>] cgroup_diput+0x4a/0xb2 [<c04ca473>] dentry_iput+0x93/0xb7 [<c04ca4b3>] d_kill+0x1c/0x36 [<c04cb5c5>] dput+0xf5/0xfe [<c04c6084>] do_rmdir+0x95/0xbe [<c04c60ec>] sys_rmdir+0x10/0x12 [<c04027cc>] sysenter_do_call+0x12/0x32 -> #0 (blkio_list_lock){+.+...}: [<c0458117>] validate_chain+0x61c/0xb9c [<c0458dba>] __lock_acquire+0x723/0x789 [<c0458eb0>] lock_acquire+0x90/0xa7 [<c06929fd>] _raw_spin_lock+0x1e/0x4e [<c053a990>] blkiocg_weight_write+0x82/0x9e [<c0467f1e>] cgroup_file_write+0xc6/0x1c0 [<c04bd2f3>] vfs_write+0x8c/0x116 [<c04bd7c6>] sys_write+0x3b/0x60 [<c04027cc>] sysenter_do_call+0x12/0x32 other info that might help us debug this: 1 lock held by test_io_control/7357: #0: (&(&blkcg->lock)->rlock){......}, at: [<c053a949>] blkiocg_weight_write+0x3b/0x9e stack backtrace: Pid: 7357, comm: test_io_control Not tainted 2.6.33-rc2 #1 Call Trace: [<c045754f>] print_circular_bug+0x91/0x9d [<c0458117>] validate_chain+0x61c/0xb9c [<c0458dba>] __lock_acquire+0x723/0x789 [<c0458eb0>] lock_acquire+0x90/0xa7 [<c053a990>] ? blkiocg_weight_write+0x82/0x9e [<c06929fd>] _raw_spin_lock+0x1e/0x4e [<c053a990>] ? blkiocg_weight_write+0x82/0x9e [<c053a990>] blkiocg_weight_write+0x82/0x9e [<c0467f1e>] cgroup_file_write+0xc6/0x1c0 [<c0454df5>] ? trace_hardirqs_off+0xb/0xd [<c044d93a>] ? cpu_clock+0x2e/0x44 [<c050e6ec>] ? security_file_permission+0xf/0x11 [<c04bcdda>] ? rw_verify_area+0x8a/0xad [<c0467e58>] ? cgroup_file_write+0x0/0x1c0 [<c04bd2f3>] vfs_write+0x8c/0x116 [<c04bd7c6>] sys_write+0x3b/0x60 [<c04027cc>] sysenter_do_call+0x12/0x32 To prevent deadlock, we should take locks as following sequence: blkio_list_lock -> queue_lock -> blkcg_lock. The following patch should fix this bug. Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-02-01 09:58:54 +01:00
Alan D. Brunelle	488991e28e	block: Added in stricter no merge semantics for block I/O Updated 'nomerges' tunable to accept a value of '2' - indicating that _no_ merges at all are to be attempted (not even the simple one-hit cache). The following table illustrates the additional benefit - 5 minute runs of a random I/O load were applied to a dozen devices on a 16-way x86_64 system. nomerges Throughput %System Improvement (tput / %sys) -------- ------------ ----------- ------------------------- 0 12.45 MB/sec 0.669365609 1 12.50 MB/sec 0.641519199 0.40% / 2.71% 2 12.52 MB/sec 0.639849750 0.56% / 2.96% Signed-off-by: Alan D. Brunelle <alan.brunelle@hp.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-01-29 09:04:08 +01:00
Divyesh Shah	875feb63b9	cfq-iosched: Respect ioprio_class when preempting In cfq_should_preempt(), we currently allow some cases where a non-RT request can preempt an ongoing RT cfqq timeslice. This should not happen. Examples include: o A sync_noidle wl type non-RT request pre-empting a sync_noidle wl type cfqq on which we are idling. o Once we have per-cgroup async queues, a non-RT sync request pre-empting a RT async cfqq. Signed-off-by: Divyesh Shah<dpshah@google.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-01-11 16:16:18 +01:00
Martin K. Petersen	e03a72e136	block: Stop using byte offsets All callers of the stacking functions use 512-byte sector units rather than byte offsets. Simplify the code so the stacking functions take sectors when specifying data offsets. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-01-11 14:30:09 +01:00
Kirill Afonshin	ce289321b7	block: removed unused as_io_context It isn't used anymore, since AS was deleted. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-01-11 14:29:20 +01:00
Martin K. Petersen	17be8c2450	block: bdev_stack_limits wrapper DM does not want to know about partition offsets. Add a partition-aware wrapper that DM can use when stacking block devices. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-01-11 14:29:20 +01:00
Martin K. Petersen	dd3d145d49	block: Fix discard alignment calculation and printing Discard alignment reporting for partitions was incorrect. Update to match the algorithm used elsewhere. The alignment can be negative (misaligned). Fix format string accordingly. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-01-11 14:29:19 +01:00
Martin K. Petersen	fe0b393f2c	block: Correct handling of bottom device misaligment The top device misalignment flag would not be set if the added bottom device was already misaligned as opposed to causing a stacking failure. Also massage the reporting so that an error is only returned if adding the bottom device caused the misalignment. I.e. don't return an error if the top is already flagged as misaligned. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-01-11 14:29:19 +01:00
OGAWA Hirofumi	e79e95db5c	block: Honor the gfp_mask for alloc_page() in blkdev_issue_discard() Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-29 08:53:54 +01:00
Martin K. Petersen	81744ee44a	block: Fix incorrect alignment offset reporting and update documentation queue_sector_alignment_offset returned the wrong value which caused partitions to report an incorrect alignment_offset. Since offset alignment calculation is needed several places it has been split into a separate helper function. The topology stacking function has been updated accordingly. Furthermore, comments have been added to clarify how the stacking function works. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Tested-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-29 08:35:35 +01:00
Shaohua Li	2f7a2d89a8	cfq-iosched: don't regard requests with long distance as close seek_mean could be very big sometimes, using it as close criteria is meaningless as this doen't improve any performance. So if it's big, let's fallback to default value. Reviewed-by: Corrado Zoccolo <czoccolo@gmail.com> Signed-off-by: Shaohua Li<shaohua.li@intel.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-28 13:18:44 +01:00
Martin K. Petersen	9504e0864b	block: Fix topology stacking for data and discard alignment The stacking code incorrectly scaled up the data offset in some cases causing misaligned devices to report alignment. Rewrite the stacking algorithm to remedy this and apply the same alignment principles to the discard handling. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-21 15:55:51 +01:00
Vivek Goyal	65b32a573e	cfq-iosched: Remove prio_change logic for workload selection o CFQ now internally divides cfq queues in therr workload categories. sync-idle, sync-noidle and async. Which workload to run depends primarily on rb_key offset across three service trees. Which is a combination of mulitiple things including what time queue got queued on the service tree. There is one exception though. That is if we switched the prio class, say we served some RT tasks and again started serving BE class, then with-in BE class we always started with sync-noidle workload irrespective of rb_key offset in service trees. This can provide better latencies for sync-noidle workload in the presence of RT tasks. o This patch gets rid of that exception and which workload to run with-in class always depends on lowest rb_key across service trees. The reason being that now we have multiple BE class groups and if we always switch to sync-noidle workload with-in group, we can potentially starve a sync-idle workload with-in group. Same is true for async workload which will be in root group. Also the workload-switching with-in group will become very unpredictable as it now depends whether some RT workload was running in the system or not. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com> Acked-by: Corrado Zoccolo <czoccolo@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-18 12:40:21 +01:00
Vivek Goyal	fb104db41e	cfq-iosched: Get rid of nr_groups o Currently code does not seem to be using cfqd->nr_groups. Get rid of it. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-18 12:40:21 +01:00
Vivek Goyal	1db32c4060	cfq-iosched: Remove the check for same cfq group from allow_merge o allow_merge() already checks if submitting task is pointing to same cfqq as rq has been queued in. If everything is fine, we should not be having a task in one cgroup and having a pointer to cfqq in other cgroup. Well I guess in some situations it can happen and that is, when a random IO queue has been moved into root cgroup for group_isolation=0. In this case, tasks's cgroup/group is different from where actually cfqq is, but this is intentional and in this case merging should be allowed. The second situation is where due to close cooperator patches, multiple processes can be sharing a cfqq. If everything implemented right, we should not end up in a situation where tasks from different processes in different groups are sharing the same cfqq as we allow merging of cooperating queues only if they are in same group. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-18 12:40:21 +01:00
Jens Axboe	b568be627a	block: temporarily disable discard granularity Commit `86b3728141` adds a check for misaligned stacking offsets, but it's buggy since the defaults are 0. Hence all dm devices that pass in a non-zero starting offset will be marked as misaligned amd dm will complain. A real fix is coming, in the mean time disable the discard granularity check so that users don't worry about dm reporting about misaligned devices. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-16 09:16:41 +01:00
Linus Torvalds	51b736b851	Merge branch 'for-2.6.33' of git://git.kernel.dk/linux-2.6-block * 'for-2.6.33' of git://git.kernel.dk/linux-2.6-block: cfq: set workload as expired if it doesn't have any slice left Fix a CFQ crash in "for-2.6.33" branch of block tree cfq: Remove wait_request flag when idle time is being deleted cfq-iosched: commenting non-obvious initialization cfq-iosched: Take care of corner cases of group losing share due to deletion cfq-iosched: Get rid of cfqq wait_busy_done flag cfq: Optimization for close cooperating queue searching block,xd: Delay allocation of DMA buffers until device is known drbd: Following the hmac change to SHASH (see linux commit `8bd1209cff`) cfq-iosched: reduce write depth only if sync was delayed	2009-12-15 09:11:28 -08:00
Gui Jianfeng	66ae291978	cfq: set workload as expired if it doesn't have any slice left When a group is resumed, if it doesn't have workload slice left, we should set workload_expires as expired. Otherwise, we might start from where we left in previous group by error. Thanks the idea from Corrado. Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-15 10:08:45 +01:00
Vivek Goyal	82bbbf28db	Fix a CFQ crash in "for-2.6.33" branch of block tree I think my previous patch introduced a bug which can lead to CFQ hitting BUG_ON(). The offending commit in for-2.6.33 branch is. commit `7667aa0630` Author: Vivek Goyal <vgoyal@redhat.com> Date: Tue Dec 8 17:52:58 2009 -0500 cfq-iosched: Take care of corner cases of group losing share due to deletion While doing some stress testing on my box, I enountered following. login: [ 3165.148841] BUG: scheduling while atomic: swapper/0/0x10000100 [ 3165.149821] Modules linked in: cfq_iosched dm_multipath qla2xxx igb scsi_transport_fc dm_snapshot [last unloaded: scsi_wait_scan] [ 3165.149821] Pid: 0, comm: swapper Not tainted 2.6.32-block-for-33-merged-new #3 [ 3165.149821] Call Trace: [ 3165.149821] <IRQ> [<ffffffff8103fab8>] __schedule_bug+0x5c/0x60 [ 3165.149821] [<ffffffff8103afd7>] ? __wake_up+0x44/0x4d [ 3165.149821] [<ffffffff8153a979>] schedule+0xe3/0x7bc [ 3165.149821] [<ffffffff8103a796>] ? cpumask_next+0x1d/0x1f [ 3165.149821] [<ffffffffa000b21d>] ? cfq_dispatch_requests+0x6ba/0x93e [cfq_iosched] [ 3165.149821] [<ffffffff810422d8>] __cond_resched+0x2a/0x35 [ 3165.149821] [<ffffffffa000b21d>] ? cfq_dispatch_requests+0x6ba/0x93e [cfq_iosched] [ 3165.149821] [<ffffffff8153b1ee>] _cond_resched+0x2c/0x37 [ 3165.149821] [<ffffffff8100e2db>] is_valid_bugaddr+0x16/0x2f [ 3165.149821] [<ffffffff811e4161>] report_bug+0x18/0xac [ 3165.149821] [<ffffffff8100f1fc>] die+0x39/0x63 [ 3165.149821] [<ffffffff8153cde1>] do_trap+0x11a/0x129 [ 3165.149821] [<ffffffff8100d470>] do_invalid_op+0x96/0x9f [ 3165.149821] [<ffffffffa000b21d>] ? cfq_dispatch_requests+0x6ba/0x93e [cfq_iosched] [ 3165.149821] [<ffffffff81034b4d>] ? enqueue_task+0x5c/0x67 [ 3165.149821] [<ffffffff8103ae83>] ? task_rq_unlock+0x11/0x13 [ 3165.149821] [<ffffffff81041aae>] ? try_to_wake_up+0x292/0x2a4 [ 3165.149821] [<ffffffff8100c935>] invalid_op+0x15/0x20 [ 3165.149821] [<ffffffffa000b21d>] ? cfq_dispatch_requests+0x6ba/0x93e [cfq_iosched] [ 3165.149821] [<ffffffff810df5a6>] ? virt_to_head_page+0xe/0x2f [ 3165.149821] [<ffffffff811d8c2a>] blk_peek_request+0x191/0x1a7 [ 3165.149821] [<ffffffff811e5b8d>] ? kobject_get+0x1a/0x21 [ 3165.149821] [<ffffffff812c8d4c>] scsi_request_fn+0x82/0x3df [ 3165.149821] [<ffffffff8110b2de>] ? bio_fs_destructor+0x15/0x17 [ 3165.149821] [<ffffffff810df5a6>] ? virt_to_head_page+0xe/0x2f [ 3165.149821] [<ffffffff811d931f>] __blk_run_queue+0x42/0x71 [ 3165.149821] [<ffffffff811d9403>] blk_run_queue+0x26/0x3a [ 3165.149821] [<ffffffff812c8761>] scsi_run_queue+0x2de/0x375 [ 3165.149821] [<ffffffff812b60ac>] ? put_device+0x17/0x19 [ 3165.149821] [<ffffffff812c92d7>] scsi_next_command+0x3b/0x4b [ 3165.149821] [<ffffffff812c9b9f>] scsi_io_completion+0x1c9/0x3f5 [ 3165.149821] [<ffffffff812c3c36>] scsi_finish_command+0xb5/0xbe I think I have hit following BUG_ON() in cfq_dispatch_request(). BUG_ON(RB_EMPTY_ROOT(&cfqq->sort_list)); Please find attached the patch to fix it. I have done some stress testing with it and have not seen it happening again. o We should wait on a queue even after slice expiry only if it is empty. If queue is not empty then continue to expire it. o If we decide to keep the queue then make cfqq=NULL. Otherwise select_queue() will return a valid cfqq and cfq_dispatch_request() can hit following BUG_ON(). BUG_ON(RB_EMPTY_ROOT(&cfqq->sort_list)) Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-10 19:25:41 +01:00
Gui Jianfeng	554554f60a	cfq: Remove wait_request flag when idle time is being deleted Remove wait_request flag when idle time is being deleted, otherwise it'll hit this path every time when a request is enqueued. Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-10 09:38:39 +01:00
Linus Torvalds	4ef58d4e2a	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (42 commits) tree-wide: fix misspelling of "definition" in comments reiserfs: fix misspelling of "journaled" doc: Fix a typo in slub.txt. inotify: remove superfluous return code check hdlc: spelling fix in find_pvc() comment doc: fix regulator docs cut-and-pasteism mtd: Fix comment in Kconfig doc: Fix IRQ chip docs tree-wide: fix assorted typos all over the place drivers/ata/libata-sff.c: comment spelling fixes fix typos/grammos in Documentation/edac.txt sysctl: add missing comments fs/debugfs/inode.c: fix comment typos sgivwfb: Make use of ARRAY_SIZE. sky2: fix sky2_link_down copy/paste comment error tree-wide: fix typos "couter" -> "counter" tree-wide: fix typos "offest" -> "offset" fix kerneldoc for set_irq_msi() spidev: fix double "of of" in comment comment typo fix: sybsystem -> subsystem ...	2009-12-09 19:43:33 -08:00
Corrado Zoccolo	edc71131c4	cfq-iosched: commenting non-obvious initialization Added a comment to explain the initialization of last_delayed_sync. Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-09 20:56:04 +01:00
Vivek Goyal	7667aa0630	cfq-iosched: Take care of corner cases of group losing share due to deletion If there is a sequential reader running in a group, we wait for next request to come in that group after slice expiry and once new request is in, we expire the queue. Otherwise we delete the group from service tree and group looses its fair share. So far I was marking a queue as wait_busy if it had consumed its slice and it was last queue in the group. But this condition did not cover following two cases. 1.If a request completed and slice has not expired yet. Next request comes in and is dispatched to disk. Now select_queue() hits and slice has expired. This group will be deleted. Because request is still in the disk, this queue will never get a chance to wait_busy. 2.If request completed and slice has not expired yet. Before next request comes in (delay due to think time), select_queue() hits and expires the queue hence group. This queue never got a chance to wait busy. Gui was hitting the boundary condition 1 and not getting fairness numbers proportional to weight. This patch puts the checks for above two conditions and improves the fairness numbers for sequential workload on rotational media. Check in select_queue() takes care of case 1 and additional check in should_wait_busy() takes care of case 2. Reported-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-09 15:11:04 +01:00
Vivek Goyal	c244bb50a9	cfq-iosched: Get rid of cfqq wait_busy_done flag o Get rid of wait_busy_done flag. This flag only tells we were doing wait busy on a queue and that queue got request so expire it. That information can easily be obtained by (cfq_cfqq_wait_busy() && queue_is_not_empty). So remove this flag and keep code simple. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-09 15:11:03 +01:00
Gui Jianfeng	b9d8f4c73b	cfq: Optimization for close cooperating queue searching It doesn't make any sense to try to find out a close cooperating queue if current cfqq is the only one in the group. Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-09 15:11:03 +01:00
Corrado Zoccolo	573412b295	cfq-iosched: reduce write depth only if sync was delayed The introduction of ramp-up formula for async queue depths has slowed down dirty page reclaim, by reducing async write performance. This patch makes sure the formula kicks in only when sync request was recently delayed. Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-09 12:32:55 +01:00
Vivek Goyal	878eaddd05	cfq-iosched: Do not access cfqq after freeing it Fix a crash during boot reported by Jeff Moyer. Fix the issue of accessing cfqq after freeing it. Reported-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@carl.(none)>	2009-12-07 19:37:15 +01:00
Stephen Rothwell	accee7854b	block: include linux/err.h to use ERR_PTR Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-07 09:47:07 +01:00
Jens Axboe	bb729bc98c	cfq-iosched: use call_rcu() instead of doing grace period stall on queue exit After the merge of the IO controller patches, booting on my megaraid box ran much slower. Vivek Goyal traced it down to megaraid discovery creating tons of devices, each suffering a grace period when they later kill that queue (if no device is found). So lets use call_rcu() to batch these deferred frees, instead of taking the grace period hit for each one. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-06 09:54:19 +01:00
Vivek Goyal	846954b0a3	blkio: Allow CFQ group IO scheduling even when CFQ is a module o Now issues of blkio controller and CFQ in module mode should be fixed. Enable the cfq group scheduling support in module mode. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-04 16:38:14 +01:00
Vivek Goyal	3e25206689	blkio: Implement dynamic io controlling policy registration o One of the goals of block IO controller is that it should be able to support mulitple io control policies, some of which be operational at higher level in storage hierarchy. o To begin with, we had one io controlling policy implemented by CFQ, and I hard coded the CFQ functions called by blkio. This created issues when CFQ is compiled as module. o This patch implements a basic dynamic io controlling policy registration functionality in blkio. This is similar to elevator functionality where ioschedulers register the functions dynamically. o Now in future, when more IO controlling policies are implemented, these can dynakically register with block IO controller. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-04 16:38:14 +01:00
Vivek Goyal	9d6a986c0b	blkio: Export some symbols from blkio as its user CFQ can be a module o blkio controller is inside the kernel and cfq makes use of interfaces exported by blkio. CFQ can be a module too, hence export symbols used by CFQ. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-04 16:38:14 +01:00
Louis Rilling	b69f229206	block: Fix io_context leak after failure of clone with CLONE_IO With CLONE_IO, parent's io_context->nr_tasks is incremented, but never decremented whenever copy_process() fails afterwards, which prevents exit_io_context() from calling IO schedulers exit functions. Give a task_struct to exit_io_context(), and call exit_io_context() instead of put_io_context() in copy_process() cleanup path. Signed-off-by: Louis Rilling <louis.rilling@kerlabs.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-04 16:36:18 +01:00
Louis Rilling	61cc74fbb8	block: Fix io_context leak after clone with CLONE_IO With CLONE_IO, copy_io() increments both ioc->refcount and ioc->nr_tasks. However exit_io_context() only decrements ioc->refcount if ioc->nr_tasks reaches 0. Always call put_io_context() in exit_io_context(). Signed-off-by: Louis Rilling <louis.rilling@kerlabs.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-04 16:36:18 +01:00
André Goddard Rosa	af901ca181	tree-wide: fix assorted typos all over the place That is "success", "unknown", "through", "performance", "[re\|un]mapping" , "access", "default", "reasonable", "[con]currently", "temperature" , "channel", "[un]used", "application", "example","hierarchy", "therefore" , "[over\|under]flow", "contiguous", "threshold", "enough" and others. Signed-off-by: André Goddard Rosa <andre.goddard@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2009-12-04 15:39:55 +01:00
Shaohua Li	3c764b7a65	cfq-iosched: make nonrot check logic consistent cfq_arm_slice_timer() has logic to disable idle window for SSD device. The same thing should be done at cfq_select_queue() too, otherwise we will still see idle window. This makes the nonrot check logic consistent in cfq. Tests in a intel SSD with low_latency knob close, below patch can triple disk thoughput for muti-thread sequential read. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-04 13:12:06 +01:00
Jens Axboe	237e5bc4e5	io controller: quick fix for blk-cgroup and modular CFQ It's currently not an allowed configuration, so express that in Kconfig. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-04 10:07:38 +01:00
Jens Axboe	f2eecb9152	cfq-iosched: move IO controller declerations to a header file They should not be declared inside some other file that's not related to CFQ. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-04 10:06:35 +01:00
Jens Axboe	2f5ea47712	cfq-iosched: fix compile problem with !CONFIG_CGROUP Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-03 21:07:17 +01:00
Vivek Goyal	c04645e592	blkio: Wait on sync-noidle queue even if rq_noidle = 1 o rq_noidle() is supposed to tell cfq that do not expect a request after this one, hence don't idle. But this does not seem to work very well. For example for direct random readers, rq_noidle = 1 but there is next request coming after this. Not idling, leads to a group not getting its share even if group_isolation=1. o The right solution for this issue is to scan the higher layers and set right flag (WRITE_SYNC or WRITE_ODIRECT). For the time being, this single line fix helps. This should not have any significant impact when we are not using cgroups. I will later figure out IO paths in higher layer and fix it. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-03 19:28:53 +01:00
Vivek Goyal	ae30c28655	blkio: Implement group_isolation tunable o If a group is running only a random reader, then it will not have enough traffic to keep disk busy and we will reduce overall throughput. This should result in better latencies for random reader though. If we don't idle on random reader service tree, then this random reader will experience large latencies if there are other groups present in system with sequential readers running in these. o One solution suggested by corrado is that by default keep the random readers or sync-noidle workload in root group so that during one dispatch round we idle only once on sync-noidle tree. This means that all the sync-idle workload queues will be in their respective group and we will see service differentiation in those but not on sync-noidle workload. o Provide a tunable group_isolation. If set, this will make sure that even sync-noidle queues go in their respective group and we wait on these. This provides stronger isolation between groups but at the expense of throughput if group does not have enough traffic to keep the disk busy. o By default group_isolation = 0 Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-03 19:28:53 +01:00
Vivek Goyal	f26bd1f0a3	blkio: Determine async workload length based on total number of queues o Async queues are not per group. Instead these are system wide and maintained in root group. Hence their workload slice length should be calculated based on total number of queues in the system and not just queues in the root group. o As root group's default weight is 1000, make sure to charge async queue more in terms of vtime so that it does not get more time on disk because root group has higher weight. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-03 19:28:53 +01:00
Vivek Goyal	f75edf2dc8	blkio: Wait for cfq queue to get backlogged if group is empty o If a queue consumes its slice and then gets deleted from service tree, its associated group will also get deleted from service tree if this was the only queue in the group. That will make group loose its share. o For the queues on which we have idling on and if these have used their slice, wait a bit for these queues to get backlogged again and then expire these queues so that group does not loose its share. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-03 19:28:53 +01:00
Vivek Goyal	f8d461d692	blkio: Propagate cgroup weight updation to cfq groups o Propagate blkio cgroup weight updation to associated cfq groups. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-03 19:28:53 +01:00
Vivek Goyal	24610333d5	blkio: Drop the reference to queue once the task changes cgroup o If a task changes cgroup, drop reference to the cfqq associated with io context and set cfqq pointer stored in ioc to NULL so that upon next request arrival we will allocate a new queue in new group. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-03 19:28:52 +01:00
Vivek Goyal	8682e1f15f	blkio: Provide some isolation between groups o Do not allow following three operations across groups for isolation. - selection of co-operating queues - preemtpions across groups - request merging across groups. o Async queues are currently global and not per group. Allow preemption of an async queue if a sync queue in other group gets backlogged. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-03 19:28:52 +01:00
Vivek Goyal	220841906f	blkio: Export disk time and sectors used by a group to user space o Export disk time and sector used by a group to user space through cgroup interface. o Also export a "dequeue" interface to cgroup which keeps track of how many a times a group was deleted from service tree. Helps in debugging. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-03 19:28:52 +01:00
Vivek Goyal	2868ef7b39	blkio: Some debugging aids for CFQ o Some debugging aids for CFQ. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-03 19:28:52 +01:00
Vivek Goyal	b1c3576961	blkio: Take care of cgroup deletion and cfq group reference counting o One can choose to change elevator or delete a cgroup. Implement group reference counting so that both elevator exit and cgroup deletion can take place gracefully. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Nauman Rafique <nauman@google.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-03 19:28:52 +01:00
Vivek Goyal	25fb5169d4	blkio: Dynamic cfq group creation based on cgroup tasks belongs to o Determine the cgroup IO submitting task belongs to and create the cfq group if it does not exist already. o Also link cfqq and associated cfq group. o Currently all async IO is mapped to root group. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-03 19:28:52 +01:00
Vivek Goyal	dae739ebc4	blkio: Group time used accounting and workload context save restore o This patch introduces the functionality to do the accounting of group time when a queue expires. This time used decides which is the group to go next. o Also introduce the functionlity to save and restore the workload type context with-in group. It might happen that once we expire the cfq queue and group, a different group will schedule in and we will lose the context of the workload type. Hence save and restore it upon queue expiry. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-03 19:28:52 +01:00
Vivek Goyal	58ff82f34c	blkio: Implement per cfq group latency target and busy queue avg o So far we had 300ms soft target latency system wide. Now with the introduction of cfq groups, divide that latency by number of groups so that one can come up with group target latency which will be helpful in determining the workload slice with-in group and also the dynamic slice length of the cfq queue. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-03 19:28:52 +01:00
Vivek Goyal	25bc6b0776	blkio: Introduce per cfq group weights and vdisktime calculations o Bring in the per cfq group weight and how vdisktime is calculated for the group. Also bring in the functionality of updating the min_vdisktime of the group service tree. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-03 19:28:52 +01:00
Vivek Goyal	31e4c28d95	blkio: Introduce blkio controller cgroup interface o This is basic implementation of blkio controller cgroup interface. This is the common interface visible to user space and should be used by different IO control policies as we implement those. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-03 19:28:51 +01:00
Vivek Goyal	1fa8f6d68b	blkio: Introduce the root service tree for cfq groups o So far we just had one cfq_group in cfq_data. To create space for more than one cfq_group, we need to have a service tree of groups where all the groups can be queued if they have active cfq queues backlogged in these. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-03 19:28:51 +01:00
Vivek Goyal	f04a642463	blkio: Keep queue on service tree until we expire it o Currently cfqq deletes a queue from service tree if it is empty (even if we might idle on the queue). This patch keeps the queue on service tree hence associated group remains on the service tree until we decide that we are not going to idle on the queue and expire it. o This just helps in time accounting for queue/group and in implementation of rest of the patches. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-03 19:28:51 +01:00
Vivek Goyal	615f0259e6	blkio: Implement macro to traverse each service tree in group o Implement a macro to traverse each service tree in the group. This avoids usage of double for loop and special condition for idle tree 4 times. o Macro is little twisted because of special handling of idle class service tree. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-03 19:28:51 +01:00
Vivek Goyal	cdb16e8f73	blkio: Introduce the notion of cfq groups o This patch introduce the notion of cfq groups. Soon we will can have multiple groups of different weights in the system. o Various service trees (prioclass and workload type trees), will become per cfq group. So hierarchy looks as follows. cfq_groups \| workload type \| cfq queue o When an scheduling decision has to be taken, first we select the cfq group then workload with-in the group and then cfq queue with-in the workload type. o This patch just makes various workload service tree per cfq group and introduce the function to be able to choose a group for scheduling. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-03 19:28:51 +01:00
Vivek Goyal	bf79193710	blkio: Set must_dispatch only if we decided to not dispatch the request o must_dispatch flag should be set only if we decided not to run the queue and dispatch the request. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-03 19:28:51 +01:00
Shaohua Li	474b18ccc2	cfq-iosched: no dispatch limit for single queue Since commit `2f5cb7381b`, each queue can send up to 4 * 4 requests if only one queue exists. I wonder why we have such limit. Device supports tag can send more requests. For example, AHCI can send 31 requests. Test (direct aio randread) shows the limits reduce about 4% disk thoughput. On the other hand, since we send one request one time, if other queue pop when current is sending more than cfq_quantum requests, current queue will stop send requests soon after one request, so sounds there is no big latency. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-03 12:58:05 +01:00
Martin K. Petersen	98262f2762	block: Allow devices to indicate whether discarded blocks are zeroed The discard ioctl is used by mkfs utilities to clear a block device prior to putting metadata down. However, not all devices return zeroed blocks after a discard. Some drives return stale data, potentially containing old superblocks. It is therefore important to know whether discarded blocks are properly zeroed. Both ATA and SCSI drives have configuration bits that indicate whether zeroes are returned after a discard operation. Implement a block level interface that allows this information to be bubbled up the stack and queried via a new block device ioctl. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-12-03 09:24:48 +01:00
Jens Axboe	464191c65b	Revert "cfq: Make use of service count to estimate the rb_key offset" This reverts commit `3586e917f2`. Corrado Zoccolo <czoccolo@gmail.com> correctly points out, that we need consistency of rb_key offset across groups. This means we cannot properly use the per-service_tree service count. Revert this change. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-11-30 09:38:13 +01:00
Corrado Zoccolo	8e550632cc	cfq-iosched: fix corner cases in idling logic Idling logic was disabled in some corner cases, leading to unfair share for noidle queues. * the idle timer was not armed if there were other requests in the driver. unfortunately, those requests could come from other workloads, or queues for which we don't enable idling. So we will check only pending requests from the active queue * rq_noidle check on no-idle queue could disable the end of tree idle if the last completed request was rq_noidle. Now, we will disable that idle only if all the queues served in the no-idle tree had rq_noidle requests. Reported-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-11-26 10:39:31 +01:00
Corrado Zoccolo	76280aff1c	cfq-iosched: idling on deep seeky sync queues Seeky sync queues with large depth can gain unfairly big share of disk time, at the expense of other seeky queues. This patch ensures that idling will be enabled for queues with I/O depth at least 4, and small think time. The decision to enable idling is sticky, until an idle window times out without seeing a new request. The reasoning behind the decision is that, if an application is using large I/O depth, it is already optimized to make full utilization of the hardware, and therefore we reserve a slice of exclusive use for it. Reported-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-11-26 10:39:31 +01:00
Corrado Zoccolo	e4a229196a	cfq-iosched: fix no-idle preemption logic An incoming no-idle queue should preempt the active no-idle queue only if the active queue is idling due to service tree empty. Previous code was buggy in two ways: * it relied on service_tree field to be set on the active queue, while it is not set when the code is idling for a new request * it didn't check for the service tree empty condition, so could lead to LIFO behaviour if multiple queues with depth > 1 were preempting each other on an non-NCQ device. Reported-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-11-26 10:39:31 +01:00
Corrado Zoccolo	e459dd08f4	cfq-iosched: fix ncq detection code CFQ's detection of queueing devices initially assumes a queuing device and detects if the queue depth reaches a certain threshold. However, it will reconsider this choice periodically. Unfortunately, if device is considered not queuing, CFQ will force a unit queue depth for some workloads, thus defeating the detection logic. This leads to poor performance on queuing hardware, since the idle window remains enabled. Given this premise, switching to hw_tag = 0 after we have proved at least once that the device is NCQ capable is not a good choice. The new detection code starts in an indeterminate state, in which CFQ behaves as if hw_tag = 1, and then, if for a long observation period we never saw large depth, we switch to hw_tag = 0, otherwise we stick to hw_tag = 1, without reconsidering it again. Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-11-26 10:02:57 +01:00
Corrado Zoccolo	c16632bab1	cfq-iosched: cleanup unreachable code cfq_should_idle returns false for no-idle queues that are not the last, so the control flow will never reach the removed code in a state that satisfies the if condition. The unreachable code was added to emulate previous cfq behaviour for non-NCQ rotational devices. My tests show that even without it, the performances and fairness are comparable with previous cfq, thanks to the fact that all seeky queues are grouped together, and that we idle at the end of the tree. Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-11-26 09:46:46 +01:00
Ilya Loginov	2d4dc890b5	block: add helpers to run flush_dcache_page() against a bio and a request's pages Mtdblock driver doesn't call flush_dcache_page for pages in request. So, this causes problems on architectures where the icache doesn't fill from the dcache or with dcache aliases. The patch fixes this. The ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE symbol was introduced to avoid pointless empty cache-thrashing loops on architectures for which flush_dcache_page() is a no-op. Every architecture was provided with this flush pages on architectires where ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE is equal 1 or do nothing otherwise. See "fix mtd_blkdevs problem with caches on some architectures" discussion on LKML for more information. Signed-off-by: Ilya Loginov <isloginov@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Peter Horton <phorton@bitbox.co.uk> Cc: "Ed L. Cashin" <ecashin@coraid.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-11-26 09:16:19 +01:00
Gui Jianfeng	3586e917f2	cfq: Make use of service count to estimate the rb_key offset For the moment, different workload cfq queues are put into different service trees. But CFQ still uses "busy_queues" to estimate rb_key offset when inserting a cfq queue into a service tree. I think this isn't appropriate, and it should make use of service tree count to do this estimation. This patch is for for-2.6.33 branch. Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-11-26 09:14:11 +01:00
Randy Dunlap	ad5ebd2fa2	block: jiffies fixes Use HZ-independent calculation of milliseconds. Add jiffies.h where it was missing since functions or macros from it are used. Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-11-11 13:47:45 +01:00
Martin K. Petersen	86b3728141	block: Expose discard granularity While SSDs track block usage on a per-sector basis, RAID arrays often have allocation blocks that are bigger. Allow the discard granularity and alignment to be set and teach the topology stacking logic how to handle them. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-11-10 11:50:21 +01:00
Corrado Zoccolo	cf7c25cf91	cfq-iosched: fix next_rq computation Cfq has a bug in computation of next_rq, that affects transition between multiple sequential request streams in a single queue (e.g.: two sequential buffered writers of the same priority), causing the alternation between the two streams for a transient period. 8,0 1 18737 0.260400660 5312 D W 141653311 + 256 8,0 1 20839 0.273239461 5400 D W 141653567 + 256 8,0 1 20841 0.276343885 5394 D W 142803919 + 256 8,0 1 20843 0.279490878 5394 D W 141668927 + 256 8,0 1 20845 0.292459993 5400 D W 142804175 + 256 8,0 1 20847 0.295537247 5400 D W 141668671 + 256 8,0 1 20849 0.298656337 5400 D W 142804431 + 256 8,0 1 20851 0.311481148 5394 D W 141668415 + 256 8,0 1 20853 0.314421305 5394 D W 142804687 + 256 8,0 1 20855 0.318960112 5400 D W 142804943 + 256 The fix makes sure that the next_rq is computed from the last dispatched request, and not affected by merging. 8,0 1 37776 4.305161306 0 D W 141738087 + 256 8,0 1 37778 4.308298091 0 D W 141738343 + 256 8,0 1 37780 4.312885190 0 D W 141738599 + 256 8,0 1 37782 4.315933291 0 D W 141738855 + 256 8,0 1 37784 4.319064459 0 D W 141739111 + 256 8,0 1 37786 4.331918431 5672 D W 142803007 + 256 8,0 1 37788 4.334930332 5672 D W 142803263 + 256 8,0 1 37790 4.337902723 5672 D W 142803519 + 256 8,0 1 37792 4.342359774 5672 D W 142803775 + 256 8,0 1 37794 4.345318286 0 D W 142804031 + 256 Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-11-08 17:16:46 +01:00
H Hartley Sweeten	476d42f138	block/scsi_ioctl.c: quiet sparse noise Quiet sparse noise about symbol's not being declared. Symbol blk_default_cmd_filter is only used locally and should be static. The function blk_scsi_ioctl_init() is a fs_initcall and should also be static. Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-11-04 09:10:33 +01:00
Jens Axboe	e00ef79971	cfq-iosched: get rid of the coop_preempt flag We need to rework this logic post the cooperating cfq_queue merging, for now just get rid of it and Jeff Moyer will fix the fall out. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-11-04 08:54:55 +01:00
Jens Axboe	125c4f221a	cfq-iosched: fix merge error We ended up with testing the same condition twice, pretty pointless. Remove that first if. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-11-03 21:25:45 +01:00
Jens Axboe	2058297d2d	Merge branch 'for-linus' into for-2.6.33 Conflicts: block/cfq-iosched.c Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-11-03 21:14:39 +01:00
Jens Axboe	150e6c67f4	Merge branch 'cfq-2.6.33' into for-2.6.33	2009-11-03 21:12:10 +01:00
Shaohua Li	4b27e1bb44	cfq-iosched: limit coop preemption CFQ has an optimization for cooperated applications. if several io-context have close requests, they will get boost. But the optimization get abused. Considering thread a, b, which work on one file. a reads sectors s, s+2, s+4, ...; b reads sectors s+1, s+3, s +5, ... Both a and b are sequential read, so they can open idle window. a reads a sector s and goes to idle window and wakeup b. b reads sector s+1, since in current implementation, cfq_should_preempt() thinks a and b are cooperators, b will preempt a. b then reads sector s+1 and goes to idle window and wakeup a. for the same reason, a will preempt b and reads s+2. a and b will continue the circle. The circle will be very long, and a and b will occupy whole disk queue. Other applications will nearly have no chance to run. Fix this limiting coop preempt until a queue is scheduled normally again. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-11-03 20:25:02 +01:00
Jens Axboe	e6ec4fe245	cfq-iosched: fix bad return value cfq_should_preempt() Commit `a6151c3a5c` inadvertently reversed a preempt condition check, potentially causing a performance regression. Make the meta check correct again. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-11-03 20:21:35 +01:00
Corrado Zoccolo	dddb74519a	cfq-iosched: simplify prio-unboost code Eliminate redundant checks. Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-11-02 10:40:37 +01:00
Jens Axboe	5869619cb5	cfq-iosched: fix style issue in cfq_get_avg_queues() Line breaks and bad brace placement. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-10-28 09:27:07 +01:00
Corrado Zoccolo	718eee0579	cfq-iosched: fairness for sync no-idle queues Currently no-idle queues in cfq are not serviced fairly: even if they can only dispatch a small number of requests at a time, they have to compete with idling queues to be serviced, experiencing large latencies. We should notice, instead, that no-idle queues are the ones that would benefit most from having low latency, in fact they are any of: * processes with large think times (e.g. interactive ones like file managers) * seeky (e.g. programs faulting in their code at startup) * or marked as no-idle from upper levels, to improve latencies of those requests. This patch improves the fairness and latency for those queues, by: * separating sync idle, sync no-idle and async queues in separate service_trees, for each priority * service all no-idle queues together * and idling when the last no-idle queue has been serviced, to anticipate for more no-idle work * the timeslices allotted for idle and no-idle service_trees are computed proportionally to the number of processes in each set. Servicing all no-idle queues together should have a performance boost for NCQ-capable drives, without compromising fairness. Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-10-28 09:23:26 +01:00
Corrado Zoccolo	a6d44e982d	cfq-iosched: enable idling for last queue on priority class cfq can disable idling for queues in various circumstances. When workloads of different priorities are competing, if the higher priority queue has idling disabled, lower priority queues may steal its disk share. For example, in a scenario with an RT process performing seeky reads vs a BE process performing sequential reads, on an NCQ enabled hardware, with low_latency unset, the RT process will dispatch only the few pending requests every full slice of service for the BE process. The patch solves this issue by always performing idle on the last queue at a given priority class > idle. If the same process, or one that can pre-empt it (so at the same priority or higher), submits a new request within the idle window, the lower priority queue won't dispatch, saving the disk bandwidth for higher priority ones. Note: this doesn't touch the non_rotational + NCQ case (no hardware to test if this is a benefit in that case). Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-10-28 09:23:26 +01:00
Corrado Zoccolo	c0324a020e	cfq-iosched: reimplement priorities using different service trees We use different service trees for different priority classes. This allows a simplification in the service tree insertion code, that no longer has to consider priority while walking the tree. Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-10-28 09:23:26 +01:00
Corrado Zoccolo	aa6f6a3de1	cfq-iosched: preparation to handle multiple service trees We embed a pointer to the service tree in each queue, to handle multiple service trees easily. Service trees are enriched with a counter. cfq_add_rq_rb is invoked after putting the rq in the fifo, to ensure that all fields in rq are properly initialized. Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-10-28 09:23:26 +01:00
Corrado Zoccolo	5db5d64277	cfq-iosched: adapt slice to number of processes doing I/O When the number of processes performing I/O concurrently increases, a fixed time slice per process will cause large latencies. This patch, if low_latency mode is enabled, will scale the time slice assigned to each process according to a 300ms target latency. In order to keep fairness among processes: * The number of active processes is computed using a special form of running average, that quickly follows sudden increases (to keep latency low), and decrease slowly (to have fairness in spite of rapid decreases of this value). To safeguard sequential bandwidth, we impose a minimum time slice (computed using 2*cfq_slice_idle as base, adjusted according to priority and async-ness). Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-10-28 09:23:26 +01:00
Shaohua Li	1a1238a7dd	cfq-iosched: improve hw_tag detection If active queue hasn't enough requests and idle window opens, cfq will not dispatch sufficient requests to hardware. In such situation, current code will zero hw_tag. But this is because cfq doesn't dispatch enough requests instead of hardware queue doesn't work. Don't zero hw_tag in such case. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-10-27 08:46:23 +01:00
Jeff Moyer	e6c5bc737a	cfq: break apart merged cfqqs if they stop cooperating cfq_queues are merged if they are issuing requests within the mean seek distance of one another. This patch detects when the coopearting stops and breaks the queues back up. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-10-26 14:34:47 +01:00
Jeff Moyer	b3b6d0408c	cfq: change the meaning of the cfqq_coop flag The flag used to indicate that a cfqq was allowed to jump ahead in the scheduling order due to submitting a request close to the queue that just executed. Since closely cooperating queues are now merged, the flag holds little meaning. Change it to indicate that multiple queues were merged. This will later be used to allow the breaking up of merged queues when they are no longer cooperating. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-10-26 14:34:47 +01:00
Jeff Moyer	df5fe3e8e1	cfq: merge cooperating cfq_queues When cooperating cfq_queues are detected currently, they are allowed to skip ahead in the scheduling order. It is much more efficient to automatically share the cfq_queue data structure between cooperating processes. Performance of the read-test2 benchmark (which is written to emulate the dump(8) utility) went from 12MB/s to 90MB/s on my SATA disk. NFS servers with multiple nfsd threads also saw performance increases. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-10-26 14:34:47 +01:00
Jeff Moyer	b2c18e1e08	cfq: calculate the seek_mean per cfq_queue not per cfq_io_context async cfq_queue's are already shared between processes within the same priority, and forthcoming patches will change the mapping of cic to sync cfq_queue from 1:1 to 1:N. So, calculate the seekiness of a process based on the cfq_queue instead of the cfq_io_context. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-10-26 14:34:46 +01:00
Mark McLoughlin	6cafb12dc8	block: silently error unsupported empty barriers too With 2.6.32-rc5 in a KVM guest using dm and virtio_blk, we see the following errors: end_request: I/O error, dev vda, sector 0 end_request: I/O error, dev vda, sector 0 The errors go away if dm stops submitting empty barriers, by reverting: commit `52b1fd5a27` Author: Mikulas Patocka <mpatocka@redhat.com> dm: send empty barriers to targets in dm_flush We should silently error all barriers, even empty barriers, on devices like virtio_blk which don't support them. See also: https://bugzilla.redhat.com/514901 Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Alasdair G Kergon <agk@redhat.com> Acked-by: Mikulas Patocka <mpatocka@redhat.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Neil Brown <neilb@suse.de> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-10-24 14:14:31 +02:00
Jens Axboe	c30f33437c	Merge branch 'for-linus' into for-2.6.33	2009-10-13 12:29:45 +02:00
Randy Dunlap	c7ebf0657b	blk-settings: fix function parameter kernel-doc notation Fix kernel-doc notation in blk-settings.c::blk_queue_max_discard_sectors(). Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-10-12 08:20:47 +02:00
KOSAKI Motohiro	8c27959858	elv_iosched_store(): fix strstrip() misuse elv_iosched_store() ignore the return value of strstrip(). It makes small inconsistent behavior. This patch fixes it. <before> ==================================== # cd /sys/block/{blockdev}/queue case1: # echo "anticipatory" > scheduler # cat scheduler noop [anticipatory] deadline cfq case2: # echo "anticipatory " > scheduler # cat scheduler noop [anticipatory] deadline cfq case3: # echo " anticipatory" > scheduler bash: echo: write error: Invalid argument <after> ==================================== # cd /sys/block/{blockdev}/queue case1: # echo "anticipatory" > scheduler # cat scheduler noop [anticipatory] deadline cfq case2: # echo "anticipatory " > scheduler # cat scheduler noop [anticipatory] deadline cfq case3: # echo " anticipatory" > scheduler noop [anticipatory] deadline cfq Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-10-09 08:48:08 +02:00
Corrado Zoccolo	355b659c87	cfq-iosched: avoid probable slice overrun when idling If the average think time is larger than the remaining time slice for any given queue, don't allow it to idle. A succesful idle also means that we need to dispatch and complete a request, so if we don't even have time left for the idle process, we would overrun the slice in any case. Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-10-08 08:43:32 +02:00
Jens Axboe	a6151c3a5c	cfq-iosched: apply bool value where we return 0/1 Saves 16 bytes of text, woohoo. But the more important point is that it makes the code more readable when returning bool for 0/1 cases. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-10-07 20:02:57 +02:00
Corrado Zoccolo	ec60e4f674	cfq-iosched: fix think time allowed for seekers CFQ enables idle only for processes that think less than the allowed idle time. Since idle time is lower for seeky queues, we should use the correct value in the comparison. Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-10-07 19:51:54 +02:00
Jens Axboe	b9c8946b19	cfq-iosched: fix the slice residual sign We should subtract the slice residual from the rb tree key, since a negative residual count indicates that the cfqq overran its slice the last time. Hence we want to add the overrun time, to position it a bit further away in the service tree. Reported-by: Corrado Zoccolo <czoccolo@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-10-06 21:09:32 +02:00
Jens Axboe	0b182d617e	cfq-iosched: abstract out the 'may this cfqq dispatch' logic Makes the whole thing easier to read, cfq_dispatch_requests() was a bit messy before. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-10-06 20:49:37 +02:00
Jens Axboe	1b59dd511b	block: use proper BLK_RW_ASYNC in blk_queue_start_tag() Makes it easier to read than the 0. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-10-06 20:19:02 +02:00
Nikanth Karthikesan	316d315bff	block: Seperate read and write statistics of in_flight requests v2 Commit `a9327cac44` added seperate read and write statistics of in_flight requests. And exported the number of read and write requests in progress seperately through sysfs. But Corrado Zoccolo <czoccolo@gmail.com> reported getting strange output from "iostat -kx 2". Global values for service time and utilization were garbage. For interval values, utilization was always 100%, and service time is higher than normal. So this was reverted by commit `0f78ab9899` The problem was in part_round_stats_single(), I missed the following: if (now == part->stamp) return; - if (part->in_flight) { + if (part_in_flight(part)) { __part_stat_add(cpu, part, time_in_queue, part_in_flight(part) * (now - part->stamp)); __part_stat_add(cpu, part, io_ticks, (now - part->stamp)); With this chunk included, the reported regression gets fixed. Signed-off-by: Nikanth Karthikesan <knikanth@suse.de> -- Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-10-06 20:16:55 +02:00
Jens Axboe	23e018a1b0	block: get rid of kblock_schedule_delayed_work() It was briefly introduced to allow CFQ to to delayed scheduling, but we ended up removing that feature again. So lets kill the function and export, and just switch CFQ back to the normal work schedule since it is now passing in a '0' delay from all call sites. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-10-05 11:03:58 +02:00
Corrado Zoccolo	48e025e63a	cfq-iosched: fix possible problem with jiffies wraparound The RR service tree is indexed by a key that is relative to current jiffies. This can cause problems on jiffies wraparound. The patch fixes it using time_before comparison, and changing the add_front path to use a relative number, too. Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-10-05 11:03:55 +02:00
Jens Axboe	30996f40bf	cfq-iosched: fix issue with rq-rq merging and fifo list ordering cfq uses rq->start_time as the fifo indicator, but that field may get modified prior to cfq doing it's fifo list adjustment when a request gets merged with another request. This can cause the fifo list to become unordered. Reported-by: Corrado Zoccolo <czoccolo@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-10-05 11:03:39 +02:00
Jens Axboe	5d13379a4d	Merge branch 'master' into for-2.6.33	2009-10-05 09:30:10 +02:00
Jens Axboe	0f78ab9899	Revert "Seperate read and write statistics of in_flight requests" This reverts commit `a9327cac44`. Corrado Zoccolo <czoccolo@gmail.com> reports: "with 2.6.32-rc1 I started getting the following strange output from "iostat -kx 2": Linux 2.6.31bisect (et2) 04/10/2009 _i686_ (2 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 10,70 0,00 3,16 15,75 0,00 70,38 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 18,22 0,00 0,67 0,01 14,77 0,02 43,94 0,01 10,53 39043915,03 2629219,87 sdb 60,89 9,68 50,79 3,04 1724,43 50,52 65,95 0,70 13,06 488437,47 2629219,87 avg-cpu: %user %nice %system %iowait %steal %idle 2,72 0,00 0,74 0,00 0,00 96,53 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 100,00 sdb 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 100,00 avg-cpu: %user %nice %system %iowait %steal %idle 6,68 0,00 0,99 0,00 0,00 92,33 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 100,00 sdb 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 100,00 avg-cpu: %user %nice %system %iowait %steal %idle 4,40 0,00 0,73 1,47 0,00 93,40 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 100,00 sdb 0,00 4,00 0,00 3,00 0,00 28,00 18,67 0,06 19,50 333,33 100,00 Global values for service time and utilization are garbage. For interval values, utilization is always 100%, and service time is higher than normal. I bisected it down to: [`a9327cac44`] Seperate read and write statistics of in_flight requests and verified that reverting just that commit indeed solves the issue on 2.6.32-rc1." So until this is debugged, revert the bad commit. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-10-04 21:04:38 +02:00
Jens Axboe	e00c54c36a	cfq-iosched: don't delay async queue if it hasn't dispatched at all We cannot delay for the first dispatch of the async queue if it hasn't dispatched at all, since that could present a local user DoS attack vector using an app that just did slow timed sync reads while filling memory. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-10-04 20:36:19 +02:00
Martin K. Petersen	ac481c20ef	block: Topology ioctls Not all users of the topology information want to use libblkid. Provide the topology information through bdev ioctls. Also clarify sector size comments for existing BLK ioctls. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-10-03 20:52:01 +02:00
Jens Axboe	61f0c1dcaa	cfq-iosched: use assigned slice sync value, not default We should use the sysfs modified slice sync value, in case it differs from the default. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-10-03 19:46:03 +02:00
Jens Axboe	963b72fc66	cfq-iosched: rename 'desktop' sysfs entry to 'low_latency' Don't think that's necessarily a perfect description of what this option fiddles with, but it's probably better than 'desktop'. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-10-03 19:42:18 +02:00
Jens Axboe	8e29675555	cfq-iosched: implement slower async initiate and queue ramp up This slowly ramps up the async queue depth based on the time passed since the sync IO, and doesn't allow async at all until a sync slice period has passed. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-10-03 16:27:13 +02:00
Vivek Goyal	365722bb91	cfq-iosched: delay async IO dispatch, if sync IO was just done o Do not allow more than max_dispatch requests from an async queue, if some sync request has finished recently. This is in the hope that sync activity is still going on in the system and we might receive a sync request soon. Most likely from a sync queue which finished a request and we did not enable idling on it. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-10-03 15:21:27 +02:00
Jens Axboe	08dc8726d4	block: CFQ is more than a desktop scheduler Update Kconfig.iosched entry. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-10-03 09:40:47 +02:00
Jens Axboe	492af6350a	block: remove the anticipatory IO scheduler AS is mostly a subset of CFQ, so there's little point in still providing this separate IO scheduler. Hopefully at some point we can get down to one single IO scheduler again, at least this brings us closer by having only one intelligent IO scheduler. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-10-03 09:37:51 +02:00
Jens Axboe	1d2235152d	cfq-iosched: add a knob for desktop interactiveness This is basically identical to what Vivek Goyal posted, but combined into one and labelled 'desktop' instead of 'fairness'. The goal is to continue to improve on the latency side of things as it relates to interactiveness, keeping the questionable bits under this sysfs tunable so it would be easy for throughput-only people to turn off. Apart from adding the interactive sysfs knob, it also adds the behavioural change of allowing slice idling even if the hardware does tagged command queuing. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-10-02 20:06:02 +02:00
Jun'ichi Nomura	b0da3f0dad	Add a tracepoint for block request remapping Since 2.6.31 now has request-based device-mapper, it's useful to have a tracepoint for request-remapping as well as bio-remapping. This patch adds a tracepoint for request-remapping, trace_block_rq_remap(). Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Cc: Alasdair G Kergon <agk@redhat.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-10-01 21:19:34 +02:00
Christoph Hellwig	67efc92580	block: allow large discard requests Currently we set the bio size to the byte equivalent of the blocks to be trimmed when submitting the initial DISCARD ioctl. That means it is subject to the max_hw_sectors limitation of the HBA which is much lower than the size of a DISCARD request we can support. Add a separate max_discard_sectors tunable to limit the size for discard requests. We limit the max discard request size in bytes to 32bit as that is the limit for bio->bi_size. This could be much larger if we had a way to pass that information through the block layer. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-10-01 21:19:34 +02:00
Christoph Hellwig	c15227de13	block: use normal I/O path for discard requests prepare_discard_fn() was being called in a place where memory allocation was effectively impossible. This makes it inappropriate for all but the most trivial translations of Linux's DISCARD operation to the block command set. Additionally adding a payload there makes the ownership of the bio backing unclear as it's now allocated by the device driver and not the submitter as usual. It is replaced with QUEUE_FLAG_DISCARD which is used to indicate whether the queue supports discard operations or not. blkdev_issue_discard now allocates a one-page, sector-length payload which is the right thing for the common ATA and SCSI implementations. The mtd implementation of prepare_discard_fn() is replaced with simply checking for the request being a discard. Largely based on a previous patch from Matthew Wilcox <matthew@wil.cx> which did the prepare_discard_fn but not the different payload allocation yet. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-10-01 21:19:30 +02:00
Jun'ichi Nomura	1a35e0f644	Add a tracepoint for block request remapping Since 2.6.31 now has request-based device-mapper, it's useful to have a tracepoint for request-remapping as well as bio-remapping. This patch adds a tracepoint for request-remapping, trace_block_rq_remap(). Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Cc: Alasdair G Kergon <agk@redhat.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-10-01 21:16:13 +02:00
Christoph Hellwig	ca80650cfb	block: allow large discard requests Currently we set the bio size to the byte equivalent of the blocks to be trimmed when submitting the initial DISCARD ioctl. That means it is subject to the max_hw_sectors limitation of the HBA which is much lower than the size of a DISCARD request we can support. Add a separate max_discard_sectors tunable to limit the size for discard requests. We limit the max discard request size in bytes to 32bit as that is the limit for bio->bi_size. This could be much larger if we had a way to pass that information through the block layer. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-10-01 21:15:46 +02:00
Christoph Hellwig	1122a26f2a	block: use normal I/O path for discard requests prepare_discard_fn() was being called in a place where memory allocation was effectively impossible. This makes it inappropriate for all but the most trivial translations of Linux's DISCARD operation to the block command set. Additionally adding a payload there makes the ownership of the bio backing unclear as it's now allocated by the device driver and not the submitter as usual. It is replaced with QUEUE_FLAG_DISCARD which is used to indicate whether the queue supports discard operations or not. blkdev_issue_discard now allocates a one-page, sector-length payload which is the right thing for the common ATA and SCSI implementations. The mtd implementation of prepare_discard_fn() is replaced with simply checking for the request being a discard. Largely based on a previous patch from Matthew Wilcox <matthew@wil.cx> which did the prepare_discard_fn but not the different payload allocation yet. Signed-off-by: Christoph Hellwig <hch@lst.de>	2009-10-01 21:15:46 +02:00
Zdenek Kabelac	48c0d4d4c0	Add missing blk_trace_remove_sysfs to be in pair with blk_trace_init_sysfs Add missing blk_trace_remove_sysfs to be in pair with blk_trace_init_sysfs introduced in commit `1d54ad6da9`. Release kobject also in case the request_fn is NULL. Problem was noticed via kmemleak backtrace when some sysfs entries were note properly destroyed during device removal: unreferenced object 0xffff88001aa76640 (size 80): comm "lvcreate", pid 2120, jiffies 4294885144 hex dump (first 32 bytes): 01 00 00 00 00 00 00 00 f0 65 a7 1a 00 88 ff ff .........e...... 90 66 a7 1a 00 88 ff ff 86 1d 53 81 ff ff ff ff .f........S..... backtrace: [<ffffffff813f9cc6>] kmemleak_alloc+0x26/0x60 [<ffffffff8111d693>] kmem_cache_alloc+0x133/0x1c0 [<ffffffff81195891>] sysfs_new_dirent+0x41/0x120 [<ffffffff81194b0c>] sysfs_add_file_mode+0x3c/0xb0 [<ffffffff81197c81>] internal_create_group+0xc1/0x1a0 [<ffffffff81197d93>] sysfs_create_group+0x13/0x20 [<ffffffff810d8004>] blk_trace_init_sysfs+0x14/0x20 [<ffffffff8123f45c>] blk_register_queue+0x3c/0xf0 [<ffffffff812447e4>] add_disk+0x94/0x160 [<ffffffffa00d8b08>] dm_create+0x598/0x6e0 [dm_mod] [<ffffffffa00de951>] dev_create+0x51/0x350 [dm_mod] [<ffffffffa00de823>] ctl_ioctl+0x1a3/0x240 [dm_mod] [<ffffffffa00de8f2>] dm_compat_ctl_ioctl+0x12/0x20 [dm_mod] [<ffffffff81177bfd>] compat_sys_ioctl+0xcd/0x4f0 [<ffffffff81036ed8>] sysenter_dispatch+0x7/0x2c [<ffffffffffffffff>] 0xffffffffffffffff Signed-off-by: Zdenek Kabelac <zkabelac@redhat.com> Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-10-01 21:15:46 +02:00
Martin K. Petersen	5dee2477df	block: Do not clamp max_hw_sectors for stacking devices Stacking devices do not have an inherent max_hw_sector limit. Set the default to INT_MAX so we are bounded only by capabilities of the underlying storage. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-10-01 21:15:45 +02:00
Martin K. Petersen	80ddf247c8	block: Set max_sectors correctly for stacking devices The topology changes unintentionally caused SAFE_MAX_SECTORS to be set for stacking devices. Set the default limit to BLK_DEF_MAX_SECTORS and provide SAFE_MAX_SECTORS in blk_queue_make_request() for legacy hw drivers that depend on the old behavior. Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-10-01 21:15:45 +02:00
Kay Sievers	e454cea20b	Driver-Core: extend devnode callbacks to provide permissions This allows subsytems to provide devtmpfs with non-default permissions for the device node. Instead of the default mode of 0600, null, zero, random, urandom, full, tty, ptmx now have a mode of 0666, which allows non-privileged processes to access standard device nodes in case no other userspace process applies the expected permissions. This also fixes a wrong assignment in pktcdvd and a checkpatch.pl complain. Signed-off-by: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2009-09-19 12:50:38 -07:00
Linus Torvalds	ab86e5765d	Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: Driver Core: devtmpfs - kernel-maintained tmpfs-based /dev debugfs: Modify default debugfs directory for debugging pktcdvd. debugfs: Modified default dir of debugfs for debugging UHCI. debugfs: Change debugfs directory of IWMC3200 debugfs: Change debuhgfs directory of trace-events-sample.h debugfs: Fix mount directory of debugfs by default in events.txt hpilo: add poll f_op hpilo: add interrupt handler hpilo: staging for interrupt handling driver core: platform_device_add_data(): use kmemdup() Driver core: Add support for compatibility classes uio: add generic driver for PCI 2.3 devices driver-core: move dma-coherent.c from kernel to driver/base mem_class: fix bug mem_class: use minor as index instead of searching the array driver model: constify attribute groups UIO: remove 'default n' from Kconfig Driver core: Add accessor for device platform data Driver core: move dev_get/set_drvdata to drivers/base/dd.c Driver core: add new device to bus's list before probing	2009-09-16 08:27:10 -07:00
David Brownell	a4dbd6740d	driver model: constify attribute groups Let attribute group vectors be declared "const". We'd like to let most attribute metadata live in read-only sections... this is a start. Signed-off-by: David Brownell <dbrownell@users.sourceforge.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2009-09-15 09:50:47 -07:00
Linus Torvalds	ada3fa1505	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (46 commits) powerpc64: convert to dynamic percpu allocator sparc64: use embedding percpu first chunk allocator percpu: kill lpage first chunk allocator x86,percpu: use embedding for 64bit NUMA and page for 32bit NUMA percpu: update embedding first chunk allocator to handle sparse units percpu: use group information to allocate vmap areas sparsely vmalloc: implement pcpu_get_vm_areas() vmalloc: separate out insert_vmalloc_vm() percpu: add chunk->base_addr percpu: add pcpu_unit_offsets[] percpu: introduce pcpu_alloc_info and pcpu_group_info percpu: move pcpu_lpage_build_unit_map() and pcpul_lpage_dump_cfg() upward percpu: add @align to pcpu_fc_alloc_fn_t percpu: make @dyn_size mandatory for pcpu_setup_first_chunk() percpu: drop @static_size from first chunk allocators percpu: generalize first chunk allocator selection percpu: build first chunk allocators selectively percpu: rename 4k first chunk allocator to page percpu: improve boot messages percpu: fix pcpu_reclaim() locking ... Fix trivial conflict as by Tejun Heo in kernel/sched.c	2009-09-15 09:39:44 -07:00
Linus Torvalds	355bbd8cb8	Merge branch 'for-2.6.32' of git://git.kernel.dk/linux-2.6-block * 'for-2.6.32' of git://git.kernel.dk/linux-2.6-block: (29 commits) block: use blkdev_issue_discard in blk_ioctl_discard Make DISCARD_BARRIER and DISCARD_NOBARRIER writes instead of reads block: don't assume device has a request list backing in nr_requests store block: Optimal I/O limit wrapper cfq: choose a new next_req when a request is dispatched Seperate read and write statistics of in_flight requests aoe: end barrier bios with EOPNOTSUPP block: trace bio queueing trial only when it occurs block: enable rq CPU completion affinity by default cfq: fix the log message after dispatched a request block: use printk_once cciss: memory leak in cciss_init_one() splice: update mtime and atime on files block: make blk_iopoll_prep_sched() follow normal 0/1 return convention cfq-iosched: get rid of must_alloc flag block: use interrupts disabled version of raise_softirq_irqoff() block: fix comment in blk-iopoll.c block: adjust default budget for blk-iopoll block: fix long lines in block/blk-iopoll.c block: add blk-iopoll, a NAPI like approach for block devices ...	2009-09-14 17:55:15 -07:00
Christoph Hellwig	746cd1e7e4	block: use blkdev_issue_discard in blk_ioctl_discard blk_ioctl_discard duplicates large amounts of code from blkdev_issue_discard, the only difference between the two is that blkdev_issue_discard needs to send a barrier discard request and blk_ioctl_discard a non-barrier one, and blk_ioctl_discard needs to wait on the request. To facilitates this add a flags argument to blkdev_issue_discard to control both aspects of the behaviour. This will be very useful later on for using the waiting funcitonality for other callers. Based on an earlier patch from Matthew Wilcox <matthew@wil.cx>. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-14 08:24:53 +02:00
Jens Axboe	b8a9ae779f	block: don't assume device has a request list backing in nr_requests store Stacked devices do not. For now, just error out with -EINVAL. Later we could make the limit apply on stacked devices too, for throttling reasons. This fixes 5a54cd13353bb3b88887604e2c980aa01e314309 and should go into 2.6.31 stable as well. Cc: stable@kernel.org Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-14 08:24:53 +02:00
Martin K. Petersen	3c5820c743	block: Optimal I/O limit wrapper Implement blk_limits_io_opt() and make blk_queue_io_opt() a wrapper around it. DM needs this to avoid poking at the queue_limits directly. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-14 08:24:52 +02:00
Jeff Moyer	06d2188644	cfq: choose a new next_req when a request is dispatched This patch addresses http://bugzilla.kernel.org/show_bug.cgi?id=13401, a regression introduced in 2.6.30. From the bug report: Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-14 08:24:52 +02:00
Nikanth Karthikesan	a9327cac44	Seperate read and write statistics of in_flight requests Currently, there is a single in_flight counter measuring the number of requests in the request_queue. But some monitoring tools would like to know how many read requests and write requests are in progress. Split the current in_flight counter into two seperate counters for read and write. This information is exported as a sysfs attribute, as changing the currently available stat files would break the existing tools. Signed-off-by: Nikanth Karthikesan <knikanth@suse.de> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-14 08:24:52 +02:00
Minchan Kim	01edede41e	block: trace bio queueing trial only when it occurs If BIO is discarded or cross over end of device, BIO queueing trial doesn't occur. Actually the trace was called just before make_request at first: [PATCH] Block queue IO tracing support (blktrace) as of 2006-03-23 2056a782f8e7e65fd4bfd027506b4ce1c5e9ccd4 And then 2 patches added some checks between them: [PATCH] md: check bio address after mapping through partitions 5ddfe9691c91a244e8d1be597b6428fcefd58103, [BLOCK] Don't allow empty barriers to be passed down to queues that don't grok them 51fd77bd9f512ab6cc9df0733ba1caaab89eb957 It breaks original goal. Let's trace it only when it happens. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-11 14:34:34 +02:00
Shan Wei	b217a903ab	cfq: fix the log message after dispatched a request The blktrace tools can show process id when cfq dispatched a request, using cfq_log_cfqq() instead of cfq_log(). Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-11 14:34:33 +02:00
Jens Axboe	1b379d8daf	cfq-iosched: get rid of must_alloc flag It's not currently used, as pointed out by Gui Jianfeng <guijianfeng@cn.fujitsu.com>. We already check the wait_request flag to allow an idling queue priority allocation access, so we don't need this extra flag. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-11 14:33:32 +02:00
Jens Axboe	a33dac26d4	block: use interrupts disabled version of raise_softirq_irqoff() We already have interrupts disabled at that point, so use the __raise_softirq_irqoff() variant. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-11 14:33:32 +02:00
Jens Axboe	fca51d64c5	block: fix comment in blk-iopoll.c Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-11 14:33:32 +02:00
Jens Axboe	37867ae7c5	block: adjust default budget for blk-iopoll It's not exported, I doubt we'll have a reason to change this... Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-11 14:33:31 +02:00
Jens Axboe	1badcfbd7f	block: fix long lines in block/blk-iopoll.c Note sure why they happened in the first place, probably some bad terminal setting. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-11 14:33:31 +02:00
Jens Axboe	5e605b64a1	block: add blk-iopoll, a NAPI like approach for block devices This borrows some code from NAPI and implements a polled completion mode for block devices. The idea is the same as NAPI - instead of doing the command completion when the irq occurs, schedule a dedicated softirq in the hopes that we will complete more IO when the iopoll handler is invoked. Devices have a budget of commands assigned, and will stay in polled mode as long as they continue to consume their budget from the iopoll softirq handler. If they do not, the device is set back to interrupt completion mode. This patch holds the core bits for blk-iopoll, device driver support sold separately. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-11 14:33:31 +02:00
Jens Axboe	fb1e75389b	block: improve queue_should_plug() by looking at IO depths Instead of just checking whether this device uses block layer tagging, we can improve the detection by looking at the maximum queue depth it has reached. If that crosses 4, then deem it a queuing device. This is important on high IOPS devices, since plugging hurts the performance there (it can be as much as 10-15% of the sys time). Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-11 14:33:31 +02:00
Jens Axboe	1f98a13f62	bio: first step in sanitizing the bio->bi_rw flag testing Get rid of any functions that test for these bits and make callers use bio_rw_flagged() directly. Then it is at least directly apparent what variable and flag they check. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-11 14:33:31 +02:00
Hannes Reinecke	e3264a4d7d	Send uevents for write_protect changes Whenever a block device changes it's read-only attribute notify the userspace about it. Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Nikanth Karthikesan <knikanth@suse.de> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-11 14:33:31 +02:00
Vivek Goyal	d58b85e1e8	cfq-iosched: no need to keep track of busy_rt_queues o Get rid of busy_rt_queues infrastructure. Looks like it is redundant. o Once an RT queue gets request it will preempt any of the BE or IDLE queues immediately. Otherwise this queue will be put on service tree and scheduler will anyway select this queue before any of the BE or IDLE queue. Hence looks like there is no need to keep track of how many busy RT queues are currently on service tree. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-11 14:33:30 +02:00
Jens Axboe	5ad531db6e	cfq-iosched: drain device queue before switching to a sync queue To lessen the impact of async IO on sync IO, let the device drain of any async IO in progress when switching to a sync cfqq that has idling enabled. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-11 14:33:30 +02:00
Tejun Heo	da6c5c720c	scsi,block: update SCSI to handle mixed merge failures Update scsi_io_completion() such that it only fails requests till the next error boundary and retry the leftover. This enables block layer to merge requests with different failfast settings and still behave correctly on errors. Allow merge of requests of different failfast settings. As SCSI is currently the only subsystem which follows failfast status, there's no need to worry about other block drivers for now. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Niel Lambrechts <niel.lambrechts@gmail.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-11 14:33:30 +02:00
Tejun Heo	80a761fd33	block: implement mixed merge of different failfast requests Failfast has characteristics from other attributes. When issuing, executing and successuflly completing requests, failfast doesn't make any difference. It only affects how a request is handled on failure. Allowing requests with different failfast settings to be merged cause normal IOs to fail prematurely while not allowing has performance penalties as failfast is used for read aheads which are likely to be located near in-flight or to-be-issued normal IOs. This patch introduces the concept of 'mixed merge'. A request is a mixed merge if it is merge of segments which require different handling on failure. Currently the only mixable attributes are failfast ones (or lack thereof). When a bio with different failfast settings is added to an existing request or requests of different failfast settings are merged, the merged request is marked mixed. Each bio carries failfast settings and the request always tracks failfast state of the first bio. When the request fails, blk_rq_err_bytes() can be used to determine how many bytes can be safely failed without crossing into an area which requires further retrials. This allows request merging regardless of failfast settings while keeping the failure handling correct. This patch only implements mixed merge but doesn't enable it. The next one will update SCSI to make use of mixed merge. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Niel Lambrechts <niel.lambrechts@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-11 14:33:30 +02:00
Tejun Heo	a82afdfcb8	block: use the same failfast bits for bio and request bio and request use the same set of failfast bits. This patch makes the following changes to simplify things. * enumify BIO_RW* bits and reorder bits such that BIOS_RW_FAILFAST_* bits coincide with __REQ_FAILFAST_* bits. * The above pushes BIO_RW_AHEAD out of sync with __REQ_FAILFAST_DEV but the matching is useless anyway. init_request_from_bio() is responsible for setting FAILFAST bits on FS requests and non-FS requests never use BIO_RW_AHEAD. Drop the code and comment from blk_rq_bio_prep(). * Define REQ_FAILFAST_MASK which is OR of all FAILFAST bits and simplify FAILFAST flags handling in init_request_from_bio(). Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-11 14:33:27 +02:00
Jens Axboe	d993831fa7	writeback: add name to backing_dev_info This enables us to track who does what and print info. Its main use is catching dirty inodes on the default_backing_dev_info, so we can fix that up. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-11 09:20:26 +02:00
Nikanth Karthikesan	c295fc0578	block: Allow changing max_sectors_kb above the default 512 The patch "block: Use accessor functions for queue limits" (`ae03bf639a`) changed queue_max_sectors_store() to use blk_queue_max_sectors() instead of directly assigning the value. But blk_queue_max_sectors() differs a bit 1. It sets both max_sectors_kb, and max_hw_sectors_kb 2. Never allows one to change max_sectors_kb above BLK_DEF_MAX_SECTORS. If one specifies a value greater then max_hw_sectors is set to that value but max_sectors is set to BLK_DEF_MAX_SECTORS I am not sure whether blk_queue_max_sectors() should be changed, as it seems to be that way for a long time. And there may be callers dependent on that behaviour. This patch simply reverts to the older way of directly assigning the value to max_sectors as it was before. Signed-off-by: Nikanth Karthikesan <knikanth@suse.de> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-09-01 22:40:15 +02:00
Tejun Heo	384be2b18a	Merge branch 'percpu-for-linus' into percpu-for-next Conflicts: arch/sparc/kernel/smp_64.c arch/x86/kernel/cpu/perf_counter.c arch/x86/kernel/setup_percpu.c drivers/cpufreq/cpufreq_ondemand.c mm/percpu.c Conflicts in core and arch percpu codes are mostly from commit ed78e1e078dd44249f88b1dd8c76dafb39567161 which substituted many num_possible_cpus() with nr_cpu_ids. As for-next branch has moved all the first chunk allocators into mm/percpu.c, the changes are moved from arch code to mm/percpu.c. Signed-off-by: Tejun Heo <tj@kernel.org>	2009-08-14 14:45:31 +09:00
John Stoffel	14d9fa3525	Make SCSI SG v4 driver enabled by default and remove EXPERIMENTAL dependency, since udev depends on BSG Make Block Layer SG support v4 the default, since recent udev versions depend on this to access serial numbers and other low level info properly. This should be backported to older kernels as well, since most distros have enabled this for a long time. Signed-off-by: John Stoffel <john@stoffel.org> Cc: stable@kernel.org Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-08-04 22:10:17 +02:00
Martin K. Petersen	7e5f5fb09e	block: Update topology documentation Update topology comments and sysfs documentation based upon discussions with Neil Brown. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-08-01 10:24:35 +02:00
Martin K. Petersen	70dd5bf3b9	block: Stack optimal I/O size When stacking block devices ensure that optimal I/O size is scaled accordingly. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-08-01 10:24:35 +02:00
Martin K. Petersen	7c958e3264	block: Add a wrapper for setting minimum request size without a queue Introduce blk_limits_io_min() and make blk_queue_io_min() call it. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-08-01 10:24:35 +02:00
Martin K. Petersen	fef246672b	block: Make blk_queue_stack_limits use the new stacking interface blk_queue_stack_limits() has been superceded by blk_stack_limits() and disk_stack_limits(). Wrap the function call for now, we'll deprecate it later. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-08-01 10:24:35 +02:00
Jens Axboe	56ad1740d9	block: make the end_io functions be non-GPL exports Prior to the change for more sane end_io functions, we exported the helpers with the normal EXPORT_SYMBOL(). That got changed to _GPL() for the new interface. Revert that particular change, on the basis that this is basic functionality and doesn't dip into internal structures. If these exports can't be non-GPL, then we may as well make EXPORT_SYMBOL() imply GPL for everything. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-07-28 22:11:24 +02:00
Xiaotian Feng	3839e4b29b	block: fix improper kobject release in blk_integrity_unregister blk_integrity_unregister should use kobject_put to release the kobject, otherwise after bi is freed, memory of bi->kobj->name is leaked. Signed-off-by: Xiaotian Feng <dfeng@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-07-28 09:11:14 +02:00
Jens Axboe	a4e7d46407	block: always assign default lock to queues Move the assignment of a default lock below blk_init_queue() to blk_queue_make_request(), so we also get to set the default lock for ->make_request_fn() based drivers. This is important since the queue flag locking requires a lock to be in place. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-07-28 09:07:29 +02:00
Xiaotian Feng	9cb308ce8d	block: sysfs fix mismatched queue_var_{store,show} in 64bit kernel In blk-sysfs.c, queue_var_store uses unsigned long to store data, but queue_var_show uses unsigned int to show data. This causes, # echo 70000000000 > /sys/block/<dev>/queue/read_ahead_kb # cat /sys/block/<dev>/queue/read_ahead_kb => get wrong value Fix it by using unsigned long. While at it, convert queue_rq_affinity_show() such that it uses bool variable instead of explicit != 0 testing. Signed-off-by: Xiaotian Feng <dfeng@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2009-07-17 16:54:15 +09:00
Tejun Heo	0a09f4319c	block: fix failfast merge testing in elv_rq_merge_ok() Commit `ab0fd1debe` tries to prevent merge of requests with different failfast settings. In elv_rq_merge_ok(), it compares new bio's failfast flags against the merge target request's. However, the flag testing accessors for bio and blk don't return boolean but the tested bit value directly and FAILFAST on bio and blk don't match, so directly comparing them with == results in false negative unnecessary preventing merge of readahead requests. This patch convert the results to boolean by negating them before comparison. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Boaz Harrosh <bharrosh@panasas.com> Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Jeff Garzik <jeff@garzik.org>	2009-07-17 14:50:43 +09:00
Vivek Goyal	32f2e807a3	cfq-iosched: reset oom_cfqq in cfq_set_request() In case memory is scarce, we now default to oom_cfqq. Once memory is available again, we should allocate a new cfqq and stop using oom_cfqq for a particular io context. Once a new request comes in, check if we are using oom_cfqq, and if yes, try to allocate a new cfqq. Tested the patch by forcing the use of oom_cfqq and upon next request thread realized that it was using oom_cfqq and it allocated a new cfqq. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-07-10 20:31:54 +02:00
FUJITA Tomonori	76da03467a	block: call blk_scsi_ioctl_init() Currently, blk_scsi_ioctl_init() is not called since it lacks an initcall marking. This causes the command table to be unitialized, hence somce commands are block when they should not have been. This fixes a regression introduced by commit `018e044689` Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-07-10 20:31:53 +02:00
Tejun Heo	c43768cbb7	Merge branch 'master' into for-next Pull linus#master to merge PER_CPU_DEF_ATTRIBUTES and alpha build fix changes. As alpha in percpu tree uses 'weak' attribute instead of inline assembly, there's no need for __used attribute. Conflicts: arch/alpha/include/asm/percpu.h arch/mn10300/kernel/vmlinux.lds.S include/linux/percpu-defs.h	2009-07-04 07:13:18 +09:00
Tejun Heo	ab0fd1debe	block: don't merge requests of different failfast settings Block layer used to merge requests and bios with different failfast settings. This caused regular IOs to fail prematurely when they were merged into failfast requests for readahead. Niel Lambrechts could trigger the problem semi-reliably on ext4 when resuming from STR. ext4 uses readahead when reading inodes and combined with the deterministic extra SATA PHY exception cycle during resume on the specific configuration, non-readahead inode read would fail causing ext4 errors. Please read the following thread for details. http://lkml.org/lkml/2009/5/23/21 This patch makes block layer reject merging if the failfast settings don't match. This is correct but likely to lower IO performance by preventing regular IOs from mingling into surrounding readahead requests. Changes to allow such mixed merges and handle errors correctly will be added later. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Niel Lambrechts <niel.lambrechts@gmail.com> Cc: Theodore Tso <tytso@mit.edu> Signed-off-by: Jens Axboe <axboe@carl.(none)>	2009-07-03 21:06:45 +02:00
Shan Wei	b706f64281	cfq-iosched: remove redundant check for NULL cfqq in cfq_set_request() With the changes for falling back to an oom_cfqq, we never fail to find/allocate a queue in cfq_get_queue(). So remove the check. Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-07-01 12:41:14 +02:00
NeilBrown	db64f680ba	blocK: Restore barrier support for md and probably other virtual devices. The next_ordered flag is only meaningful for devices that use __make_request. So move the test against next_ordered out of generic code and in to __make_request Since this test was added, barriers have not worked on md or any devices that don't use __make_request and so don't bother to set next_ordered. (dm explicitly sets something other than QUEUE_ORDERED_NONE since commit `99360b4c18` but notes in the comments that it is otherwise meaningless). Cc: Ken Milmore <ken.milmore@googlemail.com> Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-07-01 10:56:26 +02:00
Jens Axboe	018e044689	block: get rid of queue-private command filter The initial patches to support this through sysfs export were broken and have been if 0'ed out in any release. So lets just kill the code and reclaim some space in struct request_queue, if anyone would later like to fixup the sysfs bits, the git history can easily restore the removed bits. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-07-01 10:56:26 +02:00
Martin K. Petersen	7878cba9f0	block: Create bip slabs with embedded integrity vectors This patch restores stacking ability to the block layer integrity infrastructure by creating a set of dedicated bip slabs. Each bip slab has an embedded bio_vec array at the end. This cuts down on memory allocations and also simplifies the code compared to the original bvec version. Only the largest bip slab is backed by a mempool. The pool is contained in the bio_set so stacking drivers can ensure forward progress. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@carl.(none)>	2009-07-01 10:56:25 +02:00
Jens Axboe	6118b70b3a	cfq-iosched: get rid of the need for __GFP_NOFAIL in cfq_find_alloc_queue() Setup an emergency fallback cfqq that we allocate at IO scheduler init time. If the slab allocation fails in cfq_find_alloc_queue(), we'll just punt IO to that cfqq instead. This ensures that cfq_find_alloc_queue() never fails without having to ensure free memory. On cfqq lookup, always try to allocate a new cfqq if the given cfq io context has the oom_cfqq assigned. This ensures that we only temporarily punt to this shared queue. Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-07-01 10:56:25 +02:00
Jens Axboe	d5036d770f	cfq-iosched: move cfqq initialization out of cfq_find_alloc_queue() We're going to be needing that init code outside of that function to get rid of the __GFP_NOFAIL in cfqq allocation. Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-07-01 10:56:25 +02:00
Tejun Heo	245b2e70ea	percpu: clean up percpu variable definitions Percpu variable definition is about to be updated such that all percpu symbols including the static ones must be unique. Update percpu variable definitions accordingly. * as,cfq: rename ioc_count uniquely * cpufreq: rename cpu_dbs_info uniquely * xen: move nesting_count out of xen_evtchn_do_upcall() and rename it * mm: move ratelimits out of balance_dirty_pages_ratelimited_nr() and rename it * ipv4,6: rename cookie_scratch uniquely * x86 perf_counter: rename prev_left to pmc_prev_left, irq_entry to pmc_irq_entry and nmi_entry to pmc_nmi_entry * perf_counter: rename disable_count to perf_disable_count * ftrace: rename test_event_disable to ftrace_test_event_disable * kmemleak: rename test_pointer to kmemleak_test_pointer * mce: rename next_interval to mce_next_interval [ Impact: percpu usage cleanups, no duplicate static percpu var names ] Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Dave Jones <davej@redhat.com> Cc: Jeremy Fitzhardinge <jeremy@xensource.com> Cc: linux-mm <linux-mm@kvack.org> Cc: David S. Miller <davem@davemloft.net> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Steven Rostedt <srostedt@redhat.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Andi Kleen <andi@firstfloor.org>	2009-06-24 15:13:48 +09:00
FUJITA Tomonori	1119865935	block: revert "bsg: setting rq->bio to NULL" The SMP handler (sas_smp_request) was fixed to use the block API properly, so we don't need this workaround to avoid blk_put_request() warning. Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>	2009-06-21 10:52:42 -05:00
Randy Dunlap	f740f5ca05	Fix kernel-doc parameter name typo in blk-settings.c: Warning(block/blk-settings.c:108): No description found for parameter 'lim' Warning(block/blk-settings.c:108): Excess function parameter 'limits' description in 'blk_set_default_limits' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-06-19 09:18:32 +02:00
Bartlomiej Zolnierkiewicz	90c699a9ee	block: rename CONFIG_LBD to CONFIG_LBDAF Follow-up to "block: enable by default support for large devices and files on 32-bit archs". Rename CONFIG_LBD to CONFIG_LBDAF to: - allow update of existing [def]configs for "default y" change - reflect that it is used also for large files support nowadays Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-06-19 08:08:50 +02:00
Martin K. Petersen	3a02c8e814	block: Fix bounce_pfn setting Correct stacking bounce_pfn limit setting and prevent warnings on 32-bit. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-06-18 09:56:20 +02:00
Linus Torvalds	6fd03301d7	Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (64 commits) debugfs: use specified mode to possibly mark files read/write only debugfs: Fix terminology inconsistency of dir name to mount debugfs filesystem. xen: remove driver_data direct access of struct device from more drivers usb: gadget: at91_udc: remove driver_data direct access of struct device uml: remove driver_data direct access of struct device block/ps3: remove driver_data direct access of struct device s390: remove driver_data direct access of struct device parport: remove driver_data direct access of struct device parisc: remove driver_data direct access of struct device of_serial: remove driver_data direct access of struct device mips: remove driver_data direct access of struct device ipmi: remove driver_data direct access of struct device infiniband: ehca: remove driver_data direct access of struct device ibmvscsi: gadget: at91_udc: remove driver_data direct access of struct device hvcs: remove driver_data direct access of struct device xen block: remove driver_data direct access of struct device thermal: remove driver_data direct access of struct device scsi: remove driver_data direct access of struct device pcmcia: remove driver_data direct access of struct device PCIE: remove driver_data direct access of struct device ... Manually fix up trivial conflicts due to different direct driver_data direct access fixups in drivers/block/{ps3disk.c,ps3vram.c}	2009-06-16 12:57:37 -07:00
Li Zefan	e212d6f250	block: remove some includings of blktrace_api.h When porting blktrace to tracepoints, we changed to trace/block.h for trace prober declarations. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-06-16 11:19:36 +02:00
Martin K. Petersen	e475bba2fd	block: Introduce helper to reset queue limits to default values DM reuses the request queue when swapping in a new device table Introduce blk_set_default_limits() which can be used to reset the the queue_limits prior to stacking devices. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Alasdair G Kergon <agk@redhat.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-06-16 08:23:52 +02:00
Jeff Moyer	6923715ae3	cfq: remove extraneous '\n' in blktrace output I noticed a blank line in blktrace output. This patch fixes that. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-06-16 08:21:04 +02:00
Jens Axboe	0989a025d2	block: don't overwrite bdi->state after bdi_init() has been run Move the defaults to where we do the init of the backing_dev_info. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-06-16 08:21:03 +02:00
Gui Jianfeng	81be834713	cfq: cleanup for last_end_request in cfq_data Actually, last_end_request in cfq_data isn't used now. So lets just remove it. Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-06-16 08:21:03 +02:00
Kay Sievers	2bdf914915	Driver Core: bsg: add nodename for bsg driver This adds support to the BSG driver to report the proper device name to userspace for the bsg devices. Signed-off-by: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Jan Blunck <jblunck@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2009-06-15 21:30:26 -07:00
Kay Sievers	b03f38b685	Driver Core: block: add nodename support for block drivers. This adds support for block drivers to report their requested nodename to userspace. It also updates a number of block drivers to provide the needed subdirectory and device name to be used for them. Signed-off-by: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Jan Blunck <jblunck@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2009-06-15 21:30:25 -07:00
Randy Dunlap	8ebf975608	block: fix kernel-doc in recent block/ changes Fix kernel-doc warnings in recently changed block/ source code. Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-06-11 20:14:23 -07:00
Linus Torvalds	c9059598ea	Merge branch 'for-2.6.31' of git://git.kernel.dk/linux-2.6-block * 'for-2.6.31' of git://git.kernel.dk/linux-2.6-block: (153 commits) block: add request clone interface (v2) floppy: fix hibernation ramdisk: remove long-deprecated "ramdisk=" boot-time parameter fs/bio.c: add missing __user annotation block: prevent possible io_context->refcount overflow Add serial number support for virtio_blk, V4a block: Add missing bounce_pfn stacking and fix comments Revert "block: Fix bounce limit setting in DM" cciss: decode unit attention in SCSI error handling code cciss: Remove no longer needed sendcmd reject processing code cciss: change SCSI error handling routines to work with interrupts enabled. cciss: separate error processing and command retrying code in sendcmd_withirq_core() cciss: factor out fix target status processing code from sendcmd functions cciss: simplify interface of sendcmd() and sendcmd_withirq() cciss: factor out core of sendcmd_withirq() for use by SCSI error handling code cciss: Use schedule_timeout_uninterruptible in SCSI error handling code block: needs to set the residual length of a bidi request Revert "block: implement blkdev_readpages" block: Fix bounce limit setting in DM Removed reference to non-existing file Documentation/PCI/PCI-DMA-mapping.txt ... Manually fix conflicts with tracing updates in: block/blk-sysfs.c drivers/ide/ide-atapi.c drivers/ide/ide-cd.c drivers/ide/ide-floppy.c drivers/ide/ide-tape.c include/trace/events/block.h kernel/trace/blktrace.c	2009-06-11 11:10:35 -07:00
Linus Torvalds	27951daa71	Merge branch 'for-2.6.31' of git://git.kernel.org/pub/scm/linux/kernel/git/bart/ide-2.6 * 'for-2.6.31' of git://git.kernel.org/pub/scm/linux/kernel/git/bart/ide-2.6: (28 commits) ide-tape: fix debug call alim15x3: Remove historical hacks, re-enable init_hwif for PowerPC ide-dma: don't reset request fields on dma_timeout_retry() ide: drop rq->data handling from ide_map_sg() ide-atapi: kill unused fields and callbacks ide-tape: simplify read/write functions ide-tape: use byte size instead of sectors on rw issue functions ide-tape: unify r/w init paths ide-tape: kill idetape_bh ide-tape: use standard data transfer mechanism ide-tape: use single continuous buffer ide-atapi,tape,floppy: allow ->pc_callback() to change rq->data_len ide-tape,floppy: fix failed command completion after request sense ide-pm: don't abuse rq->data ide-cd,atapi: use bio for internal commands ide-atapi: convert ide-{floppy,tape} to using preallocated sense buffer ide-cd: convert to using generic sense request ide: add helpers for preparing sense requests ide-cd: don't abuse rq->buffer ide-atapi: don't abuse rq->buffer ...	2009-06-11 10:00:03 -07:00
Kiyoshi Ueda	b0fd271d5f	block: add request clone interface (v2) This patch adds the following 2 interfaces for request-stacking drivers: - blk_rq_prep_clone(struct request clone, struct request orig, struct bio_set bs, gfp_t gfp_mask, int (bio_ctr)(struct bio , struct bio, void ), void data) * Clones bios in the original request to the clone request (bio_ctr is called for each cloned bios.) * Copies attributes of the original request to the clone request. The actual data parts (e.g. ->cmd, ->buffer, ->sense) are not copied. - blk_rq_unprep_clone(struct request clone) Frees cloned bios from the clone request. Request stacking drivers (e.g. request-based dm) need to make a clone request for a submitted request and dispatch it to other devices. To allocate request for the clone, request stacking drivers may not be able to use blk_get_request() because the allocation may be done in an irq-disabled context. So blk_rq_prep_clone() takes a request allocated by the caller as an argument. For each clone bio in the clone request, request stacking drivers should be able to set up their own completion handler. So blk_rq_prep_clone() takes a callback function which is called for each clone bio, and a pointer for private data which is passed to the callback. NOTE: blk_rq_prep_clone() doesn't copy any actual data of the original request. Pages are shared between original bios and cloned bios. So caller must not complete the original request before the clone request. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Cc: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-06-11 13:11:05 +02:00
Linus Torvalds	8623661180	Merge branch 'tracing-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (244 commits) Revert "x86, bts: reenable ptrace branch trace support" tracing: do not translate event helper macros in print format ftrace/documentation: fix typo in function grapher name tracing/events: convert block trace points to TRACE_EVENT(), fix !CONFIG_BLOCK tracing: add protection around module events unload tracing: add trace_seq_vprint interface tracing: fix the block trace points print size tracing/events: convert block trace points to TRACE_EVENT() ring-buffer: fix ret in rb_add_time_stamp ring-buffer: pass in lockdep class key for reader_lock tracing: add annotation to what type of stack trace is recorded tracing: fix multiple use of __print_flags and __print_symbolic tracing/events: fix output format of user stack tracing/events: fix output format of kernel stack tracing/trace_stack: fix the number of entries in the header ring-buffer: discard timestamps that are at the start of the buffer ring-buffer: try to discard unneeded timestamps ring-buffer: fix bug in ring_buffer_discard_commit ftrace: do not profile functions when disabled tracing: make trace pipe recognize latency format flag ...	2009-06-10 19:53:40 -07:00
Nikanth Karthikesan	d9c7d394a8	block: prevent possible io_context->refcount overflow Currently io_context has an atomic_t(32-bit) as refcount. In the case of cfq, for each device against whcih a task does I/O, a reference to the io_context would be taken. And when there are multiple process sharing io_contexts(CLONE_IO) would also have a reference to the same io_context. Theoretically the possible maximum number of processes sharing the same io_context + the number of disks/cfq_data referring to the same io_context can overflow the 32-bit counter on a very high-end machine. Even though it is an improbable case, let us make it atomic_long_t. Signed-off-by: Nikanth Karthikesan <knikanth@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-06-10 23:07:15 +02:00
Li Zefan	55782138e4	tracing/events: convert block trace points to TRACE_EVENT() TRACE_EVENT is a more generic way to define tracepoints. Doing so adds these new capabilities to this tracepoint: - zero-copy and per-cpu splice() tracing - binary tracing without printf overhead - structured logging records exposed under /debug/tracing/events - trace events embedded in function tracer output and other plugins - user-defined, per tracepoint filter expressions ... Cons: - no dev_t info for the output of plug, unplug_timer and unplug_io events. no dev_t info for getrq and sleeprq events if bio == NULL. no dev_t info for rq_abort,...,rq_requeue events if rq->rq_disk == NULL. This is mainly because we can't get the deivce from a request queue. But this may change in the future. - A packet command is converted to a string in TP_assign, not TP_print. While blktrace do the convertion just before output. Since pc requests should be rather rare, this is not a big issue. - In blktrace, an event can have 2 different print formats, but a TRACE_EVENT has a unique format, which means we have some unused data in a trace entry. The overhead is minimized by using __dynamic_array() instead of __array(). I've benchmarked the ioctl blktrace vs the splice based TRACE_EVENT tracing: dd dd + ioctl blktrace dd + TRACE_EVENT (splice) 1 7.36s, 42.7 MB/s 7.50s, 42.0 MB/s 7.41s, 42.5 MB/s 2 7.43s, 42.3 MB/s 7.48s, 42.1 MB/s 7.43s, 42.4 MB/s 3 7.38s, 42.6 MB/s 7.45s, 42.2 MB/s 7.41s, 42.5 MB/s So the overhead of tracing is very small, and no regression when using those trace events vs blktrace. And the binary output of TRACE_EVENT is much smaller than blktrace: # ls -l -h -rw-r--r-- 1 root root 8.8M 06-09 13:24 sda.blktrace.0 -rw-r--r-- 1 root root 195K 06-09 13:24 sda.blktrace.1 -rw-r--r-- 1 root root 2.7M 06-09 13:25 trace_splice.out Following are some comparisons between TRACE_EVENT and blktrace: plug: kjournald-480 [000] 303.084981: block_plug: [kjournald] kjournald-480 [000] 303.084981: 8,0 P N [kjournald] unplug_io: kblockd/0-118 [000] 300.052973: block_unplug_io: [kblockd/0] 1 kblockd/0-118 [000] 300.052974: 8,0 U N [kblockd/0] 1 remap: kjournald-480 [000] 303.085042: block_remap: 8,0 W 102736992 + 8 <- (8,8) 33384 kjournald-480 [000] 303.085043: 8,0 A W 102736992 + 8 <- (8,8) 33384 bio_backmerge: kjournald-480 [000] 303.085086: block_bio_backmerge: 8,0 W 102737032 + 8 [kjournald] kjournald-480 [000] 303.085086: 8,0 M W 102737032 + 8 [kjournald] getrq: kjournald-480 [000] 303.084974: block_getrq: 8,0 W 102736984 + 8 [kjournald] kjournald-480 [000] 303.084975: 8,0 G W 102736984 + 8 [kjournald] bash-2066 [001] 1072.953770: 8,0 G N [bash] bash-2066 [001] 1072.953773: block_getrq: 0,0 N 0 + 0 [bash] rq_complete: konsole-2065 [001] 300.053184: block_rq_complete: 8,0 W () 103669040 + 16 [0] konsole-2065 [001] 300.053191: 8,0 C W 103669040 + 16 [0] ksoftirqd/1-7 [001] 1072.953811: 8,0 C N (5a 00 08 00 00 00 00 00 24 00) [0] ksoftirqd/1-7 [001] 1072.953813: block_rq_complete: 0,0 N (5a 00 08 00 00 00 00 00 24 00) 0 + 0 [0] rq_insert: kjournald-480 [000] 303.084985: block_rq_insert: 8,0 W 0 () 102736984 + 8 [kjournald] kjournald-480 [000] 303.084986: 8,0 I W 102736984 + 8 [kjournald] Changelog from v2 -> v3: - use the newly introduced __dynamic_array(). Changelog from v1 -> v2: - use __string() instead of __array() to minimize the memory required to store hex dump of rq->cmd(). - support large pc requests. - add missing blk_fill_rwbs_rq() in block_rq_requeue TRACE_EVENT. - some cleanups. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> LKML-Reference: <4A2DF669.5070905@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2009-06-09 12:34:23 -04:00
FUJITA Tomonori	c1d4c41f2f	bsg: setting rq->bio to NULL Due to commit `1cd96c242a` ("block: WARN in __blk_put_request() for potential bio leak"), BSG SMP requests get the false warnings: WARNING: at block/blk-core.c:1068 __blk_put_request+0x52/0xc0() This sets rq->bio to NULL to avoid that false warnings. Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-06-09 15:17:37 +02:00
Martin K. Petersen	77634f33d4	block: Add missing bounce_pfn stacking and fix comments DM no longer needs to set limits explicitly when calling blk_stack_limits. Let the latter automatically deal with bounce_pfn scaling. Fix kerneldoc variable names. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-06-09 06:23:22 +02:00
Jens Axboe	9df1bb9b51	Revert "block: Fix bounce limit setting in DM" This reverts commit `a05c0205ba`. DM doesn't need to access the bounce_pfn directly. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-06-09 06:22:57 +02:00
FUJITA Tomonori	dbb66c4be0	block: needs to set the residual length of a bidi request Tejun's "block: set rq->resid_len to blk_rq_bytes() on issue" patch seems to be incomplete; It doesn't set rq->resid_len to blk_rq_bytes() for a bidi request (req->next_rq). As a result, all bidi users are broken. Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-06-09 05:47:10 +02:00
Martin K. Petersen	a05c0205ba	block: Fix bounce limit setting in DM blk_queue_bounce_limit() is more than a wrapper about the request queue limits.bounce_pfn variable. Introduce blk_queue_bounce_pfn() which can be called by stacking drivers that wish to set the bounce limit explicitly. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-06-03 09:33:18 +02:00
Kiyoshi Ueda	53c663ce0f	block: fix a possible oops on elv_abort_queue() I found one more mis-conversion to the 'request is always dequeued when completing' model in elv_abort_queue() during code inspection. Although I haven't hit any problem caused by this mis-conversion yet and just done compile/boot test, please apply if you have no problem. Request must be dequeued when it completes. However, elv_abort_queue() completes requests without dequeueing. This will cause oops in the __blk_end_request_all(). This patch fixes the oops. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-06-02 08:44:01 +02:00
James Bottomley	c143dc903d	block: fix an oops on BLKPREP_KILL Doing a bit of torture testing, I ran across a BUG in the block subsystem (at blk-core.c:2048): the test for if the request is queued. It turns out the trigger was a BLKPREP_KILL coming out of the SCSI prep function. Currently for BLKPREP_KILL requests, we send them straight into __blk_end_request_all() with an error, but they've never been dequeued, so they trip the bug. Fix this by starting requests before killing them. Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-05-30 06:43:49 +02:00
Mike Snitzer	5d85d3247c	block: export blk_stack_limits() DM needs to use blk_stack_limits(), so it needs to be exported. Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-05-28 11:04:53 +02:00
Kiyoshi Ueda	3c4198e874	block: fix no diskstat problem The commit below in 2.6-block/for-2.6.31 causes no diskstat problem because the blk_discard_rq() check was added with '&&'. It should be 'blk_fs_request() \|\| blk_discard_rq()'. This patch does it and fixes the no diskstat problem. Please review and apply. ------ /proc/diskstat without this patch ------------------------------------- 8 0 sda 0 0 0 0 0 0 0 0 0 0 0 ------------------------------------------------------------------------------ ----- /proc/diskstat with this patch applied --------------------------------- 8 0 sda 4186 303 373621 61600 9578 3859 107468 169479 2 89755 231059 ------------------------------------------------------------------------------ -------------------------------------------------------------------------- commit `c69d48540c` Author: Jens Axboe <jens.axboe@oracle.com> Date: Fri Apr 24 08:12:19 2009 +0200 block: include discard requests in IO accounting We currently don't do merging on discard requests, but we potentially could. If we do, then we need to include discard requests in the IO accounting, or merging would end up decrementing in_flight IO counters for an IO which never incremented them. So enable accounting for discard requests. <snip> static inline int blk_do_io_stat(struct request *rq) { - return rq->rq_disk && blk_rq_io_stat(rq) && blk_fs_request(rq); + return rq->rq_disk && blk_rq_io_stat(rq) && blk_fs_request(rq) && + blk_discard_rq(rq); } -------------------------------------------------------------------------- Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-05-27 14:50:02 +02:00
James Bottomley	ba396a6c10	block: fix oops with block tag queueing commit e8939a50466fd963eb1ba9118c34b9ffb7ff6aa6 Author: Tejun Heo <tj@kernel.org> Date: Fri May 8 11:54:16 2009 +0900 block: implement and enforce request peek/start/fetch Added a BUG_ON(blk_queued_rq(req)) to the top of blk_finish_req(). Unfortunately, this checks whether req->queuelist is empty. This list is doing double duty both as the queue list and the tag list, so tagged requests come in here with this not empty and boom (the tag list is emptied by blk_queue_end_tag() lower down). Fix this by moving the BUG_ON to below the end tag we also seem vulnerable to this in blk_requeue_request() as well. I think all uses of blk_queued_rq() need auditing because the check is clearly wrong in the tagged case. Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-05-27 14:17:08 +02:00
Martin K. Petersen	c72758f337	block: Export I/O topology for block devices and partitions To support devices with physical block sizes bigger than 512 bytes we need to ensure proper alignment. This patch adds support for exposing I/O topology characteristics as devices are stacked. logical_block_size is the smallest unit the device can address. physical_block_size indicates the smallest I/O the device can write without incurring a read-modify-write penalty. The io_min parameter is the smallest preferred I/O size reported by the device. In many cases this is the same as the physical block size. However, the io_min parameter can be scaled up when stacking (RAID5 chunk size > physical block size). The io_opt characteristic indicates the optimal I/O size reported by the device. This is usually the stripe width for arrays. The alignment_offset parameter indicates the number of bytes the start of the device/partition is offset from the device's natural alignment. Partition tools and MD/DM utilities can use this to pad their offsets so filesystems start on proper boundaries. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-05-22 23:22:55 +02:00
Martin K. Petersen	cd43e26f07	block: Expose stacked device queues in sysfs Currently stacking devices do not have a queue directory in sysfs. However, many of the I/O characteristics like sector size, maximum request size, etc. are queue properties. This patch enables the queue directory for MD/DM devices. The elevator code has been modified to deal with queues that do not have an I/O scheduler. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-05-22 23:22:55 +02:00
Martin K. Petersen	025146e13b	block: Move queue limits to an embedded struct To accommodate stacking drivers that do not have an associated request queue we're moving the limits to a separate, embedded structure. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-05-22 23:22:55 +02:00
Martin K. Petersen	ae03bf639a	block: Use accessor functions for queue limits Convert all external users of queue limits to using wrapper functions instead of poking the request queue variables directly. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-05-22 23:22:54 +02:00
Martin K. Petersen	e1defc4ff0	block: Do away with the notion of hardsect_size Until now we have had a 1:1 mapping between storage device physical block size and the logical block sized used when addressing the device. With SATA 4KB drives coming out that will no longer be the case. The sector size will be 4KB but the logical block size will remain 512-bytes. Hence we need to distinguish between the physical block size and the logical ditto. This patch renames hardsect_size to logical_block_size. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-05-22 23:22:54 +02:00
Jens Axboe	e4b636366c	Merge branch 'master' into for-2.6.31 Conflicts: drivers/block/hd.c drivers/block/mg_disk.c Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-05-22 20:25:34 +02:00
Jens Axboe	0a7ae2ff0d	block: change the tag sync vs async restriction logic Make them fully share the tag space, but disallow async requests using the last any two slots. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-05-20 08:54:31 +02:00
Jens Axboe	53674ac5a9	block: add warning to blk_make_request() Add a note about how one needs to be careful when setting up these bio chains. Extracted from Boaz's updated patch. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-05-19 19:52:35 +02:00
Boaz Harrosh	a411f4bbb8	block: Un-export blk_rq_append_bio OSD was the last in-tree user of blk_rq_append_bio(). Now that it is fixed blk_rq_append_bio is un-exported and is only used internally by block layer. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-05-19 12:14:56 +02:00
Boaz Harrosh	79eb63e9e5	block: Add blk_make_request(), takes bio, returns a request New block API: given a struct bio allocates a new request. This is the parallel of generic_make_request for BLOCK_PC commands users. The passed bio may be a chained-bio. The bio is bounced if needed inside the call to this member. This is in the effort of un-exporting blk_rq_append_bio(). Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> CC: Jeff Garzik <jeff@garzik.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-05-19 12:14:56 +02:00
James Bottomley	3a5a39276d	block: allow blk_rq_map_kern to append to requests Use blk_rq_append_bio() internally instead of blk_rq_bio_prep() so blk_rq_map_kern can be called multiple times, to map multiple buffers. This is in the effort to un-export blk_rq_append_bio() Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-05-19 12:14:55 +02:00
Tejun Heo	5f49f63178	block: set rq->resid_len to blk_rq_bytes() on issue In commit `c3a4d78c58`, while introducing rq->resid_len, the default value of residue count was changed from full count to zero. The conversion was done under the assumption that when a request fails residue count wasn't defined. However, Boaz and James pointed out that this wasn't true and the residue count should be preserved for failed requests too. This patchset restores the original behavior by setting rq->resid_len to blk_rq_bytes(rq) on request start and restoring explicit clearing in affected drivers. While at it, take advantage of the fact that rq->resid_len is set to full count where applicable. * ide-cd: rq->resid_len cleared on pc success * mptsas: req->resid_len cleared on success * sas_expander: rsp/req->resid_len cleared on success * mpt2sas_transport: req->resid_len cleared on success * ide-cd, ide-tape, mptsas, sas_host_smp, mpt2sas_transport, ub: take advantage of initial full count to simplify code Boaz Harrosh spotted bug in resid_len initialization. Fixed as suggested. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Borislav Petkov <petkovbb@googlemail.com> Cc: Boaz Harrosh <bharrosh@panasas.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Pete Zaitcev <zaitcev@redhat.com> Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com> Cc: Sergei Shtylyov <sshtylyov@ru.mvista.com> Cc: Eric Moore <Eric.Moore@lsi.com> Cc: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-05-19 11:36:08 +02:00
Ingo Molnar	1079cac0f4	Merge commit 'v2.6.30-rc6' into tracing/core Merge reason: we were on an -rc4 base, sync up to -rc6 Signed-off-by: Ingo Molnar <mingo@elte.hu>	2009-05-18 10:15:35 +02:00
Kazuhisa Ichikawa	af498d7fa3	block: fix the bio_vec array index out-of-bounds test Current bio_vec array index out-of-bounds test within __end_that_request_first() does not seem correct. It checks bio->bi_idx against bio->bi_vcnt, but the subsequent code uses idx (which is, bio->bi_idx + next_idx) as the array index into bio_vec array. This means that the test really make sense only at the first iteration of !(nr_bytes >=bio->bi_size) case (when next_idx == zero). Fix this by replacing bio->bi_idx with idx. (This patch applies to 2.6.30-rc4.) Signed-off-by: Kazuhisa Ichikawa <ki@epsilou.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-05-12 13:27:45 +02:00
FUJITA Tomonori	b1f744937f	block: move completion related functions back to blk-core.c Let's put the completion related functions back to block/blk-core.c where they have lived. We can also unexport blk_end_bidi_request() and __blk_end_bidi_request(), which nobody uses. Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-05-11 11:06:48 +02:00
Tejun Heo	9934c8c045	block: implement and enforce request peek/start/fetch Till now block layer allowed two separate modes of request execution. A request is always acquired from the request queue via elv_next_request(). After that, drivers are free to either dequeue it or process it without dequeueing. Dequeue allows elv_next_request() to return the next request so that multiple requests can be in flight. Executing requests without dequeueing has its merits mostly in allowing drivers for simpler devices which can't do sg to deal with segments only without considering request boundary. However, the benefit this brings is dubious and declining while the cost of the API ambiguity is increasing. Segment based drivers are usually for very old or limited devices and as converting to dequeueing model isn't difficult, it doesn't justify the API overhead it puts on block layer and its more modern users. Previous patches converted all block low level drivers to dequeueing model. This patch completes the API transition by... * renaming elv_next_request() to blk_peek_request() * renaming blkdev_dequeue_request() to blk_start_request() * adding blk_fetch_request() which is combination of peek and start * disallowing completion of queued (not started) requests * applying new API to all LLDs Renamings are for consistency and to break out of tree code so that it's apparent that out of tree drivers need updating. [ Impact: block request issue API cleanup, no functional change ] Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Mike Miller <mike.miller@hp.com> Cc: unsik Kim <donari75@gmail.com> Cc: Paul Clements <paul.clements@steeleye.com> Cc: Tim Waugh <tim@cyberelk.net> Cc: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com> Cc: David S. Miller <davem@davemloft.net> Cc: Laurent Vivier <Laurent@lvivier.info> Cc: Jeff Garzik <jgarzik@pobox.com> Cc: Jeremy Fitzhardinge <jeremy@xensource.com> Cc: Grant Likely <grant.likely@secretlab.ca> Cc: Adrian McMenamin <adrian@mcmen.demon.co.uk> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com> Cc: Borislav Petkov <petkovbb@googlemail.com> Cc: Sergei Shtylyov <sshtylyov@ru.mvista.com> Cc: Alex Dubov <oakad@yahoo.com> Cc: Pierre Ossman <drzeus@drzeus.cx> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Markus Lidel <Markus.Lidel@shadowconnect.com> Cc: Stefan Weinhuber <wein@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Pete Zaitcev <zaitcev@redhat.com> Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-05-11 09:52:18 +02:00
Tejun Heo	a2dec7b363	block: hide request sector and data_len Block low level drivers for some reason have been pretty good at abusing block layer API. Especially struct request's fields tend to get violated in all possible ways. Make it clear that low level drivers MUST NOT access or manipulate rq->sector and rq->data_len directly by prefixing them with double underscores. This change is also necessary to break build of out-of-tree codes which assume the previous block API where internal fields can be manipulated and rq->data_len carries residual count on completion. [ Impact: hide internal fields, block API change ] Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-05-11 09:50:55 +02:00
Tejun Heo	2e46e8b27a	block: drop request->hard_* and nr_sectors struct request has had a few different ways to represent some properties of a request. ->hard_ represent block layer's view of the request progress (completion cursor) and the ones without the prefix are supposed to represent the issue cursor and allowed to be updated as necessary by the low level drivers. The thing is that as block layer supports partial completion, the two cursors really aren't necessary and only cause confusion. In addition, manual management of request detail from low level drivers is cumbersome and error-prone at the very least. Another interesting duplicate fields are rq->[hard_]nr_sectors and rq->{hard_cur\|current}_nr_sectors against rq->data_len and rq->bio->bi_size. This is more convoluted than the hard_ case. rq->[hard_]nr_sectors are initialized for requests with bio but blk_rq_bytes() uses it only for !pc requests. rq->data_len is initialized for all request but blk_rq_bytes() uses it only for pc requests. This causes good amount of confusion throughout block layer and its drivers and determining the request length has been a bit of black magic which may or may not work depending on circumstances and what the specific LLD is actually doing. rq->{hard_cur\|current}_nr_sectors represent the number of sectors in the contiguous data area at the front. This is mainly used by drivers which transfers data by walking request segment-by-segment. This value always equals rq->bio->bi_size >> 9. However, data length for pc requests may not be multiple of 512 bytes and using this field becomes a bit confusing. In general, having multiple fields to represent the same property leads only to confusion and subtle bugs. With recent block low level driver cleanups, no driver is accessing or manipulating these duplicate fields directly. Drop all the duplicates. Now rq->sector means the current sector, rq->data_len the current total length and rq->bio->bi_size the current segment length. Everything else is defined in terms of these three and available only through accessors. * blk_recalc_rq_sectors() is collapsed into blk_update_request() and now handles pc and fs requests equally other than rq->sector update. This means that now pc requests can use partial completion too (no in-kernel user yet tho). * bio_cur_sectors() is replaced with bio_cur_bytes() as block layer now uses byte count as the primary data length. * blk_rq_pos() is now guranteed to be always correct. In-block users converted. * blk_rq_bytes() is now guaranteed to be always valid as is blk_rq_sectors(). In-block users converted. * blk_rq_sectors() is now guaranteed to equal blk_rq_bytes() >> 9. More convenient one is used. * blk_rq_bytes() and blk_rq_cur_bytes() are now inlined and take const pointer to request. [ Impact: API cleanup, single way to represent one property of a request ] Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-05-11 09:50:54 +02:00

... 3 4 5 6 7 ...

1340 Commits