Commit Graph

4666 Commits

Author SHA1 Message Date
Jens Axboe 6222d1721d NVMe: cq_vector should be signed
This was inadvertently dropped from an earlier commit, otherwise
the check against cq_vector == -1 to prevent double free doesn't
make any sense.

Fixes: 2b25d98179
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-15 15:19:10 -07:00
Boaz Harrosh c8fa31730f brd: Request from fdisk 4k alignment
Because of the direct_access() API which returns a PFN. partitions
better start on 4K boundary, else offset ZERO of a partition will
not be aligned and blk_direct_access() will fail the call.

By setting blk_queue_physical_block_size(PAGE_SIZE) we can communicate
this to fdisk and friends.

The call to blk_queue_physical_block_size() is harmless and will
not affect the Kernel behavior in any way. It is only for
communication to user-mode.

before this patch running fdisk on a default size brd of 4M
the first sector offered is 34 (BAD), but after this patch it
will be 40, ie 8 sectors aligned. Also when entering some random
partition sizes the next partition-start sector is offered 8 sectors
aligned after this patch. (Please note that with fdisk the user
can still enter bad values, only the offered default values will
be correct)

Note that with bdev-size > 4M fdisk will try to align on a 1M
boundary (above first-sector will be 2048), in any case.

CC: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-13 21:59:15 -07:00
Boaz Harrosh 937af5ecd0 brd: Fix all partitions BUGs
This patch fixes up brd's partitions scheme, now enjoying all worlds.

The MAIN fix here is that currently, if one fdisks some partitions,
a BAD bug will make all partitions point to the same start-end sector
ie: 0 - brd_size And an mkfs of any partition would trash the partition
table and the other partition.

Another fix is that "mount -U uuid" will not work if show_part was not
specified, because of the GENHD_FL_SUPPRESS_PARTITION_INFO flag.
We now always load without it and remove the show_part parameter.

[We remove Dmitry's new module-param part_show it is now always
 show]

So NOW the logic goes like this:
* max_part - Just says how many minors to reserve between ramX
  devices. In any way, there can be as many partition as requested.
  If minors between devices ends, then dynamic 259-major ids will
  be allocated on the fly.
  The default is now max_part=1, which means all partitions devt(s)
  will be from the dynamic (259) major-range.
  (If persistent partition minors is needed use max_part=X)
  For example with /dev/sdX max_part is hard coded 16.

* Creation of new devices on the fly still/always work:
  mknod /path/devnod b 1 X
  fdisk -l /path/devnod
  Will create a new device if [X / max_part] was not already
  created before. (Just as before)

  partitions on the dynamically created device will work as well
  Same logic applies with minors as with the pre-created ones.

TODO: dynamic grow of device size. So each device can have it's
      own size.

CC: Dmitry Monakhov <dmonakhov@openvz.org>
Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-13 21:59:03 -07:00
Jens Axboe d4119ee0e1 Merge branch 'for-3.20/core' into for-3.20/drivers 2015-01-13 21:58:45 -07:00
Matthew Wilcox dd22f551ac block: Change direct_access calling convention
In order to support accesses to larger chunks of memory, pass in a
'size' parameter (counted in bytes), and return the amount available at
that address.

Add a new helper function, bdev_direct_access(), to handle common
functionality including partition handling, checking the length requested
is positive, checking for the sector being page-aligned, and checking
the length of the request does not pass the end of the partition.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Boaz Harrosh <boaz@plexistor.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-13 21:58:11 -07:00
Keith Busch 7a509a6b07 NVMe: Fix locking on abort handling
The queues and device need to be locked when messing with them.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-08 09:02:23 -07:00
Keith Busch c9d3bf8810 NVMe: Start and stop h/w queues on reset
This freezes and stops all the queues on device shutdown and restarts
them on resume. This fixes hotplug and reset issues when the controller
is actively being used.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-08 09:02:20 -07:00
Keith Busch cef6a94827 NVMe: Command abort handling fixes
Aborts all requeued commands prior to killing the request_queue. For
commands that time out on a dying request queue, set the "Do Not Retry"
bit on the command status so the command cannot be requeued. Finanally, if
the driver is requested to abort a command it did not start, do nothing.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-08 09:02:18 -07:00
Keith Busch 0fb59cbc5f NVMe: Admin queue removal handling
This protects admin queue access on shutdown. When the controller is
disabled, the queue is frozen to prevent new entry, and unfrozen on
resume, and fixes cq_vector signedness to not suspend a queue twice.

Since unfreezing the queue makes it available for commands, it requires
the queue be initialized, so this moves this part after that.

Special handling is done when the device is unresponsive during
shutdown. This can be optimized to not require subsequent commands to
timeout, but saving that fix for later.

This patch also removes the kill signals in this path that were left-over
artifacts from the blk-mq conversion and no longer necessary.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-08 09:02:08 -07:00
Keith Busch ea191d2f36 NVMe: Reference count admin queue usage
Since there is no gendisk associated with the admin queue, the driver
needs to hold a reference to it until all open references to the
controller are closed.

This also combines queue cleanup with freeing the tag set since these
should not be separate.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-08 09:00:32 -07:00
Keith Busch c917dfe528 NVMe: Start all requests
Once the nvme callback is set for a request, the driver can start it
and make it available for timeout handling. For timed out commands on a
device that is not initialized, this fixes potential deadlocks that can
occur on startup and shutdown when a device is unresponsive since they
can now be cancelled.

Asynchronous requests do not have any expected timeout, so these are
using the new "REQ_NO_TIMEOUT" request flags.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-08 09:00:29 -07:00
Jens Axboe 78e367a360 loop: add blk-mq.h include
Looks like we pull it in through other ways on x86, but we fail
on sparc:

In file included from drivers/block/cryptoloop.c:30:0:
drivers/block/loop.h:63:24: error: field 'tag_set' has incomplete type
struct blk_mq_tag_set tag_set;

Add the include to loop.h, kill it from loop.c.

Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-02 15:20:25 -07:00
Ming Lei af65aa8ea7 block: loop: don't handle REQ_FUA explicitly
block core handles REQ_FUA by its flush state machine, so
won't do it in loop explicitly.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-02 15:07:49 -07:00
Ming Lei cf655d9534 block: loop: introduce lo_discard() and lo_req_flush()
No behaviour change, just move the handling for REQ_DISCARD
and REQ_FLUSH in these two functions.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-02 15:07:49 -07:00
Ming Lei 3011201346 block: loop: say goodby to bio
Switch to block request completely.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-02 15:07:49 -07:00
Ming Lei b5dd2f6047 block: loop: improve performance via blk-mq
The conversion is a bit straightforward, and use work queue to
dispatch requests of loop block, and one big change is that requests
is submitted to backend file/device concurrently with work queue,
so throughput may get improved much. Given write requests over same
file are often run exclusively, so don't handle them concurrently for
avoiding extra context switch cost, possible lock contention and work
schedule cost. Also with blk-mq, there is opportunity to get loop I/O
merged before submitting to backend file/device.

In the following test:
	- base: v3.19-rc2-2041231
	- loop over file in ext4 file system on SSD disk
	- bs: 4k, libaio, io depth: 64, O_DIRECT, num of jobs: 1
	- throughput: IOPS

	------------------------------------------------------
	|            | base      | base with loop-mq | delta |
	------------------------------------------------------
	| randread   | 1740      | 25318             | +1355%|
	------------------------------------------------------
	| read       | 42196     | 51771             | +22.6%|
	-----------------------------------------------------
	| randwrite  | 35709     | 34624             | -3%   |
	-----------------------------------------------------
	| write      | 39137     | 40326             | +3%   |
	-----------------------------------------------------

So loop-mq can improve throughput for both read and randread, meantime,
performance of write and randwrite isn't hurted basically.

Another benefit is that loop driver code gets simplified
much after blk-mq conversion, and the patch can be thought as
cleanup too.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-02 15:07:49 -07:00
Ming Lei 35b489d32f block: fix checking return value of blk_mq_init_queue
Check IS_ERR_OR_NULL(return value) instead of just return value.

Signed-off-by: Ming Lei <ming.lei@canonical.com>

Reduced to IS_ERR() by me, we never return NULL.
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-02 10:32:02 -07:00
Keith Busch 2b25d98179 NVMe: Fix double free irq
Sets the vector to an invalid value after it's freed so we don't free
it twice.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-12-22 12:59:04 -07:00
Linus Torvalds 57666509b7 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull ceph updates from Sage Weil:
 "The big item here is support for inline data for CephFS and for
  message signatures from Zheng.  There are also several bug fixes,
  including interrupted flock request handling, 0-length xattrs, mksnap,
  cached readdir results, and a message version compat field.  Finally
  there are several cleanups from Ilya, Dan, and Markus.

  Note that there is another series coming soon that fixes some bugs in
  the RBD 'lingering' requests, but it isn't quite ready yet"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (27 commits)
  ceph: fix setting empty extended attribute
  ceph: fix mksnap crash
  ceph: do_sync is never initialized
  libceph: fixup includes in pagelist.h
  ceph: support inline data feature
  ceph: flush inline version
  ceph: convert inline data to normal data before data write
  ceph: sync read inline data
  ceph: fetch inline data when getting Fcr cap refs
  ceph: use getattr request to fetch inline data
  ceph: add inline data to pagecache
  ceph: parse inline data in MClientReply and MClientCaps
  libceph: specify position of extent operation
  libceph: add CREATE osd operation support
  libceph: add SETXATTR/CMPXATTR osd operations support
  rbd: don't treat CEPH_OSD_OP_DELETE as extent op
  ceph: remove unused stringification macros
  libceph: require cephx message signature by default
  ceph: introduce global empty snap context
  ceph: message versioning fixes
  ...
2014-12-17 16:03:12 -08:00
Ilya Dryomov 7e868b6eff rbd: don't treat CEPH_OSD_OP_DELETE as extent op
CEPH_OSD_OP_DELETE is not an extent op, stop treating it as such.  This
sneaked in with discard patches - it's one of the three osd ops (the
other two are CEPH_OSD_OP_TRUNCATE and CEPH_OSD_OP_ZERO) that discard
is implemented with.

Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
Reviewed-by: Alex Elder <elder@linaro.org>
2014-12-17 20:09:51 +03:00
SF Markus Elfring e96a650a81 ceph, rbd: delete unnecessary checks before two function calls
The functions ceph_put_snap_context() and iput() test whether their
argument is NULL and then return immediately. Thus the test around the
call is not needed.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
[idryomov@redhat.com: squashed rbd.c hunk, changelog]
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
2014-12-17 20:09:50 +03:00
Linus Torvalds e6b5be2be4 Driver core patches for 3.19-rc1
Here's the set of driver core patches for 3.19-rc1.
 
 They are dominated by the removal of the .owner field in platform
 drivers.  They touch a lot of files, but they are "simple" changes, just
 removing a line in a structure.
 
 Other than that, a few minor driver core and debugfs changes.  There are
 some ath9k patches coming in through this tree that have been acked by
 the wireless maintainers as they relied on the debugfs changes.
 
 Everything has been in linux-next for a while.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iEYEABECAAYFAlSOD20ACgkQMUfUDdst+ylLPACg2QrW1oHhdTMT9WI8jihlHVRM
 53kAoLeteByQ3iVwWurwwseRPiWa8+MI
 =OVRS
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core update from Greg KH:
 "Here's the set of driver core patches for 3.19-rc1.

  They are dominated by the removal of the .owner field in platform
  drivers.  They touch a lot of files, but they are "simple" changes,
  just removing a line in a structure.

  Other than that, a few minor driver core and debugfs changes.  There
  are some ath9k patches coming in through this tree that have been
  acked by the wireless maintainers as they relied on the debugfs
  changes.

  Everything has been in linux-next for a while"

* tag 'driver-core-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (324 commits)
  Revert "ath: ath9k: use debugfs_create_devm_seqfile() helper for seq_file entries"
  fs: debugfs: add forward declaration for struct device type
  firmware class: Deletion of an unnecessary check before the function call "vunmap"
  firmware loader: fix hung task warning dump
  devcoredump: provide a one-way disable function
  device: Add dev_<level>_once variants
  ath: ath9k: use debugfs_create_devm_seqfile() helper for seq_file entries
  ath: use seq_file api for ath9k debugfs files
  debugfs: add helper function to create device related seq_file
  drivers/base: cacheinfo: remove noisy error boot message
  Revert "core: platform: add warning if driver has no owner"
  drivers: base: support cpu cache information interface to userspace via sysfs
  drivers: base: add cpu_device_create to support per-cpu devices
  topology: replace custom attribute macros with standard DEVICE_ATTR*
  cpumask: factor out show_cpumap into separate helper function
  driver core: Fix unbalanced device reference in drivers_probe
  driver core: fix race with userland in device_add()
  sysfs/kernfs: make read requests on pre-alloc files use the buffer.
  sysfs/kernfs: allow attributes to request write buffer be pre-allocated.
  fs: sysfs: return EGBIG on write if offset is larger than file size
  ...
2014-12-14 16:10:09 -08:00
Linus Torvalds 9ea18f8cab Merge branch 'for-3.19/drivers' of git://git.kernel.dk/linux-block
Pull block layer driver updates from Jens Axboe:

 - NVMe updates:
        - The blk-mq conversion from Matias (and others)

        - A stack of NVMe bug fixes from the nvme tree, mostly from Keith.

        - Various bug fixes from me, fixing issues in both the blk-mq
          conversion and generic bugs.

        - Abort and CPU online fix from Sam.

        - Hot add/remove fix from Indraneel.

 - A couple of drbd fixes from the drbd team (Andreas, Lars, Philipp)

 - With the generic IO stat accounting from 3.19/core, converting md,
   bcache, and rsxx to use those.  From Gu Zheng.

 - Boundary check for queue/irq mode for null_blk from Matias.  Fixes
   cases where invalid values could be given, causing the device to hang.

 - The xen blkfront pull request, with two bug fixes from Vitaly.

* 'for-3.19/drivers' of git://git.kernel.dk/linux-block: (56 commits)
  NVMe: fix race condition in nvme_submit_sync_cmd()
  NVMe: fix retry/error logic in nvme_queue_rq()
  NVMe: Fix FS mount issue (hot-remove followed by hot-add)
  NVMe: fix error return checking from blk_mq_alloc_request()
  NVMe: fix freeing of wrong request in abort path
  xen/blkfront: remove redundant flush_op
  xen/blkfront: improve protection against issuing unsupported REQ_FUA
  NVMe: Fix command setup on IO retry
  null_blk: boundary check queue_mode and irqmode
  block/rsxx: use generic io stats accounting functions to simplify io stat accounting
  md: use generic io stats accounting functions to simplify io stat accounting
  drbd: use generic io stats accounting functions to simplify io stat accounting
  md/bcache: use generic io stats accounting functions to simplify io stat accounting
  NVMe: Update module version major number
  NVMe: fail pci initialization if the device doesn't have any BARs
  NVMe: add ->exit_hctx() hook
  NVMe: make setup work for devices that don't do INTx
  NVMe: enable IO stats by default
  NVMe: nvme_submit_async_admin_req() must use atomic rq allocation
  NVMe: replace blk_put_request() with blk_mq_free_request()
  ...
2014-12-13 14:22:26 -08:00
Linus Torvalds caf292ae5b Merge branch 'for-3.19/core' of git://git.kernel.dk/linux-block
Pull block driver core update from Jens Axboe:
 "This is the pull request for the core block IO changes for 3.19.  Not
  a huge round this time, mostly lots of little good fixes:

   - Fix a bug in sysfs blktrace interface causing a NULL pointer
     dereference, when enabled/disabled through that API.  From Arianna
     Avanzini.

   - Various updates/fixes/improvements for blk-mq:

        - A set of updates from Bart, mostly fixing buts in the tag
          handling.

        - Cleanup/code consolidation from Christoph.

        - Extend queue_rq API to be able to handle batching issues of IO
          requests. NVMe will utilize this shortly. From me.

        - A few tag and request handling updates from me.

        - Cleanup of the preempt handling for running queues from Paolo.

        - Prevent running of unmapped hardware queues from Ming Lei.

        - Move the kdump memory limiting check to be in the correct
          location, from Shaohua.

        - Initialize all software queues at init time from Takashi. This
          prevents a kobject warning when CPUs are brought online that
          weren't online when a queue was registered.

   - Single writeback fix for I_DIRTY clearing from Tejun.  Queued with
     the core IO changes, since it's just a single fix.

   - Version X of the __bio_add_page() segment addition retry from
     Maurizio.  Hope the Xth time is the charm.

   - Documentation fixup for IO scheduler merging from Jan.

   - Introduce (and use) generic IO stat accounting helpers for non-rq
     drivers, from Gu Zheng.

   - Kill off artificial limiting of max sectors in a request from
     Christoph"

* 'for-3.19/core' of git://git.kernel.dk/linux-block: (26 commits)
  bio: modify __bio_add_page() to accept pages that don't start a new segment
  blk-mq: Fix uninitialized kobject at CPU hotplugging
  blktrace: don't let the sysfs interface remove trace from running list
  blk-mq: Use all available hardware queues
  blk-mq: Micro-optimize bt_get()
  blk-mq: Fix a race between bt_clear_tag() and bt_get()
  blk-mq: Avoid that __bt_get_word() wraps multiple times
  blk-mq: Fix a use-after-free
  blk-mq: prevent unmapped hw queue from being scheduled
  blk-mq: re-check for available tags after running the hardware queue
  blk-mq: fix hang in bt_get()
  blk-mq: move the kdump check to blk_mq_alloc_tag_set
  blk-mq: cleanup tag free handling
  blk-mq: use 'nr_cpu_ids' as highest CPU ID count for hwq <-> cpu map
  blk: introduce generic io stat accounting help function
  blk-mq: handle the single queue case in blk_mq_hctx_next_cpu
  genhd: check for int overflow in disk_expand_part_tbl()
  blk-mq: add blk_mq_free_hctx_request()
  blk-mq: export blk_mq_free_request()
  blk-mq: use get_cpu/put_cpu instead of preempt_disable/preempt_enable
  ...
2014-12-13 14:14:23 -08:00
Linus Torvalds 78a45c6f06 Merge branch 'akpm' (second patch-bomb from Andrew)
Merge second patchbomb from Andrew Morton:
 - the rest of MM
 - misc fs fixes
 - add execveat() syscall
 - new ratelimit feature for fault-injection
 - decompressor updates
 - ipc/ updates
 - fallocate feature creep
 - fsnotify cleanups
 - a few other misc things

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (99 commits)
  cgroups: Documentation: fix trivial typos and wrong paragraph numberings
  parisc: percpu: update comments referring to __get_cpu_var
  percpu: update local_ops.txt to reflect this_cpu operations
  percpu: remove __get_cpu_var and __raw_get_cpu_var macros
  fsnotify: remove destroy_list from fsnotify_mark
  fsnotify: unify inode and mount marks handling
  fallocate: create FAN_MODIFY and IN_MODIFY events
  mm/cma: make kmemleak ignore CMA regions
  slub: fix cpuset check in get_any_partial
  slab: fix cpuset check in fallback_alloc
  shmdt: use i_size_read() instead of ->i_size
  ipc/shm.c: fix overly aggressive shmdt() when calls span multiple segments
  ipc/msg: increase MSGMNI, remove scaling
  ipc/sem.c: increase SEMMSL, SEMMNI, SEMOPM
  ipc/sem.c: change memory barrier in sem_lock() to smp_rmb()
  lib/decompress.c: consistency of compress formats for kernel image
  decompress_bunzip2: off by one in get_next_block()
  usr/Kconfig: make initrd compression algorithm selection not expert
  fault-inject: add ratelimit option
  ratelimit: add initialization macro
  ...
2014-12-13 13:00:36 -08:00
Ganesh Mahendran 083914eab9 zram: use DEVICE_ATTR_[RW|RO|WO] to define zram sys device attribute
In current zram, we use DEVICE_ATTR() to define sys device attributes.
SO, we need to set (S_IRUGO | S_IWUSR) permission and other arguments
manually.  Linux already provids the macro DEVICE_ATTR_[RW|RO|WO] to
define sys device attribute.  It is simple and readable.

This patch uses kernel defined macro DEVICE_ATTR_[RW|RO|WO] to define
zram device attribute.

Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 12:42:50 -08:00
Mahendran Ganesh d49b1c254c mm/zram: correct ZRAM_ZERO flag bit position
In struct zram_table_entry, the element *value* contains obj size and obj
zram flags.  Bit 0 to bit (ZRAM_FLAG_SHIFT - 1) represent obj size, and
bit ZRAM_FLAG_SHIFT to the highest bit of unsigned long represent obj
zram_flags.  So the first zram flag(ZRAM_ZERO) should be from
ZRAM_FLAG_SHIFT instead of (ZRAM_FLAG_SHIFT + 1).

This patch fixes this cosmetic issue.

Also fix a typo, "page in now accessed" -> "page is now accessed"

Signed-off-by: Mahendran Ganesh <opensource.ganesh@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Weijie Yang <weijie.yang@samsung.com>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 12:42:50 -08:00
karam.lee 8c7f01025f zram: implement rw_page operation of zram
This patch implements rw_page operation for zram block device.

I implemented the feature in zram and tested it.  Test bed was the G2, LG
electronic mobile device, whtich has msm8974 processor and 2GB memory.

With a memory allocation test program consuming memory, the system
generates swap.

Operating time of swap_write_page() was measured.

--------------------------------------------------
|             |   operating time   | improvement |
|             |  (20 runs average) |             |
--------------------------------------------------
|with patch   |    1061.15 us      |    +2.4%    |
--------------------------------------------------
|without patch|    1087.35 us      |             |
--------------------------------------------------

Each test(with paged_io,with BIO) result set shows normal distribution and
has equal variance.  I mean the two values are valid result to compare.  I
can say operation with paged I/O(without BIO) is faster 2.4% with
confidence level 95%.

[minchan@kernel.org: make rw_page opeartion return 0]
[minchan@kernel.org: rely on the bi_end_io for zram_rw_page fails]
[sergey.senozhatsky@gmail.com: code cleanup]
[minchan@kernel.org: add comment]
Signed-off-by: karam.lee <karam.lee@lge.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: <seungho1.park@lge.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 12:42:50 -08:00
karam.lee 54850e73e8 zram: change parameter from vaild_io_request()
This patch changes parameter of valid_io_request for common usage.  The
purpose of valid_io_request() is to determine if bio request is valid or
not.

This patch use I/O start address and size instead of a BIO parameter for
common usage.

Signed-off-by: karam.lee <karam.lee@lge.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: <seungho1.park@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 12:42:50 -08:00
karam.lee b627cff3d3 zram: remove bio parameter from zram_bvec_rw()
Recently rw_page block device operation has been added.  This patchset
implements rw_page operation for zram block device and does some clean-up.

This patch (of 3):

Remove an unnecessary parameter(bio) from zram_bvec_rw() and
zram_bvec_read().  zram_bvec_read() doesn't use a bio parameter, so remove
it.  zram_bvec_rw() calls a read/write operation not using bio, so a rw
parameter replaces a bio parameter.

Signed-off-by: karam.lee <karam.lee@lge.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: <seungho1.park@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13 12:42:49 -08:00
Jens Axboe 849c6e7746 NVMe: fix race condition in nvme_submit_sync_cmd()
If we have a race between the schedule timing out and the command
completing, we could have the task issuing the command exit
nvme_submit_sync_cmd() while the irq is running sync_completion().
If that happens, we could be corrupting memory, since the stack
that held 'cmdinfo' is no longer valid.

Fix this by always calling nvme_abort_cmd_info(). Once that call
completes, we know that we have either run sync_completion() if
the completion came in, or that we will never run it since we now
have special_completion() as the command callback handler.

Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-12-12 08:53:40 -07:00
Dwight Engen 76e74bbe0a sunvdc: reconnect ldc after vds service domain restarts
This change enables the sunvdc driver to reconnect and recover if a vds
service domain is disconnected or bounced.

By default, it will wait indefinitely for the service domain to become
available again, but will honor a non-zero vdc-timout md property if one
is set. If a timeout is reached, any in-progress I/O's are completed
with -EIO.

Signed-off-by: Dwight Engen <dwight.engen@oracle.com>
Reviewed-by: Chris Hyser <chris.hyser@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-11 18:52:45 -08:00
Dwight Engen fe47c3c262 vio: create routines for inc,dec vio dring indexes
Both sunvdc and sunvnet implemented distinct functionality for incrementing
and decrementing dring indexes. Create common functions for use by both
from the sunvnet versions, which were chosen since they will still work
correctly in case a non power of two ring size is used.

Signed-off-by: Dwight Engen <dwight.engen@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-11 18:52:45 -08:00
Dwight Engen 31f4888f51 sunvdc: fix module unload/reload
Free resources allocated during port/disk probing so that the module may be
successfully reloaded after unloading.

Signed-off-by: Dwight Engen <dwight.engen@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-11 18:51:57 -08:00
Jens Axboe fe54303ee2 NVMe: fix retry/error logic in nvme_queue_rq()
The logic around retrying and erroring IO in nvme_queue_rq() is broken
in a few ways:

- If we fail allocating dma memory for a discard, we return retry. We
  have the 'iod' stored in ->special, but we free the 'iod'.

- For a normal request, if we fail dma mapping of setting up prps, we
  have the same iod situation. Additionally, we haven't set the callback
  for the request yet, so we also potentially leak IOMMU resources.

Get rid of the ->special 'iod' store. The retry is uncommon enough that
it's not worth optimizing for or holding on to resources to attempt to
speed it up. Additionally, it's usually best practice to free any
request related resources when doing retries.

Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-12-11 13:58:39 -07:00
Linus Torvalds 6b9e2cea42 virtio: virtio 1.0 support, misc patches
This adds a lot of infrastructure for virtio 1.0 support.
 Notable missing pieces: virtio pci, virtio balloon (needs spec extension),
 vhost scsi.
 
 Plus, there are some minor fixes in a couple of places.
 
 Cc: David Miller <davem@davemloft.net>
 Cc: Rusty Russell <rusty@rustcorp.com.au>
 Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJUh1CVAAoJECgfDbjSjVRpWZcH/2+EGPyng7Lca820UHA0cU1U
 u4D8CAAwOGaVdnUUo8ox1eon3LNB2UgRtgsl3rBDR3YTgFfNPrfuYdnHO0dYIDc1
 lS26NuPrVrTX0lA+OBPe2nlKrsrOkn8aw1kxG9Y0gKtNg/+HAGNW5e2eE7R/LrA5
 94XbWZ8g9Yf4GPG1iFmih9vQvvN0E68zcUlojfCnllySgaIEYr8nTiGQBWpRgJat
 fCqFAp1HMDZzGJQO+m1/Vw0OftTRVybyfai59e6uUTa8x1djvzPb/1MvREqQjegM
 ylSuofIVyj7JPu++FbAjd9mikkb53GSc8ql3YmWNZLdr69rnkzP0GdzQvrdheAo=
 =RtrR
 -----END PGP SIGNATURE-----

Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost

Pull virtio updates from Michael Tsirkin:
 "virtio: virtio 1.0 support, misc patches

  This adds a lot of infrastructure for virtio 1.0 support.  Notable
  missing pieces: virtio pci, virtio balloon (needs spec extension),
  vhost scsi.

  Plus, there are some minor fixes in a couple of places.

  Note: some net drivers are affected by these patches.  David said he's
  fine with merging these patches through my tree.

  Rusty's on vacation, he acked using my tree for these, too"

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (70 commits)
  virtio_ccw: finalize_features error handling
  virtio_ccw: future-proof finalize_features
  virtio_pci: rename virtio_pci -> virtio_pci_common
  virtio_pci: update file descriptions and copyright
  virtio_pci: split out legacy device support
  virtio_pci: setup config vector indirectly
  virtio_pci: setup vqs indirectly
  virtio_pci: delete vqs indirectly
  virtio_pci: use priv for vq notification
  virtio_pci: free up vq->priv
  virtio_pci: fix coding style for structs
  virtio_pci: add isr field
  virtio: drop legacy_only driver flag
  virtio_balloon: drop legacy_only driver flag
  virtio_ccw: rev 1 devices set VIRTIO_F_VERSION_1
  virtio: allow finalize_features to fail
  virtio_ccw: legacy: don't negotiate rev 1/features
  virtio: add API to detect legacy devices
  virtio_console: fix sparse warnings
  vhost: remove unnecessary forward declarations in vhost.h
  ...
2014-12-11 12:20:31 -08:00
Indraneel M 285dffc910 NVMe: Fix FS mount issue (hot-remove followed by hot-add)
After Hot-remove of a device with a mounted partition,
when the device is hot-added again, the new node reappears
as nvme0n1. Mounting this new node fails with the error:

mount: mount /dev/nvme0n1p1 on /mnt failed: File exists.

The old nodes's FS entries still exist and the kernel can't re-create
procfs and sysfs entries for the new node with the same name.
The patch fixes this issue.

Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Indraneel M <indraneel.m@samsung.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-12-11 08:24:18 -07:00
Linus Torvalds cbfe0de303 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull VFS changes from Al Viro:
 "First pile out of several (there _definitely_ will be more).  Stuff in
  this one:

   - unification of d_splice_alias()/d_materialize_unique()

   - iov_iter rewrite

   - killing a bunch of ->f_path.dentry users (and f_dentry macro).

     Getting that completed will make life much simpler for
     unionmount/overlayfs, since then we'll be able to limit the places
     sensitive to file _dentry_ to reasonably few.  Which allows to have
     file_inode(file) pointing to inode in a covered layer, with dentry
     pointing to (negative) dentry in union one.

     Still not complete, but much closer now.

   - crapectomy in lustre (dead code removal, mostly)

   - "let's make seq_printf return nothing" preparations

   - assorted cleanups and fixes

  There _definitely_ will be more piles"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
  copy_from_iter_nocache()
  new helper: iov_iter_kvec()
  csum_and_copy_..._iter()
  iov_iter.c: handle ITER_KVEC directly
  iov_iter.c: convert copy_to_iter() to iterate_and_advance
  iov_iter.c: convert copy_from_iter() to iterate_and_advance
  iov_iter.c: get rid of bvec_copy_page_{to,from}_iter()
  iov_iter.c: convert iov_iter_zero() to iterate_and_advance
  iov_iter.c: convert iov_iter_get_pages_alloc() to iterate_all_kinds
  iov_iter.c: convert iov_iter_get_pages() to iterate_all_kinds
  iov_iter.c: convert iov_iter_npages() to iterate_all_kinds
  iov_iter.c: iterate_and_advance
  iov_iter.c: macros for iterating over iov_iter
  kill f_dentry macro
  dcache: fix kmemcheck warning in switch_names
  new helper: audit_file()
  nfsd_vfs_write(): use file_inode()
  ncpfs: use file_inode()
  kill f_dentry uses
  lockd: get rid of ->f_path.dentry->d_sb
  ...
2014-12-10 16:10:49 -08:00
Jens Axboe d8ead9b763 Merge branch 'for-jens-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen into for-3.19/drivers
Konrad writes:

These are two fixes for Xen blkfront. They harden how it deals with
broken backends.
2014-12-10 13:58:17 -07:00
Jens Axboe 97fe383222 NVMe: fix error return checking from blk_mq_alloc_request()
We return an error pointer or the request, not NULL. Half
the call paths got it right, the others didn't. Fix those up.

Signed-off-by: Jens Axboe <axboe@fb.com>
2014-12-10 13:02:44 -07:00
Sam Bradshaw c87fd5407e NVMe: fix freeing of wrong request in abort path
We allocate 'abort_req', but free 'req' in case of an error
submitting the IO.

Signed-off-by: Sam Bradshaw <sbradshaw@micron.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-12-10 13:00:31 -07:00
Vitaly Kuznetsov fdf9b96503 xen/blkfront: remove redundant flush_op
flush_op is unambiguously defined by feature_flush:
    REQ_FUA | REQ_FLUSH -> BLKIF_OP_WRITE_BARRIER
    REQ_FLUSH -> BLKIF_OP_FLUSH_DISKCACHE
    0 -> 0
and thus can be removed. This is just a cleanup.

The patch was suggested by Boris Ostrovsky.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-12-10 12:20:18 -05:00
Vitaly Kuznetsov ad42d391ae xen/blkfront: improve protection against issuing unsupported REQ_FUA
Guard against issuing unsupported REQ_FUA and REQ_FLUSH was introduced
in d11e61583 and was factored out into blkif_request_flush_valid() in
0f1ca65ee. However:
1) This check in incomplete. In case we negotiated to feature_flush = REQ_FLUSH
   and flush_op = BLKIF_OP_FLUSH_DISKCACHE (so FUA is unsupported) FUA request
   will still pass the check.
2) blkif_request_flush_valid() is misnamed. It is bool but returns true when
   the request is invalid.
3) When blkif_request_flush_valid() fails -EIO is being returned. It seems that
   -EOPNOTSUPP is more appropriate here.
Fix all of the above issues.

This patch is based on the original patch by Laszlo Ersek and a comment by
Jeff Moyer.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Laszlo Ersek <lersek@redhat.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-12-10 12:20:07 -05:00
Michael S. Tsirkin 51cdc3815f virtio: drop VIRTIO_F_VERSION_1 from drivers
Core activates this bit automatically now,
drop it from drivers that set it explicitly.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2014-12-09 12:06:32 +02:00
Michael S. Tsirkin 38f37b578f virtio_blk: fix race at module removal
If a device appears while module is being removed,
driver will get a callback after we've given up
on the major number.

In theory this means this major number can get reused
by something else, resulting in a conflict.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
2014-12-09 12:05:27 +02:00
Michael S. Tsirkin 393c525b5b virtio_blk: make serial attribute static
It's never declared so no need to make it extern.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
2014-12-09 12:05:27 +02:00
Michael S. Tsirkin 19c1c5a64c virtio_blk: v1.0 support
Based on patch by Cornelia Huck.

Note: for consistency, and to avoid sparse errors,
      convert all fields, even those no longer in use
      for virtio v1.0.

Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2014-12-09 12:05:26 +02:00
Al Viro ba00410b81 Merge branch 'iov_iter' into for-next 2014-12-08 20:39:29 -05:00
James Bottomley dc843ef00e Merge remote-tracking branch 'scsi-queue/core-for-3.19' into for-linus 2014-12-08 07:40:20 -08:00
Keith Busch 9af8785a38 NVMe: Fix command setup on IO retry
On retry, the req->special is pointing to an already setup IOD, but we
still need to setup the command context and callback, otherwise you'll
see false twice completed errors and leak requests.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-12-03 19:10:45 -07:00
Matias Bjorling 709c8667ad null_blk: boundary check queue_mode and irqmode
When either queue_mode or irq_mode parameter is set outside its
boundaries, the driver will not complete requests. This stalls driver
initialization when partitions are probed. Fix by setting out of bound
values to the parameters default.

Signed-off-by: Matias Bjørling <m@bjorling.me>

Updated by me to have the parse+check in just one function.

Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-26 14:45:48 -07:00
Hannes Reinecke eb846d9f14 scsi: rename SERVICE_ACTION_IN to SERVICE_ACTION_IN_16
SPC-3 defines SERVICE ACTION IN(12) and SERVICE ACTION IN(16).
So rename SERVICE_ACTION_IN to SERVICE_ACTION_IN_16 to be
consistent with SPC and to allow for better distinction.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Tested-by: Robert Elliott <elliott@hp.com>
Reviewed-by: Robert Elliott <elliott@hp.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2014-11-24 20:01:40 +01:00
Gu Zheng fa573f7279 block/rsxx: use generic io stats accounting functions to simplify io stat accounting
Use generic io stats accounting help functions (generic_{start,end}_io_acct)
to simplify io stat accounting.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-24 08:05:18 -07:00
Gu Zheng 244808543e drbd: use generic io stats accounting functions to simplify io stat accounting
Use generic io stats accounting help functions (generic_{start,end}_io_acct)
to simplify io stat accounting.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-24 08:05:14 -07:00
Keith Busch c78b47136f NVMe: Update module version major number
It's already near impossible to tell what bits someone is running based on
a 'modinfo nvme', and I don't want to try guessing if someone is running
blk-mq or bio-based. Let's make it obvious with the module version that
the blk-mq conversion is a major change. Future bio-based versions can
increment to 0.10 in a fork if revisions occur.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-21 18:22:35 -07:00
Jens Axboe be7837e89d NVMe: fail pci initialization if the device doesn't have any BARs
The PCI init of NVMe doesn't check for valid bars before proceeding
to map and use BAR 0. If the device is hosed (or firmware is), then
we should catch this case and give up early.

This fixes a:

[ 1662.035778] WARNING: CPU: 0 PID: 4 at arch/x86/mm/ioremap.c:63 __ioremap_check_ram+0xa7/0xc0()

and later badness on such a device.

Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-20 11:10:06 -07:00
Jens Axboe 2c30540b38 NVMe: add ->exit_hctx() hook
If we do teardown and setup of the queue and block related parts
of the driver, then we should clear nvmeq->hctx once we kill the
hardware queue.

Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-20 11:10:03 -07:00
Jens Axboe e32efbfc35 NVMe: make setup work for devices that don't do INTx
The setup/probe part currently relies on INTx being there and
working, that's not always the case. For devices that don't
advertise INTx, enable a single MSIx vector early on and disable
it again before we ask for our full range of queue vecs.

Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-19 12:46:17 -07:00
Al Viro b583043e99 kill f_dentry uses
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-19 13:01:25 -05:00
Jens Axboe b352172976 Merge branch 'master' into for-3.19/drivers 2014-11-18 19:43:46 -07:00
Jens Axboe 1397688953 NVMe: enable IO stats by default
Before the blk-mq conversion they were on by default, we should
not change behavior there.

Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-18 08:45:31 -07:00
Jens Axboe 6dcc0cf6cb NVMe: nvme_submit_async_admin_req() must use atomic rq allocation
We are called for async event notification issues, and the
nvmeq lock is already held. If we fail the request allocation,
we'll just retry next time.

Reported-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-18 08:21:18 -07:00
Jens Axboe 9d135bb8c2 NVMe: replace blk_put_request() with blk_mq_free_request()
No point in using blk_put_request(), since we know we are blk-mq.
This only makes sense in core code where we could be dealing with
either legacy or blk-mq drivers. Additionally, use
blk_mq_free_hctx_request() for the request completion fast path,
where we already know the mapping from request to hardware queue.

Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-17 10:43:42 -07:00
Weijie Yang c406515239 zram: avoid kunmap_atomic() of a NULL pointer
zram could kunmap_atomic() a NULL pointer in a rare situation: a zram
page becomes a full-zeroed page after a partial write io.  The current
code doesn't handle this case and performs kunmap_atomic() on a NULL
pointer, which panics the kernel.

This patch fixes this issue.

Signed-off-by: Weijie Yang <weijie.yang@samsung.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Weijie Yang <weijie.yang.kh@gmail.com>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-11-13 16:17:05 -08:00
Philipp Reisner e805b983d3 drbd: Remove an useless copy of kernel_setsockopt()
Old backward-compat cruft

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-10 09:27:43 -07:00
Philipp Reisner 9581f97a68 drbd: Fix state change in case of connection timeout
A connection timeout affects all volumes of a resource!
Under the following conditions:

 A resource with multiple volumes
  AND
 ko-count >=1
  AND
 a write request triggers the timeout (ko-count * timeout)

DRBD's internal state gets confused. That in turn may
lead to very miss leading follow up failures. E.g.
"BUG: scheduling while atomic"

CC: stable@kernel.org # v3.17
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-10 09:27:41 -07:00
Lars Ellenberg 3b9d35d744 drbd: merge_bvec_fn: properly remap bvm->bi_bdev
This was not noticed for many years. Affects operation if
md raid is used a backing device for DRBD.

CC: stable@kernel.org # v3.2+
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-10 09:27:39 -07:00
Lars Ellenberg ff8bd88b73 drbd: fix resync throttling initialization
If for some reason DRBD resync was the only activity on a backend
device, drbd_rs_c_min_rate_throttle() would mistakenly decide that it is
still initialization time, and keep throttling the resync.

This patch explicitly initializes ->rs_last_events to the current
backend event counters, and drops the rs_last_events == 0 from the
throttle condition.

Reported-by: Mikhail Sugakov <msugakov@amazon.de>

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-10 09:27:37 -07:00
Philipp Reisner a88215312c drbd: fix race between role change and handshake
Symptoms:
If DRBD was "cleanly shut down" (all in sync, both Secondary before
disconnect, identical data generation uuids), and then one side was
promoted *during* the next connection handshake, the role change
could confuse the handshake.

The Primary would get stuck in WFBitmapS, the Secondary would log
unexpected cstate (Connected) in receive_bitmap
and get stuck in WFBitmapT.

Fix:
The test in is_valid_soft_transition wrong. It works because
the not allowed actions (promote/attach) do not touch the
cstate. The previous condition failed to demand a cstate change
in one clause.

In order to avoid deadlocks give up the state_mutex while waiting
for the transient state to go away.

Conflicts:
	drbd/drbd_state.c
	drbd/drbd_state.h
	drbd/drbd_wrappers.h

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-10 09:27:35 -07:00
Andreas Gruenbacher f221f4bcc5 drbd: Only use drbd_msg_put_info() in drbd_nl.c
Avoid generic netlink calls in other parts of the code base.

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-10 09:27:33 -07:00
Andreas Gruenbacher 179e20b8df drbd: Minor cleanups
. Update comments
 . drbd_set_{in,out_of}_sync(): Remove unused parameters
 . Move common code into adm_del_resource()
 . Redefine ERR_MINOR_EXISTS -> ERR_MINOR_OR_VOLUME_EXISTS

Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com>
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-10 09:27:30 -07:00
kbuild test robot a64e6bb4db NVMe: __nvme_submit_admin_cmd() can be static
drivers/block/nvme-core.c:865:5: sparse: symbol '__nvme_submit_admin_cmd' was not declared. Should it be static?

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-10 09:27:01 -07:00
Dan Carpenter 9f173b3384 NVMe: blk_mq_alloc_request() returns error pointers
We recently converted this to blk_mq but the error checks have to be
updated to check for IS_ERR() instead of NULL.

Fixes: a4aea5623d ('NVMe: Convert to blk-mq')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-05 13:47:03 -07:00
Matias Bjørling a4aea5623d NVMe: Convert to blk-mq
This converts the NVMe driver to a blk-mq request-based driver.

The NVMe driver is currently bio-based and implements queue logic within
itself.  By using blk-mq, a lot of these responsibilities can be moved
and simplified.

The patch is divided into the following blocks:

 * Per-command data and cmdid have been moved into the struct request
   field. The cmdid_data can be retrieved using blk_mq_rq_to_pdu() and id
   maintenance are now handled by blk-mq through the rq->tag field.

 * The logic for splitting bio's has been moved into the blk-mq layer.
   The driver instead notifies the block layer about limited gap support in
   SG lists.

 * blk-mq handles timeouts and is reimplemented within nvme_timeout().
   This both includes abort handling and command cancelation.

 * Assignment of nvme queues to CPUs are replaced with the blk-mq
   version. The current blk-mq strategy is to assign the number of
   mapped queues and CPUs to provide synergy, while the nvme driver
   assign as many nvme hw queues as possible. This can be implemented in
   blk-mq if needed.

 * NVMe queues are merged with the tags structure of blk-mq.

 * blk-mq takes care of setup/teardown of nvme queues and guards invalid
   accesses. Therefore, RCU-usage for nvme queues can be removed.

 * IO tracing and accounting are handled by blk-mq and therefore removed.

 * Queue suspension logic is replaced with the logic from the block
   layer.

Contributions in this patch from:

  Sam Bradshaw <sbradshaw@micron.com>
  Jens Axboe <axboe@fb.com>
  Keith Busch <keith.busch@intel.com>
  Robert Nelson <rlnelson@google.com>

Acked-by: Keith Busch <keith.busch@intel.com>
Acked-by: Jens Axboe <axboe@fb.com>

Updated for new ->queue_rq() prototype.

Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-04 13:18:52 -07:00
Keith Busch 9dbbfab7d5 NVMe: Do not over allocate for discard requests
Discard requests are often for very large ranges. The discard size is not
representative of the data transfer size so we don't need to allocate
for such a large prp list. This patch requests allocating only enough
for the memory needed for the data transfer and saves a little over 8k
of memory per max discard request.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reported-by: Paul Grabinar <paul.grabinar@ranbarg.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-04 13:18:37 -07:00
Keith Busch 9e60352cf8 NVMe: Do not open disks that are being deleted
It is possible the block layer will request to open a block device after
the driver deleted it. Subsequent releases will cause a double free,
or the disk's private_data is pointing to freed memory. This patch
protects the driver's freed disks from being opened and accessed: the
nvme namespaces are freed only when the device's refcount is 0, so at
that moment there were no active openers and no more should be allowed,
and it is safe to clear the disk's private_data that is about to be freed.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reported-by: Henry Chow <henry.chow@oracle.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-04 13:18:32 -07:00
Keith Busch 5940c8578f NVMe: Clear QUEUE_FLAG_STACKABLE
The nvme namespace request_queue's flags are initialized to
QUEUE_FLAG_DEFAULT, which currently sets QUEUE_FLAG_STACKABLE. The
device-mapper indicates this flag means the block driver is requset
based, though this driver is bio-based and problems will occur if an nvme
namespace is used with a request based dm device. This patch clears the
stackable flag.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-04 13:18:10 -07:00
Keith Busch 387caa5a2b NVMe: Fix device probe waiting on kthread
If we ever do parallel device probing, we need to wake up all processes
waiting for nvme kthread to start, not just one. This is currently
serialized so the bug is not reachable today, but fixing this anyway in
the hopes we implement parallel or asynchronous probe in the future.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-04 13:17:10 -07:00
Keith Busch 7963e52181 NVMe: Passthrough IOCTL for IO commands
The NVME_IOCTL_SUBMIT_IO only works for IO commands with block data
transfers and isn't usable for other NVMe commands like flush,
data set management, or any sort of vendor unique command. The
NVME_IOCTL_ADMIN_CMD, however, can easily be modified to accept arbitrary
IO commands in addition to arbitrary admin commands without breaking
backward compatibility. This patch just adds a new IOCTL to distinguish
if the driver should submit the command on an IO or Admin queue.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-04 13:17:09 -07:00
Keith Busch 1b9dbf7fe0 NVMe: Add revalidate_disk callback
This adds a callback to revalidate the disk and change its block size
and capacity if needed. Before, a user would have to remove + rescan
an entire device if they changed the logical block size using an NVMe
Format or other vendor specific command; now they can just run something
that issues the BLKRRPART IOCTL, like

 # hdparm -z /dev/nvmeXnY

This can also be used in response to the 1.2 Spec's Namespace Attribute
Change asynchronous event.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-04 13:17:09 -07:00
Keith Busch 7be50e93fb NVMe: Fix nvmeq waitqueue entry initialization
We need to update the nvme queue's wait_queue_t entry during each
initialization since the nvme_thread may be ended and restarted when
the device is reset. If a device reset occurs during a large amount
of buffered IO, it would take a lot longer to complete the outstanding
requests due to the 1 second polling instead of waking up as completions
occur.

Fixes: b9afca3efb

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-04 13:17:09 -07:00
Keith Busch b4ff9c8ddb NVMe: Translate NVMe status to errno
This returns a more appropriate error for the "capacity exceeded"
status. In case other NVMe statuses have a better errno, this patch adds
a convience function to translate an NVMe status code to an errno for
IO commands, defaulting to the current -EIO.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-04 13:17:09 -07:00
Keith Busch 695a4fe79f NVMe: Fix SG_IO status values
We've only been setting the sg_io_hdr status values on SCSI commands
that require an nvme command to complete the translation. The fields
in the struct are output parameters, so we have to set them, otherwise
user space will see whatever was in memory from before. In the case of
compat SG_IO, this would reveal kernel memory. This fixes the issue by
initializing the sg_io_hdr with successful status.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Acked-by: Vishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-04 13:17:09 -07:00
Keith Busch e179729a82 NVMe: Remove duplicate compat SG_IO code
We can return -ENOIOCTLCMD and the ioctl will be handled by
fs/compat_ioctl.c instead. This removes a lot of duplicate code in the
nvme driver.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-04 13:17:09 -07:00
Keith Busch a96d4f5c2d NVMe: Reference count pci device
If an nvme device is removed but user space has an open reference,
the nvme driver would have been holding an invalid reference to its pci
device. You may get a general protection fault on x86 h/w when the driver
uses that reference in dma_map_sg(), as is done in nvme_map_user_pages()
from the IOCTL interface.

This patch fixes the fault by taking a reference on the pci device and
holding it even after device removal until all opens on the nvme device
are closed.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reported-by: Nilesh Choudhury <nilesh.choudhury@oracle.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-04 13:17:09 -07:00
Andreea-Cristina Bernat 062261be4e nvme: Replace rcu_assign_pointer() with RCU_INIT_POINTER()
The use of "rcu_assign_pointer()" is NULLing out the pointer.
According to RCU_INIT_POINTER()'s block comment:
"1.   This use of RCU_INIT_POINTER() is NULLing out the pointer"
it is better to use it instead of rcu_assign_pointer() because it has a
smaller overhead.

The following Coccinelle semantic patch was used:
@@
@@

- rcu_assign_pointer
+ RCU_INIT_POINTER
  (..., NULL)

Signed-off-by: Andreea-Cristina Bernat <bernat.ada@gmail.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-04 13:17:08 -07:00
Sam Bradshaw 5905535610 NVMe: Correctly handle IOCTL_SUBMIT_IO when cpus > online queues
nvme_submit_io_cmd() uses smp_processor_id() to pick an IO queue index.
This patch fixes the case where there are more cpus from which the ioctl
call can originate than online queues, which can happen when a device
supports or was allocated fewer interrupt vectors than exist cpu cores.

Thanks to Keith Busch for the implementation suggestion.

Signed-off-by: Sam Bradshaw <sbradshaw@micron.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-04 13:17:08 -07:00
Keith Busch 302c6727e5 NVMe: Fix filesystem sync deadlock on removal
This changes the order of deleting the gendisks so it happens after the
nvme IO queues are freed. If a device is removed while a filesystem has
associated dirty data, the removal will wait on these to complete before
proceeding from del_gendisk, which could have caused deadlock before.

The implication of this is that an orderly removal of a responsive
device won't necessarily wait for dirty data to be written, but we are
not guaranteed the device is even going to respond at this point either.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-04 13:17:08 -07:00
Keith Busch f435c2825b NVMe: Call nvme_free_queue directly
Rather than relying on call_rcu, this patch directly frees the
nvme_queue's memory after ensuring no readers exist. Some arch specific
dma_free_coherent implementations may not be called from a call_rcu's
soft interrupt context, hence the change.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Reported-by: Matthew Minter <matthew_minter@xyratex.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-04 13:17:08 -07:00
Dan McLeran 2484f40780 NVMe: Add shutdown timeout as module parameter.
The current implementation hard-codes the shutdown timeout to 2 seconds.
Some devices take longer than this to complete a normal shutdown.
Changing the shutdown timeout to a module parameter with a default
timeout of 5 seconds.

Signed-off-by: Dan McLeran <daniel.mcleran@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-04 13:17:08 -07:00
Keith Busch 7c1b245038 NVMe: Skip orderly shutdown on failed devices
Rather than skipping shutdown only for devices that have been removed,
skip the orderly shutdown on failed devices to avoid the long timeout
handling that inevitably happens when deleting queues on such a device.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-04 13:17:08 -07:00
Keith Busch a67394790a NVMe: Whitespace fixes
Fixing tabs inadvertently converted to spaces.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-04 13:17:08 -07:00
Keith Busch c81f49758a NVMe: Use pci_stop_and_remove_bus_device_locked()
Race conditions are theoretically possible between the NVMe PCI device
removal and the generic PCI bus rescan and device removal that can be
triggered via sysfs.

To avoid those race conditions make the NVMe code use
pci_stop_and_remove_bus_device_locked().

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-04 13:17:07 -07:00
Keith Busch badc34d415 NVMe: Handling devices incapable of I/O
This is a minor refactor for handling devices that are incapable of IO.
The driver previously used special error codes to know that IO queues
are unavailable, but we have an online queue count now.

This also fixes an issue where the driver successfully sets the queue
count, but either is unable to allocate an IO queue or the device can't
create one for some reason.

If the driver can successfully enable the device and get responses to
admin commands, the driver will bring up a character device for managment
but not create block devices.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-04 13:17:07 -07:00
Dan McLeran 01079522f9 NVMe: Change nvme_enable_ctrl to set EN and manage CC thru ctrl_config.
Change the behavior of nvme_enable_ctrl to set EN.
Clear CC.SH for both nvme_enable_ctrl and nvme_disable_ctrl.
Remove reading of the CC register and manage the state in
dev->ctrl_config.

Signed-off-by: Dan McLeran <daniel.mcleran@intel.com>
[removed an unwanted write to CC]
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-04 13:17:07 -07:00
Keith Busch 1d09062460 NVMe: Mismatched host/device page size support
Adds support for devices with max page size smaller than the host's.
In the case we encounter such a host/device combination, the driver will
split a page into as many PRP entries as necessary for the device's page
size capabilities. If the device's reported minimum page size is greater
than the host's, the driver will not attempt to enable the device and
return an error instead.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-04 13:17:07 -07:00
Keith Busch 6fccf9383b NVMe: Async event request
Submits NVMe asynchronous event requests, one event up to the controller
maximum or number of possible different event types (8), whichever is
smaller. Events successfully returned by the controller are logged.

Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-04 13:17:07 -07:00
Joe Perches 4d51abf9bc block: Use dma_zalloc_coherent
Use the zeroing function instead of dma_alloc_coherent & memset(,0,)

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-11-04 13:17:07 -07:00
Greg Kroah-Hartman a8a93c6f99 Merge branch 'platform/remove_owner' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux into driver-core-next
Remove all .owner fields from platform drivers
2014-11-03 19:53:56 -08:00
Linus Torvalds ce1928da84 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull ceph fixes from Sage Weil:
 "There is a GFP flag fix from Mike Christie, an error code fix from
  Jan, and fixes for two unnecessary allocations (kmalloc and workqueue)
  from Ilya.  All are well tested.

  Ilya has one other fix on the way but it didn't get tested in time"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  libceph: eliminate unnecessary allocation in process_one_ticket()
  rbd: Fix error recovery in rbd_obj_read_sync()
  libceph: use memalloc flags for net IO
  rbd: use a single workqueue for all devices
2014-11-03 15:04:26 -08:00