Stall error handler if attempting resets/aborts while an rport is blocked.
This avoids device offline scenarios due to errors in the error handler.
Background:
Although the transport is using the scsi_timed_out functionality to
restart the timeout if the rport is blocked, if the timeout has already
fired before the block occurs, the eh handler still runs and can take
the device offline. Ultimately, this window cannot be resolved without
significant work in the error handler thread. Christoph noted the first
level of these issues when he noted the poor error response handling
by the error thread.
We found, under heavy load and error testing, that time window from when
the scsi_times_out() adds the io to the queue to when the scsi_error_handler
gets around to servicing it, can be in the several seconds range. In most
cases, these test conditions are highly unusual, but possible.
As a result, we're stalling the error handler in this race window so that
we can avoid the device_offline transitions.
Signed-off-by: James Smart <James.Smart@emulex.com>
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
Misc Bug Fixes:
- Cap MBX_DOWN_LINK command timeout to 60 seconds
- Fix double free of ndlp object
- Don't free mbox structures on error. The completion handlers expect to do so.
- Clear host attention work items when going offline
- Fixed discovery issues in multi-initiator environments.
Signed-off-by: James Smart <James.Smart@emulex.com>
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
This patch updates the fc transport for the following:
- Addition of a new attribute "system_hostname" which can be
used to set the fully qualified hostname that the fc_host
is attached to. The fc_host can then register this string
as the FDMI-based host name attribute.
Note: for NPIV, a fc_host could be associated with a system which
is not the local system.
- Add the inline function u64_to_wwn(), which is the inverse of the
existing wwn_to_u64() function.
- Slight reorg, just to keep dynamic attributes with each other, etc
Signed-off-by: James Smart <James.Smart@emulex.com>
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
Convert the pci_device_id-table of the megaraid_sas-driver to
the PCI_DEVICE-macro, to safe some lines.
Signed-off-by: Henrik Kretzschmar <henne@nachtwindheim.de>
Acked-by: "Patro, Sumant" <Sumant.Patro@lsil.com>
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
Modify beginning string to be more readable. Remove one trailing newline.
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
kbuild includes this automatically these days.
Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
Some targets may return slight variations of PQ and PDT to indicate
no LUN mapped. USB UFI setting PDT=0x1f but having reserved bits for
PQ is one example, and NetApp targets returning PQ=1 and PDT=0x1f is
another. Both instances seem like reasonable responses according to
SPC-3 and UFI specs.
The current scsi_probe_and_add_lun() code adds a scsi_device
for targets that return PQ=1 and PDT=0x1f. This causes LUNs of type
"UNKNOWN" to show up in /proc/scsi/scsi when no LUNs are mapped.
In addition, subsequent rescans fail to recognize LUNs that may be
added on the target, unless preceded by a write to the delete attribute
of the "UNKNOWN" LUN.
This patch addresses this problem by skipping over the scsi_add_lun()
when PQ=1,PDT=0x1f is encountered, and just returns
SCSI_SCAN_TARGET_PRESENT.
Signed-off-by: Dave Wysochanski <davidw@netapp.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
Received from Mark Salyzyn
If the adapter is in blinkled (Firmware Assert) when error recovery
timeout actions have been triggered, perform an adapter warm reset and
restart the initialization.
Signed-off-by: Mark Haverkamp <markh@osdl.org>
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
Received from Mark Salyzyn
The enclosed patch cleans up some code fragments, adds some paranoia
(unproven causes of potential driver failures).
Signed-off-by: Mark Haverkamp <markh@osdl.org>
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
Received from Mark Salyzyn
If the adapter should be in a blinkled (Firmware Assert) state when the
driver loads, we will perform a warm restart of the Adapter Firmware to
see if we can rescue the adapter. Possible causes of a blinkled can
occur on some early release motherboard BIOSes, transitory PCI bus
problems on embedded systems or non-x86 based architectures, transitory
startup failures of early release drives or transitory hardware
failures; some of which can bite the adapter later at runtime. Future
enhancements will include recovery during runtime.
Fixed extra whitespace space issue.
Signed-off-by: Mark Haverkamp <markh@osdl.org>
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
Received from Mark Salyzyn
This patch allows the FSACTL_SEND_LARGE_FIB, FSACTL_SENDFIB and
FSACTL_SEND_RAW_SRB ioctl calls into the aacraid driver to be
interruptible. Only necessary if the adapter and/or the management
software has gone into some sort of misbehavior and the system is being
rebooted, thus permitting the user management software applications to
be killed relatively cleanly. The FIB queue resource is held out of the
free queue until the adapter finally, if ever, completes the command.
Signed-off-by: Mark Haverkamp <markh@osdl.org>
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
Attached is a patch that should limit a possible recursion that can
lead to a stack overflow like follows:
Kernel stack overflow.
CPU: 3 Not tainted
Process zfcperp0.0.d819
(pid: 13897, task: 000000003e0d8cc8, ksp: 000000003499dbb8)
Krnl PSW : 0404000180000000 000000000030f8b2 (get_device+0x12/0x48)
Krnl GPRS: 00000000135a1980 000000000030f758 000000003ed6c1e8 0000000000000005
0000000000000000 000000000044a780 000000003dbf7000 0000000034e15800
000000003621c048 070000003499c108 000000003499c1a0 000000003ed6c000
0000000040895000 00000000408ab630 000000003499c0a0 000000003499c0a0
Krnl Code: a7 fb ff e8 a7 19 00 00 b9 02 00 22 e3 e0 f0 98 00 24 a7 84
Call Trace:
([<000000004089edc2>] scsi_request_fn+0x13e/0x650 [scsi_mod])
[<00000000002c5ff4>] blk_run_queue+0xd4/0x1a4
[<000000004089ff8c>] scsi_queue_insert+0x22c/0x2a4 [scsi_mod]
[<000000004089779a>] scsi_dispatch_cmd+0x8a/0x3d0 [scsi_mod]
[<000000004089f1ec>] scsi_request_fn+0x568/0x650 [scsi_mod]
...
[<000000004089f1ec>] scsi_request_fn+0x568/0x650 [scsi_mod]
[<00000000002c5ff4>] blk_run_queue+0xd4/0x1a4
[<000000004089ff8c>] scsi_queue_insert+0x22c/0x2a4 [scsi_mod]
[<000000004089779a>] scsi_dispatch_cmd+0x8a/0x3d0 [scsi_mod]
[<000000004089f1ec>] scsi_request_fn+0x568/0x650 [scsi_mod]
[<00000000002c5ff4>] blk_run_queue+0xd4/0x1a4
[<000000004089fa9e>] scsi_run_host_queues+0x196/0x230 [scsi_mod]
[<00000000409eba28>] zfcp_erp_thread+0x2638/0x3080 [zfcp]
[<0000000000107462>] kernel_thread_starter+0x6/0xc
[<000000000010745c>] kernel_thread_starter+0x0/0xc
<0>Kernel panic - not syncing: Corrupt kernel stack, can't continue.
This stack overflow occurred during tests on s390 using zfcp.
Recursion depth for this panic was 19.
Usually recursion between blk_run_queue and a request_fn is avoided
using QUEUE_FLAG_REENTER. But this does not help if the scsi stack
tries to flush the starved_list of a scsi_host.
Limit recursion depth when flushing the starved_list
of a scsi_host.
Signed-off-by: Andreas Herrmann <aherrman@de.ibm.com>
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
Hi,
Reading the Intel VSC and AHCI it seems like writing 0x302 is incorrect.
The only valid values are 4, 1 and 0. Writing 4 disables the
PHY.
Signed-off-by: Martin Hicks <mort@bork.org>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
pdc_adma was overlooked and broken by the irq-pio patch:
Only HSM_ST_LAST interrupts should be delivered to this LLDD.
Adding ATA_FLAG_PIO_POLLING to pdc_adma fixes the problem (temporarily),
before we convert the irq handler of pdc_adma to handle all interrupts.
Signed-off-by: Albert Lee <albertcc@tw.ibm.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
Implement dummy port which can be requested by setting appropriate bit
in probe_ent->dummy_port_mask. The dummy port is used as placeholder
for stolen legacy port. This allows libata to guarantee that
index_of(ap) == ap->port_no == actual_device_port_no, and thus to
remove error-prone ap->hard_port_no.
As it's used only when one port of a legacy controller is reserved by
some other entity (e.g. IDE), the focus is on keeping the added *code*
complexity at minimum, so dummy port allocates all libata core
resources and acts as a normal port. It just has all dummy port_ops.
This patch only implements dummy port. The following patch will make
libata use it for stolen legacy ports.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Kill host_set->next
Fix simplex support
Allow per platform setting of IDE legacy bases
Some of this can be tidied further later on, in particular all the
legacy port gunge belongs as a PCI quirk/PCI header decode to understand
the special legacy IDE rules in the PCI spec.
Longer term Jeff also wants to move the request_irq/free_irq out of core
which will make this even cleaner.
tj: folded in three followup patches - ata_piix-fix, broken-arch-fix
and fix-new-legacy-handling, and separated per-dev xfermask into
separate patch preceding this one. Folded in fixes are...
* ata_piix-fix: fix build failure due to host_set->next removal
* broken-arch-fix: add missing include/asm-*/libata-portmap.h
* fix-new-legacy-handling:
* In ata_pci_init_legacy_port(), probe_num was incorrectly
incremented during initialization of the secondary port and
probe_ent->n_ports was incorrectly fixed to 1.
* Both legacy ports ended up having the same hard_port_no.
* When printing port information, both legacy ports printed
the first irq.
Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Tejun Heo <htejun@gmail.com>
Implement per-dev xfermask. libata used to determine xfermask
per-port - the fastest mode of the slowest device on the port. This
patch enables per-dev xfermask.
Original patch is from Alan Cox <alan@redhat.com>. The following
changes are made by me.
* simplex warning message is added
* remove disabled device handling code which is never invoked
(originally for choosing port-wide lowest PIO mode)
Cc: Alan Cox <alan@redhat.com>
Signed-off-by: Tejun Heo <htejun@gmail.com>
s/ata_host_add/ata_port_add/
s/ata_host_init/ata_port_init/
libata naming got stuck in the middle of a Great Renaming:
ata_host -> ata_port
ata_host_set -> ata_host
To eliminate confusion, let's just give up for now, and simply ensure
that things are internally consistent.
Signed-off-by: Jeff Garzik <jeff@garzik.org>
Signed-off-by: Tejun Heo <htejun@gmail.com>
Update ata_host_init() such that it only initializes SCSI host related
stuff and doesn't call into ata_port_init(), and rename it to
ata_port_init_shost().
Signed-off-by: Tejun Heo <htejun@gmail.com>
SCSI EH locks door if sdev->locked is set. Sometimes door lock
command fails continuously (e.g. when medium is not present) and as
libata uses EH to acquire sense data, this easily creates a loop where
a failed lock door invokes EH and EH issues lock door on completion.
This patch clears sdev->locked on door lock failure to break this
loop. This problem has been spotted and diagnosed by Unicorn Chang
<uchang@tw.ibm.com>.
Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
Fix a sata debug print statement that still uses an old variable name.
Signed-off-by: Keith Owens <kaos@ocs.com.au>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
Initial IRQ mask clearing is done by libata-core by freezing all ports
prior to requesting IRQ. Remove redundant IRQ clearing from
init_controller().
Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
The following patch enhances libata to allow SAS device drivers
to utilize libata to talk to SATA devices. It introduces some
new APIs which allow libata to be used without allocating a
virtual scsi host.
New APIs:
ata_sas_port_alloc - Allocate an ata_port
ata_sas_port_init - Initialize an ata_port (probe device, etc)
ata_sas_port_destroy - Free an ata_port allocated by ata_sas_port_alloc
ata_sas_slave_configure - configure scsi device
ata_sas_queuecmd - queue a scsi command, similar to ata_scsi_queuecomand
These new APIs can be used either directly by a SAS LLDD or could be used
by the SAS transport class.
Possible usage for a SAS LLDD would be:
scsi_scan_host
target_alloc
ata_sas_port_alloc
slave_alloc
ata_sas_port_init
slave_configure
ata_sas_slave_configure
Commands received by the LLDD for SATA devices would call ata_sas_queuecmd.
Device teardown would occur with:
slave_destroy
port_disable
target_destroy
ata_sas_port_destroy
Signed-off-by: Brian King <brking@us.ibm.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
Move ata_probe_ent_alloc to libata-core. It will also be used by
future SAS/SATA integration patches.
Signed-off-by: Brian King <brking@us.ibm.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
Separate out the ata_port initialization from ata_host_init
so that it can be used in future SAS patches.
Signed-off-by: Brian King <brking@us.ibm.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
A recent drivers base commit:
3e95637a48820ff8bedb33e6439def96ccff1de5
Caused the bus to be added to dev_printk, so now our SCSI inquiry short
messages print like this:
scsiscsi 2:0:0:0: Direct access IBM-ESXS ST973401SS B519 PQ: 0 ANSI: 5
Just remove the "scsi" from the sdev_printk to compensate.
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
- Replace scsi_device_types array API with scsi_device_type function API.
Gets rid of a lot of common code, as well as being easier to use.
- Add the new device types in SPC4 r05a, and rename some of the older ones.
- Reformat the printing of inquiry data; now fits on one line and
includes PQ.
I think I've addressed all the feedback from the previous versions. My
current test box prints:
scsi 2:0:1:0: Direct access HP 18.2G ATLAS10K3_18_SCA HP05 PQ: 0 ANSI: 2
Signed-off-by: Matthew Wilcox <matthew@wil.cx>
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
Fix up a logic error in the checking for valid sense data.
Signed-off-by: Brian King <brking@us.ibm.com>
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
The ipr driver currently translates adapter recovered errors
to DID_ERROR. This patch fixes this to translate these
errors to success instead.
Signed-off-by: Brian King <brking@us.ibm.com>
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
Add definitions for some SAS error codes that can be
logged by ipr SAS adapters.
Signed-off-by: Brian King <brking@us.ibm.com>
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>