Saturday, January 9, 2016

SCSI driver Error Handaling (EH) - how multipath behave when few paths of a device fail

Question? 
It is challenge even for most seasoned system admin to explain how the IO will behave when few paths of device fail but some paths are still active. What is your answer ….?
.
.
.
.
.
.
.
.
.
.
.
.  
.
.
.
.
.
.
.
.
.
.
Most people answer this – IO will continue via remaining active paths.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
And…….. that is wrong !!


Let me explain, it with extreme case – say one of remote storage port (rport) of storage frame failed.  So if, there were 2 active paths of device, now device have only one good path and one failed path.

As soon as Linux kernel scsi (sc) driver will detect a failed path, it will quiesce HBA (yes, complete HBA) and wait for all outstanding IO to complete or timeout [1]. Then SCSI layer will activate error handaler. NO IO WILL BE SUBMITTED until error recovery completes – even though one path is failed and still one good paths available. This design is there to avoid any data corruption.

Device Recovery Steps

11-      Abort the command after specified scsi device timeout value defined in /sys/block//device/timeout [1]

When the error handler is triggered, it attempts the following operations in order (until one successfully executes or all options exhausts):

22-      Wait several seconds to hope that remote port become online (if device is Fiber Channel Device – not applicable for SCSI device)
33-      Activate Error Handler and do following in sequence
a.       Reset the device
b.      Reset the bus
c.       Reset the Host

First, scsi driver try to reset device and then bus. If it is not successful, and adapter firmware and device drive decide that adapter has not completed full recovery, adapter will be hard reset. It means, all paths of a disk via that adapter will be unavailable for few moment – irrespective, if they have failed are healthy. Hard reset happens when the I/O is black-holed with NOP response in the fabric.  Since, IO had been frozen by scsi drive, there is no change of IO request drop or data corruption.

Case-1: If all of above will fail, device will be set to offline. It means, complete device is not available via any path. It need manual intervention to look at system/storage logs, find problem, fix it , scan HBA to detect active  paths and make device online/running state.

Case-2: If recovery succeed,  path check heuristics of multipath will mark dead paths as failed. Now, IO to device will continue via remaining active path.


Reduce device recovery time

To reduce overall recovery time, upgrade kernel (to version 2.6.18-371.6.1 or higher) and  device_mapper (to version 0.4.7-63 or higher) to latest release to leverage on time related parameters such as :

11-      scsi driver Error Handaling (EH) timeout – eh_timeout (from default 10 second to 5 seconds)  [4]
22-      HBA port reset time - eh_deadline ( from disable/0 to 5 seconds) [5]
33-      Adpater reset time e.g. Qlogic reset time [2]
Add the following to /etc/modprobe.conf an recreate initrd
options qla2xxx ql2xextended_error_logging=1 qlport_down_retry=10 ql2xloginretrycount=10
44-      Multipath check_timeout  (reduce to 10 seconds from default 60 seconds) [3]

Reference 

[3] /usr/share/doc/device-mapper-multipath-0.4.7/multipath.conf.annotated


Additional Reference

3-      Redhat KB  this and this


1 comment:

  1. Thanks for the very useful (and what is more important - very rare) explanation of that complex matter. Could you please explain what happens in the other case.
    Let's set SCSI timeout to 120 seconds. It means SCSI error handler will start only after the very first "unanswered" SCSI command exceeds 120 seconds. What happens to multipathing within those 120 seconds during which all the SCSI commands in the presumably failed path are queued? Suppose there are 8 paths to the SAN storage. One controller goes offline unexpectedly or there is another situation that pauses processing SCSI commands on the storage. So 4 paths in fact do not work properly: they just take commands, attempt to send them out and start EH timeout for each of them. What happens to other paths before EH starts and after it gets started?

    ReplyDelete