Wednesday, June 11, 2014

dmsetup and multipath did not resume multipath device paths after a LUN failure

Problem

> System logs shows that all paths of a multiapth LUNs failed at some point of time (unknown reason)
> After sometime all paths were available and accessible. Below commands on each paths shows each device path as running.

#  cat /sys/block/sdyg/device/state
    running
#  cat /sys/block/sdabm/device/state
    running

> multiapth output disk was used by orable ASM. Though all paths were available now, multipath was still sowing all paths as failed.

# multipath -l a_multipath_device_01
a_multipath_device_01 (51230050123000b1230001e000212120000) dm-158 HP,HSV210
[size=250G][features=1 queue_if_no_path][hwhandler=0][rw]
\_ round-robin 0 [prio=0][enabled]
 \_ 1:0:1:62  sdabm 134:576 [failed][undef]
 \_ 1:0:3:62  sdadc 8:992   [failed][undef]
 \_ 0:0:1:62  sdyg  129:512 [failed][undef]
 \_ 0:0:2:62  sdzb  130:592 [failed][undef]
\_ round-robin 0 [prio=0][enabled]
 \_ 1:0:2:62  sdach 135:656 [failed][undef]
 \_ 0:0:3:62  sdzw  131:672 [failed][undef]
 \_ 0:0:4:62  sdaar 132:752 [failed][undef]
 \_ 1:0:4:62  sdadx 66:816  [failed][undef]


Observation

> dmsetup status show all paths as failed ( F next to each path). Thought all paths have come back in running state after failure.

# dmsetup status a_multipath_device_01
0 524288000 multipath 2 3 0 0 2 1 E 0 4 0 134:576 F 3377 8:992 F 3377 129:512 F 3378 130:592 F 3378 E 0 4 0 135:656 F 3377 131:672 F 3377 132:752 F 3377 66:816 F 3378

>> dmsetup info still looks good  !

# dmsetup info a_multipath_device_01
dmsetup info a_multipath_device_01
Name:              a_multipath_device_01
State:             ACTIVE
Read Ahead:        256
Tables present:    LIVE
Open count:        1
Event number:      2039343
Major, minor:      253, 158
Number of targets: 1
UUID: mpath-51230050123000b1230001e000212120000

>> Multipath device a_multipath_device_01 is not accessible by application  and system logs shows error when a_multipath_device_01 is being  accessed.


Solution

> Below commands file with message - device map is in use

# multipath -f a_multipath_device_01
 device map in use

> dd command on each path successful (means each path are good)

# dd if=/dev/sdabm of=/dev/null  # no error
# dd if=/dev/sdzw of=/dev/null    # no error

> Generated table of a working device

# dmsetup status  another_mpath_device_02 > /tmp/table.txt

> Replace major:minor number in file with major:minor number of faulty device paths

# vi /tmp/table.txt
# cat /tmp/table.txt
0 524288000 multipath 1 queue_if_no_path 0 2 1 round-robin 0 4 1 134:576 1000 8:992 1000 129:512 1000 130:592 1000 round-robin 0 4 1 135:656 1000 131:672 1000 132:752 1000 66:816 1000

> Load table for device from file

# dmsetup reload  a_multipath_device_01 /tmp/table.txt

>> dmsetup status still show error

# dmsetup status a_multipath_device_01
0 524288000 error

> dmsetup info shows table state as INACIVE 

# dmsetup info a_multipath_device_01
Name:              a_multipath_device_01
State:             ACTIVE
Read Ahead:        256
Tables present:    LIVE & INACTIVE
Open count:        1
Event number:      2039343
Major, minor:      253, 158
Number of targets: 1
UUID: mpath-51230050123000b1230001e000212120000

> Resume device to make its table active

# dmsetup resume  a_multipath_device_01

> And device will be good to use now !

# dmsetup status a_multipath_device_01
0 524288000 multipath 2 0 0 0 2 1 E 0 4 0 134:576 A 0 8:992 A 0 129:512 A 0 130:592 A 0 E 0 4 0 135:656 A 0 131:672 A 0 132:752 A 0 66:816 A 0

# multiapth -l a_multipath_device_01
 [size=250G][features=1 queue_if_no_path][hwhandler=0][rw]
\_ round-robin 0 [prio=0][enabled]
 \_ 1:0:1:62  sdabm 134:576 [active][undef]
 \_ 1:0:3:62  sdadc 8:992   [active][undef]
 \_ 0:0:1:62  sdyg  129:512 [active][undef]
 \_ 0:0:2:62  sdzb  130:592 [active][undef]
\_ round-robin 0 [prio=0][enabled]
 \_ 1:0:2:62  sdach 135:656 [active][undef]
 \_ 0:0:3:62  sdzw  131:672 [active][undef]
 \_ 0:0:4:62  sdaar 132:752 [active][undef]
 \_ 1:0:4:62  sdadx 66:816  [active][undef]

# dmsetup info a_multipath_device_01
Name:              a_multipath_device_01
State:             ACTIVE
Read Ahead:        256
Tables present:    LIVE
Open count:        1
Event number:      2039343
Major, minor:      253, 158
Number of targets: 1
UUID: mpath-51230050123000b1230001e000212120000


Ideally, once all path of devices are available, dmsetup should change F (failed ) status to A (active) status  and multipath should show all path as active. Why it has not happened automatically ?

1 comment:

  1. Lately, I have seen - read using dd commands was failing on exactly same block while running command on any path of disk and any node. Hence we have replace LUN with new LUN and it resolved the problem.

    dd if=/dev/ of=/dev/null

    ReplyDelete