How to detect failing storage LUN on Linux – multipath

detect-failing-lun-on-linux-failing-scsi-detection

If you login to server and after running dmesg – show kernel log command you get tons of messages like:

# dmesg

 

end_request: I/O error, dev sdc, sector 0
sd 0:0:0:1: SCSI error: return code = 0x00010000
end_request: I/O error, dev sdb, sector 0
sd 0:0:0:0: SCSI error: return code = 0x00010000
end_request: I/O error, dev sda, sector 0
sd 0:0:0:2: SCSI error: return code = 0x00010000
end_request: I/O error, dev sdc, sector 0
sd 0:0:0:1: SCSI error: return code = 0x00010000
end_request: I/O error, dev sdb, sector 0
sd 0:0:0:0: SCSI error: return code = 0x00010000
end_request: I/O error, dev sda, sector 0
sd 0:0:0:2: SCSI error: return code = 0x00010000
end_request: I/O error, dev sdc, sector 0
sd 0:0:0:1: SCSI error: return code = 0x00010000
end_request: I/O error, dev sdb, sector 0
sd 0:0:0:0: SCSI error: return code = 0x00010000
end_request: I/O error, dev sda, sector 0

 

In /var/log/messages there are also log messages present like:

# cat /var/log/messages
...

 

Apr 13 09:45:49 sl02785 kernel: end_request: I/O error, dev sda, sector 0
Apr 13 09:45:49 sl02785 kernel: klogd 1.4.1, ———- state change ———-
Apr 13 09:45:50 sl02785 kernel: sd 0:0:0:1: SCSI error: return code = 0x00010000
Apr 13 09:45:50 sl02785 kernel: end_request: I/O error, dev sdb, sector 0
Apr 13 09:45:55 sl02785 kernel: sd 0:0:0:2: SCSI error: return code = 0x00010000
Apr 13 09:45:55 sl02785 kernel: end_request: I/O error, dev sdc, sector 0

 

 

This is a sure sign something is wrong with SCSI hard drives or SCSI controller, to further debug the situation you can use the multipath command, i.e.:


multipath -ll | grep -i failed
_ 0:0:0:2 sdc 8:32 [failed][faulty]
_ 0:0:0:1 sdb 8:16 [failed][faulty]
_ 0:0:0:0 sda 8:0 [failed][faulty]

 

As you can see all 3 drives merged (sdc, sdb and sda) to show up on 1 physical drive via the remote network connectedLUN to the server is showing as faulty. This is a clear sign something is either wrong with 3 hard drive members of LUN – (Logical Unit Number) (which is less probable)  or most likely there is problem with the LUN network  SCSI controller. It is common error that LUN SCSI controller optics cable gets dirty and needs a physical clean up to solve it.

In case you don't know what is storage LUN? – In computer storage, a logical unit number, or LUN, is a number used to identify a logical unit, which is a device addressed by the SCSI protocol or protocols which encapsulate SCSI, such as Fibre Channel or iSCSI. A LUN may be used with any device which supports read/write operations, such as a tape drive, but is most often used to refer to a logical disk as created on a SAN. Though not technically correct, the term "LUN" is often also used to refer to the logical disk itself.

What LUN's does is very similar to Linux's Software LVM (Logical Volume Manager).

Share this on:

More helpful Articles

Download PDFDownload PDF

Tags: , , , , , , , , , , , , , ,

Leave a Reply

CommentLuv badge