April 2024
M T W T F S S
1234567
891011121314
15161718192021
22232425262728
2930  

Categories

April 2024
M T W T F S S
1234567
891011121314
15161718192021
22232425262728
2930  

Detect failing storage LUN on Linux – multipath

f you login to server and after running dmesg – show kernel log command you get tons of messages like:

# dmesg

end_request: I/O error, dev sdc, sector 0
sd 0:0:0:1: SCSI error: return code = 0×00010000
end_request: I/O error, dev sdb, sector 0
sd 0:0:0:0: SCSI error: return code = 0×00010000
end_request: I/O error, dev sda, sector 0
sd 0:0:0:2: SCSI error: return code = 0×00010000
end_request: I/O error, dev sdc, sector 0
sd 0:0:0:1: SCSI error: return code = 0×00010000
end_request: I/O error, dev sdb, sector 0
sd 0:0:0:0: SCSI error: return code = 0×00010000
end_request: I/O error, dev sda, sector 0
sd 0:0:0:2: SCSI error: return code = 0×00010000
end_request: I/O error, dev sdc, sector 0
sd 0:0:0:1: SCSI error: return code = 0×00010000
end_request: I/O error, dev sdb, sector 0
sd 0:0:0:0: SCSI error: return code = 0×00010000
end_request: I/O error, dev sda, sector 0

In /var/log/messages there are also log messages present like:

# cat /var/log/messages
...

Apr 13 09:45:49 sl02785 kernel: end_request: I/O error, dev sda, sector 0
Apr 13 09:45:49 sl02785 kernel: klogd 1.4.1, ———- state change ———-
Apr 13 09:45:50 sl02785 kernel: sd 0:0:0:1: SCSI error: return code = 0×00010000
Apr 13 09:45:50 sl02785 kernel: end_request: I/O error, dev sdb, sector 0
Apr 13 09:45:55 sl02785 kernel: sd 0:0:0:2: SCSI error: return code = 0×00010000
Apr 13 09:45:55 sl02785 kernel: end_request: I/O error, dev sdc, sector 0

This is a sure sign something is wrong with SCSI hard drives orSCSI controller, to further debug the situation you can use themultipath command, i.e.:

multipath -ll | grep -i failed
_ 0:0:0:2 sdc 8:32 [failed][faulty]
_ 0:0:0:1 sdb 8:16 [failed][faulty]
_ 0:0:0:0 sda 8:0 [failed][faulty]

As you can see all 3 drives merged (sdc, sdb and sda) to show up on 1 physical drive via the remote network connected  LUN to the server is showing as faulty. This is a clear sign something is either wrong with 3 hard drive members of LUN – (Logical Unit Number) (which is less probable)  or most likely there is problem with the LUN network  SCSI controller. It is common error that LUN SCSI controller optics cable gets dirty and needs a physical clean up to solve it.

In case you don’t know what is storage LUN? – In computer storage, a logical unit number, or LUN, is a number used to identify a logical unit, which is a device addressed by the SCSI protocol or protocols which encapsulate SCSI, such as Fibre Channel or iSCSI. A LUN may be used with any device which supports read/write operations, such as a tape drive, but is most often used to refer to a logical disk as created on a SAN. Though not technically correct, the term “LUN” is often also used to refer to the logical disk itself.

What LUN’s does is very similar to Linux’s Software LVM (Logical Volume Manager).

Leave a Reply

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>