We have an exciting new series starting here at The Support Insider. Our senior support engineers sometimes need to deep dive into customer issues, digging into logs some of you might not even know exist. We thought we’d share some examples of these with you. You see the problem presented, the game plan unfold, and how our top escalation engineers investigate tough problems.

We begin today with Nathan Small (Twitter handle: vSphereStorage). Nathan is a Senior Escalation Engineer on the storage team in Global Support Services, and has been employed with VMware since May 2005. He has written and designed many storage technical presentations that have been delivered at various VMware sponsored events including VMUGs and VMworld conferences. He is recognized in the worldwide virtualization community as a storage and infrastructure guru.

History of the issue: This customer performed a firmware upgrade of their IBM SVC however after doing so all hosts in 1 cluster went to an unresponsive state in vCenter, and it was not possible to SSH to the servers. It was still possible to ping the affected hosts. The servers were rebooted to resolve the issue. A host in a different cluster also connected to the IBM SVC did not experience this same issue.



Log review:



The host vmris01-sc stopped logging after 5am. The /var/log/vmksummary file updates at the first minute of every hour stating the uptime of the host, build version of the vmkernel/service console, and the memory usage of the top 3 process:



Mar 1 04:01:13 vmris01-sc logger: (1330592473) hb: vmk loaded, 31076125.88, 31076117.975, 2, 208167, 208167, 136, sfcbd-153328, ftPerl-76692, vmware-h-73592

Mar 1 05:01:14 vmris01-sc logger: (1330596074) hb: vmk loaded, 31079726.01, 31079718.111, 2, 208167, 208167, 136, sfcbd-153336, ftPerl-76736, vmware-h-73188

Mar 1 08:20:13 vmris01-sc vmkhalt: (1330608013) Starting system…

Mar 1 08:20:25 vmris01-sc logger: (1330608025) loaded VMkernel



Since there was no update at 6:01, this means the server stopped logging at this point. This means one of two possibilities: 1. the server hung (due to hardware), or 2. the server was no longer able to write to /var/log. It is highly unlikely that the server hung since multiple hosts were affected simultaneously, so the only possibility is that these hosts were not able to write to /var/log.



This host observed an event beginning at 4:13am in the form of commands failing to complete to the target:



Mar 1 04:13:39 vmris01-sc vmkernel: 359:16:27:43.715 cpu14:5443)VMK_PCI: 739: device 065:00.0 capType 16 capIndex 76

Mar 1 04:29:34 vmris01-sc vmkernel: 359:16:43:38.593 cpu30:4126)NMP: nmp_CompleteCommandForPath: Command 0x2a (0x41000a0a2f00) to NMP device “naa.60050768019102668000000000000328” failed on physical path “vmhba0:C0:T1:L1” H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.



Then we see the remote ports (array controller ports) timeout, specifically the second target (‘1’) on both HBAs:



Mar 1 04:29:45 vmris01-sc vmkernel: 359:16:43:48.167 cpu15:4303)<3> rport-2:0-1: blocked FC remote port time out: saving binding

Mar 1 04:29:45 vmris01-sc vmkernel: 359:16:43:48.253 cpu25:4301)<3> rport-3:0-1: blocked FC remote port time out: saving binding



This causes a host status of H:0x1, which requires us to failover:



Mar 1 04:29:45 vmris01-sc vmkernel: 359:16:43:48.627 cpu30:4126)NMP: nmp_CompleteCommandForPath: Command 0x2a (0x41000a0df680) to NMP device “naa.60050768019102668000000000000328” failed on physical path “vmhba0:C0:T1:L1” H:0x1 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

Mar 1 04:29:45 vmris01-sc vmkernel: 359:16:43:48.627 cpu30:4126)WARNING: NMP: nmp_DeviceRetryCommand: Device “naa.60050768019102668000000000000328”: awaiting fast path state update for failover with I/O blocked. No prior reservation exists on the device.



Then failover actually occurs (from target 1 to target 0):



Mar 1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.629 cpu24:4268)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba0:C0:T1:L5 to vmhba1:C0:T0:L5 for device “naa.60050768019102668000000000000335”.

Mar 1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.629 cpu26:8602)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba0:C0:T1:L1 to vmhba1:C0:T0:L1 for device “naa.60050768019102668000000000000328”.

Mar 1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.629 cpu2:4327)WARNING: NMP: nmp_DeviceAttemptFailover: Retry world failover device “naa.60050768019102668000000000000335” – issuing command 0x41000a15d380

Mar 1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.629 cpu2:4327)WARNING: NMP: nmp_DeviceAttemptFailover: Retry world failover device “naa.60050768019102668000000000000328” – issuing command 0x41000a0df680

Mar 1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.629 cpu39:4163)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba0:C0:T1:L2 to vmhba1:C0:T0:L2 for device “naa.6005076801910266800000000000032b”.

Mar 1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.629 cpu12:8604)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba0:C0:T1:L0 to vmhba1:C0:T0:L0 for device “naa.6005076801910266800000000000031c”.

Mar 1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.629 cpu2:4327)WARNING: NMP: nmp_DeviceAttemptFailover: Retry world failover device “naa.6005076801910266800000000000031c” – issuing command 0x41000a1ec240

Mar 1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.629 cpu39:4163)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba0:C0:T1:L4 to vmhba1:C0:T0:L4 for device “naa.60050768019102668000000000000334”.

Mar 1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.629 cpu37:4361)NMP: nmp_CompleteRetryForPath: Retry world recovered device “naa.60050768019102668000000000000335”

Mar 1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.630 cpu39:4163)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba0:C0:T1:L6 to vmhba1:C0:T0:L6 for device “naa.60050768019102668000000000000336”.

Mar 1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.630 cpu39:4135)NMP: nmp_CompleteRetryForPath: Retry world recovered device “naa.60050768019102668000000000000328”

Mar 1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.630 cpu36:4163)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba0:C0:T1:L7 to vmhba1:C0:T0:L7 for device “naa.60050768019102668000000000000337”.

Mar 1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.631 cpu36:4163)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba0:C0:T1:L8 to vmhba1:C0:T0:L8 for device “naa.6005076801910266800000000000033a”.

Mar 1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.631 cpu36:4163)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba0:C0:T1:L9 to vmhba1:C0:T0:L9 for device “naa.6005076801910266800000000000033b”.

Mar 1 04:29:46 vmris01-sc vmkernel: 359:16:43:48.638 cpu37:4361)NMP: nmp_CompleteRetryForPath: Retry world recovered device “naa.6005076801910266800000000000031c”



The statement of “Retry world recovered device” indicates that failover completed successfully.



After fa

ilover,Everything is running fine until 70 minutes later when we see the first target (‘0’) timeout to both HBAs:



Mar 1 05:39:55 vmris01-sc vmkernel: 359:17:53:59.805 cpu0:4300)<3> rport-2:0-0: blocked FC remote port time out: saving binding

Mar 1 05:39:55 vmris01-sc vmkernel: 359:17:53:59.929 cpu30:4301)<3> rport-3:0-0: blocked FC remote port time out: saving binding



We then failover back from target 0 to target 1:



Mar 1 05:39:56 vmris01-sc vmkernel: 359:17:54:00.228 cpu36:4163)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba1:C0:T0:L10 to vmhba0:C0:T1:L10 for device “naa.60050768019102668000000000000348”.

Mar 1 05:39:56 vmris01-sc vmkernel: 359:17:54:00.229 cpu36:4163)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba0:C0:T0:L11 to vmhba1:C0:T1:L11 for device “naa.6005076801910266800000000000035b”.



This behavior is consistent with a firmware upgrade of an IBM SVC or any array type that supports non-disruptive upgrading of the controllers.



What is odd is that 20 minutes later we see more failover events, target 1 back to target 0, right before the server stopped logging to the vmkernel log:



Mar 1 06:00:37 vmris01-sc vmkernel: 359:18:14:41.379 cpu38:4163)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba0:C0:T1:L10 to vmhba1:C0:T0:L10 for device “naa.60050768019102668000000000000348”.

Mar 1 06:00:37 vmris01-sc vmkernel: 359:18:14:41.381 cpu12:4303)vmw_psp_fixed: psp_fixedSelectPathToActivateInt: Changing active path from vmhba1:C0:T1:L11 to vmhba0:C0:T0:L11 for device “naa.6005076801910266800000000000035b”.

Mar 1 08:20:22 vmris01-sc vmkernel: pol 1

Mar 1 08:20:22 vmris01-sc vmkernel: 0:00:00:03.637 cpu0:4096)Chipset: 497: bus pci, id 4



We can see the timestamp jumps from 359 days to 0 days, which is essentially the hard reboot of this host.



It is odd that the logging suddenly stopped, however we would speculate that logging stopped because the ESX host was not able to communicate with its boot device. The only other possibility would be that /var/log ran out of space on all 3 host, which is highly unlikely. When reviewing the esxcfg-mpath output for this host, there is no local RAID controller, which means that these are likely servers that boot-from-SAN. A quick check of /etc/vmware/esx.conf reveals the following:



/boot/cosvmdk = “/vmfs/volumes/4d0f879c-ccb0a33e-91f7-842b2b5c4f54/esxconsole-4d0f86da-ab5c-79a0-83ee-842b2b5c4f54/esxconsole.vmdk”



# cat tmp/4d0f879c-ccb0a33e-91f7-842b2b5c4f54.vpa.txt

VMFS-3.33 file system spanning 1 partitions.

File system label (if any): Storage1 (26)

Mode: public

Capacity 30870077440 (29440 file blocks * 1048576), 21807235072 (20797 blocks) avail

UUID: 4d0f879c-ccb0a33e-91f7-842b2b5c4f54

Partitions spanned (on “lvm”):

naa.6005076801910266800000000000031c:5



This server is booting off the 5th partition of naa.6005076801910266800000000000031c



naa.6005076801910266800000000000031c : IBM Fibre Channel Disk (naa.6005076801910266800000000000031c)

vmhba1:C0:T0:L0 LUN:0 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:a7:67 WWPN: 21:00:00:24:ff:00:a7:67 Target: WWNN: 50:05:07:68:01:00:4c:c5 WWPN: 50:05:07:68:01:10:4c:c5

vmhba1:C0:T1:L0 LUN:0 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:a7:67 WWPN: 21:00:00:24:ff:00:a7:67 Target: WWNN: 50:05:07:68:01:00:4b:00 WWPN: 50:05:07:68:01:10:4b:00

vmhba0:C0:T0:L0 LUN:0 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:a7:90 WWPN: 21:00:00:24:ff:00:a7:90 Target: WWNN: 50:05:07:68:01:00:4c:c5 WWPN: 50:05:07:68:01:30:4c:c5

vmhba0:C0:T1:L0 LUN:0 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:a7:90 WWPN: 21:00:00:24:ff:00:a7:90 Target: WWNN: 50:05:07:68:01:00:4b:00 WWPN: 50:05:07:68:01:30:4b:00



The output above confirms the host is boot-from-SAN.



Let’s review the server hardware (from dmidecode or smbios):



Manufacturer: Dell Inc.

Product Name: PowerEdge R910



BIOS Information

Vendor: Dell Inc.

Version: 1.2.0

Release Date: 06/22/2010



And the ESX version:



VMware ESX 4.0.0 build-208167



This BIOS version is very old which means it may have been susceptible to the IRQ remapping issue (KB 1030265 — http://kb.vmware.com/kb/1030265), however the ESX version is prior to 4.1 when this feature was introduced. This means that the reason the hosts stopped logging was more than likely related to the fact that they boot-from-SAN and all paths to the array were lost around 6am.



Let’s review another host:



Mar 1 06:04:04 vmris02-sc vmkernel: 435:15:06:06.885 cpu22:4118)NMP: nmp_CompleteCommandForPath: Command 0x12 (0x41000a1a3e00) to NMP device “mpx.vmhba2:C0:T0:L0” failed on physical path “vmhba2:C0:T0:L0” H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.

Mar 1 06:04:04 vmris02-sc vmkernel: 435:15:06:06.885 cpu22:4118)ScsiDeviceIO: 747: Command 0x12 to device “mpx.vmhba2:C0:T0:L0” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.

Mar 1 08:17:56 vmris02-sc vmkernel: 497: bus pci, id 3

Mar 1 08:17:56 vmris02-sc vmkernel: 0:00:00:03.634 cpu0:4096)Chipset: 507: 0:A (0) 01-03 trg 2 pol 1

Mar 1 08:17:56 vmris02-sc vmkernel: 0:00:00:03.634 cpu0:4096)Chipset: 507: 0:B (1) 01-14 trg 2 pol 1



This host stopped logging at 6:04 instead of 6:00, which is a difference of 4 minutes. Let’s review when the first remote ports disappeared to help correlate:



Mar 1 04:33:09 vmris02-sc vmkernel: 435:13:35:12.922 cpu33:4307)<3> rport-3:0-0: blocked FC remote port time out: saving binding

Mar 1 04:33:10 vmris02-sc vmkernel: 435:13:35:13.143 cpu24:4294)<3> rport-2:0-0: blocked FC remote port time out: saving binding



This is also off by 4 minutes, which means we can likely conclude that they actually occurred at the exact same time on both hosts, but these hosts are not all setup for NTP.



Finally, let’s review the last host to verify these hosts observed the issue identically. This host observed the same behavior, though the server time seems to be off by 9 minutes (lack of NTP):



Mar 1 04:38:12 vmris03-sc vmkernel: 75:18:52:34.554 cpu18:4313)<3> rport-3:0-1: blocked FC remote port time out: saving binding

Mar 1 04:38:12 vmris03-sc vmkernel: 75:18:52:34.826 cpu12:4307)<3> rport-2:0-1: blocked FC remote port time out: saving binding



Mar 1 05:48:24 vmris03-sc vmkernel: 75:20:02:46.594 cpu6:4304)<3> rport-3:0-0: blocked FC remote port time out: saving binding

Mar 1 05:48:27 vmris03-sc vmkernel: 75:20:02:46.640 cpu18:4309)<3> rport-2:0-0: blocked FC remote port time out: saving binding



Needless to say, all hosts observed the issue at the exact same time, when taking into account the time drift is a direct result of a lack of NTP running. Root cause for why storage paths disappeared again at 6am is not known since the ESX hosts had nowhere to write logs, which suggests that all available paths to the storage were simultaneously lost at that moment.



The other host that logs were upload for was a host that did not experience the same outage as hosts in another cluster:



Mar 1 04:42:38 vmhost16-sc vmkernel: 167:15:49:25.581 cpu14:4295)<3> rport-2:0-0: blocked FC remote port time out: saving binding



Mar 1 05:52:52 vmhost16-sc vmkernel: 167:16:59:37.397 cpu12:4313)<3> rport-2:0-1: blocked FC remote port time out: saving binding

Mar 1 05:52:52 vmhost16-sc vmkernel: 167:16:59:37.815 cpu12:4298)<3> rport-3:0-0: blocked FC remote port time out: saving binding



We can see that only 3 out of 4 remote ports went offline compared to other hosts, however a grep for rport shows the port did go down 2.5 hours later:



Mar 1 07:18:23 vmhost16-sc vmkernel: 167:18:25:13.641 cpu12:4301)<3> rport-3:0-1: blocked FC remote port time out: saving binding



*Note: With default HBA driver logging it will only report when a remote disappears, not when it is visible again.



This host logs well past 6am, which likely means this host does not boot-from-SAN:



/boot/cosvmdk = “/vmfs/volumes/4d2b18aa-56df78bf-9bc2-842b2b5c502c/esxconsole-4d2b174e-97a3-3540-1b89-842b2b5c502c/esxconsole.vmdk”



# cat /tmp/4d2b18aa-56df78bf-9bc2-842b2b5c502c.vpa.txt

VMFS-3.33 file system spanning 1 partitions.

File system label (if any): Storage1

Mode: public

Capacity 30870077440 (29440 file blocks * 1048576), 21595422720 (20595 blocks) avail

UUID: 4d2b18aa-56df78bf-9bc2-842b2b5c502c

Partitions spanned (on “lvm”):

naa.600507680191026680000000000001a6:5



naa.600507680191026680000000000001a6 : IBM Fibre Channel Disk (naa.600507680191026680000000000001a6)

vmhba0:C0:T0:L0 LUN:0 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9 Target: WWNN: 50:05:07:68:01:00:4b:00 WWPN: 50:05:07:68:01:40:4b:00

vmhba0:C0:T1:L0 LUN:0 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9 Target: WWNN: 50:05:07:68:01:00:4c:c5 WWPN: 50:05:07:68:01:40:4c:c5



As we can see, this is actually a boot-from-SAN configuration, so why were 3 hosts affected but not this one?



Going back to the fact that this last host didn’t lose all paths to the storage (only 3 out of 4), we will focus on the fact that it eventually lost the last path 2.5 hours later, after which the host complained constantly about SCSI Reservation conflicts:



Mar 1 08:03:13 vmhost16-sc vmkernel: 167:19:10:03.646 cpu45:4369)VMW_SATP_SVC: satp_svc_UpdatePath: Failed to update path “vmhba0:C0:T0:L9” state. Status=SCSI reservation conflict

Mar 1 08:03:13 vmhost16-sc vmkernel: 167:19:10:03.652 cpu45:4369)VMW_SATP_SVC: satp_svc_UpdatePath: Failed to update path “vmhba0:C0:T1:L9” state. Status=SCSI reservation conflict



Also, the qlogic driver is aborting commands every 5 minutes exactly, which is the frequency of the disk.patheval code (300 seconds be default):



Mar 1 08:28:33 vmhost16-sc vmkernel: 167:19:35:23.641 cpu3:20943)<6>qla2xxx 0000:41:00.0: scsi(3:1:72): Abort command issued — 1 32bf3c06 2002.

Mar 1 08:33:23 vmhost16-sc vmkernel: 167:19:40:13.637 cpu6:20944)<6>qla2xxx 0000:41:00.0: scsi(3:5:67): Abort command issued — 1 32c01950 2002.

Mar 1 08:38:23 vmhost16-sc vmkernel: 167:19:45:13.641 cpu12:20943)<6>qla2xxx 0000:41:00.0: scsi(3:3:0): Abort command issued — 1 32c0d62a 2002.

Mar 1 08:43:23 vmhost16-sc vmkernel: 167:19:50:13.639 cpu9:20944)<6>qla2xxx 0000:41:00.0: scsi(3:5:74): Abort command issued — 1 32c18b9f 2002.

Mar 1 08:53:43 vmhost16-sc vmkernel: 167:20:00:33.641 cpu6:20944)<6>qla2xxx 0000:41:00.0: scsi(3:1:66): Abort command issued — 1 32c31fbf 2002.

Mar 1 08:58:23 vmhost16-sc vmkernel: 167:20:05:13.644 cpu4:20944)<6>qla2xxx 0000:41:00.0: scsi(3:3:69): Abort command issued — 1 32c3938d 2002.



Eventually we see paths deleted for various LUNs:



Mar 1 10:24:28 vmhost16-sc vmkernel: 167:21:31:18.923 cpu18:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=20

Mar 1 10:24:28 vmhost16-sc vmkernel: 167:21:31:18.924 cpu18:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=22

Mar 1 10:24:28 vmhost16-sc vmkernel: 167:21:31:18.

924 cpu18:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=23

Mar 1 10:24:28 vmhost16-sc vmkernel: 167:21:31:18.924 cpu18:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=25

Mar 1 10:24:28 vmhost16-sc vmkernel: 167:21:31:18.925 cpu18:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=26

Mar 1 10:24:28 vmhost16-sc vmkernel: 167:21:31:18.925 cpu18:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=27

Mar 1 10:24:50 vmhost16-sc vmkernel: 167:21:31:40.535 cpu14:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=30

Mar 1 10:24:50 vmhost16-sc vmkernel: 167:21:31:40.535 cpu14:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=35

Mar 1 10:24:50 vmhost16-sc vmkernel: 167:21:31:40.535 cpu14:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=38

Mar 1 10:24:50 vmhost16-sc vmkernel: 167:21:31:40.536 cpu14:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=39

Mar 1 10:25:12 vmhost16-sc vmkernel: 167:21:32:02.494 cpu17:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=40

Mar 1 10:25:12 vmhost16-sc vmkernel: 167:21:32:02.496 cpu17:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=44

Mar 1 10:25:12 vmhost16-sc vmkernel: 167:21:32:02.497 cpu17:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=45

Mar 1 10:25:52 vmhost16-sc vmkernel: 167:21:32:43.208 cpu12:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=18

Mar 1 10:25:52 vmhost16-sc vmkernel: 167:21:32:43.209 cpu15:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=19

Mar 1 10:26:12 vmhost16-sc vmkernel: 167:21:33:03.350 cpu12:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=51

Mar 1 10:26:12 vmhost16-sc vmkernel: 167:21:33:03.351 cpu12:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=52

Mar 1 10:26:12 vmhost16-sc vmkernel: 167:21:33:03.351 cpu12:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=53

Mar 1 10:26:33 vmhost16-sc vmkernel: 167:21:33:23.700 cpu12:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=0

Mar 1 10:26:33 vmhost16-sc vmkernel: 167:21:33:23.701 cpu12:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=1

Mar 1 10:26:33 vmhost16-sc vmkernel: 167:21:33:23.701 cpu12:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=2

Mar 1 10:26:33 vmhost16-sc vmkernel: 167:21:33:23.701 cpu12:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=3

Mar 1 10:26:33 vmhost16-sc vmkernel: 167:21:33:23.702 cpu12:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=4

Mar 1 10:26:33 vmhost16-sc vmkernel: 167:21:33:23.903 cpu15:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=8

Mar 1 10:26:33 vmhost16-sc vmkernel: 167:21:33:23.903 cpu15:4149)ScsiPath: 3902: DeletePath : adapter=vmhba1, channel=0, target=0, lun=9



This means the connection to all LUNs were removed for vmhba1 to a particular array. There are still paths from vmhba1 to the other array:



eui.00173800123a001a : IBM Fibre Channel Disk (eui.00173800123a001a)

vmhba1:C0:T5:L67 LUN:67 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:9a:b3 WWPN: 21:00:00:24:ff:00:9a:b3 Target: WWNN: 50:01:73:80:12:3a:00:00 WWPN: 50:01:73:80:12:3a:01:92

vmhba1:C0:T4:L67 LUN:67 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:9a:b3 WWPN: 21:00:00:24:ff:00:9a:b3 Target: WWNN: 50:01:73:80:12:3a:00:00 WWPN: 50:01:73:80:12:3a:01:82

vmhba1:C0:T3:L67 LUN:67 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:9a:b3 WWPN: 21:00:00:24:ff:00:9a:b3 Target: WWNN: 50:01:73:80:12:3a:00:00 WWPN: 50:01:73:80:12:3a:01:72

vmhba1:C0:T2:L67 LUN:67 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:9a:b3 WWPN: 21:00:00:24:ff:00:9a:b3 Target: WWNN: 50:01:73:80:12:3a:00:00 WWPN: 50:01:73:80:12:3a:01:52

vmhba1:C0:T1:L67 LUN:67 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:9a:b3 WWPN: 21:00:00:24:ff:00:9a:b3 Target: WWNN: 50:01:73:80:12:3a:00:00 WWPN: 50:01:73:80:12:3a:01:42

vmhba0:C0:T4:L67 LUN:67 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9 Target: WWNN: 50:01:73:80:12:3a:00:00 WWPN: 50:01:73:80:12:3a:01:80

vmhba0:C0:T3:L67 LUN:67 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9 Target: WWNN: 50:01:73:80:12:3a:00:00 WWPN: 50:01:73:80:12:3a:01:50

vmhba0:C0:T2:L67 LUN:67 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9 Target: WWNN: 50:01:73:80:12:3a:00:00 WWPN: 50:01:73:80:12:3a:01:40

vmhba0:C0:T6:L67 LUN:67 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9 Target: WWNN: 50:01:73:80:12:3a:00:00 WWPN: 50:01:73:80:12:3a:01:90

vmhba0:C0:T5:L67 LUN:67 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9 Target: WWNN: 50:01:73:80:12:3a:00:00 WWPN: 50:01:73:80:12:3a:01:81



We can see all LUNs from the IBM array with WWNN 50:05:07:68:01:00:4c:c5 and 50:05:07:68:01:00:4b:00 has only 2 paths:



naa.600507680191026680000000000002ac : IBM Fibre Channel Disk (naa.600507680191026680000000000002ac)

vmhba0:C0:T1:L30 LUN:30 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9 Target: WWNN: 50:05:07:68:01:00:4c:c5 WWPN: 50:05:07:68:01:40:4c:c5

vmhba0:C0:T0:L30 LUN:30 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN:

21:00:00:24:ff:00:99:e9 Target: WWNN: 50:05:07:68:01:00:4b:00 WWPN: 50:05:07:68:01:40:4b:00



naa.6005076801910266800000000000027c : IBM Fibre Channel Disk (naa.6005076801910266800000000000027c)

vmhba0:C0:T1:L26 LUN:26 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9 Target: WWNN: 50:05:07:68:01:00:4c:c5 WWPN: 50:05:07:68:01:40:4c:c5

vmhba0:C0:T0:L26 LUN:26 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9 Target: WWNN: 50:05:07:68:01:00:4b:00 WWPN: 50:05:07:68:01:40:4b:00



naa.6005076801910266800000000000033f : IBM Fibre Channel Disk (naa.6005076801910266800000000000033f)

vmhba0:C0:T1:L45 LUN:45 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9 Target: WWNN: 50:05:07:68:01:00:4c:c5 WWPN: 50:05:07:68:01:40:4c:c5

vmhba0:C0:T0:L45 LUN:45 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9 Target: WWNN: 50:05:07:68:01:00:4b:00 WWPN: 50:05:07:68:01:40:4b:00



naa.6005076801910266800000000000027d : IBM Fibre Channel Disk (naa.6005076801910266800000000000027d)

vmhba0:C0:T1:L27 LUN:27 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9 Target: WWNN: 50:05:07:68:01:00:4c:c5 WWPN: 50:05:07:68:01:40:4c:c5

vmhba0:C0:T0:L27 LUN:27 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:99:e9 WWPN: 21:00:00:24:ff:00:99:e9 Target: WWNN: 50:05:07:68:01:00:4b:00 WWPN: 50:05:07:68:01:40:4b:00



Based on this evidence, this could be a zoning issue or simply LUN presentation problem (perhaps that HBA is not logged into the target when it should be?). Regardless of whether or not zoning changes were implemented as well, the focus should remain that all hosts in one cluster stopped logging to /var/log while a host in another cluster did not.



Let’s get back to the original issue, which is why did 3 servers behave differently than 1 server. To do that, let’s compare logged-in targets:



Affected host (vmris01) boot LUN targets:



naa.6005076801910266800000000000031c : IBM Fibre Channel Disk (naa.6005076801910266800000000000031c)

vmhba1:C0:T0:L0 LUN:0 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:a7:67 WWPN: 21:00:00:24:ff:00:a7:67 Target: WWNN: 50:05:07:68:01:00:4c:c5 WWPN: 50:05:07:68:01:10:4c:c5

vmhba1:C0:T1:L0 LUN:0 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:a7:67 WWPN: 21:00:00:24:ff:00:a7:67 Target: WWNN: 50:05:07:68:01:00:4b:00 WWPN: 50:05:07:68:01:10:4b:00

vmhba0:C0:T0:L0 LUN:0 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:a7:90 WWPN: 21:00:00:24:ff:00:a7:90 Target: WWNN: 50:05:07:68:01:00:4c:c5 WWPN: 50:05:07:68:01:30:4c:c5

vmhba0:C0:T1:L0 LUN:0 state:active fc Adapter: WWNN: 20:00:00:24:ff:00:a7:90 WWPN: 21:00:00:24:ff:00:a7:90 Target: WWNN: 50:05:07:68:01:00:4b:00 WWPN: 50:05:07:68:01:30:4b:00



The HBAs were connected to the following:



50:05:07:68:01:10:4c:c5

50:05:07:68:01:10:4b:00

50:05:07:68:01:30:4c:c5

50:05:07:68:01:30:4b:00



The host that remained online the entire time is connected to the following ports:



50:05:07:68:01:40:4b:00

50:05:07:68:01:40:4c:c5



These are different ports that the host that stopped logging and required rebooting. Now let’s compare with the other 2 hosts that had to be rebooted (vmris02 and vmris03):



vmris02:



50:05:07:68:01:20:4b:00

50:05:07:68:01:20:4c:c5

50:05:07:68:01:40:4b:00

50:05:07:68:01:40:4c:c5



vmris03:



50:05:07:68:01:10:4c:c5

50:05:07:68:01:30:4b:00

50:05:07:68:01:30:4c:c5

50:05:07:68:01:10:4b:00



While vmris03 is logged into the same physical ports and vmris01, vmris02 is logged into the same ports as the host that remained online. The host that remained online is only communicating on 2 of the 4 ports however, so do they identically match?



vmhost2: (from /proc/scsi/qla2xxx/)



FC Port Information:

scsi-qla0-port-0=5005076801004b00:5005076801404b00:25021b:81;

scsi-qla0-port-1=5005076801004cc5:5005076801404cc5:25021d:82;



FC Port Information:

scsi-qla1-port-0=5005076801004cc5:5005076801204cc5:83031e:1;



vmris02: (from /proc/scsi/qla2xxx/)



FC Port Information:

scsi-qla0-port-0=5005076801004b00:5005076801404b00:25021b:81;

scsi-qla0-port-1=5005076801004cc5:5005076801404cc5:25021d:82;



FC Port Information:

scsi-qla1-port-0=5005076801004b00:5005076801204b00:83031b:81;

scsi-qla1-port-1=5005076801004cc5:5005076801204cc5:83031e:82;



Even though vmhost is missing one of the paths, we can see that the targets are the same. Since the hosts were logged into the same targets but had different results when it comes to logged out ports and being able to stay online, this may have to do with the amount of I/O that was in-flight from each host at the time. A host that was not issuing hardly any I/O at the time may have survived the outage because it did not issue any I/O, whereas the service console of the ESX hosts that were doing I/O at the time had a timeout on the local disk, resulting in the EXT3 file systems remounting as read-only. This would have resulted in an inability to write to /var/log. It would also cause some services to stop functioning, such as SSH, while other stacks in memory continue to operate (responds to ICMP ping requests).



The host that remained online (vmhost2) was only running 2 VMs at the time the logs were collected:



# cat /proc/vmware/vm/*/names

vmid=4371 pid=-1 cfgFile=”/vmfs/volumes/4d1a3ce4-d8b96856-0231-842b2b5c4f58/uhmc-riscxr/uhmc-riscxr.vmx” uuid=”42 38 00 37 fc 6e 04 38-a4 cb a9 22 98 f3 ce 9d” displayName=”uhmc-riscxr”

vmid=4376 pid=-1 cfgFile=”/vmfs/volumes/4d2f4d6a-a751b8d0-1109-842b2b5c4f58/uhmc-risrpt/uhmc-risrpt.vmx” uuid=”42 38 18 73 93 c6 d2 89-c5 87 e8 9e 24 5e 34 94″ displayName=”uhmc-risrpt”



A review of the vmkernel logs for that host do not show any VMs being powered on/off before, during, or after the event, so it is safe to assume these were the only VMs running on this host. We can only speculate what these VMs are designed for, however I would speculate that they are low I/O VMs and thus there were no or very few I/O’s issued at the time the event occurred. We would also surmise that hosts vmris01, vmris02, and vmris03 had well more than 2 VMs running and they were all doing I/O at the time the event occurred. This theory explains why vmhost02 remained online while other hosts did not. As far as why all paths were lost at the same moment, which forcing the service consoles EXT partitions to remount as read-only, this would be possible if the paths were unavailable for longer than the disk timeout period, which is around 30 seconds. The question then becomes, what occurred on the SVC at ~6am? Were both SVC’s hard rebooted after the firmware was successfully applied?

We’d love to hear your comments! Use the box below.