Hi, seems to be having strange issue with NFS v4
mounted with ESXi servers. There are 3 cluster nodes where the NFS share is mounted. I am using openSUSE Tumbleweed Linux: 6.6.11-1-default
on RockStor 4.6.1-0
with a data store on Dell R730XD server of 24 SAS drives with RAID5 configuration (Yes I am aware it’s not fully supported that’s why installed Tumbleweed). Out of the three servers one of the server starts screaming
2024-01-17T17:42:30.164Z Wa(180) vmkwarning: cpu36:2346988)WARNING: NFS41: NFS41FileIOSync:766: Synchronous IO on obj 0x430bb75e9f90 (fh 0x4319ea4456c8) failed on #15 retry: IO was aborted
and then the NFS store disconnects. It comes back after a few minutes and sometime have to reboot the ESXi host.
2024-01-17T17:35:59.692Z Wa(180) vmkwarning: cpu34:2346988)WARNING: NFS41: NFS41FileIOSync:766: Synchronous IO on obj 0x430bb75e9f90 (fh 0x4319ea4456c8) failed on #1 retry: IO was aborted
2024-01-17T17:36:09.412Z In(182) vmkernel: cpu36:2098963 opID=d4471708)World: 12321: VC opID 4892611e-c3-027c maps to vmkernel opID d4471708
2024-01-17T17:36:09.412Z Wa(180) vmkwarning: cpu36:2098963 opID=d4471708)WARNING: NFS41: NFS41FileOpGetFileAttributes:4559: Failed to get file attributes for object 0x430bb72199a0 name ea918613-cc1054d2-0000-000000000000: IO was aborted
2024-01-17T17:36:09.704Z Wa(180) vmkwarning: cpu34:2346988)WARNING: NFS41: NFS41FileIOSync:766: Synchronous IO on obj 0x430bb75e9f90 (fh 0x4319ea4456c8) failed on #2 retry: IO was aborted
2024-01-17T17:36:19.717Z Wa(180) vmkwarning: cpu34:2346988)WARNING: NFS41: NFS41FileIOSync:766: Synchronous IO on obj 0x430bb75e9f90 (fh 0x4319ea4456c8) failed on #3 retry: IO was aborted
2024-01-17T17:36:29.729Z Wa(180) vmkwarning: cpu34:2346988)WARNING: NFS41: NFS41FileIOSync:766: Synchronous IO on obj 0x430bb75e9f90 (fh 0x4319ea4456c8) failed on #4 retry: IO was aborted
2024-01-17T17:36:33.728Z In(182) vmkernel: cpu45:2097389)NetqueueBal: 4422: vmnic5: Cleaned up a NetQ RSS engine, 0 left
2024-01-17T17:36:38.728Z In(182) vmkernel: cpu45:2097389)NetqueueBal: 4422: vmnic4: Cleaned up a NetQ RSS engine, 0 left
2024-01-17T17:36:39.741Z Wa(180) vmkwarning: cpu34:2346988)WARNING: NFS41: NFS41FileIOSync:766: Synchronous IO on obj 0x430bb75e9f90 (fh 0x4319ea4456c8) failed on #5 retry: IO was aborted
2024-01-17T17:36:45.660Z Wa(180) vmkwarning: cpu8:2098960)WARNING: NFS41: NFS41FileOpGetFileAttributes:4559: Failed to get file attributes for object 0x430bb72199a0 name ea918613-cc1054d2-0000-000000000000: IO was aborted
2024-01-17T17:36:47.281Z Wa(180) vmkwarning: cpu5:2097565)WARNING: NFS41: NFS41FileIOSync:766: Synchronous IO on obj 0x430bb720c930 (fh 0x4319ea417e88) failed on #0 retry: IO was aborted
2024-01-17T17:36:49.752Z Wa(180) vmkwarning: cpu34:2346988)WARNING: NFS41: NFS41FileIOSync:766: Synchronous IO on obj 0x430bb75e9f90 (fh 0x4319ea4456c8) failed on #6 retry: IO was aborted
2024-01-17T17:36:57.294Z Wa(180) vmkwarning: cpu5:2097565)WARNING: NFS41: NFS41FileIOSync:766: Synchronous IO on obj 0x430bb720c930 (fh 0x4319ea417e88) failed on #1 retry: IO was aborted
2024-01-17T17:36:59.764Z Wa(180) vmkwarning: cpu34:2346988)WARNING: NFS41: NFS41FileIOSync:766: Synchronous IO on obj 0x430bb75e9f90 (fh 0x4319ea4456c8) failed on #7 retry: IO was aborted
2024-01-17T17:37:07.306Z Wa(180) vmkwarning: cpu5:2097565)WARNING: NFS41: NFS41FileIOSync:766: Synchronous IO on obj 0x430bb720c930 (fh 0x4319ea417e88) failed on #2 retry: IO was aborted
2024-01-17T17:37:09.410Z In(182) vmkernel: cpu9:2098949 opID=966c0be7)World: 12321: VC opID 3aa063a5-23-02ba maps to vmkernel opID 966c0be7
2024-01-17T17:37:09.410Z Wa(180) vmkwarning: cpu9:2098949 opID=966c0be7)WARNING: NFS41: NFS41FileOpGetFileAttributes:4559: Failed to get file attributes for object 0x430bb72199a0 name ea918613-cc1054d2-0000-000000000000: IO was aborted
Rockstor on the other hand does not shows anything in the logs and the connectivity between Rockstor and ESXi host keeps like no disconnections or such. I even have changed the Dac cables to be sure. On the same host I have Truenas
with iSCSI configured which never disconnects. It’s only the NFS which keeps bouncing.
vobd.log
on ESXi says
2024-01-17T17:42:27.127Z In(14) vobd[2097707]: [APDCorrelator] 34683441598us: [vob.storage.apd.start] Device or filesystem with identifier [ea918613-cc1054d2-0000-000000000000] has entered the All Paths Down state.
2024-01-17T17:42:27.127Z In(14) vobd[2097707]: [vmfsCorrelator] 34683441530us: [vob.vmfs.nfs.server.disconnect] Lost connection to the server 10.1.1.250,10.1.2.250,10.1.3.250,10.1.4.250 mount point NFS-ROCK-250, mounted as ea918613-cc1054d2-0000-000000000000 ("/export/esx_share_1")
2024-01-17T17:42:27.127Z In(14) vobd[2097707]: [APDCorrelator] 34683637679us: [esx.problem.storage.apd.start] Device or filesystem with identifier [ea918613-cc1054d2-0000-000000000000] has entered the All Paths Down state.
2024-01-17T17:42:27.127Z In(14) vobd[2097707]: [vmfsCorrelator] 34683637760us: [esx.problem.vmfs.nfs.server.disconnect] 10.1.1.250,10.1.2.250,10.1.3.250,10.1.4.250 NFS-ROCK-250 ea918613-cc1054d2-0000-000000000000 /export/esx_share_1
2024-01-17T17:42:36.934Z In(14) vobd[2097707]: [vmfsCorrelator] 34693248519us: [vob.vmfs.nfs.server.restored] Restored connection to the server 10.1.1.250,10.1.2.250,10.1.3.250,10.1.4.250 mount point NFS-ROCK-250, mounted as ea918613-cc1054d2-0000-000000000000 ("/export/esx_share_1")
I can’t seem to figure out what to look for. Any pointers would help dig out the issue.
Thanks,
Naseer