#Troubleshooting# Sangfor HCI Cluster Storage Latency Causing VM Performance Degradation
  

Humayun Ahmed Lv4Posted May-07-2026 13:47

1. Issue Description
A customer reported that multiple production virtual machines hosted on the Sangfor HCI cluster were experiencing:
Slow application response
High VM disk latency
Random freezing during file operations
Delayed database transactions

The issue started after a storage expansion and HCI upgrade maintenance window.
Users specifically noticed:
Microsoft SQL database response delays
Slow file server access
Increased login time on VDI desktops

No hardware alarms were visible from the HCI dashboard, and all cluster nodes appeared healthy.

-------------------------------------------------------------------------------------------------------------------------------------------------------------------

2. Product & Version
ComponentVersion
ProductSangfor HCI
HCI VersionHCI 6.11.1_R1
StorageaSAN Distributed Storage
HypervisoraSV
Deployment Type4-node production cluster
Storage Network25GbE RDMA

-------------------------------------------------------------------------------------------------------------------------------------------------------------------


3. Root Cause Analysis
After detailed troubleshooting, the root cause was identified as:
RDMA Configuration Was Disabled After Upgrade
Although the storage network interfaces remained operational after the upgrade, the following issue occurred:
RDMA was no longer enabled on the storage network interfaces
Switch congestion control configuration was missing
The cluster automatically fell back to normal TCP storage communication

This caused:
Increased storage latency
Higher CPU usage on storage nodes
Reduced IOPS performance
Slow VM disk read/write operations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------

Technical Findings Symptoms Observed:
> VM disk latency increased from: < 5ms → 35~60ms
> Storage network throughput became unstable
> CPU usage on storage processes increased significantly
-------------------------------------------------------------------------------------------------------------------------------------------------------------------

Investigation Performed:

Step 1 – Health Check
Cluster health status:
Healthy No disk failures or node failures detected.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------

Step 2 – Storage Performance Analysis
Checked:
> aSAN latency
> Network throughput
> Storage synchronization status
Result: Abnormally high storage network latency
-------------------------------------------------------------------------------------------------------------------------------------------------------------------

Step 3 – Verify RDMA Status
Using backend diagnostic tools, verified: RDMA service inactive
-------------------------------------------------------------------------------------------------------------------------------------------------------------------

Step 4 – Switch Configuration Review
It was discovered that after firmware maintenance:
> PFC (Priority Flow Control) was disabled
> ECN configuration was missing
> Jumbo frame MTU reverted to 1500
-------------------------------------------------------------------------------------------------------------------------------------------------------------------

4. Solution
Step 1 – Re-enable RDMARDMA was reconfigured and enabled on all storage interfaces.
Verified: RDMA status = Active
-------------------------------------------------------------------------------------------------------------------------------------------------------------------

Step 2 – Configure Jumbo Frames
Configured MTU:
MTU 9000 Applied on:
> Switch ports
> HCI storage interfaces
-------------------------------------------------------------------------------------------------------------------------------------------------------------------

Step 3 – Configure Switch Congestion Control
Enabled:
> Priority Flow Control (PFC)
> ECN (Explicit Congestion Notification)
On all storage network switches.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------

Step 4 – Validate Storage Network
Performed:
> Ping with jumbo packets
> Throughput testing
> Storage benchmark verification
Result: No packet fragmentation detected
-------------------------------------------------------------------------------------------------------------------------------------------------------------------

Step 5 – Monitor Cluster Performance
After corrective actions:
MetricBefore FixAfter Fix
Storage Latency35~60ms2~5ms
VM Response TimeSlowNormal
CPU UsageHighStable
Database PerformanceDelayedNormal
-------------------------------------------------------------------------------------------------------------------------------------------------------------------


5. Confirmation of Resolution
The issue was confirmed resolved after:
> VM applications returned to normal performance
> SQL database response normalized
> No additional storage latency alarms appeared
> 48-hour monitoring showed stable performance
Customer confirmed that all business services are operating normally.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------

6. Preventive Recommendations
To avoid similar issues in future upgrades:
  • Verify RDMA status after every upgrade
  • Validate switch PFC/ECN configuration
  • Maintain MTU consistency end-to-end
  • Perform storage performance baseline checks
  • Document switch configuration backup before maintenance



-------------------------------------------------------------------------------------------------------------------------------------------------------------------

7. Final Conclusion

The incident was caused by RDMA and storage network optimization settings being lost after maintenance activity.
Once RDMA, jumbo frames, and switch congestion control were restored, storage latency returned to normal, and all VM workloads stabilized successfully.

This topic contains more resources

You must log in to download or view the file. Not registered yet? Register

x

Like this topic? Like it or reward the author.

Creating a topic earns you 5 coins. A featured or excellent topic earns you more coins. What is Coin?

Enter your mobile phone number and company name for better service. Go

Prosi Lv3Posted May-09-2026 14:23
  
Thank you for this valuable information.
Sangfor Jojo Lv5Posted May-09-2026 15:39
  
Thanks for sharing your case. 3500 coins have been delivered to your account. You can check the system message on the homepage.

This topic contains more resources

You must log in to download or view the file. Not registered yet? Register

x
Sangfor Jojo Lv5Posted May-09-2026 15:42
  
To all, please do not hesitate to share your likes and comments when reading this article.
Much appreciated!
We are also looking forward to seeing your knowledge and skills here.

This topic contains more resources

You must log in to download or view the file. Not registered yet? Register

x