#Troubleshooting# Sangfor HCI Cluster Storage Latency Causing VM Performance Degradation

Show author only · Posted 2026-May-07 13:47

1. Issue Description

A customer reported that multiple production virtual machines hosted on the Sangfor HCI cluster were experiencing:
Slow application response
High VM disk latency
Random freezing during file operations

Delayed database transactions

The issue started after a storage expansion and HCI upgrade maintenance window.
Users specifically noticed:
Microsoft SQL database response delays
Slow file server access

Increased login time on VDI desktops

No hardware alarms were visible from the HCI dashboard, and all cluster nodes appeared healthy.

-------------------------------------------------------------------------------------------------------------------------------------------------------------------

2. Product & Version

Component	Version
Product	Sangfor HCI
HCI Version	HCI 6.11.1_R1
Storage	aSAN Distributed Storage
Hypervisor	aSV
Deployment Type	4-node production cluster
Storage Network	25GbE RDMA

-------------------------------------------------------------------------------------------------------------------------------------------------------------------

3. Root Cause Analysis

After detailed troubleshooting, the root cause was identified as:
RDMA Configuration Was Disabled After Upgrade

Although the storage network interfaces remained operational after the upgrade, the following issue occurred:
RDMA was no longer enabled on the storage network interfaces
Switch congestion control configuration was missing

The cluster automatically fell back to normal TCP storage communication

This caused:
Increased storage latency
Higher CPU usage on storage nodes
Reduced IOPS performance
Slow VM disk read/write operations

-------------------------------------------------------------------------------------------------------------------------------------------------------------------

Technical Findings Symptoms Observed:

> VM disk latency increased from: < 5ms → 35~60ms
> Storage network throughput became unstable
> CPU usage on storage processes increased significantly
-------------------------------------------------------------------------------------------------------------------------------------------------------------------

Investigation Performed:

Step 1 – Health Check

Cluster health status:

Healthy No disk failures or node failures detected.

-------------------------------------------------------------------------------------------------------------------------------------------------------------------

Step 2 – Storage Performance Analysis

Checked:
> aSAN latency
> Network throughput
> Storage synchronization status

Result: Abnormally high storage network latency

-------------------------------------------------------------------------------------------------------------------------------------------------------------------

Step 3 – Verify RDMA Status

Using backend diagnostic tools, verified: RDMA service inactive

-------------------------------------------------------------------------------------------------------------------------------------------------------------------

Step 4 – Switch Configuration Review

It was discovered that after firmware maintenance:
> PFC (Priority Flow Control) was disabled
> ECN configuration was missing
> Jumbo frame MTU reverted to 1500

-------------------------------------------------------------------------------------------------------------------------------------------------------------------

4. Solution

Step 1 – Re-enable RDMARDMA was reconfigured and enabled on all storage interfaces.
Verified: RDMA status = Active

-------------------------------------------------------------------------------------------------------------------------------------------------------------------

Step 2 – Configure Jumbo Frames

Configured MTU:
MTU 9000 Applied on:
> Switch ports
> HCI storage interfaces

-------------------------------------------------------------------------------------------------------------------------------------------------------------------

Step 3 – Configure Switch Congestion Control

Enabled:
> Priority Flow Control (PFC)
> ECN (Explicit Congestion Notification)
On all storage network switches.

-------------------------------------------------------------------------------------------------------------------------------------------------------------------

Step 4 – Validate Storage Network

Performed:
> Ping with jumbo packets
> Throughput testing
> Storage benchmark verification
Result: No packet fragmentation detected

-------------------------------------------------------------------------------------------------------------------------------------------------------------------

Step 5 – Monitor Cluster Performance

After corrective actions:

Metric	Before Fix	After Fix
Storage Latency	35~60ms	2~5ms
VM Response Time	Slow	Normal
CPU Usage	High	Stable
Database Performance	Delayed	Normal

-------------------------------------------------------------------------------------------------------------------------------------------------------------------

5. Confirmation of Resolution

The issue was confirmed resolved after:

> VM applications returned to normal performance
> SQL database response normalized
> No additional storage latency alarms appeared
> 48-hour monitoring showed stable performance

Customer confirmed that all business services are operating normally.

-------------------------------------------------------------------------------------------------------------------------------------------------------------------

6. Preventive Recommendations

To avoid similar issues in future upgrades:

Verify RDMA status after every upgrade
Validate switch PFC/ECN configuration
Maintain MTU consistency end-to-end
Perform storage performance baseline checks
Document switch configuration backup before maintenance

-------------------------------------------------------------------------------------------------------------------------------------------------------------------

7. Final Conclusion

The incident was caused by RDMA and storage network optimization settings being lost after maintenance activity.
Once RDMA, jumbo frames, and switch congestion control were restored, storage latency returned to normal, and all VM workloads stabilized successfully.

Show author only · Posted 2026-May-09 14:23

Thank you for this valuable information.

Show author only · Posted 2026-May-09 15:39

Thanks for sharing your case. 3500 coins have been delivered to your account. You can check the system message on the homepage.

Show author only · Posted 2026-May-09 15:42

To all, please do not hesitate to share your likes and comments when reading this article.

Much appreciated!

We are also looking forward to seeing your knowledge and skills here.

Join us: Product Pioneer Program: Share Your Tech Skills & Win Great Rewards!

This topic contains more resources

This topic contains more resources

This topic contains more resources

2025 Annual Active Member

2025 Annual Engagement Star

Active Member

Tech Xpert

Contributor

Most Active Users

2024 Moderator

Creativaholic

Trending Topics

Board Leaders