Sangfor HCI Cluster Storage Performance Degradation Due to Hard Disk Bad Sectors
  

Humayun Ahmed Lv4Posted 2026-May-20 12:17

# Sangfor HCI Troubleshooting Case Report

## Case Title

Sangfor HCI Cluster Storage Performance Degradation Due to Hard Disk Bad Sectors

# 1. Issue Description

A customer reported intermittent performance issues and virtual machine instability in their production Sangfor HCI environment.

The following symptoms were observed:

* Virtual machines responding slowly
* Random VM freeze events
* High disk read/write latency
* VM backup jobs failing intermittently
* aSAN storage synchronization delays
* Occasional HA migration triggers

Additionally, the HCI dashboard generated storage-related warnings indicating abnormal disk health conditions on one of the cluster nodes.

Users specifically experienced:

* Slow database transactions
* Delayed file server access
* VDI desktop lag during peak hours

The issue gradually worsened over several days.

# 2. Product & Version

| Component    | Version                           |
| ------------ | --------------------------------- |
| Product      | Sangfor HCI                       |
| HCI Version  | HCI 6.10.0_R2                     |
| Storage      | aSAN Distributed Storage          |
| Hypervisor   | aSV                               |
| Cluster Size | 4 Nodes                           |
| Disk Type    | SAS SSD + SATA HDD Hybrid Storage |

# 3. Root Cause Analysis

After detailed investigation, the root cause was identified as:

## Physical Hard Disk Bad Sectors on a Storage Node

One HDD within the aSAN storage pool developed increasing bad sectors.

The failing disk caused:

* Read retry operations
* Storage latency spikes
* Delayed replica synchronization
* Increased storage queue depth
* VM I/O performance degradation

Although the disk had not completely failed, the bad sectors significantly impacted cluster storage performance.

## Technical Investigation

### Step 1 – Health Alarm Review

The HCI dashboard displayed:

```text id="6n1y2f"
Disk health abnormal
```

and intermittent:

```text id="v3f8kc"
Storage latency warning
```

### Step 2 – Storage Performance Analysis

Observed:

* Disk latency increased from:

```text id="r5z4wb"
< 5ms → 80~150ms
```

* Replica synchronization delays
* Increased storage wait time

### Step 3 – Physical Disk Verification

Using backend diagnostic tools and SMART analysis:

```bash id="q3k7pd"
smartctl -a /dev/sdX
```

The following indicators were identified:

* Reallocated sector count increasing
* Pending sector count detected
* Uncorrectable sector errors present

### Step 4 – RAID / Disk Event Log Review

Storage logs confirmed:

```text id="x8m2ac"
Medium error / unrecoverable read error
```

during VM I/O operations.

# 4. Solution

## Step 1 – Identify the Faulty Disk

The affected storage node and disk slot were identified through:

* HCI hardware monitoring
* SMART diagnostics
* RAID controller event logs

## Step 2 – Migrate Critical VM Workloads

Before hardware replacement:

* Critical VMs were migrated to healthy nodes
* Storage rebalance status was verified
* Snapshot consistency was checked

---

## Step 3 – Replace the Failed Disk

The faulty HDD was physically replaced with a compatible enterprise-grade disk matching:

* Capacity
* RPM
* Interface type

## Step 4 – Rebuild aSAN Replica Data

After replacement:

* The new disk was added back into the aSAN storage pool
* Replica rebuild process started automatically

Storage synchronization was monitored until completion.

## Step 5 – Verify Cluster Health

The following checks were performed:

### Storage Health

```text id="a9f6td"
Healthy
```

### Replica Status

```text id="g5p1vz"
Fully synchronized
```

### Disk SMART Status

```text id="t2h7ye"
PASSED
```

## Step 6 – Performance Validation

After rebuild completion:

| Metric       | Before Fix | After Fix  |
| ------------ | ---------- | ---------- |
| Disk Latency | 80~150ms   | 2~5ms      |
| VM Response  | Slow       | Normal     |
| Storage Sync | Delayed    | Healthy    |
| Backup Jobs  | Failed     | Successful |

# 5. Confirmation of Resolution

The issue was confirmed resolved after:

* VM performance returned to normal
* No additional storage alarms appeared
* Replica synchronization completed successfully
* Backup jobs completed without errors
* 72-hour monitoring showed stable operation

Customer confirmed:

```text id="y7u3ld"
All production services restored successfully
```

# 6. Preventive Recommendations

To prevent similar incidents:

* Enable regular SMART monitoring
* Configure proactive disk health alerts
* Maintain spare disks onsite
* Perform periodic storage health checks
* Replace disks showing increasing bad sector counts before failure
* Use enterprise-grade SSD/HDD only

# 7. Final Conclusion

The production issue was caused by a physical hard disk developing bad sectors inside the aSAN storage cluster.

Although the disk had not completely failed, storage read retries and I/O delays severely affected VM performance and cluster synchronization.

After replacing the faulty disk and rebuilding storage replicas, the Sangfor HCI cluster returned to stable and healthy operation.

Like this topic? Like it or reward the author.

Creating a topic earns you 5 coins. A featured or excellent topic earns you more coins. What is Coin?

Enter your mobile phone number and company name for better service. Go