[HCI] Health check result shows alert on the Cluster RAM

|
  • 167
  • 3

Issue Description

[HCI] Health check result shows alert on the Cluster RAM, [The boot memory(917504MB) is smaller than previous boot(108576MB), possibly because memory anomaly exists.]

Error/Warning Information

Handling Process

1. Access to HCI backend and check the number of the HCI RAM card, found that all node have 8 RAM card.
grep [0-9]/sys/devices/system/edac/mc/mc*/dimm*/dimm_ce_count dmidecode –t memory |grep Locator|grep NO -v
2. Check /sf/log/checkitem/mem_size_data log.
3. As per checking, found that the cluster total RAM size ofthe cluster was larger compare to current total RAM size. Hence, the RAM alertconsider normal based on the log.
4. Verify with user, previously during the POC project, the cluster have more RAM compare to current HCI cluster setup.
5. After verified that the HCI cluster RAM size has been manipulated by the user. The phenomenon was consider normal.
6. On the abnormal node, apply command [mv/sf/log/checkitem/mem_size_data /sf/data/local/dump/].
7. After apply the command on the abnormal node, the healthcheck result was returned back to normal.

Root Cause

User performed RAM size reduction previously.

Solution

Move the HCI node's RAM information log to other directory(does not affect production),
apply command
mv /sf/log/checkitem/mem_size_data /sf/data/local/dump/
Faisal Lv8Posted 12 Apr 2022 11:49
  
Raza Islam Lv3Posted 04 Jul 2022 19:47
  
Thanks for sharing.
Raza Islam Lv3Posted 13 Jul 2022 15:20
  
Nice guidance

I want to write a case
Doc ID: 5593
Author: CTI Teoh
Updated: 2021-12-30 09:35
Version: