You are on page 1of 28

ScaleIO Alarm

Cloud Execution Environment

OPERATING INSTRUCTIONS

1/1543-CNA 403 3308/7 Uen J


Copyright

© Ericsson AB 2019-2021. All rights reserved. No part of this document may be


reproduced in any form without the written permission of the copyright owner.

Disclaimer

The contents of this document are subject to revision without notice due to
continued progress in methodology, design and manufacturing. Ericsson shall
have no liability for any error or damage of any kind resulting from the use of
this document.

Trademark List

All trademarks mentioned herein are the property of their respective owners.
These are shown in the document Trademark Information.

1/1543-CNA 403 3308/7 Uen J | 2021-06-25


Contents

Contents

1 Introduction 1
1.1 Alarm Description 1
1.2 Prerequisites 2

2 Procedure 3
2.1 Actions 3
2.2 Advanced Log Collection 3

3 Additional Information 6

1/1543-CNA 403 3308/7 Uen J | 2021-06-25


ScaleIO Alarm

1/1543-CNA 403 3308/7 Uen J | 2021-06-25


Introduction

1 Introduction

This instruction concerns alarm handling for VxSDS related alarm events in the
Cloud Execution Environment (CEE), in case an embedded Software Defined
Storage (SDS) solution is used. In case of shared SDS, fault management is
handled outside of CEE, so in CEE no VxSDS alarms are received. For more
information on embedded and shared SDS, see the document Storage Guide.

Note: This document contains generic instructions for VxSDS alarms. There are
no separate OPIs for each individual VxSDS alarm.

1.1 Alarm Description


Distributed storage alarms are issued for VxSDS alarm events by the Managed
Object (MO) DistributedStorage.

The severity of the alarm is CRITICAL, MAJOR, or MINOR.

Each VxSDS alarm is mapped into this CEE alarm and the ScaleIO Alert Active
alarm. Identify the relevant VxSDS alarm with the Additional Text field in the
alarm message, see Table 1.

VxSDS provides the following severities: 2 (Warning), 3 (Error), and 5


(Critical). For VxSDS alarm integration into CEE, and to enable correct alarm
handling, alarms are mapped in the following way:

VxSDS Severity Level CEE Severity Level


2 (Warning) MINOR
3 (Error) MAJOR
5 (Critical) CRITICAL

For more information on the different VxSDS alarms, see Section 3 on page 6.

Either of these may be the consequences for the system if the alarm is not solved:

— Data or data protection can be lost.

— Redundancy can be lost.

— System performance can be reduced.

— Storage capacity can be degraded.

— VxSDS licenses expire.

The alarm attributes are listed in Table 1.

1/1543-CNA 403 3308/7 Uen J | 2021-06-25 1


ScaleIO Alarm

Table 1 Alarm Attributes


Attribute Name Attribute Value
Major Type 193
Minor Type 2031719-2031813, 2031820-2031869
Managed Object Class AlarmType
Managed Object Instance Region=<name_of_the_region>,
Equipment=1,
DistributedStorage=<storage_system_name>
Specific Problem ScaleIO Alarm: <scaleio_alarm_name>(1)
Event Type <alarm_specific_event_type>
Probable Cause <alarm_specific_probable_cause>
Additional Text Affected object(s): <object_type>:
<object_name>[; <object_type>:
<object_name>]
Severity CRITICAL, MAJOR, or MINOR

Note: Several VxSDS alarms of the range 2031719–2031813 are not valid in the
VxSDS Version 3.x, with the following Minor Type IDs: 2031740, 2031753,
2031755, 2031776, 2031780–82, 2031789, 2031794, 2031798, 2031800,
2031804–06, and 2031814–19.

See Table 2 for detailed alarm information and recommended actions.

1.2 Prerequisites
This section provides information on the documents, tools, and conditions that
apply to the procedure.

1.2.1 Documents

Before starting this procedure, ensure that you have read the following documents:

— Dell EMC PowerFlex Version 3.5.x - Security Configuration Guide

— Dell EMC PowerFlex Version 3.5.x - Getting to Know Dell EMC PowerFlex

— Dell EMC PowerFlex Version 3.5.x - Monitor Dell EMC PowerFlex

— Dell EMC PowerFlex Version 3.5.x - CLI Reference Guide

1.2.2 Tools
No tools are required.

2 1/1543-CNA 403 3308/7 Uen J | 2021-06-25


Procedure

1.2.3 Conditions
Before starting this procedure, ensure that the following conditions are met:

— The virtual IP address of the Meta Data Manager (MDM) cluster is known.

— Information about how to connect to the MDM master and how to use the
scli commands is available.

— Administrator user role information is available.

— superuser privileges are available.

2 Procedure

This section describes the procedure to follow when this alarm is received.

2.1 Actions
Follow the below procedure:

1. Run the log collection script for MDM, Storage Data Server (SDS), or Storage
Data Client (SDC):

/opt/emc/scaleio/<mdm|sds|sdc>/diag/get_info.sh -d <output_file.zip>

For example, for MDM the command is:

/opt/emc/scaleio/mdm/diag/get_info.sh -d <output_file.zip>

2. Fetch the <output_file.zip> file.

3. Identify the received alarm from the output and check the VxSDS log files to
get an overview about the fault. Perform the recommended action in Table 2.

4. If the alarm ceases, exit this procedure.

If the alarm remains, continue with the advanced log collection steps
described in Section 2.2 on page 3.

2.2 Advanced Log Collection


This section describes how to collect detailed VxSDS system log files.

1/1543-CNA 403 3308/7 Uen J | 2021-06-25 3


ScaleIO Alarm

1. Retrieve the VxSDS component logs for each component.

For manual log collection, continue with Step 2.

For automated log collection, continue with Step 3.

2. For manual log collection follow the below procedure:

a. Issue the below command:

/opt/emc/scaleio/<scaleio_component>/diag/get_info.sh –f

b. If the selected node is the Master MDM, use the flags -u <mdm_user>,
and -p <mdm_password> instead of –f.

If the selected node contains more than one VxSDS component,


running any script gathers logs for all components on that node.

c. Verify that you get an output similar to the following, which shows
that the process of log collection has completed successfully:

Archive /tmp/scaleio-getinfo/getInfoDump.tgz created


successfully

When the log collection process completes, a ZIP file containing the
logs of all VxSDS components is created in the node.

To check the options for get_info.sh, use the below command:

get_info.sh --help

3. For automated log collection follow the below procedure:

a. Log in to the Installation Manager (IM):

— From Internet Explorer®, browse to https://<im_server_ip>,


where <im_server_ip> is the IP address of the Gateway (GW)
server, with the Gateway or IM package installed on it.

— From the IM welcome screen, enter the IM credentials.

b. From the IM main menu, select Maintain.

c. Enter the login credentials and click Retrieve system topology. The
Maintenance operation screen displays the system topology.

d. Click Collect logs.

e. Enter the MDM admin password and select one of the following
options:

4 1/1543-CNA 403 3308/7 Uen J | 2021-06-25


Procedure

— Copy repositories: Besides full logs, it includes the MDM


repository from installed components. Usual size for the
repository is ~306 MB.

— Collect exceptions only: Instead of full logs it collects the VxSDS


core dumps only. By default, VxSDS core dumps are disabled.
Contact customer support if VxSDS core dumps need to be
enabled.

— Last version: Instead of full logs it collects the most recently


created log file from each component.

— Click Collect Logs to start the Get Info operation, which is the
log collection function of the IM. This operation may take some
time to complete, depending on your system topology. Once
started, this operation can be rolled back.

f. Select the Monitor tab to display log collection progress.

g. When the Get Info operation completes, click Download logs to


download the log files. A ZIP file, containing all VxSDS component
logs, is downloaded.

h. Click Mark operation completed to clear the log files from the IM and
enable it to be available for other operations.

4. Use the collected information and perform the recommended action in Table
2 to solve the fault.

5. If the alarm ceases, exit this procedure.

The ScaleIO® Alarm is synchronized once in every hour. It means that if the
alarm refers to several instances of the same alarm type and not all of them
are solved, the Additional Text field still shows all instances.

Example:

— Additional Text field at the time when the alarm is issued:

Affected object(s): Storage pool: pool1; Storage pool:


pool2, Storage pool: pool3

— Additional Text field within an hour after resolving problems for pool3:

Affected object(s): Storage pool: pool1; Storage pool:


pool2, Storage pool: pool3

— Additional Text field after an hour:

Affected object(s): Storage pool: pool1; Storage pool:


pool2

1/1543-CNA 403 3308/7 Uen J | 2021-06-25 5


ScaleIO Alarm

If the alarm remains, collect troubleshooting data as described in the Data


Collection Guideline and contact the next level of maintenance support.

6. The job is completed.

3 Additional Information

Table 2 lists alarms that can be received from the VxSDS system and provides
recommended actions for each alarm. CEA alarms can be matched to the relevant
VxSDS alert using the columns ‘‘Alarm Message Created by Watchmen’’ and
‘‘Alert Message in VxSDS REST API’’ in the table. For more information on VxSDS
alerts, see the document Dell EMC PowerFlex Version 3.5.x - Monitor Dell EMC
PowerFlex.

Note: Use the GUI only for alarm monitoring. For acting on alarms and
performing the recommended actions described below, use CLI.

Table 2 VxSDS Alarms and Recommended Action


Alarm Message Alert Message in CEE Severity Recommended Action
Created by VxSDS REST API Level
Watchmen
License LICENSE_EXPIRE 5 (Critical) To resume operational mode,
Expired D contact the next level of
maintenance support for license
renewal. If you have already
renewed your license, install it.
License about LICENSE_ABOUT_ 3 (Error) Contact the next level of
to Expire TO_EXPIRE or maintenance support for license
2 (Warning) renewal. If you have already
renewed your license, install it.
according to time
left and limits
Trial License TRIAL_LICENSE_ 2 (Warning) Purchase a license and install it.
Used USED
Object Has OBJECT_HAS_OSC 2 (Warning) Check oscillating failures of
Oscillating ILLATING_FAILU the component and take action
Failures RES accordingly. If the oscillating failure
does not indicate a problem, change
the settings of the oscillating failure
window to suppress this alert.

6 1/1543-CNA 403 3308/7 Uen J | 2021-06-25


Additional Information

Table 2 VxSDS Alarms and Recommended Action


Alarm Message Alert Message in CEE Severity Recommended Action
Created by VxSDS REST API Level
Watchmen
Object Has OBJECT_HAS_OSC 2 (Warning) Check the oscillating failure report
Oscillating ILLATING_NETWO that can be accessed from one of
Network RK_FAILURES the management interfaces. Check
Failures if there is a problem with network
links, fix it, and restart the counters.
GW Configurati GW_CONFIGURATI 5 (Critical) Configure the MDM credentials
on Invalid MDM ON_INVALID_MDM in the VxSDS gateway using the
Credentials _CREDENTIALS FOSGWTool.
MDM Credent MDM_CREDENTIAL 5 (Critical) Ensure lockbox is present, and
ials Are Not S_ARE_NOT_CONF add MDM credentials using
Configured IGURED /opt/emc/scaleio/gateway/bin
/FOSGWTool.sh.
GW User GW_USER_REQUIR Configure MDM credentials on
Requires PW ES_PW_CHANGE the VxSDS gateway using the
Change FOSGWTool.
Upgrade in UPGRADE_IN_PRO 3 (Error) Monitor the upgrade process,
Progress GRESS and check that it is completed
successfully.
GW Too Old GW_TOO_OLD 5 (Critical) Upgrade the VxSDS Gateway to
the same version as the rest of your
system.
MDM Not MDM_NOT_CLUSTE 5 (Critical) MDM cluster was manually set to
Clustered RED SINGLE mode. Confirm that this is
an expected operation. Working in
SINGLE mode is not recommended.
Prepare the cluster modules, if
needed, and return to the CLUSTER
mode.
MDM Fails Over MDM_FAILS_OVER 5 (Critical) The MDMs frequently swap
Frequently _FREQUENTLY or ownership. No action required.
3 (Error)
or
2 (Warning)

according to
disconnect count
and hard coded
values (2/3/10)
FWD Rebuild FWD_REBUILD_ST 2 (Warning)–5 Check the system for lack of spare
Stuck UCK (Critical) capacity and/or failed capacity,
and either fix the problem or add
capacity, if necessary.

1/1543-CNA 403 3308/7 Uen J | 2021-06-25 7


ScaleIO Alarm

Table 2 VxSDS Alarms and Recommended Action


Alarm Message Alert Message in CEE Severity Recommended Action
Created by VxSDS REST API Level
Watchmen
BKWD Rebuild BKWD_REBUILD_S 2 (Warning)–5 Check the system for lack of spare
Stuck TUCK (Critical) capacity and/or failed capacity,
and either fix the problem or add
capacity, if necessary.
Rebalance REBALANCE_STUC 5 (Critical) Add a physical disk; if this is not
Stuck K or possible, reduce the spare policy
3 (Error) while maintaining enough spare to
or sustain a rebuild, if necessary.
2 (Warning)
Cluster CLUSTER_DEGRAD 3 (Error)–5 Check that all MDM cluster nodes
Degraded ED (Critical) are functioning correctly, and
fix and replace faulty nodes,
if necessary, to return to full
protection.
MDM Not Clust MDM_NOT_CLUSTE 5 (Critical) The MDM cluster was manually
ered Volumes RED_VOLUMES_EX set to SINGLE mode. Working in
Exist IST SINGLE mode is not recommended.
SINGLE mode means that there
is only one copy of the MDM
repository. If this copy is lost, all
system configurations and all the
data on all the existing volumes
are lost. To avoid data loss, do the
following:
• Verify that this is an expected
operation.
• Prepare the cluster modules, if
needed, and return to CLUSTERED
mode as soon as possible.
PD Inactive PD_INACTIVE 2 (Warning) Protection Domain was
inactivated by a user command.
Confirm that this is an expected
operation. This is usually done for
maintenance. When maintenance
is complete, reactivate the
Protection Domain.
Storage Pool STORAGE_POOL_H 5 (Critical) For the given storage pool, for some
Has Failed AS_FAILED_CAPA blocks, both primary and secondary
Capacity CITY copies are inaccessible. Check and
fix the state of all devices in the
storage pool and all holding devices
of the server in the storage pool.

8 1/1543-CNA 403 3308/7 Uen J | 2021-06-25


Additional Information

Table 2 VxSDS Alarms and Recommended Action


Alarm Message Alert Message in CEE Severity Recommended Action
Created by VxSDS REST API Level
Watchmen
Storage Pool STORAGE_POOL_H 3 (Error) For the given storage pool, for
Has Degraded AS_DEGRADED_CA some blocks, one of the two copies
Capacity PACITY is inaccessible. Check if a server is
offline or if there is another server
hardware-related issue. Check if a
storage device is down.
Capacity CAPACITY_UTILI 5 (Critical) The capacity utilization of the
Utilization ZATION_ABOVE_C storage pool is reaching a critical
Above Critical RITICAL_THRESH threshold. Remove unneeded
Threshold OLD volumes and snapshots, if possible,
or add physical storage.
Capacity CAPACITY_UTILI 3 (Error) The capacity utilization of the
Utilization ZATION_ABOVE_H or storage pool is reaching a high
Above High IGH_THRESHOLD 2 (Warning) threshold. Remove unneeded
Threshold volumes and snapshots, if possible,
or add physical storage.
Failure FAILURE_RECOVE 3 (Error) The capacity available for recovery
Recovery RY_CAPACITY_BE in a degraded storage event
Capacity below LOW_THRESHOLD is lower than the predefined
Threshold threshold. Replace failed hardware
or add more physical storage.
Background BACKGROUND_SCA 5 (Critical) Primary and secondary data copies
Scanner NNER_COMPARE_E do not match. Check the volumes
Compare Error RROR running on the storage pool for any
consistency issues. Contact the
next level of maintenance support
for assistance.
Spare Capacity CONFIGURED_SPA 5 (Critical) Increase the spare percentage
Smaller Than RE_CAPACITY_SM configured in the storage pool and
Largest Fault ALLER_THAN_LAR reserved for failure recovery, so that
Unit GEST_FAULT_UNI it is larger than the largest fault set
T in the storage pool.
Storage Pool STORAGE_POOL_U 3 (Error) Move some physical disks from
Unbalanced NBALANCED the large SDS to the others, or add
disks to the smaller SDS in order
to approximate the capacity of the
large SDS as much as possible.
Not Enough NOT_ENOUGH_FAU 3 (Error) Add more SDSs to the storage pool
Fault Units in LT_UNITS_IN_SP to meet the minimum requirement
SP of three hosts.

1/1543-CNA 403 3308/7 Uen J | 2021-06-25 9


ScaleIO Alarm

Table 2 VxSDS Alarms and Recommended Action


Alarm Message Alert Message in CEE Severity Recommended Action
Created by VxSDS REST API Level
Watchmen
Untrusted UNTRUSTED_CERT 3 (Error) Pending for approval certificates
Certificate IFICATE can be viewed and approved via
System Settings>Connection.
Certificate CERTIFICATE_AB 5 (Critical) Remove the certificate on MDM
About to OUT_TO_EXPIRE nodes:
Expire path:
/opt/emc/scaleio/mdm/cfg/mdm
_management_certificate.pem

Restart MDM service on slave and


master MDM respectively.
MDM Certificat MDM_CERTIFICAT 5 (Critical) Remove the certificate on MDM
e Expired E_EXPIRED nodes:
path:
/opt/emc/scaleio/mdm/cfg/mdm
_management_certificate.pem

Restart MDM service on slave and


master MDM respectively.
MDM Secure MDM_SECURE_CON 5 (Critical) Enable secure connections on
Connection NECTION_DISABL the MDM to protect your login
Disabled ED information.
MDM Self Sign MDM_SELF_SIGNE 5 (Critical) Check the certificate, and proceed
ed Certificate D_CERTIFICATE_ at your own risk.
Not Trusted NOT_TRUSTED
MDM Secure MDM_SECURE_CON 5 (Critical) Check MDM cluster nodes.
Connection Not NECTION_NOT_SU
Supported PPORTED
MDM Certifi MDM_CERTIFICAT 5 (Critical) The time and date on the computer
cate Not yet E_NOT_YET_VALI where the certificate was created
Valid D is not consistent with the time
and date set in the VxSDS system.
Replace the certificate or fix the
system time.
MDM CA Signed MDM_CA_SIGNED_ 5 (Critical) Proceed at your own risk.
Certificate CA CERTIFICATE_CA
Not Trusted _NOT_TRUSTED
SDS Disconnect SDS_DISCONNECT 3 (Error) The SDS service can be down or
ed ED unreachable over the network.
Verify that the SDS service is up
and running and that the network
is properly connected.

10 1/1543-CNA 403 3308/7 Uen J | 2021-06-25


Additional Information

Table 2 VxSDS Alarms and Recommended Action


Alarm Message Alert Message in CEE Severity Recommended Action
Created by VxSDS REST API Level
Watchmen
SDS Disconnect SDS_DISCONNECT 3 (Error) The SDS connection is fluctuating
s Frequently S_FREQUENTLY or due to an unstable network
2 (Warning) connection. Check the SDS data
network connection for packet
according to drops, and try to disconnect one
disconnect count of the ports to see if the SDS
and hard-coded disconnection issue is resolved by
using only one port. If this does
not resolve the issue, switch to
the other port. If there is still an
issue, it can be due to a faulty
NIC, faulty switch ports, or a faulty
switch. If there is no issue with
another switch, the issue was
switch-related. Otherwise, the
issue can be due to a faulty NIC,
which requires NIC replacement.
SDS RMCache SDS_RMCACHE_ME 2 (Warning) The system failed to allocate
Memory Alloca MORY_ALLOCATIO memory to the SDS RAM read
tion Failed N_FAILED cache. For 32 GB RAM or less,
up to 50% of the memory can
be allocated for caching. From
32 GB or more, up to 75% of
the memory can be allocated for
caching. Reduce the configured
RAM read cache memory to match
the allocation conditions.
DRL Mode Non DRL_MODE_NON_V 2 (Warning) DRL mode is configured to
Volatile OLATILE Hardened instead of Volatile.
Both modes are configurable.
RFCache CARD RFCACHE_CARD_I 2 (Warning) Disable caching on the device
IO Error O_ERROR and check the health of the
device, because it can be faulty. If
necessary, replace the device.
RFCAche Cache RFCACHE_CACHE_ 2 (Warning) Read flash cache is working
Skip Due to SKIP_DUE_TO_HE under a heavy load, and therefore
Heavy Load AVY_LOAD has skipped some IOs. This is a
temporary error which can resolve
itself. If it persists, try to balance
the storage pool contents across
more SDSs, or add more cache
cards.

1/1543-CNA 403 3308/7 Uen J | 2021-06-25 11


ScaleIO Alarm

Table 2 VxSDS Alarms and Recommended Action


Alarm Message Alert Message in CEE Severity Recommended Action
Created by VxSDS REST API Level
Watchmen
RFCAche IO RFCACHE_IO_STU 2 (Warning) IO has become stuck on the cache
Stuck Error CK_ERROR device. Disable caching on the
device and check the health of the
device, because it can be faulty. If
necessary, replace the device.
RFCAche Low RFCACHE_LOW_RE 2 (Warning) There is not enough RAM available
Resources SOURCES on the server for read flash cache
optimal operation. Increase the
amount of available RAM.
RFCAche Incon RFCACHE_INCONS 2 (Warning) Check RFcache state of all disks in
sistent Source ISTENT_SOURCE_ the pool and adjust them so that all
Configuration CONFIGURATION disks have the same caching state.
RFCAche Incon RFCACHE_INCONS 2 (Warning) Query the system to determine
sistent Cache ISTENT_CACHE_C what is not consistent in the
Configuration ONFIGURATION configurations of the read flash
cache driver and the SDS where the
cache device is located.
RFCAche Device RFCACHE_DEVICE 2 (Warning) You tried to add a cache device that
Does Not Exist _DOES_NOT_EXIS does not exist. Check and fix read
T flash cache configuration.
RFCAche API RFCACHE_API_ER 2 (Warning) The read flash cache xcache driver
Error Mismatch ROR_MISMATCH version and SDS version do not
match. Try to upgrade them to
the same version. If the problem
persists, contact the next level of
maintenance support.
SDS in Mainten SDS_IN_MAINTEN 2 (Warning) The SDS is currently in maintenance
ance ANCE mode. Exit maintenance mode
once maintenance is complete. If a
non-Disruptive Upgrade (NDU) is in
progress, ignore this warning.
Device Failed DEVICE_FAILED 3 (Error) The SDS device cannot be opened,
read from or written to. Validate the
device state. Check the cause of the
error, and determine if it is a human
error or a system malfunction.
Check hardware if needed.
Device Pending DEVICE_PENDING 2 (Warning) The SDS device has been added and
Activation _ACTIVATION tested. Activate the SDS device.

12 1/1543-CNA 403 3308/7 Uen J | 2021-06-25


Additional Information

Table 2 VxSDS Alarms and Recommended Action


Alarm Message Alert Message in CEE Severity Recommended Action
Created by VxSDS REST API Level
Watchmen
Fixed Read FIXED_READ_ERR 3 (Error) if Read from the SDS device failed.
Error Count OR_COUNT_ABOVE counter > 0 Data was corrected from the other
Above Warning _WARNING_THRES copy. No action is required, but
Threshold HOLD note that the device can be faulty.
Fixed Read FIXED_READ_ERR 5 (Critical) if SDS device read failed more than
Error Count OR_COUNT_ABOVE counter >= 5 five times. Replace the physical
Above Critical _CRITICAL_THRE device.
Threshold SHOLD
Device Error DEVICE_ERROR_E 5 (Critical) Check the device, and, if necessary,
Error RROR replace it.
Device Error DEVICE_ERROR_W 3 (Error) Check the device, and, if necessary,
Warning ARNING replace it.
Device Error DEVICE_ERROR_N 2 (Warning) Check the device, and, if necessary,
Notice OTICE replace it.
Device Error DEVICE_ERROR_I 2 (Warning) Check the device, and, if necessary,
Info NFO replace it.
Smart Tempe SMART_TEMPERAT 3 (Error) Check the temperature in the server
rature State URE_STATE_FAIL and at the data center. Check if a
Failed Now ED_NOW fan alert is raised in the node.
Smart End of SMART_END_OF_L 3 (Error) Replace the disk.
Life State IFE_STATE_FAIL
Failed Now ED_NOW
Smart Aggre SMART_AGGREGAT 3 (Error) Consider replacing the disk.
gated State ED_STATE_FAILE
Failed Now D_NOW
SDC Disconnect SDC_DISCONNECT 3 (Error) Verify that the SDC service is up
ed ED and running and that the network is
properly configured and connected.
SDC Max Count SDC_MAX_COUNT 3 (Error) The maximum number of SDCs
in the system (10242) has been
reached.
Physical Drive PHYSICAL_DRIVE 5 (Critical) Replace the physical disk.
Bad State _BAD_STATE
Physical Drive PHYSICAL_DRIVE 3 (Error) Errors have been detected on a
Hot Spare _HOT_SPARE spare physical disk. Replace the
physical disk.
Physical Drive PHYSICAL_DRIVE 3 (Error) Access the storage controller and
Not in Use _NOT_IN_USE configure the disk.

1/1543-CNA 403 3308/7 Uen J | 2021-06-25 13


ScaleIO Alarm

Table 2 VxSDS Alarms and Recommended Action


Alarm Message Alert Message in CEE Severity Recommended Action
Created by VxSDS REST API Level
Watchmen
Physical Drive PHYSICAL_DRIVE 5 (Critical) Replace the SSD device.
Endurance _ENDURANCE_USE
Used Above D_ABOVE_THRESH
Threshold OLD
Physical PHYSICAL_DRIVE 5 (Critical) or Ensure that the server and its
Drive Invalid _INVALID_TEMPE 3 (Error) environment are properly cooled.
Temperature RATURE Ensure that the internal chassis
fans are working and there is
adequate airflow.
Physical Drive PHYSICAL_DRIVE 3 (Error) Disable the physical drive physical
Invalid Cache _INVALID_PD_CA cache. This can cause DI issues in
Policy CHE_POLICY power cycle scenarios.
Device Should DEVICE_SHOULD_ 3 (Error)
Use as DAS USE_AS_DAS_CAC
Cache But It HE_BUT_IT_IS_N
Is Not OT
Logical Disk LOGICAL_DISK_I 3 (Error) Access the storage controller and
Invalid Read NVALID_READ_AH change the read policy of the logical
Ahead Policy EAD_POLICY disk to ‘‘Read-Ahead’’.
Logical Disk LOGICAL_DISK_I 3 (Error) Check the battery status of the
Invalid Write NVALID_WRITE_B RAID controller: a fault in the
Back Policy ACK_POLICY battery impacts the ‘‘Write-Back’’
functionality.
Logical Disk LOGICAL_DISK_I 5 (Critical) Access the storage controller and
Invalid Access NVALID_ACCESS_ change the access mode of the
Mode MODE logical disk to ‘‘Read-Write’’.
Logical Disk LOGICAL_DISK_I 3 (Error) The RAID type set in the storage
Invalid RAID NVALID_RAID_LE controller is incorrect. Do the
Level VEL following:
• Access the storage controller.
• Verify that the logical disk does
not contain any data that is not
backed up.
• Destroy the logical disk.
• Recreate the logical disk as
RAID0.
Logical Disk LOGICAL_DISK_I 3 (Error) Access the storage controller and
Invalid Cache NVALID_CACHE_P change the RAID0 caching policy
Policy OLICY for the logical disk to ‘‘DirectIO’’.

14 1/1543-CNA 403 3308/7 Uen J | 2021-06-25


Additional Information

Table 2 VxSDS Alarms and Recommended Action


Alarm Message Alert Message in CEE Severity Recommended Action
Created by VxSDS REST API Level
Watchmen
Logical Disk LOGICAL_DISK_N 3 (Error) Disable the physical cache of the
No Longer O_LONGER_CACHE physical drive. This can cause DI
Cached D issues in power cycle scenarios.
Battery BATTERY_INVALI 5 (Critical) or The backup battery is not fully
Invalid State D_STATE 3 (Error) charged, but it recharges itself
while the storage controller is
powered on.
Battery BATTERY_REPLAC 5 (Critical) The backup battery in the storage
Replacement EMENT_REQUIRED controller is possibly near the end
Required of its lifecycle. Replace the battery.
Battery Invali BATTERY_INVALI 3 (Error) • Check the temperature in the
d Temperature D_TEMPERATURE server and at the data center.
• Check if a fan alert is raised in the
node.
• Check the battery and the RAID
controller.
• Replace faulty items.
Battery Not BATTERY_NOT_PR 5 (Critical) Install a backup battery in the
Present ESENT storage controller.
Battery BATTERY_INVALI 3 (Error) Replace the storage controller
Invalid Pack D_PACK_ENERGY battery with one that is compatible
Energy with the controller model.
Battery Invali BATTERY_INVALI 3 (Error) Check the RAID controller battery:
d Voltage D_VOLTAGE it possibly needs to be replaced.
Boot Drive BOOT_DRIVE_INV 3 (Error) The drive on which the ESXi®
Invalid State ALID_STATE hypervisor is installed has an error.
Replace the drive and reinstall the
ESXi on the new drive.
Storage Contr STORAGE_CONTRO 5 (Critical) The storage controller can be faulty
oller Invalid LLER_INVALID_S and has to be replaced.
State TATE
Storage Contr STORAGE_CONTRO 5 (Critical) or Ensure that the server and its
oller Invalid LLER_INVALID_T 3 (Error) environment are properly cooled
Temperature EMPERATURE and all the chassis fans are
functional.
Storage Contro STORAGE_CONTRO 3 (Error) Install a CacheCade license on the
ller Cachecade LLER_CACHECADE storage controller.
Not Licensed _NOT_LICENSED

1/1543-CNA 403 3308/7 Uen J | 2021-06-25 15


ScaleIO Alarm

Table 2 VxSDS Alarms and Recommended Action


Alarm Message Alert Message in CEE Severity Recommended Action
Created by VxSDS REST API Level
Watchmen
Storage STORAGE_CONTRO 3 (Error) Add more disks if needed. If a
Controller Not LLER_NOT_ALL_S disk was removed during a Field
All Slots Full LOTS_FULL Replacement Unit (FRU) operation,
insert the new disk into the chassis.
Storage Cont STORAGE_CONTRO 5 (Critical) or The BOSS storage controller is
roller BOSS LLER_BOSS_ERRO 3 (Error) faulty and must be replaced.
Error State R_STATE
Storage STORAGE_CONTRO 5 (Critical) The BOSS storage controller may
Controller LLER_BOSS_DEGR be faulty and must be replaced.
BOSS Degraded ADED_STATE
State
Socket Disable SOCKET_DISABLE 3 (Error) Enable the CPU socket in the server
d D BIOS.
Socket Not All SOCKET_NOT_ALL 3 (Error) Change the physical CPU core
Cores Enabled _CORES_ENABLED configuration in the server BIOS.
Socket Speed SOCKET_SPEED_I 3 (Error) Set the CPU clock speed in the
Is Not Max S_NOT_MAX_SPEE server BIOS to the maximum speed.
Speed D
CPU Socket CPU_SOCKET_INV 5 (Critical) or The CPU temperature of the server
Invalid ALID_TEMPERATU 3 (Error) has exceeded the configured
Temperature RE threshold. Make sure that the
server is properly cooled and the
CPU and internal chassis fans are
active.
CPU Socket CPU_SOCKET_INV 5 (Critical) or Make sure that the server is properly
Invalid VR ALID_VR_TEMPER 3 (Error) cooled and the CPU and internal
Temperature ATURE chassis fans are active.
CPU Socket CPU_SOCKET_INV 5 (Critical) or • Verify that the power supply is
Invalid VR ALID_VR_VOLTAG 3 (Error) functioning correctly.
Voltage E
• Try to replace a port in the power
distribution unit or supply an
external power source to check it.
• Replace the power cable.
• Replace the power supply unit
module.
CPU Socket CPU_SOCKET_INV 5 (Critical) Check the chassis power supply.
Invalid ALID_VOLTAGE
Voltage

16 1/1543-CNA 403 3308/7 Uen J | 2021-06-25


Additional Information

Table 2 VxSDS Alarms and Recommended Action


Alarm Message Alert Message in CEE Severity Recommended Action
Created by VxSDS REST API Level
Watchmen
RAM Invalid RAM_INVALID_TE 5 (Critical) or The temperature of one or more
Temperature MPERATURE 3 (Error) server RAM modules exceeds the
configured threshold. Ensure that
the server and its environment are
properly cooled and all the chassis
fans are functional.
RAM Invalid VR RAM_INVALID_VR 5 (Critical) or Ensure that the server and its
Temperature _TEMPERATURE 3 (Error) environment are properly cooled
and all the chassis fans are
functional.
RAM Invalid VR RAM_INVALID_VR 5 (Critical) or Ensure that the server and its
Voltage _VOLTAGE 3 (Error) environment are properly cooled
and all the chassis fans are
functional. If the system still
issues an alert, the DIMM can be
faulty and it probably needs to be
replaced.
Node Invalid NODE_INVALID_T 5 (Critical) or Ensure that the server and its
Temperature EMPERATURE 3 (Error) environment are properly cooled
and all the chassis fans are
functional.
Node Invalid NODE_INVALID_V 5 (Critical) or • Verify that the power supply is
Voltage OLTAGE 3 (Error) functioning correctly.
• Try to replace a port in the power
distribution unit or supply an
external power source to check it.
• Replace the power cable.
• Replace the power supply module.
Node with No NODE_WITH_NO_S 3 (Error) Consider installing an SDC on this
SDC DC node so that it can use VxSDS
volumes.
Node Failed NODE_FAILED_TO 5 (Critical) Ensure that the vCenter®
to Connect to _CONNECT_TO_VC configuration is correct and
vCenter ENTER the node Mgmt port is routable to
the vCenter IP address.
Xtremcache XTREMCACHE_INV 3 (Error) Check the read flash cache pool
Invalid State ALID_STATE state. The cache device might be
misconfigured. In this case, remove
the cache device from the cache
pool and add it back again.

1/1543-CNA 403 3308/7 Uen J | 2021-06-25 17


ScaleIO Alarm

Table 2 VxSDS Alarms and Recommended Action


Alarm Message Alert Message in CEE Severity Recommended Action
Created by VxSDS REST API Level
Watchmen
Node Invalid NODE_INVALID_C 3 (Error) Check the CMOS battery of the
CMOS Battery MOS_BATTERY node. It possibly needs to be
replaced.
Node Fan NODE_FAN_DISAB 2 (Warning) The fan is disabled or missing.
Disabled LED Check the server chassis fans
and ensure they are inserted and
working properly.
Socket Cache SOCKET_CACHE_D 3 (Error) Enable the CPU cache in the server
Disabled ISABLED BIOS.
Socket Cache SOCKET_CACHE_S 3 (Error) Check the CPU cache in the server
Size Not Max IZE_NOT_MAX_SI BIOS and set it to maximum size.
Size ZE
Fan Invalid FAN_INVALID_SP 5 (Critical) or Ensure that the server is properly
Speed EED 3 (Error) cooled and that the chassis fans are
functional.
PSU Invalid PSU_INVALID_IN 5 (Critical) or • Verify that the power supply is
Input PUT 3 (Error) functioning correctly.
• Try to replace a port in the Power
Distribution Unit or supply an
external power source to check.
• Replace the power cable.
• Replace the Power Supply Unit
module.
PSU Not PSU_NOT_AVAILA 3 (Error) Install a new Power Supply Unit,
Available BLE or if there is an existing PSU, verify
that it is properly connected.
PSU Power Lost PSU_POWER_LOST 5 (Critical) Check the power outlet or power
feed to the server.
ESRS Connectiv ESRS_CONNECTIV 5 (Critical) Check the connectivity of the
ity Error ITY_ERROR network to the ESRS Gateway.
ESRS Not ESRS_NOT_REGIS 2 (Warning) The system is not registered to
Registered TERED ESRS and does not send alerts to
EMC® for monitoring. For more
information, contact the next level
of maintenance support.

18 1/1543-CNA 403 3308/7 Uen J | 2021-06-25


Additional Information

Table 2 VxSDS Alarms and Recommended Action


Alarm Message Alert Message in CEE Severity Recommended Action
Created by VxSDS REST API Level
Watchmen
ESRS Reached ESRS_REACHED_C 5 (Critical) ESRS has a limit of receiving up to
Capacity Limit APACITY_LIMIT 200 alerts per 8 hours. The limit is
reached, and no ESRS messages
are sent in the following 8 hours.
NIC Port Down NIC_PORT_DOWN 5 (Critical) The network link is down in the
network adapter. Verify that the
network adapter is enabled in the
operating system and the network
cable is connected properly at both
ends.
Device Media DEVICE_MEDIA_T 2 (Warning) Verify the device media type
Type Mismatch YPE_MISMATCH and reset accordingly, or remove
and re-add the disk in case Auto
Detected Media Type does not
match the actual device media type.
If the alert returns, the hardware
side must be checked.
Automatic Logs AUTOMATIC_LOGS 2 (Warning) Delete some files from the directory:
Collect Dire _COLLECT_DIREC
ctory Above TORY_ABOVE_THR • Linux®: /opt/emc/scaleio/gat
Threshold ESHOLD eway/temp/scaleio-auto-col
lect-logs/

• Windows®: C:\Program Files\


EMC\ScaleIO\Gateway\Temp\sc
aleio-auto-collect-logs\
Automatic Logs AUTOMATIC_LOGS 5 (Critical) Delete some files from the disk.
Collect Not _COLLECT_NOT_E
Enough Disk NOUGH_DISK_SPA
Space CE
Automatic AUTOMATIC_LOGS 2 (Warning) Configure ESX® credentials in
Logs Collect _COLLECT_MISSI lockbox or by using the IM option.
Missing ESX NG_ESX_CREDENT
Credentials IALS
One SDC ONE_SDC_DISCON 5 (Critical) Check the network links between
Disconnected NECTED_FROM_ON the affected SDS and SDC.
from One SDS E_SDS
One SDC ONE_SDC_DISCON 2 (Warning) Check the network links between
Disconnected NECTED_FROM_ON the affected SDC and SDS IP
from One SDS E_SDS_IP addresses.
IP

1/1543-CNA 403 3308/7 Uen J | 2021-06-25 19


ScaleIO Alarm

Table 2 VxSDS Alarms and Recommended Action


Alarm Message Alert Message in CEE Severity Recommended Action
Created by VxSDS REST API Level
Watchmen
One SDC ONE_SDC_DISCON 5 (Critical) Check the network links between
Disconnected NECTED_FROM_AL the affected SDC and all SDSs.
from All SDS L_SDS
All SDC ALL_SDC_DISCON 5 (Critical) Check the network links between
Disconnected NECTED_FROM_ON all SDCs and the affected SDS.
from One SDS E_SDS
All SDC ALL_SDC_DISCON 2 (Warning) Check the network links between
Disconnected NECTED_FROM_ON all SDCs and the affected SDS IP
from One SDS E_SDS_IP address.
IP
All SDC ALL_SDC_DISCON 5 (Critical) Check the network links between
Disconnected NECTED_FROM_AL all SDCs and SDSs.
from all SDS L_SDS
SDC Multiple SDC_MULTIPLE_D 5 (Critical) Check the network links between
Disconnections ISCONNECTIONS_ all SDCs and SDSs.
from SDS FROM_SDS
SDC Not SDC_NOT_APPROV 2 (Warning) Add the IP address for this SDC to
Approved ED the list of approved IP addresses.
SDC Does Not SDC_DOES_NOT_H 2 (Warning) Update the list of approved IP
Have Approved AVE_APPROVED_I addresses for this SDC.
IPs PS
Snapshot SNAPSHOT_POLIC 2 (Warning) Policy has been paused and can be
Policy Paused Y_PAUSED resumed.
Snapshot SNAPSHOT_POLIC 2 (Warning)
Policy Last Y_LAST_AUTO_SN
Auto Snapshot APSHOT_FAILURE
Failure in _IN_FIRST_LEVE
First Level L
Physical Drive PHYSICAL_DRIVE 2 (Warning) The drive is ready to be removed
Removed from _REMOVED_FROM_ from the node.
OS OS
Storage STORAGE_CONTRO 5 (Critical) Check the BOSS setup
Controller LLER_INVALID_B configuration.
Invalid BOSS OSS_RAID_LEVEL
RAID Level
NVDIMM Interle NVDIMM_INTERLE 5 (Critical) Disable NVDIMM interleave in
ave Enabled AVE_ENABLED BIOS settings.
NVDIMM Persi NVDIMM_PERSIST 5 (Critical) Enable Persistent Memory in BIOS
stent Memory ENT_MEMORY_DIS settings.
Disabled ABLED

20 1/1543-CNA 403 3308/7 Uen J | 2021-06-25


Additional Information

Table 2 VxSDS Alarms and Recommended Action


Alarm Message Alert Message in CEE Severity Recommended Action
Created by VxSDS REST API Level
Watchmen
NVDIMM Battery NVDIMM_BATTERY 5 (Critical) or Check the NVDIMM battery status:
Invalid State _INVALID_STATE 3 (Error) a fault in the battery impacts
the NVDIMM persistent memory.
Persistent memory does not
function if battery is faulty.
GPUcard Monito GPUCARD_MONITO 2 (Warning) Check that the GPU drivers are
ring Error RING_ERROR properly installed on the node.
DIMM in Error DIMM_IN_ERROR_ 3 (Error) Replace DIMM.
State STATE
DIMM in DIMM_IN_DEGRAD 2 (Warning) Replace DIMM as soon as possible.
Degraded State ED_STATE
SDC Has SDC_HAS_UNAPPR 3 (Error) Add the IP address for this SDC to
Unapproved IP OVED_IP the list of approved IP addresses.
VTree Paused VTREE_PAUSED_M 3 (Error)
Migration IGRATION_ERROR
Error
VTree Paused VTREE_PAUSED_I 2 (Warning) To resume migration, activate the
Inactive NACTIVE_PROTEC Protection Domain.
Protection TION_DOMAIN
Domain
VTree Paused VTREE_PAUSED_D 3 (Error) To resume migration, resolve
Degraded EGRADED_CAPACI the issue causing the degraded
Capacity in TY_IN_PROTECTI capacity.
Protection ON_DOMAIN
Domain
VTree Paused VTREE_PAUSED_N 3 (Error) To resume the migration, add
No Space in O_SPACE_IN_DES capacity to the destination storage
Destination TINATION_STORA pool or remove a volume from the
Storage Pool GE_POOL V-Tree migration.
VTree Paused VTREE_PAUSED_U 3 (Error) To resume the migration, resolve
Unavailable NAVAILABLE_CAP the issue causing the unavailable
Capacity in ACITY_IN_SOURC capacity in the source storage pool.
Source Storage E_STORAGE_POOL
Pool
VTrees Paused VTREES_PAUSED_ 3 (Error)
Migration MIGRATION_ERRO
Error R

1/1543-CNA 403 3308/7 Uen J | 2021-06-25 21


ScaleIO Alarm

Table 2 VxSDS Alarms and Recommended Action


Alarm Message Alert Message in CEE Severity Recommended Action
Created by VxSDS REST API Level
Watchmen
VTrees Paused VTREES_PAUSED_ 2 (Warning) To resume the migration, activate
Inactive INACTIVE_PROTE the Protection Domain.
Protection CTION_DOMAIN
Domain
VTrees Paused VTREES_PAUSED_ 3 (Error) To resume the migration, resolve
Degraded DEGRADED_CAPAC the issue causing the degraded
Capacity in ITY_IN_PROTECT capacity.
Protection ION_DOMAIN
Domain
VTrees Paused VTREES_PAUSED_ 3 (Error) To resume the migration, add
No Space in NO_SPACE_IN_DE capacity to the destination storage
Destination STINATION_STOR pool or remove a volume from the
Storage Pool AGE_POOL V-Tree migration.
VTrees Paused VTREES_PAUSED_ 3 (Error) To resume the migration, resolve
Unavailable UNAVAILABLE_CA the issue causing the unavailable
Capacity in PACITY_IN_SOUR capacity in the source storage pool.
Source Storage CE_STORAGE_POO
Pool L
Net Capacity NET_CAPACITY_H 3 (Error) Remove unneeded volumes and
High Usage IGH_USAGE snapshots, if possible, or add
devices to the storage pool.
Net Capacity NET_CAPACITY_C 5 (Critical) Remove unneeded volumes and
Critical Usage RITICAL_USAGE snapshots, if possible, or add
devices to the storage pool.
Net Capacity NET_CAPACITY_F 5 (Critical) Remove unneeded volumes and
Full Usage ULL_USAGE snapshots, if possible, or add
devices to the storage pool.
User Data High USER_DATA_HIGH 3 (Error) Remove unneeded volumes and
Usage _USAGE snapshots, if possible, or add
devices to the storage pool.
User Data USER_DATA_CRIT 5 (Critical) Remove unneeded volumes and
Critical Usage ICAL_USAGE snapshots, if possible, or add
devices to the storage pool.
User Data Full USER_DATA_FULL 5 (Critical) Remove unneeded volumes and
Usage _USAGE snapshots, if possible, or add
devices to the storage pool.
VTAS Critical VTAS_CRITICAL_ 2 (Warning) Consider adding capacity to the
Usage Capacity USAGE_CAPACITY storage pool or removing some
Limit _LIMIT V-Trees from the storage pool.

22 1/1543-CNA 403 3308/7 Uen J | 2021-06-25


Additional Information

Table 2 VxSDS Alarms and Recommended Action


Alarm Message Alert Message in CEE Severity Recommended Action
Created by VxSDS REST API Level
Watchmen
VTAS Full VTAS_FULL_USAG 2 (Warning) Consider adding capacity to the
Usage Capacity E_CAPACITY_LIM storage pool or removing some
Limit IT V-Trees from the storage pool.
VTAS Critical VTAS_CRITICAL_ 2 (Warning) Consider removing some V-Trees
Usage Storage USAGE_STORAGE_ from the storage pool.
Pool Limit POOL_LIMIT
VTAS Full VTAS_FULL_USAG 2 (Warning) Consider removing some V-Trees
Usage Storage E_STORAGE_POOL from the storage pool.
Pool Limit _LIMIT
System VTAS SYSTEM_VTAS_CR 2 (Warning) Consider removing some V-Trees
Critical Usage ITICAL_USAGE from the system.
System VTAS SYSTEM_VTAS_FU 3 (Error) Consider removing some V-Trees
Full Usage LL_USAGE from the system.
SP Device SP_DEVICE_MEDI 2 (Warning) Please update the storage
Media Type A_TYPE_MISMATC pool media type or replace the
Mismatch H mismatched devices.
SP External SP_EXTERNAL_AC 2 (Warning) Update the storage pool cache
Acceleration CELERATION_WIT settings.
Without DAS HOUT_DAS_CACHE
Cache
SP DAS Cache SP_DAS_CACHE_W 2 (Warning) Please update the storage pool
Without ITHOUT_EXTERNA cache settings.
External L_ACCELERATION
Acceleration
Protection PROTECTION_DOM 2 (Warning) Verify consistent data path
Domain Has AIN_HAS_UNBALA configurations for all host SDSs that
Unbalanced NCED_DATA_IPS support this Protection Domain.
Data IPs
Node SVM NOC NODE_SVM_NIC_P 5 (Critical) Verify that the network interface is
Port Down ORT_DOWN configured and enabled in the SVM
operating system.
System Replace SYSTEM_REPLACE 2 (Warning) The VMs on system nodes have an
SVM Needed _SVM_NEEDED outdated OS version. Perform SVM
replacement.
Node Replace NODE_REPLACE_S 2 (Warning) This node has not yet undergone
SVM Needed VM_NEEDED the SVM replacement procedure.

1/1543-CNA 403 3308/7 Uen J | 2021-06-25 23


ScaleIO Alarm

Table 2 VxSDS Alarms and Recommended Action


Alarm Message Alert Message in CEE Severity Recommended Action
Created by VxSDS REST API Level
Watchmen
Inconsistent INCONSISTENT_R 2 (Warning) One or more of the storage pool
Recache FCACHE_CONFIGU devices belong to an SDS that
Configuration RATION does not have Read Flash Cache
devices. Make sure that all SDSs
with devices belonging to this
storage pool have Read Flash
Cache devices.
Log Collection LOG_COLLECTION 2 (Warning) Monitor the log collection process.
in Progress _IN_PROGRESS Do not start another log collection
before this one is complete.
Device Error DEVICE_ERROR_U 5 (Critical)
Unrecoverable NRECOVERABLE
VTree Paused VTREE_PAUSED_S 2 (Warning) Exit the relevant SDSs from
SDS in Main DS_IN_MAINTENA Maintenance Mode.
tenance in NCE_IN_PROTECT
Protection ION_DOMAIN
Domain
VTrees Paused VTREES_PAUSED_ 2 (Warning) Exit the relevant SDSs from
SDS in Main SDS_IN_MAINTEN Maintenance Mode.
tenance in ANCE_IN_PROTEC
Protection TION_DOMAIN
Domain

24 1/1543-CNA 403 3308/7 Uen J | 2021-06-25

You might also like