Advanced Clustering Technologies > ClusterVisor > Key Features > Statistics, Monitoring & Alerting

Statistics, Monitoring, and Alerting

ClusterVisor makes it easy to keep track of cluster health which is important to its proper operation.

Getting stats from nodes

Stats are a core feature of ClusterVisor, and are used to refer to any piece of changing data on the devices in the cluster. One component of cv-clientd (ClusterVisor’s client daemon that runs on each node) is to gather statistics from the node and send them to your ClusterVisor appliance or server. The stats are driven by a plugin architecture with stats from the following domains collected:

CPU – CPU usage, load, frequency
Disks – Disk Usage, read/write I/O rates
Firmware – BIOS / IPMI firmware versions
InfiniBand – InfiniBand and OmniPath performance and error counters, and I/O throughput
IPMI temperature – All the temperature sensors throughout the node
IPMI fans – RPMs of each fan in the system
IPMI voltages – Voltage readings from the power supply
MD (Software RAID) – status of any software RAID arrays on the system
MegaRAID – status of any hardware based broadcom/LSI RAID arrays on the system
Memory – Memory and swap usage
Network – Statistics on each Ethernet interface
NFS – Status of any NFS mounts
NTP – Provide details about the time and if it’s currently synchronized with your time server
NVIDIA – detailed usage, power, memory statics by GPU on your system
Power – node level power consumption
System – kernel versions, process counts, uptime
ZFS – Status of any ZFS zpools on the system

Most nodes in your cluster will have more than 300 unique stats collected from them.

Gathering stats from devices

For devices that aren’t able to run the cv-clientd daemon, statistics can be gathered via SNMP. This is most commonly used to gather information from network infrastructure devices. ClusterVisor currently supports the following devices:

Power distribution units (PDUs): APC, Geist/Vertiv, ServerTech
Uninterruptible power supplies (UPSs): APC
Air Conditioners: APC in-row
Switches: Netgear

The list is always growing, and support for new devices can be easily added. Contact us for details.

Custom stats

ClusterVisor already collects the most common stats, but if a particular stat you are interested in is not available, you can add it! There are simple command line tools available to inject custom stats, or you can create your own stat plugin as well.

Stat history and retention

ClusterVisor collects stats every 30 seconds. With a large number of nodes, the disk space requirements would grow quite large and unmanageable. To prevent this from becoming an issue, we use a time series storage approach. The default retention periods are:

30 second interval – keep for 2 weeks
5 minute interval – keep for 2 months
15 minute interval – keep for 3 months
1 hour interval – keep for 5 years

Viewing your stats

Stats can be viewed via the ClusterVisor web interface. Stats can be used on any number of customizable dashboards, or queried ad hoc. Stats can be also used as a heatmap on rack diagrams to help pinpoint temperature hotspots or other possible environmental problems in the datacenter.

The stats can also be correlated with SLURM jobs to be able to see what’s happening on a node during a job run. Helping administrators answer questions like, why did my job run slow?

Monitoring rules, alerts, and actions

Any stat can be used as part of a monitoring rule. The ClusterVisor monitoring rules engine allows administrators to create rules for when anything on their cluster is not behaving how it should. Some ideas of the rules that could be created:

Temperature of the node is above a threshold
When a group of nodes firmware are not all the same version
An InfiniBand interface is down or running at the wrong speed
A RAID array has lost a drive
A ZFS zpool is degraded
And many more

Monitoring Rules can fire any number of actions on failure. Actions can be a simple email to an administrator, or any script. With the script engine, you can easily shut down nodes, drain them in SLURM, clean up files, etc.

A full history of any rule failures, with timelines of failure, is available. When a rule fails, a point in time snapshot of all the stats are also saved. This allows you to easily go back in time and see what was happening on a node when the failure occurred.

ClusterVisor Statistics, Monitoring & Alerting