Advanced Clustering Technologies > ClusterVisor > Key Features > Statistics, Monitoring & Alerting

Statistics, Monitoring, and Alerting

ClusterVisor makes it easy to keep track of cluster health which is important to its proper operation.

Getting stats from nodes

Stats are a core feature of ClusterVisor, and are used to refer to any piece of changing data on the devices in the cluster.  One component of cv-clientd (ClusterVisor’s client daemon that runs on each node) is to gather statistics from the node and send them to your ClusterVisor appliance or server.  The stats are driven by a plugin architecture with stats from the following domains collected:

  • CPU – CPU usage, load, frequency
  • Disks – Disk Usage, read/write I/O rates
  • Firmware – BIOS / IPMI firmware versions
  • InfiniBand – InfiniBand and OmniPath performance and error counters, and I/O throughput
  • IPMI temperature – All the temperature sensors throughout the node
  • IPMI fans – RPMs of each fan in the system
  • IPMI voltages – Voltage readings from the power supply
  • MD (Software RAID) – status of any software RAID arrays on the system
  • MegaRAID – status of any hardware based broadcom/LSI RAID arrays on the system
  • Memory – Memory and swap usage
  • Network – Statistics on each Ethernet interface
  • NFS – Status of any NFS mounts
  • NTP – Provide details about the time and if it’s currently synchronized with your time server
  • NVIDIA – detailed usage, power, memory statics by GPU on your system
  • Power – node level power consumption
  • System – kernel versions, process counts, uptime
  • ZFS – Status of any ZFS zpools on the system

Most nodes in your cluster will have more than 300 unique stats collected from them.

Gathering stats from devices

For devices that aren’t able to run the cv-clientd daemon, statistics can be gathered via SNMP.  This is most commonly used to gather information from network infrastructure devices.  ClusterVisor currently supports the following devices:

  • Power distribution units (PDUs):  APC, Geist/Vertiv, ServerTech
  • Uninterruptible power supplies (UPSs): APC
  • Air Conditioners: APC in-row
  • Switches: Netgear

The list is always growing, and support for new devices can be easily added. Contact us for details.

Custom stats 

ClusterVisor already collects the most common stats, but if a particular stat you are interested in is not available, you can add it!  There are simple command line tools available to inject custom stats, or you can create your own stat plugin as well.

Stat history and retention

ClusterVisor collects stats every 30 seconds. With a large number of nodes, the disk space requirements would grow quite large and unmanageable. To prevent this from becoming an issue, we use a time series storage approach.  The default retention periods are:

  • 30 second interval – keep for 2 weeks
  • 5 minute interval – keep for 2 months
  • 15 minute interval – keep for 3 months
  • 1 hour interval – keep for 5 years

Viewing your stats 

Stats can be viewed via the ClusterVisor web interface.  Stats can be used on any number of customizable dashboards, or queried ad hoc.  Stats can be also used as a heatmap on rack diagrams to help pinpoint temperature hotspots or other possible environmental problems in the datacenter.

The stats can also be correlated with SLURM jobs to be able to see what’s happening on a node during a job run.  Helping administrators answer questions like, why did my job run slow?

Monitoring rules, alerts, and actions

Any stat can be used as part of a monitoring rule.  The ClusterVisor monitoring rules engine allows administrators to create rules for when anything on their cluster is not behaving how it should.  Some ideas of the rules that could be created:  

  • Temperature of the node is above a threshold
  • When a group of nodes firmware are not all the same version
  • An InfiniBand interface is down or running at the wrong speed
  • A RAID array has lost a drive
  • A ZFS zpool is degraded
  • And many more

Monitoring Rules can fire any number of actions on failure.  Actions can be a simple email to an administrator, or any script.  With the script engine, you can easily shut down nodes, drain them in SLURM, clean up files, etc. 

A full history of any rule failures, with timelines of failure, is available.  When a rule fails, a point in time snapshot of all the stats are also saved.  This allows you to easily go back in time and see what was happening on a node when the failure occurred.

 
 

Request a Consultation from our team of HPC and AI Experts

Would you like to speak to one of our HPC or AI experts? We are here to help you. Submit your details, and we'll be in touch shortly.

  • This field is for validation purposes and should be left unchanged.