Case Study: Caltech Implements New ClusterVisor 1.0 to More Effectively Manage HPC Cluster and Workloads

Customer: Caltech

Caltech was already using the original version of ClusterVisor when they were offered an upgrade to ClusterVisor 1.0 while their new cluster was being installed in January 2023.

“The original version of ClusterVisor was already user-friendly. With the new version, what’s great is that you have the dashboards,” explains Ivan Maliyov, a postdoctoral scholar in Professor Marco Bernardi’s group working on theory and computational methods to study the behavior of electrons in materials.

Maliyov says the group is able to configure the dashboards in many ways. “We can create multiple tabs, and for each tab you can create multiple charts or tables or other data to report on the cluster. When I connect to ClusterVisor, at a glance I can see the status of the cluster, that everything works or maybe something doesn’t work. It’s great for me to spend only three seconds and see how we’re doing today.”

Another important feature of ClusterVisor 1.0 is the ability to easily create a rack diagram of the
entire cluster and incorporate statistics to give a
visual report on the status of the cluster.

“We can see the temperature of the nodes just in case the cooling of one node is going wrong,” Maliyov said. “We can see that.”

The team at Caltech is using ClusterVisor to test a special software development project. “Our software that we are developing – RAM is usually the bottleneck. Memory is crucial with us. Nobody has this great tool. We have been sharing our excitement.”

As Maliyov explains it, the ability to overcome this RAM bottleneck may be the most important contribution ClusterVisor has made to the work being done by Professor Bernardi’s team at Caltech.

““The one feature I really wanted to mention and say how it’s really important to us is for a particular job that we are running on the cluster, we are able to track the RAM memory as a function of time,” Maliyov said.

“This is wonderful, not only to see how our software runs on the cluster, but also to test our software.”

“This is great. In order to have this level of information, you would usually need to compile our software with the compilers,” Maliyov said.

“ClusterVisor shows this right away. We can also change the timestamp with which we have the updates on the RAM memory every 10 seconds or every minute and that’s amazing for us.”

“For the software that we’re developing in the group, the HPC bottleneck is the RAM

memory usage. So, we have to use more nodes not only to run our software faster but also just to fulfill our RAM requirements so, memory is crucial for us,” he said.

“That’s why the 512GB per node and controlling this memory is very important to us. We are talking to our peers in the U.S. and Europe and nobody among them has these tools. We share our excitement with other people and say we have this feature. So, that’s great.”

Download the case study here.

Request a Consultation from our team of HPC and AI Experts

Would you like to speak to one of our HPC or AI experts? We are here to help you. Submit your details, and we'll be in touch shortly.

  • This field is for validation purposes and should be left unchanged.