HPC cluster utilities presented at DellXL conference

April 24, 2015

Gary M Jung

Berkeley Research Computing staff members Yong Qin and Michael Jennings gave talks highlighting their respective software tools, wwibcheck and NHC, at the DellXL High Performance Computing (HPC) conference this week, April 21-23, 2015, in Boulder, Colorado (agenda).

Most HPC systems rely on a high-performance, low-latency interconnect network to connect compute nodes together in a way that supports tightly-coupled computations, where the compute nodes need to exchange a lot of information as part of the computation. Yong’s talk focused on how to troubleshoot failures in HPC InfiniBand interconnects using his software tool, wwwibcheck, which helps the system administrators isolate and identify infiniband equipment failures or performance problems affecting the execution time of compute jobs.

Michael Jennings gave a talk on his Warewulf Node Health Check (NHC) utility software. NHC runs in conjunction with the system’s job scheduler, carrying out a pre-check to detect potential problems with compute nodes before the job starts, optionally marking bad nodes as “offline.” This highly configurable utility works with popular job schedulers, such as SchedMD’s Slurm job scheduler, and Adaptive Computing’s Moab scheduler and TORQUE resource manager.

Yong and Michael are part of BRC’s HPC team, developed through a partnership with Lawrence Berkeley National Laboratory (LBNL), that supports the new SAVIO Linux compute cluster for UC Berkeley faculty.