Craig Ulmer

The Glinda Cluster

2023-10-04 pub hpc smartnics

During the pandemic Sandia procured a new 126-node HPDA cluster named Glinda. While it was a nightmare working through all the supply chain issues with the global shutdown, the hardware has been quite good: compute nodes feature a 32-core Zen3 processor, 512GB of RAM, a BlueField-2 InfiniBand SmartNIC, and an Ampere A100 GPU. People like that they can grab a few nodes to do some deep learning experiments before heading over to the DGX boxes for full runs. We received some requests for a publication that they can reference, so we wrote the below tech report with all the details. The report covers background info for the types of platforms at the labs, details about the hardware and data center, power measurements, and practical installation and operational info for the A100 and BlueField-2.


The Glinda name for this cluster is a reference to the Wizard of Oz. Glinda's Book of Records really resonated with us, as she uses it to record all the important things that are happening throughout the Land of Oz.



Abstract

Sandia National Laboratories relies on high-performance data analytics (HPDA) platforms to solve data-intensive problems in a variety of national security mission spaces. In a 2021 survey of HPDA users at Sandia, data scientists confirmed that their workloads had largely shifted from CPUs to GPUs and indicated that there was a growing need for a broader range of GPU capabilities at Sandia. While the multi-GPU DGX systems that Sandia employs are essential for large-scale training runs, researchers noted that there was also a need for a pool of single-GPU compute nodes where users could iterate on smaller-scale problems and refine their algorithms.

In response to this need, Sandia procured a new 126-node HPDA research cluster named Glinda at the end of FY2021. A Glinda compute node features a single-socket, 32-core, AMD Zen3 processor with 512GB of DRAM and an NVIDIA A100 GPU with 40GB of HBM2 memory. Nodes connect to a 100Gb/s InfiniBand fabric through an NVIDIA BlueField-2 VPI SmartNIC. The SmartNIC includes eight Arm A72 processor cores and 16GB of DRAM that network researchers can use to offload HPDA services. The Glinda cluster is adjacent to the existing Kahuna HPDA cluster and shares its storage and administrative resources.

This report summarizes our experiences in procuring, installing, and maintaining the Glinda cluster during the first two years of its service. The intent of this document is twofold. First, we aim to help other system architects make better-informed decisions about deploying HPDA systems with GPUs and SmartNICs. This report lists challenges we had to overcome to bring the system to a working state and includes practical information about incorporating SmartNICs into the computing environment. Second, we provide detailed platform information about Glinda's architecture to help Glinda's users make better use of the hardware.

Publication

  • SAND Report Craig Ulmer, Jerry Friesen, and Joseph Kenny, "Glinda: An HPDA CLuster with Ampere A100 GPUs and BlueField-2 VPI SmartNICs". SAND2023-10451, October 2023.