SmartNICs Project Final Report

2024-04-01 pub smartnics hpc

In December we finished our three-year, ASCR-funded "Offloading Data Management Services to SmartNICs" project. One of our deliverables was to write a final report that consolidates what we learned into a single report. This 144-page (!) report includes sections from our proposal and previous papers, and examines using SmartNICs from multiple perspectives.

There are three new topics in this report that we haven't covered before:

Apache Arrow vs Kokkos: Previously we've talked about Arrow as a way to write code that scales to multiple cores. In Chapter 4 we port three types of simple analytics to both Arrow and Kokkos and examine how well they scale on Host and SmartNIC processors. Arrow was tedious to write, but was competitive! The code listings are included in Appendix A.
Injection Optimizations: Host-to-NIC transfer performance has always been a problem due to memory addressing problems. In Chapter 8 we cover some optimizations that enable us to use the SmartNIC to gather data from the host's native buffers so that it can be serialized during injection.
Job-local Storage with SmartNICs: As a means of addressing performance issues with using a shared filesystem in a platform, we investigated using SmartNICs to host a private BeeOND filesystem on a job's SmartNICs. Given the limited flash memory of the SmartNIC, we borrowed the host's disk for this work via NVMeoF. Chapter 9 talks about the challenges of getting NVMeoF to work with offloading and covers some early jitter experiments.

Abstract

Modern workflows for high-performance computing (HPC) platforms rely on data management and storage services (DMSSes) to migrate data between simulations, analysis tools, and storage systems. While DMSSes help researchers assemble complex pipelines from disjoint tools, they currently consume resources that ultimately increase the workflow's overall node count. In FY21-23 the DOE ASCR project "Offloading Data Management Services to SmartNICs" explored a new architectural option for addressing this problem: hosting services in programmable network interface cards (SmartNICs). This report summarizes our work in characterizing the NVIDIA BlueField-2 SmartNIC and defining a general environment for hosting services in compute-node SmartNICs that leverages Apache Arrow for data processing and Sandia's Faodel for communication. We discuss five different aspects of SmartNIC use. Performance experiments with Sandia's Glinda cluster indicate that while SmartNIC processors are an order of magnitude slower than servers, they offer an economical and power efficient alternative for hosting services.

Publication

SAND Report Craig Ulmer, Jianshen Liu, Carlos Maltzahn, Aldrin Montana, Matthew L. Curry, Scott Levy, Whit Schonbein, and John Shawger, "Offloading Data Management Services to SmartNICs: Project Summary". SAND2024-03873, April 2024.

Presentations

HPC Initiatives Slides: Presentation I gave at the SNL HPC Initiatives seminar in January.
SRU Slides: Presentation I gave to an undergraduate Computer Engineering class at Slippery Rock University in October.

The Glinda Cluster

2023-10-04 pub hpc smartnics

During the pandemic Sandia procured a new 126-node HPDA cluster named Glinda. While it was a nightmare working through all the supply chain issues with the global shutdown, the hardware has been quite good: compute nodes feature a 32-core Zen3 processor, 512GB of RAM, a BlueField-2 InfiniBand SmartNIC, and an Ampere A100 GPU. People like that they can grab a few nodes to do some deep learning experiments before heading over to the DGX boxes for full runs. We received some requests for a publication that they can reference, so we wrote the below tech report with all the details. The report covers background info for the types of platforms at the labs, details about the hardware and data center, power measurements, and practical installation and operational info for the A100 and BlueField-2.

The Glinda name for this cluster is a reference to the Wizard of Oz. Glinda's Book of Records really resonated with us, as she uses it to record all the important things that are happening throughout the Land of Oz.

Abstract

Sandia National Laboratories relies on high-performance data analytics (HPDA) platforms to solve data-intensive problems in a variety of national security mission spaces. In a 2021 survey of HPDA users at Sandia, data scientists confirmed that their workloads had largely shifted from CPUs to GPUs and indicated that there was a growing need for a broader range of GPU capabilities at Sandia. While the multi-GPU DGX systems that Sandia employs are essential for large-scale training runs, researchers noted that there was also a need for a pool of single-GPU compute nodes where users could iterate on smaller-scale problems and refine their algorithms.

In response to this need, Sandia procured a new 126-node HPDA research cluster named Glinda at the end of FY2021. A Glinda compute node features a single-socket, 32-core, AMD Zen3 processor with 512GB of DRAM and an NVIDIA A100 GPU with 40GB of HBM2 memory. Nodes connect to a 100Gb/s InfiniBand fabric through an NVIDIA BlueField-2 VPI SmartNIC. The SmartNIC includes eight Arm A72 processor cores and 16GB of DRAM that network researchers can use to offload HPDA services. The Glinda cluster is adjacent to the existing Kahuna HPDA cluster and shares its storage and administrative resources.

This report summarizes our experiences in procuring, installing, and maintaining the Glinda cluster during the first two years of its service. The intent of this document is twofold. First, we aim to help other system architects make better-informed decisions about deploying HPDA systems with GPUs and SmartNICs. This report lists challenges we had to overcome to bring the system to a working state and includes practical information about incorporating SmartNICs into the computing environment. Second, we provide detailed platform information about Glinda's architecture to help Glinda's users make better use of the hardware.

Publication

SAND Report Craig Ulmer, Jerry Friesen, and Joseph Kenny, "Glinda: An HPDA CLuster with Ampere A100 GPUs and BlueField-2 VPI SmartNICs". SAND2023-10451, October 2023.

Extracting Ground Truth from Surveillance Video

2023-09-28 pub data webcam seismic

In the early days of the ADAPD project, I served as the "data czar" and helped acquire and organize a large amount of data about a series of underground explosive tests that took place in Nevada during the DAG phase of the Source Physics Experiments venture. While we had high-resolution seismic data from multiple sensors, we did not have much insight into what was going on any particular day. Fortunately, the DAG researchers had maintained a date-stamped webcam for the worksite that captured a distant view of the site every 10 seconds. After a lot of gritty data engineering work, we were able to extract a good bit of ground truth from the video to help explain what was going on in the seismic data. We wrote the below report in FY20 to cover all the details, but had to wait for a long embargo period to expire before we could release it publicly.

There are a few interesting things in this report:

To verify the webcam's clock was correct, we plugged the camera's coordinates into Google Earth and compared sunrises.
I wrote an OpenCV script that did some ad hoc image processing to extract the date stamp from the bottom of each image.
We manually boxed over 2,200 image samples for 18 types of vehicles.
I used file timestamps to measure how my own annotation rate improved from 5 objects/minute at the start to as many as 17 objets/minute once I had updated the labeling tool.

Abstract

The Advanced Data Analysis for Proliferation Detection (ADAPD) project is a NNSA NA-22-sponsored Venture that is developing novel data analysis capabilities to detect low-profile nuclear proliferation activities. A key step in the information refinement process for this work is to inspect input sensor datasets and data products produced by our analytics to generate as much "ground truth" as possible about the events that took place during the period of observation. This information helps the team's data scientists improve and validate their algorithms and yields data products that are valuable to analysts and decision makers. In this report we provide information about how we inspected multimodal sensor data from the Source Physics Experiment's Dry Alluvium Geology (DAG) tests and generated ground truth for ADAPD's analysis teams. This work illustrates the front-end data engineering tasks that frequently arise in new studies and documents our efforts to gain greater confidence in the assessments of the data.

Publication

SAND Report Craig Ulmer and Nicole McMahon, "Extracting Ground Truth from Surveillance Video in the Dry Alluvium Geology (DAG) Experiment". SAND2023-14455, September 2023.

Opportunistic Query Execution on SmartNICs

2023-09-26 pub hpc smartnics arrow

In our SmartNIC project we've been using Apache Arrow to represent and process in-transit data that flows between different jobs in a workflow. One of the advantages of using Arrow is that it includes a sophisticated compute engine named Acero that allows you to execute queries on tabular data. Previously we've written some basic queries in C++ to have Acero split entries in a table based on a field. Lately we've been using Acero to execute queries that a user might create at runtime (via tools like DuckDB or Ibis that can generate Substrait query plans). Jianshen and I wrote some client/server code for Faodel that allows a client to transmit a serialized substrait plan to an endpoint, deserialize the requested objects into Arrow tables, apply the plan to the data, and send the serialized results back to the client. This conduit gives us a handy way to query a remote SmartNIC and inspect its in-transit data.

For this paper (and his dissertation), Jianshen focused on making a decision engine that could quickly estimate whether it would be faster to execute the query at the SmartNIC or simply return the raw data and defer execution to the client. He measured overheads for executing queries and transmitting data, and then used machine learning techniques to make predictions about how long a query would take and how much data it would return. He used Apache DataSketches to rapidly characterize the in-transit data the SmartNIC held. At runtime the decision engine parsed the query syntax and applied probabilities to each clause to estimate how selective a query would ultimately be.

Abstract

High-performance computing (HPC) systems researchers have proposed using current, programmable network interface cards (or SmartNICs) to offload data management services that would otherwise consume host processor cycles in a platform. While this work has successfully mapped data pipelines to a collection of SmartNICs, users require a flexible means of inspecting in-transit data to assess the live state of the system. In this paper, we explore SmartNIC-driven opportunistic query execution, i.e., enabling the SmartNIC to make a decision about whether to execute a query operation locally (i.e., "offload") or defer execution to the client (i.e., "push-back"). Characterizations of different parts of the end-to-end query path allow the decision engine to make complexity predictions that would not be feasible by the client alone.

Publication

HPEC Paper Jianshen Liu, Carlos Maltzahn, and Craig Ulmer, "Opportunistic Query Execution on SmartNICs for Analyzing In-Transit Data" in IEEE High Performance Extreme Computing, September 2023.

Anycubic Kobra-2 FDM Printer

2023-06-18 3d print

A few years ago I bought a 3D resin printer so the kids and I could learn a little bit more about modeling and fabricating 3D objects. While it's been a great experience, we haven't printed much in the last year because of all the headaches of dealing with resin. Every time we do a print we have to deal with temperatures, level the plate, put on all the safety gear, and then clean up everything at the end. It's a lot of overhead and dangerous enough I don't want my kids doing it when I'm not home. I've been thinking it would be nice to have a traditional FDM printer on hand to lower the barrier for printing simple things so that printing will be more accessible to my kids. After a lot of internet wandering, I decided to get the new Anycubic Kobra-2. It's new, works with Linux, and shipped from Amazon with a 1KG spool of filament for $300.

Setup

The kids and I setup the Kobra-2 on my desk in the garage. The assembly wasn't too difficult, although it took us a while to figure out how to hold the frame so we could get some of the machined screws lined up properly. It was also a little unclear how the feeder tube was supposed to go in the header (does this go any farther in?). Once it was setup we ran the auto calibration tool to probe the height of the build place. Auto calibration was a required feature for me, and one of the reasons why I'm happy to be buying a printer after the technology has had a chance to mature. We then preheated the filament and had it print the famous 3DBenchy boat design. The kids and I watched with wonder as the extruder spun around the plate with robot brrrrr noises. FDM printing is so much more exciting to watch than resin because you really see it happen. With resin the plate moves up and down every few seconds, with an upside-down design that's coated in excess resin. While you add a whole layer at a time, it takes a long time to get through all the pads and supports before you get to your actual design.

Sample Prints

3D Benchy only took 30 minutes to print out. One of the other selling points of this printer is that it can do higher speed prints (150mm/s to 250mm/s, compared to the 60mm/s of the stock Ender printers). I was really tempted to get one of the $600 Bambu printers, which can do up to 500mm/s, but decided we should start with a basic printer and see how much we like it first. Benchy came out looking pretty good, though you can see some pixelation in the windows that I don't think you'd have in resin. That's fine though- I think I'm more interested in building functional widgets with this printer than detailed figures.

The next thing we printed was a small mesh cup I pulled from thingiverse. This design came as a plain STL object so I had to load it into a slicer to render to gcode. Anycubic says to use PrusaSlicer, which is a powerful slicer built for Prusa printers. It's free and has a Linux version that worked on my Chromebook's Linux container. I had to download the settings from the Anycubic support site, but they came up fine. For this design I just loaded the cup, hit slice, and saved the gcode. Prusa had a lot of detailed info about how it built the object. I liked that it recognized the interior and autofilled it with a grid to save on material. The scaled down version of the print took about an hour to build (correctly predicted by Prusa). I was impressed that the printer was able to build a thin mesh and have it come out ok (though later I broke it trying to trim some of the base).

Next up was a micro-sd card holder. I found a clever design someone had made that had a radial container with a screw-on lid. The threading is really interesting to me because it gives you a way to connect parts together (someone also modified the design so you could screw together multiple micro-sd containers, though I doubt I'll ever fill this one). The parts I printed screwed together just fine. Two of the slots weren't deep enough, but that's ok. I should have added an up label though, as the slots don't have enough friction to keep cards in place if you open it upside down.

Finally, I printed a baby guardian dragon dice holder from Thingiverse for my niece. This design has a spot for you to put a die. It's a cute design, though the FDM version resulted in a bunch of lines on the angled surfaces.

Issues

We have had a few issues with the Kobra-2 during our first week of use. My son had a few failed prints that we're trying to figure out. The printer would get partway through the base of the design, get stuck, and then go into an endless calibration loop. It's possible this is because we installed a newer version of the slicer than we were previously using. When I went back and sliced the design with my chromebook it printed fine. Again, it's nice that the setup/cleanup for a print is so easy. The other main issue has been quality. The FDM prints look good, but they're not as detailed as the resin prints. Below are some zoom-ins that show how this results in the FDM prints coming out jagged in certain spots.

Power

One thing I've noticed about the FDM printer is that it the motors really get a beating, zig zagging back and forth all the time. Our house doesn't have great wiring, so the lights in the garage (and bathroom) flicker slightly when the printer is bouncing. Also, there's a spike in power when you start up because it needs to warm up the build plate and nozzle. Maybe I'll look into getting a battery or power conditioner for the plug to smooth out the signal.

Overall

Overall, I'm pretty happy with the Kobra-2 so far. After dealing with all the resin printing pains it's been a breeze to get FDM working. I don't think we'll print a ton of things, but it's nice to have the option to design and build stuff when we want.