Craig Ulmer

SmartNICs Project Final Report

2024-04-01 pub smartnics hpc

In December we finished our three-year, ASCR-funded "Offloading Data Management Services to SmartNICs" project. One of our deliverables was to write a final report that consolidates what we learned into a single report. This 144-page (!) report includes sections from our proposal and previous papers, and examines using SmartNICs from multiple perspectives.


There are three new topics in this report that we haven't covered before:

  • Apache Arrow vs Kokkos: Previously we've talked about Arrow as a way to write code that scales to multiple cores. In Chapter 4 we port three types of simple analytics to both Arrow and Kokkos and examine how well they scale on Host and SmartNIC processors. Arrow was tedious to write, but was competitive! The code listings are included in Appendix A.
  • Injection Optimizations: Host-to-NIC transfer performance has always been a problem due to memory addressing problems. In Chapter 8 we cover some optimizations that enable us to use the SmartNIC to gather data from the host's native buffers so that it can be serialized during injection.
  • Job-local Storage with SmartNICs: As a means of addressing performance issues with using a shared filesystem in a platform, we investigated using SmartNICs to host a private BeeOND filesystem on a job's SmartNICs. Given the limited flash memory of the SmartNIC, we borrowed the host's disk for this work via NVMeoF. Chapter 9 talks about the challenges of getting NVMeoF to work with offloading and covers some early jitter experiments.


Abstract

Modern workflows for high-performance computing (HPC) platforms rely on data management and storage services (DMSSes) to migrate data between simulations, analysis tools, and storage systems. While DMSSes help researchers assemble complex pipelines from disjoint tools, they currently consume resources that ultimately increase the workflow's overall node count. In FY21-23 the DOE ASCR project "Offloading Data Management Services to SmartNICs" explored a new architectural option for addressing this problem: hosting services in programmable network interface cards (SmartNICs). This report summarizes our work in characterizing the NVIDIA BlueField-2 SmartNIC and defining a general environment for hosting services in compute-node SmartNICs that leverages Apache Arrow for data processing and Sandia's Faodel for communication. We discuss five different aspects of SmartNIC use. Performance experiments with Sandia's Glinda cluster indicate that while SmartNIC processors are an order of magnitude slower than servers, they offer an economical and power efficient alternative for hosting services.

Publication

  • SAND Report Craig Ulmer, Jianshen Liu, Carlos Maltzahn, Aldrin Montana, Matthew L. Curry, Scott Levy, Whit Schonbein, and John Shawger, "Offloading Data Management Services to SmartNICs: Project Summary". SAND2024-03873, April 2024.

Presentations

  • HPC Initiatives Slides: Presentation I gave at the SNL HPC Initiatives seminar in January.
  • SRU Slides: Presentation I gave to an undergraduate Computer Engineering class at Slippery Rock University in October.