Craig Ulmer

Processing Particle Data Flows with SmartNICs

2022-09-23 pub smartnics hpc

In my SmartNICs project I've been working with US Santa Cruz on new software that makes it easier to process particle data streams as they flow through the network. We've been using Apache Arrow as a way to do a lot of the heavy lifting, because Arrow provides an easy-to-use tabular data representation and has excellent serialization, query, and compute functions. For this HPEC paper we converted three particle datasets to an Arrow representation and then measured how quickly Arrow could split data into smaller tables for a log-structured merge (LSM) tree implementation we're developing. Jianshen then dug into getting the BlueField-2's compression hardware to accelerate the unpacking/packing of data with a library he developed named Bitar. After HPEC we wrote an extended version of this paper for ArXiv that includes some additional plots that had previously been cut due to page limits.


The datasets for this were pretty fun. I pulled and converted particle data from CERN's TrackML Particle Identification challenge, airplane positions from the Opensky Network, and ship positions from NOAA and Marine Cadastre. One of the benefits of working with Arrow is that it let us use existing tools to do a lot of the data. I just used Pandas to read the initial data, restructure it, and save it out to compact parquet files that our tests could quickly load at runtime. Even though each dataset had a varying number of columns, our Arrow code could process each one so long as the position and ID columns had the proper labels.

Abstract

Many distributed applications implement complex data flows and need a flexible mechanism for routing data between producers and consumers. Recent advances in programmable network interface cards, or SmartNICs, represent an opportunity to offload data-flow tasks into the network fabric, thereby freeing the hosts to perform other work. System architects in this space face multiple questions about the best way to leverage SmartNICs as processing elements in data flows. In this paper, we advocate the use of Apache Arrow as a foundation for implementing data- flow tasks on SmartNICs. We report on our experiences adapting a partitioning algorithm for particle data to Apache Arrow and measure the on-card processing performance for the BlueField-2 SmartNIC. Our experiments confirm that the BlueField-2's (de)compression hardware can have a significant impact on in- transit workflows where data must be unpacked, processed, and repacked.

Publication