Craig Ulmer

Explicit Reuse Semantics for RDMA Communication

2020-05-28 faodel net pub

Scott's did a lot of work over the last year collecting stats about how Lunasa (FAODEL's memory management system) can be used to improve performance in different types of communication scenarios. It turns out there are a lot of dirty secrets hidden in NIC device drivers, like simply de-registering memory can be pretty significant. Scott put a paper together making the case for using explicit memory handles when dealing with network data, instead of letting the communication layer take care of everything. There's a good Kokkos use case in the paper that gives an idea about how HPC is evolving, and he has numbers for both Mutrino (Cray/Gemini) and Stria (ARM/InfiniBand).


Remote Direct Memory Access (RDMA) is an increasingly important technology in high-performance computing (HPC). RDMA provides low-latency, high-bandwidth data transfer between compute nodes. Additionally, it does not require explicit synchronization with the destination processor. Eliminating unnecessary synchronization can significantly improve the communication performance of large-scale scientific codes. A long-standing challenge presented by RDMA communication is mitigating the cost of registering memory with the network interface controller (NIC). Reusing memory once it is registered has been shown to significantly reduce the cost of RDMA communication. However, existing approaches for reusing memory rely on implicit memory semantics. In this paper, we introduce an approach that makes memory reuse semantics explicit by exposing a separate allocator for registered memory. The data and analysis in this paper yield the following contributions: (i) managing registered memory explicitly enables efficient reuse of registered memory; (ii) registering large memory regions to amortize the registration cost over multiple user requests can significantly reduce cost of acquiring new registered memory; and (iii) reducing the cost of acquiring registered memory can significantly improve the performance of RDMA communication. Reusing registered memory is key to high-performance RDMA communication. By making reuse semantics explicit, our approach has the potential to improve RDMA performance by making it significantly easier for programmers to efficiently reuse registered memory.


  • IPDPSW Paper Scott Levy, Patrick Widener, Craig Ulmer, and Todd Kordenbrock, "The Case for Explicit Reuse Semantics for RDMA Communication", 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) pp 879-888. DOI:10.1109/IPDPSW50202.2020.00148

Mediating Data Center Storage Diversity

2019-12-01 hpc io faodel pub

Patrick Widener put together a paper on using FAODEL to deal with data center storage diversity for ISC 2019. This paper gets into some of the ideas we've had about how to use data services to route to different storage targets, and highlights some of the HDF5/LevelDB interfacing that Patrick's done.


Composition of computational science applications into both ad hoc pipelines for analysis of collected or generated data and into well-defined and repeatable workflows is becoming increasingly popular. Meanwhile, dedicated high performance computing storage environments are rapidly becoming more diverse, with both significant amounts of non-volatile memory storage and mature parallel file systems available. At the same time, computational science codes are being coupled to data analysis tools which are not filesystem-oriented. In this paper, we describe how the FAODEL data management service can expose different available data storage options and mediate among them in both application- and FAODEL-directed ways. These capabilities allow applications to exploit their knowledge of the different types of data they may exchange during a workflow execution, and also provide FAODEL with mechanisms to proactively tune data storage behavior when appropriate. We describe the implementation of these capabilities in FAODEL and how they are used by applications, and present preliminary performance results demonstrating the potential benefits of our approach.


  • ISC HP Paper Patrick Widener, Craig Ulmer, Scott Levy, Todd Kordenbrock, and Gary Templet, "Mediating Data Center Storage Diversity in HPC Applications with FAODEL", ISC High Performance 2019. Lecture Notes in Computer Science, vol 11887.


Revisiting RoCE on 100GigE

2019-10-01 net pub

A few years ago we stood up the Carnac cluster so our Emulytics users would have a place to do large-scale virtual network experiments. Unlike our HPC clusters which use InfiniBand or OmniPath, Carnac was built around a 100GigE network fabric based on Mellanox NICs and a large Arista switch. While the 100GigE network was much more expensive than InfiniBand, it provides a more natural conduit for Emulytics experiments. Given that we knew portions of Carnac would be idle between jobs, we wondered if we could borrow some of the nodes from time to time and run MPI jobs efficiently.

Our initial tests using MPI over TCP were abysmal as expected. TCP's latencies were pretty bad and we found you really had to open up a lot of simultaneous connections to get anywhere near the available bandwidth. About a decade ago there was a lot of interest in using RDMA over Converged Ethernet (RoCE) to improve this performance. RoCE is interesting because it tries to get Ethernet hardware to behave more like HPC hardware. On the host side, RoCE provides an InfiniBand API that HPC comm libs are used to using. RoCE NIC vendors have written OS bypass libs that allow userspace applications to directly talk with message queues on the NIC. In the fabric, RoCE messages are marked in special Ethernet frames so RoCE-aware switches can handle them more efficiently (ie, use link-level flow control to avoid drops).

RoCE has been our there for a while, but you don't see many people using it much these days. Joe Kenny set about trying to configure our switches and NIC to use it. At first it seemed like it was working, but in longer experiments he saw enough lock ups to indicate that there were incompatibilities in the hardware. He pleaded for help in an OFED talk but found no solutions. He borrowed a switch from Mellanox but got nowhere. Just as we were about to call it dead, he stumbled into some settings that fixed things. Things also worked fine back on our core Arista switch. It's frustrating that we don't have a good explanation for why things did/didn't work, but I pushed Joe to document what we went through in a SAND Report. It's SAND2019-13444.


Remote Direct Memory Access (RDMA) over Converged Ethernet (RoCE) has the potential to provide performance that rivals traditional high performance fabrics. If this potential proves out, significant impacts on system procurement decisions could follow. This work provides a series of small scale performance results which are used to compare and contrast the performance of RoCE-enabled Ethernet with TCP-based Ethernet and an HPC network. Additionally, a discussion of the maturity of RoCE firmware/software stacks and documentation is provided along with useful approaches for probing performance. A detailed description of two experimental setups known to have good RoCE performance is given, including step-by-step configuration and the exact hardware and software revisions employed. At small scales, RoCE is found to have significant performance advantages over "out-of-the-box" TCP protocols and is competitive with state-of-the-art high performance networks. Further examination of RoCE using a wider array of benchmarks and at greater scale is warranted.


  • SAND Report Joseph Kenny and Craig Ulmer "RoCE Promising Technology for Ethernet as a High Performance Networking Fabric". SAND2019-13444, October 2019.

100GigE Packet Capture

2019-09-02 net interns pub

This summer we were fortunate to have two, undergraduate summer interns come in to help us out with different projects related to the clusters. The first intern to arrive was Haoda Wang from USC. Haoda had a good bit of experience with Linux systems so we had him help us do some experiments with some nodes we just bought to do 100Gb/s packet recording. The nodes each have two AMD Epyc processors, 1TB of RAM, 10x2TB of U.2 NVMe storage, and a 100Gb/s Mellabox VPI NIC. He did a nice write up of the work he did in SAND2019-10319.

The Epyc nodes had a few new features so the first thing we had Haoda do after setting up the hardware was run some benchmarks to get a better idea of how the system should be configured. He tried a few different OSs and hardware configs. One interesting observation was that the system was slightly faster when only half the memory sockets were filled (AMD docs had warnings about this). New Ubuntu kernels had slightly better performance in some benchmarks, but we were stuck with RHEL due to driver issues with Mellanox.

The U.2 storage performed very well. Haoda found that the drives were very fast and that we could get close to 20GB/s of streaming write performance by using Btrfs or XFS raids. Interestingly, ZFS didn't perform very well, possibly due to its complexity and the speed of the drives. Haoda also explored using SPDK to stream data to disk via kernel bypass, but given that we only needed to hit 12GB/s speeds for worst-case network capture, we decided to stick with plain i/o.

Haoda spent the rest of the summer fighting Mellanox drivers and tweaking settings to get the Ethernet NICs to run at 100Gbps speeds. After a great deal of searching he realized that while the userspace DPDK library could grab packets from the wire correcly, running the data through libpcap caused multiple memory copies in order to format the data for output. Writing the raw packets to disk and then reading them in a follower application made it possible to keep up with line rates. It was a frustrating journey, but I'm glad we figured it out.


  • SAND Report Haoda Wang, Gavin Baker, Joseph Kenny, and Craig Ulmer "An Initial Investigation of the Design Challenges Associated with Reliable 100GigE Packet Capture". SAND2019-10319, September 2019.

FAODEL 1.1906.1 Released

2019-07-08 faodel code

One of the things that's been missing from FAODEL is a tool to help manage resources and launch services. After the EMPIRE release, we did a lot of work to fix this by building a new cli tool that does many different things. The faodel tool can start/stop services, set/remove DirMan resource info, and put/get Kelpie objects from resource pools. We've received approval from DOE to release this as version 1.1906.1 (Excelsior!) at Here's the changelog:

Release Improvements

  • New faodel-cli tool for manipulating many things
    • Gets build/configure info (replaces faodel-info)
    • Start/stop services (dirman, kelpie)
    • Define/query/remove dirman resources
    • Put/get/list kelpie objects
    • New example/kelpie-cli script shows how to use
  • Support for ARM platform
  • NNTI adds On-Demand Paging capability
  • NNTI adds Cereal as alternative for serialization
  • NNTI has better detection and selection of IB devices
  • Fixes
    • SBL could segfault due to Boost if exit without calling finish
    • FAODEL couldn't be included in a larger project's cmake
    • LDO had a race condition in destructor

Significant User-Visible Changes:

  • faodel-info and whookie tools replaced by faodel cli tool
  • Dirman's DirInfo "children" renamed to "members"
  • Faodel now has a package in the Spack develop branch

Known Issues

  • FAODEL's libfabric transport is still experimental. It does not fully implement Atomics or Long Sends. While Kelpie does not require these operations, other OpBox-based applications may break without this support.
  • On Cray machines with the Aries interconnect, FAODEL can be overwhelmed by a sustained stream of sends larger than the MTU. To avoid this problem, the sender should limit itself to bursts of 32 long sends at a time.