Scott's did a lot of work over the last year collecting stats about how Lunasa (FAODEL's memory management system) can be used to improve performance in different types of communication scenarios. It turns out there are a lot of dirty secrets hidden in NIC device drivers, like simply de-registering memory can be pretty significant. Scott put a paper together making the case for using explicit memory handles when dealing with network data, instead of letting the communication layer take care of everything. There's a good Kokkos use case in the paper that gives an idea about how HPC is evolving, and he has numbers for both Mutrino (Cray/Gemini) and Stria (ARM/InfiniBand).
Remote Direct Memory Access (RDMA) is an increasingly important technology in high-performance computing (HPC). RDMA provides low-latency, high-bandwidth data transfer between compute nodes. Additionally, it does not require explicit synchronization with the destination processor. Eliminating unnecessary synchronization can significantly improve the communication performance of large-scale scientific codes. A long-standing challenge presented by RDMA communication is mitigating the cost of registering memory with the network interface controller (NIC). Reusing memory once it is registered has been shown to significantly reduce the cost of RDMA communication. However, existing approaches for reusing memory rely on implicit memory semantics. In this paper, we introduce an approach that makes memory reuse semantics explicit by exposing a separate allocator for registered memory. The data and analysis in this paper yield the following contributions: (i) managing registered memory explicitly enables efficient reuse of registered memory; (ii) registering large memory regions to amortize the registration cost over multiple user requests can significantly reduce cost of acquiring new registered memory; and (iii) reducing the cost of acquiring registered memory can significantly improve the performance of RDMA communication. Reusing registered memory is key to high-performance RDMA communication. By making reuse semantics explicit, our approach has the potential to improve RDMA performance by making it significantly easier for programmers to efficiently reuse registered memory.