A few years ago we stood up the Carnac cluster so our Emulytics users would have a place to do large-scale virtual network experiments. Unlike our HPC clusters which use InfiniBand or OmniPath, Carnac was built around a 100GigE network fabric based on Mellanox NICs and a large Arista switch. While the 100GigE network was much more expensive than InfiniBand, it provides a more natural conduit for Emulytics experiments. Given that we knew portions of Carnac would be idle between jobs, we wondered if we could borrow some of the nodes from time to time and run MPI jobs efficiently.
Our initial tests using MPI over TCP were abysmal as expected. TCP's latencies were pretty bad and we found you really had to open up a lot of simultaneous connections to get anywhere near the available bandwidth. About a decade ago there was a lot of interest in using RDMA over Converged Ethernet (RoCE) to improve this performance. RoCE is interesting because it tries to get Ethernet hardware to behave more like HPC hardware. On the host side, RoCE provides an InfiniBand API that HPC comm libs are used to using. RoCE NIC vendors have written OS bypass libs that allow userspace applications to directly talk with message queues on the NIC. In the fabric, RoCE messages are marked in special Ethernet frames so RoCE-aware switches can handle them more efficiently (ie, use link-level flow control to avoid drops).
RoCE has been our there for a while, but you don't see many people using it much these days. Joe Kenny set about trying to configure our switches and NIC to use it. At first it seemed like it was working, but in longer experiments he saw enough lock ups to indicate that there were incompatibilities in the hardware. He pleaded for help in an OFED talk but found no solutions. He borrowed a switch from Mellanox but got nowhere. Just as we were about to call it dead, he stumbled into some settings that fixed things. Things also worked fine back on our core Arista switch. It's frustrating that we don't have a good explanation for why things did/didn't work, but I pushed Joe to document what we went through in a SAND Report. It's SAND2019-13444.
Remote Direct Memory Access (RDMA) over Converged Ethernet (RoCE) has the potential to provide performance that rivals traditional high performance fabrics. If this potential proves out, significant impacts on system procurement decisions could follow. This work provides a series of small scale performance results which are used to compare and contrast the performance of RoCE-enabled Ethernet with TCP-based Ethernet and an HPC network. Additionally, a discussion of the maturity of RoCE firmware/software stacks and documentation is provided along with useful approaches for probing performance. A detailed description of two experimental setups known to have good RoCE performance is given, including step-by-step configuration and the exact hardware and software revisions employed. At small scales, RoCE is found to have significant performance advantages over "out-of-the-box" TCP protocols and is competitive with state-of-the-art high performance networks. Further examination of RoCE using a wider array of benchmarks and at greater scale is warranted.