Craig Ulmer

Hardware Accelerators on the Cray XD1

2005-05-16 fpga pub hpc

One of the interesting things about working at the labs is that sometimes you get early access to cutting edge hardware. After I started work on a Reconfigurable Computing project we ran into a company called OctigaBay that was building a new dense computing product that was years ahead of others. The hardware was impressive: they made a single 3U box containing six dual-socket compute blades (with disks), a high-wattage power supply, and a custom InfiniBand-like interconnect that included a switch chip and external ports in each box. The system architects did a lot of interesting work with the network. Rather than use a stock PCIe IB card they built their own custom network interface chip in an FPGA that speaks HyperTransport on one side and IB packets on the other. This approach gives them the ability to build a custom communication fabric with high-speed access to main memory without having to invent all of the hardware that's needed for the rest of the network. It doesn't have the bandwidth of Cray's SeaStar (which also puts the NIC on HT), but the OctigaBay implementation was a lot more practical for cluster users.


Shortly before we finalized a deal to buy one of the first systems, Cray bought OctigaBay and branded the system as their new Cray XD1 product. Cray definitely helped make the system more manufacturable and production ready (eg, they put a slick web gui on the XD1's admin network that let you control the nodes and manage images in one place). Given that we were one of the first places to get a system, Cray sent multiple design and engineering teams out to our site to help us get the system up and working the way we waned. It was an interesting time, as everyone seemed to realize that the XD1 was a clean design that addressed a lot of integration problems that clusters had had for a long time.


FPGA Coprocessors

While the XD1's computing density factors were appealing, the main reason why I was interested in the system was that it was one of the first commodity platforms to integrate FPGA coprocessors into the architecture in a useful way. The XD1 designers wanted to make it possible to double the network performance for the system, so they built an add-on board that implemented a second network backplane for the box. Given that a number of people were interested in reconfigurable computing at the time, they decided to attach an additional FPGA to this board that users could program with their own app-specific hardware. The user FPGA connected to the NIC FPGA via a slimmed down HyperTransport interface. This interface allowed the user to run at lower clock rates if needed, as 200MHz was still difficult to hit in FPGA logic at the time. The FPGA had four banks of high-speed QDR memory that users could control as needed. Users often used the fact that QDR was dual ported to pipeline data transfers between the host and the FPGA logic.


The software support for the XD1 was very good. OctigaBay provided a device driver that could load the FPGA, peek/poke registers, pin host memory, and orchestrate DMAs between the host and FPGA. On the hardware side, OctigaBay provided a thin, timing-optimized interface core you could easily plug into your designs. They also provided a core for exchanging data with the QDR memory that resolved requests in a fixed number of cycles. One of their simple-but-powerful examples implemented a Mersenne PNRG that generated random numbers quickly on the FPGA and then streamed the values up into host memory. Having an endless supply of good quality random values wound up being a big help for others' Monte Carlo simulations.

A few Cores of Our Own

It took a while for us to get up to speed on the XD1's hardware. One of the first major things I did was a build a more usable HyperTransport DMA engine for the FPGA that could maximize data transfers on the bus. While OctigaBay provided an API for talking on HT, you still had to speak the protocol correctly and fill the packets to make the transfers work. Since we usually needed a way move big blocks of data at a time, I made a DMA engine that let you write data into BRAM and then have the engine schedule it into individual transactions to the host. This engine scaled pretty well and made it easier to stream data through the FPGA.

I implemented a core to do MD5 on the XD1's FPGA, naively thinking that MD5 was slow enough an FPGA would help speed it up. What I didn't think about was how MD5 is intentionally serial and there's no easy way to speed it up. I followed through on the implementation though- you could stream data through the FPGA and it'd crank out the right hash. On the positive side, it'd be easy to throw the MD5 unit into the unit and have it compute the hash while it was doing something else.

Next, I took a stab at sorting integer data values. I built a systolic array of tiles that sorted data values as you streamed them into the hardware. This work was a lot of fun, because it meant all I had to do was design a simple processing element and then write a big generate statement that would chain all the units together. The array worked, but in the end, the wiring ate up a lot of chip resources and we could only put a few hundred cores in a single chip. From an algorithmic perspective though, linear sorting times were appealing to see, though.


Our next step was to tackle some floating point algorithms. This kind of work is what HPC people want to see, but it's also the most difficult to achieve because FPGAs lack native floating-point support. Fortunately, some collaborators of mine had been developing a few good floating-point cores that were deeply pipelined. We wound up building a simple, 88 stage pipeline to compute the Pythagorean theorem. You had to stream a lot of data to overcome the transfer times and become competitive with the host, but it served as a good starting point for later work where we sequenced together more complicated algorithms.


Floorplanning

One of the things that the XD1 got me to do more as a designer was think about floorplanning. If you blindly let the tools place everything, they always spread things across the chip which made timing difficult. Below is a picture for how the tools implemented one of our floating-point pipelines. The FP cores themselves (red, blue, mauve) were dense because my co-workers built the netlists by hand. However, they weren't placed close to each other so the routing spills all over the place. The hard PowerPC cores (the black boxes) also take up a lot of space and made routing hard. It's worth pointing out that the OctigaBay logic to interface with HT (and QDR?) sits at the bottom and is tightly packed so they could hit their 200MHz clock rates. These guys (eg, Steve Margerm) really knew what they were doing.


The Fall of the XD1

We did a lot of good work with the XD1, some of which I will post about later. Unfortunately, things took a turn for the worse after a few years. Cray overextended itself when they paid cash for OctigaBay and nearly drove themselves bankrupt (again?). The XD1 didn't become the big seller that Cray wanted, so the product floundered and eventually Cray fired all of the OctigaBay engineers (who were really top-notch people).


We didn't buy a support contract for the XD1 because Cray support came at Cray prices. Eventually, one of the hard drives on the head node died and we lost the ability to boot and control all the blades in the box. Given that the management software was all proprietary, there was no hope of resurrecting the system on our own. I vowed to avoid proprietary systems that can't be taken apart and rebuilt using commodity tools. They're great when they work, but they're useless when they break.

The XD1 sat powered off for some time before I finally had someone haul it off. Before it left, I took it apart and pulled out some keepsakes for the office, including the nameplate. It was only then that I realized that the massive fan for the first two blades had never been plugged in. It's impressive the blades lasted as long as they did, but a sad ending for an otherwise great machine.

Publications

CUG Paper Craig Ulmer, Ryan Hilles, and David Thompson, "Reconfigurable Aspects of the Cray XD1", Cray User Group (CUG) 2005.

Presentations

CUG Slides Presentation given at CUG. pdf version