The elRoy Systolic Processor Array
For my undergraduate senior design project, Darrell Stogner and I designed and built a systolic processor array that used multiple processing elements to accelerate matrix and vector operations. In addition to simulating the design, we adapted an assembler to our ISA and built assembly code to demonstrate that it could process multiple types of data flows. While the design was too large to fit in the school's Zycad hardware emulator box, we were able to map, partition, and test portions of the design in FPGA hardware.
GaTech's New CompE 4500/4510 Senior Design Class
Midway through my undergraduate CompE degree, Georgia Tech did a complete overhaul of the CompE curriculum. While it would have been shorter to graduate under the old program, I chose to switch over to the new curriculum because the classes covered a broader range of material. One of the requirements in the new program was that all students had to take a two-quarter, senior design class. The course catalog described this series as a "Capstone design experience for computer engineering majors. Design a processor and associated instruction. Testing via simulation models." I signed up for the first offering of CompE4500, which was taught by Dr. Sudhakar Yalamanchili. The class only had about 15 students in it, since there weren't many CompE's at the tail end of the curriculum yet. Sudha was incredibly encouraging and told us that the point of this class was to design a new processor architecture and build all the support software necessary to bring it to life. He would teach us how to design hardware in VHDL, debug the design with EDA simulators, customize an assembler to work with our ISAs, and synthesize the hardware to run in an FPGA-based emulation platform from Zycad. Sudha asked the class to split into teams of two or three, and then scheduled weekly meetings with him to discuss how projects were progressing. Darrell and I knew each other from previous DSP classes, and teamed up without any real ideas of what we should build for the class.
Systolic Processor Arrays
When we asked Sudha for project ideas, he sent us home with some research papers about 2D systolic processor arrays that people had built for image processing. While the papers were a little bit beyond our reading level, they helped us understand that researchers had constructed systolic processor arrays as a way to maximize concurrency in complex dataflows. The idea is that you design a simple processing element (PE) with fixed routing connections that make it easy to tile out the cores in a large grid. After loading a program into the PE, you stream data into and out of the edges of the array. Each PE does a little bit of work on the data as it is pumped through the system. Sensing our concern about getting a design up and running by the end of the course, Sudha suggested that we focus on a 1D design that could implement matrix multiplication and convolution. Darrell and I picked the name "elRoy" because it sounded like something from the space age. We picked funny caps to make it sound edgy.
For the PE part of the design we sketched out an architecture that included one multiplier, one adder, a few registers, and a data path that could be adjusted at run time by software. Realizing that multipliers were expensive and that our operators sometimes had zeros in them, we inserted a configurable-depth fifo in the data flow to allow data to simply bubble through on coefficients that were zero. Next, we tiled several PE's together and created buss logic to route data and control signals into the array. Finally, we had to design a general-purpose processor that would allow us to use software to control the flow of data into and out of the array. The processor was a bad design (ie, no pipelining, minimal ops), but it was good enough we could run basic problems that we wrote in assembly. It was sheer joy the first time I compiled an assembly program, hard coded the program into a RAM simulation module, and saw the data bubbling through all the monitoring points in my simulation.
EDA Hassles
The original goal for the class was to design/simulate in the first quarter and then run synthesize/place in an emulation box the second quarter. Unfortunately, the class got hit by numerous EDA problems midway through. We were using Synopsis to synthesize the designs, but the license unexpectedly expired and didn't get renewed until later in the second quarter. Similarly, the place and route tools for the emulation box weren't fully baked, as nobody had really figured out how to handle designs that had to be sliced into multiple FPGAs at that time. We were skeptical about the tools up front, so we spent a lot of time minimizing the amount of work the synthesis tools would have to do. We basically converted our VHDL design into gate-level components as much as possible, leaving only the multiplier for synthesis. When the emulator box started working again, we discovered it just didn't have the capacity to store multiple PEs. Thus, we focused on doing piece-wise demonstrations where we could test out individual components (eg, a multiplier) on the hardware.
Later CompE 4500 classes backed away from designing exotic architectures and instead focused on building traditional CPU designs the whole way through. My friends did a lot of hard things (eg, superscalar, Booth's algorithm) in RTL and got their designs PAR'd on the emulator box. I'm happy though that my class was given some room to try out new ideas.
Reports
The following are the reports we wrote at the end of the first and final quarters:
- Q1 Progress Report: Darrell Stogner and Craig Ulmer, "elRoy: A Systolic Processor Array" Fall 1994 Report
- Q2 Final Report: Darrell Stogner and Craig Ulmer, "elRoy: A Systolic Processor Array", Winter 1995 Report