Craig Ulmer is a Principal Member of the Technical Staff in the Scalable Modeling and Analysis Systems group at Sandia National Laboratories in Livermore, California. He is the Principal Investigator for I/O work in Sandia's ATDM subprogram, which is now part of the Department of Energy's Exascale Computing Project (ECP). The I/O portion of ATDM is developing new communication software that will enable exascale workflows to route data between applications without having to relay it through the file system. In addition to scientific computing, Craig has had unique research experiences at Sandia involving storage-intensive computing, geospatial data warehousing, the Gov Clouds, cyber security, and custom hardware design.
Prior to joining Sandia, Craig received a Ph.D. in Electrical and Computer Engineering from the Georgia Institute of Technology for his work with low-level communication libraries for cluster computers. This research resulted in a flexible message layer for Myrinet named GRIM that enables users to efficiently utilize hardware accelerators and multimedia devices that are distributed throughout a cluster. While attending Georgia Tech, Craig completed intern and Co-Op work assignments at NASA's Jet Propulsion Laboratory, Eastman Kodak's Digital Technology Center, and IBM's EduQuest division.
Current Research Interests
FAODEL: Communication Libraries for Exascale Workflows
The Exascale Computing Project (ECP) is a national effort to overhaul HPC technologies in order to scale scientific computing to new levels. Exascale applications will employ sophisticated workflows that will need to route large amounts of data between different simulation and analysis tools. Current workflow systems largely use the file system as a mechanism for implementing data handoffs and are therefore limited by the performance of storage technologies. While NVMe resources found in today's platforms help, the process of converting application data to file-based representations incurs a high overhead and limits how tightly coupled applications can be.
Our approach to addressing this problem is to develop new data management services that allow applications to interact with each other in a more fluid manner. These services provide simple, object-based abstractions for thinking about distributed datasets and easy-to-use mechanisms for controlling how data migrates between distributed memory, nonvolatile memory, and persistent storage resources. The data management services do not interfere with existing MPI or AMT communication, and allow both intra- and inter-job communication. The software is called FAODEL: Flexible, Asynchronous, Object Data-Exchange Libraries. FAODEL is composed of the following libraries:
High-Performance Data Analytics (HPDA)
Many projects that I have worked on at Sandia have required a way to store, index, and analyze large amounts of data. While software frameworks such as Hadoop and Accumulo have made it easier to complete these objectives on commodity hardware, it has always been difficult to scale these frameworks up to take advantage of high-performance resources that are commonly found on HPC platforms. After running Sandia's first production Hadoop cluster for many years, we began looking at how we could build a better platform that would serve the growing needs of our data analytics users. The result is the Kahuna, a high-performance data analytics (HPDA) cluster that mixes HPC and big-data ideas. Kahuna provides 120 compute nodes, each with 256GB of memory, 700GB of local NVMe, and 56Gb/s InfiniBand networking. A separate 8-node Ceph cluster provides users with 1.5PB of centralized storage that can be accessed via POSIX, Rados, and RBD APIs over 10GigE.
The main challenge of running Kahuna though has been developing an environment where different communities can take advantage of the resources in new ways. We decided to abandon the Hadoop ecosystem and return to using Slurm for resource scheduling, and accepted responsibility for transitioning our big-data users over to this environment. For brute-force users we've had success with GNU Parallel on Slurm as well as through Slurm Job Arrays. For users that depend on specific frameworks (Spark, Jupyter) or services (postgreSQL, ephemeral NVME parallel file systems), we have written scripts to launch dependencies on demand in a Slurm allocation. Finally, the most important ingredient for a successful HPDA cluster is data. Part of our work involves finding and curating new, large datasests that would be of use to our analysts.
Analyzing Airplane Tracks
Disclaimer: This information is based entirely on my own views and not my employer's.
Last modified: March 1, 2018.
Hello! CraigUlmer.com is a personal website that I use to track different technical projects that I've worked on over my career. I'm a Computer Engineer living in the distant edge of the San Francisco Bay Area. My background is in high-performance communication networks for scientific computing, but I also work on data-intensive problems and custom-built computing architectures.