Contact Info



Work Bio

Craig Ulmer is a Principal Member of the Technical Staff in the Scalable Modeling and Analysis Systems group at Sandia National Laboratories in Livermore, California. He is the Principal Investigator for I/O work in Sandia's ATDM subprogram, which is now part of the Department of Energy's Exascale Computing Project (ECP). The I/O portion of ATDM is developing new communication software that will enable exascale workflows to route data between applications without having to relay it through the file system. In addition to scientific computing, Craig has had unique research experiences at Sandia involving storage-intensive computing, geospatial data warehousing, the Gov Clouds, cyber security, and custom hardware design.

Prior to joining Sandia, Craig received a Ph.D. in Electrical and Computer Engineering from the Georgia Institute of Technology for his work with low-level communication libraries for cluster computers. This research resulted in a flexible message layer for Myrinet named GRIM that enables users to efficiently utilize hardware accelerators and multimedia devices that are distributed throughout a cluster. While attending Georgia Tech, Craig completed intern and Co-Op work assignments at NASA's Jet Propulsion Laboratory, Eastman Kodak's Digital Technology Center, and IBM's EduQuest division.


Current Research Interests

FAODEL: Communication Libraries for Exascale Workflows

The Exascale Computing Project (ECP) is a national effort to overhaul HPC technologies in order to scale scientific computing to new levels. Exascale applications will employ sophisticated workflows that will need to route large amounts of data between different simulation and analysis tools. Current workflow systems largely use the file system as a mechanism for implementing data handoffs and are therefore limited by the performance of storage technologies. While NVMe resources found in today's platforms help, the process of converting application data to file-based representations incurs a high overhead and limits how tightly coupled applications can be.

Our approach to addressing this problem is to develop new data management services that allow applications to interact with each other in a more fluid manner. These services provide simple, object-based abstractions for thinking about distributed datasets and easy-to-use mechanisms for controlling how data migrates between distributed memory, nonvolatile memory, and persistent storage resources. The data management services do not interfere with existing MPI or AMT communication, and allow both intra- and inter-job communication. The software is called FAODEL: Flexible, Asynchronous, Object Data-Exchange Libraries. FAODEL is composed of the following libraries:

  • NNTI: NNTI is a low-level, RDMA portability library that allows applications to exchange messages and orchestrate remote memory transfers in an event-driven manner. NNTI supports the InfiniBand, OmniPath, and Cray Aries/Gemini fabrics.
  • Lunasa: Lunasa is a library for dynamically managing network-accessible memory. It helps applications limit how much memory is used for communication and improves allocation/deallocation times.
  • OpBox: OpBox is an asynchronous communication engine that orchestrates complex network operations through user-defined state machines. OpBox provides resource management services and allows users to write fire-and-forget network operations.
  • Kelpie: Kelpie is a distributed key/blob service for controlling how contiguous data objects are transferred between application nodes, distributed-memory nodes, and storage resources. Kelpie provides an asynchronous API that allows an operation to be triggered when a missing object is generated in the system.

High-Performance Data Analytics (HPDA)

Many projects that I have worked on at Sandia have required a way to store, index, and analyze large amounts of data. While software frameworks such as Hadoop and Accumulo have made it easier to complete these objectives on commodity hardware, it has always been difficult to scale these frameworks up to take advantage of high-performance resources that are commonly found on HPC platforms. After running Sandia's first production Hadoop cluster for many years, we began looking at how we could build a better platform that would serve the growing needs of our data analytics users. The result is the Kahuna, a high-performance data analytics (HPDA) cluster that mixes HPC and big-data ideas. Kahuna provides 120 compute nodes, each with 256GB of memory, 700GB of local NVMe, and 56Gb/s InfiniBand networking. A separate 8-node Ceph cluster provides users with 1.5PB of centralized storage that can be accessed via POSIX, Rados, and RBD APIs over 10GigE.

The main challenge of running Kahuna though has been developing an environment where different communities can take advantage of the resources in new ways. We decided to abandon the Hadoop ecosystem and return to using Slurm for resource scheduling, and accepted responsibility for transitioning our big-data users over to this environment. For brute-force users we've had success with GNU Parallel on Slurm as well as through Slurm Job Arrays. For users that depend on specific frameworks (Spark, Jupyter) or services (postgreSQL, ephemeral NVME parallel file systems), we have written scripts to launch dependencies on demand in a Slurm allocation. Finally, the most important ingredient for a successful HPDA cluster is data. Part of our work involves finding and curating new, large datasests that would be of use to our analysts.

Analyzing Airplane Tracks

A personal hobby of mine is to collect and analyze airplane position datasets gathered from public sources. I initially ran an Amazon scraping campaign to gather data from a popular airplane website, but have since switched to using data from community sites. I collect data for the Bay Area using my own RTL/SDR ADSB receiver and have written tools to convert the point data to tracks and analyze it. Having long term data has allowed me to pull out interesting events, such as surveillance missions by both the Russians and the US.



Previous Work

  • FPGA Hardware R&D: A significant part of my early Sandia work focused on leveraging field-programmable gate arrays (FPGAs) as computational accelerators. The initial portion of this work focused on chaining 50+ SNL-developed floating-point units together to implement custom computational pipelines to accelerate HPC applications. After learning how to use Xilinx's on-chip network transceivers, I built a SNORT-based network intrusion detection system for GigE, as well as a filter that removed malicious HTTP requests based on their similarity to common attack patterns. Other projects have earned me a classified inventor's award and NNSA recognition for discovering an obscure bug in a flight system through brute force simulation.
  • Communication Software for Resource-Rich Clusters: My Ph.D. work with Dr. Sudhakar Yalamanchili involved the design and implementation of a low-level communication library named GRIM (General-purpose Reliable In-order Messages). GRIM was unique because it provided a robust way of exchanging data between processors, memory, and peripheral cards distributed throughout a cluster. Custom firmware was developed for the Myrinet Network Interface (NI) that allowed it to serve as a communication broker for the various resources in a host system. GRIM featured a rich set of communication primitives (remote DMA, active message, and NI-based multicast), but still managed to deliver low-latency, high-bandwidth performance.
  • Wireless Sensor Networks: During my first summer internship at JPL, my mentor asked me to work through the logistics of deploying a large number of low-power, wireless sensor network (WSN) nodes on Mars. After interviewing a number of domain experts at JPL, I defined a basic set of requirements for a deployment and worked through the logistics for different deployment strategies (atmosphere scatter, tumbleweeds, hoppers). I then constructed WSN simulators to help explore different strategies for creating a routable network from the nodes. This work demonstrated that a campaign-style election system could partition the network in a distributed manner.



Disclaimer: This information is based entirely on my own views and not my employer's.
Last modified: March 1, 2018.

About..

Hello! CraigUlmer.com is a personal website that I use to track different technical projects that I've worked on over my career. I'm a Computer Engineer living in the distant edge of the San Francisco Bay Area. My background is in high-performance communication networks for scientific computing, but I also work on data-intensive problems and custom-built computing architectures.

Publications
Linked-In
GitHub
CraigUlmer.com