Verifying C++ Compile-Time Hashing

2016-01-16 Sat
cpp code

The other day while writing some RPC code for a network library I started wondering if I could use c++11 features to manage my function handlers. Specifically, I was wondering if I could do some compile-time hashing to uniquely identify each RPC by integer value. Stack Overflow had a great post about how to do it, but my initial tests found the hashes didn't always happen at compile time the way I had naively hoped they would. Using nm and objdump I figured out how to make my specific use case work.

The Problem with RPC Names

One of the nitpicky things that bugs me when I write remote procedure calls (RPCs) for a C++ network library of ours is deciding on a way to uniquely label each RPC with an integer value. Some network libraries I've used just punt and tell you to pound define each identifier, which sooner or later results in a collision when you merge code from different people. Other libraries use string labels and figure out how to translate strings to integers at run time (possibly as part of connection handshaking). These are a pain as well, as there's extra work that has to happen at init time.

What I'd really like is to use string labels for RPCs, but have the labels all hash to integer values at compile time. I'd like the hash value to be a constant associated with an RPC class that doesn't need to be set every time you create a new instance of the RPC. The hash can be weak, because I can detect collisions at init time when each RPC is registered and tell users to come up with a different name (or hash).

Prior to C++11 (or maybe C++0x), I don't think it was possible to have the compiler hash a string at compile time. You could plug in macros to hash the name, but as far as I know, you'd get code that at best executed the hash at run time every time you wanted the hash.

Compile-time Hash on Stack Overflow

I'm not the only one that wanted to do compile-time hashes. This stack overflow question asked exactly what I'd been wondering- can C++11 be used to hash a string at compile time. Their answer was yes- under certain circumstances, you can craft a constexpr that did the work. Specifically, one of the examples implemented a simple hash by recursively stepping through an array:

//From stack overflow 2111667
unsigned constexpr const_hash(char const *input) {
  return *input ?         
           static_cast(*input) + 33 * 
                                     const_hash(input + 1) :
           5381;
}

People seemed to agree that this should work, but I wanted proof that the hashing code was being done at compile time (especially since the hash now uses a recursive function). I wrote some simple tests and used objdump to look at the assembly g++ 4.8.2 was generating.

First Failed Test

My first test just looked at what happened if you just plugged a string into the hash and printed it. For example:

int main(){
  unsigned x = const_hash("bozo"); 
  cout << hex << x << endl;
}  

The code prints out the hash 7c9c033f, so the hash is being done. However, nm shows that the hash function is still in there:

$ g++ -std=c++11 bozo1.cpp -o bozo1
$ ./bozo1
7c9c033f

$ nm -g -C bozo1 | grep hash
000000000040098e W const_hash(char const*)

$ nm -g  bozo1 | grep hash
000000000040098e W _Z10const_hashPKc

Looking at the assembly for main, you can see the hash function is getting executed at runtime:

00000000004007e0 
: 4007e0: 55 push %rbp 4007e1: 48 89 e5 mov %rsp,%rbp 4007e4: 48 83 ec 10 sub $0x10,%rsp 4007e8: bf 9b 0a 40 00 mov $0x400a9b,%edi 4007ed: e8 9c 01 00 00 callq 40098e <_Z10const_hashPKc>

Declaring the string as a const or constexpr didn't help.

Progress with a Switch Example

Since it didn't look like I could just naively use the hash function to get what I wanted, I started looking at the original use case the s/o poster was after. The poster asked about compile-time hashes because they wanted to use them for cases in a switch statement. It's a useful thing to do when you write state machines- often you'd like to give each state a printable label, but at the same time, you'd like to boil the labels down to hash values at compile time to speed things up. I updated my example to call to use the hashes in a switch:

int testit(string s){
  switch(const_hash(s.c_str())){
  case const_hash("bozo1"): return 1;
  case const_hash("bozo2"): return 2;
  default: 
    cout <<"Unknown" << endl;
  } 
} 
int main(){
  cout << hex << const_hash("bozo1") << endl;
  cout << hex << const_hash("bozo2") << endl;
  return testit("bozo1");
}

When running the program you see the bozo1 and bozo2 hashes are b886f27 and b889adf. Looking at the assembly for the testit function you see it calls the hash function on the input parameter (because the input could be anything), but each case operation's comparison has been hardwired to use the actual hash values that were generated at compile time:

0000000000400ae0 <_Z6testitSs>:
  400ae0:	55                   	push   %rbp
  400ae1:	48 89 e5             	mov    %rsp,%rbp
  400ae4:	48 83 ec 10          	sub    $0x10,%rsp
  400ae8:	48 89 7d f8          	mov    %rdi,-0x8(%rbp)
  400aec:	48 8b 45 f8          	mov    -0x8(%rbp),%rax
  400af0:	48 89 c7             	mov    %rax,%rdi
  400af3:	e8 d8 fd ff ff       	callq  4008d0 <_ZNKSs5c_strEv@plt>
  400af8:	48 89 c7             	mov    %rax,%rdi
  400afb:	e8 8e 02 00 00       	callq  400d8e <_Z10const_hashPKc>
  400b00:	3d 27 6f 88 0b       	cmp    $0xb886f27,%eax
  400b05:	74 09                	je     400b10 <_Z6testitSs+0x30>
  400b07:	3d df 9a 88 0b       	cmp    $0xb889adf,%eax
  400b0c:	74 09                	je     400b17 <_Z6testitSs+0x37>
  ...

Use as an ID in an RPC Class

The final step for me was just to plug this trick into something that resembled the RPC code I wanted. In this code, I have a base class to define a generic RPC, and derived classes for each RPC. All I want is to have each one to be able to provide a static integer identifier back when asked, and not have anything hash at runtime.

class A {
public:
  virtual unsigned GetID() const = 0;
};

class B : public A {
public:
  unsigned GetID() const { return my_hash_id; }
  const static unsigned my_hash_id;
};
const unsigned B::my_hash_id = const_hash("MyThing");

int main(){
  A *b=new B(); 
  cout << hex << b->GetID()    << endl;
  cout << hex << B::my_hash_id << endl;
}

Grepping through nm's output, I saw that const_hash wasn't included in the executable. Looking at the assembly I see GetID is a function that returns the hash value and B::my_hash_id is hardwired to the hash value. This behavior is sufficient for what I need. I'll use the static my_hash_id when I want to register a new RPC during initialization (which will also check for hash collisions), and then use GetID() if I need to figure out what RPC I really am if I was only handed a base class pointer. It really is overkill for what I need, but it's nice to be able to reference the RPCs without having to generate unique integer IDs myself.