Tuesday, January 16, 2018

Small Size Data Structure Optimization

The usual approach for a expandable "vector" data structure is to use a dynamically allocated array that doubles in size as necessary. That's how C++ std::vector and Java ArrayList and Go slice/append work.

But that means the time and space overhead of a heap allocation. And it means indirection and poor locality for CPU caches. (The array itself is contiguous, but you have to chase a pointer to get to it.)

If you have a lot of small lists, one possible optimization is to "embed" a small array inside the list structure. That's not really feasible in Java, but easily done in C++ or Go. One example of this is the small string optimization (SSO) in C++.

But what happens when you grow beyond that small embedded array? The simplest solution is to just stop using it. But that seems wasteful, both in space and in locality.

Another option is to treat the embedded array as the start of the larger overall array, at the cost of complicating the indexing a little. That fixes the space wastage, but we still don't get much benefit from locality since the additions to a vector go at the end.

What if we treat the embedded array as following the used elements of the larger dynamically allocated array? Additions go into the embedded array until it fills up, and then we spill the contents into the larger dynamically allocated array.

If the embedded array holds e.g. 4 elements, then 3/4 of the adds only access the embedded array with good locality. Every 4th addition will block move the entire contents of the embedded array (4 elements) into the larger dynamic array (which will occasionally double in size). So the embedded array acts as a "buffer".


Merge trees and buffered Btrees have a somewhat similar pattern in that most of the activity takes place in a small area which periodically spills into the rest of the data structure.

I did some quick benchmarking using Go. My initial result was that it was slightly slower, but did about half as many allocations (because of the embedded small array). The speed was a little disappointing, but I was comparing against a simple Go slice append so my code was admittedly more complex.

Of course, the locality part of it is harder to measure. A small test that fits entirely in cache will not benefit much from locality. Sure enough, if I made the test create many more vectors, then the new version beat a simple Go slice append by about 30%.

The allocation advantage is entirely dependent on how big the vectors are. If they are small enough to fit in the embedded array, then there are zero allocations. If the vectors are very large then the number of allocations approaches the same as the slice append. Of course, the whole reason for the embedded array is that you have a lot of little objects.

One drawback of the embedded array is that it makes the data structure larger. A vector can be as small as a pointer and two integers or pointers which is reasonable to pass by value. Once you add an embedded array, you probably don't want to pass it by value or store it by value in other vectors or maps. If that means you have to allocate it on the heap, then you won't get as much advantage. However, it you're using it locally or embedding it in a larger data structure then it's fine.

Sunday, January 07, 2018

A Tale of Two Hash Tables

For some reason I started thinking about hash tables again. I realize only a geek would think so, but I find them pretty cool. How can you not like a data structure whose lookup time is O(1) i.e. independent of size. The other fascinating part is that there are a huge number of different designs for hash tables, each with its own set of tradeoffs.

To beat the critics to the punch ... Did I have a pressing problem that I needed to solve? Nope. Did I benchmark the old code or the new code? Nope. Would it make more sense to just use std::unordered_map and quit reinventing the wheel? Probably. Was it a fun challenge that would give incremental improvements? I think so. I could say it's because I'm semi-retired and can afford to mess around. But I've been doing it all along. What's Suneido but the biggest example of reinventing the wheel. Right or wrong, it’s what I enjoy and the way I roll, and I think it’s worked out pretty well. I try not to take it too far . Originally I wrote my own garbage collectors for cSuneido. I'm not talking about yet-another-reference-counting-class, this was a full blown conservative bit mapped garbage collector. A fun challenge and probably some of my best work. But when the obviously better Boehm collector came along I switched over and threw out my own version.

cSuneido uses a pretty standard traditional hash table design - an array of pointers to nodes with separate chaining for collisions. It has worked fine for 20 years. But it has some drawbacks - every insert requires an allocation, there's considerable space overhead (pointers, separately allocated blocks, padding), and all the indirection means it's not cache friendly.

The design of the Go hash table is quite interesting. It uses an array of buckets, each holding e.g. 8 entries. Within a bucket there are separate arrays for keys and values to eliminate padding. Although the initial buckets are allocated in one large array, overflow buckets are chained. It also caches the high byte of hashes to speed up searches.

I reimplemented the Go hash table when I was working on gSuneido. (I couldn't just use the built-in one because it doesn't allow arbitrary types for keys like Suneido needs.) At that point the Go one was written in C, nowadays it's written in Go (albeit unsafe).

So I thought I'd rewrite cSuneido's hash table using the Go design. It took me a day or two to code and debug and it worked well.

But I remembered an article I'd read with the provocative title of "I Wrote the Fastest Hashtable". It uses open addressing, linear probing, robin hood hashing, prime sizing, and a probe limit. One drawback was that it was using a traditional array-of-structs model, which means wasted space for padding, especially since it includes a one byte value per slot which is pretty much guaranteed to get padded. But that issue is easy to overcome with an approach similar to the Go design, by storing the table in "blocks", and within each block using separate arrays for the keys and values (and the one byte "distance"). However, unlike the Go design, the table is still treated as a single large flat array. The tradeoff is that it slows down indexing slightly.

So I took another day or two and implemented this design. I love the simplicity of the lookup code:

int i = index_for_key(k);
for (int d = 0; d <= distance(i); ++d, ++i)
    if (HE::keyeq(k, key(i)))
        return i;
return -1;

You can't get much simpler than that. (Another nice feature of this design is that bounds checks are eliminated by adding extra slots on the end, which is feasible because of the probe limit.)

This design trades off slightly more expensive inserts and deletes for faster lookups. For the usual case of more lookups than updates, that's a good tradeoff. One negative aspect is that resizing the hash table will be slightly more expensive. But that's probably not a large factor because when you're growing a hash table, the new table is sparse and therefore collisions are less frequent.

One feature that makes me a little nervous is that the table grows if an insert exceeds the probe limit, with the probe limit set to log2(n). In most hash tables if all the keys hash to the same value you end up with a linear search. In this design the table would grow until you ran out of memory or hit a size limit. (hash table denial of service vulnerability). But in practice, the prime table sizes make it almost impossible for this to happen.

It took me a bit longer to debug than I expected with the simple code. The insert and delete are a little trickier with robin hood hashing because elements get moved around. C++ templates also gave me a bit of grief, as usual.

Testing with (pseudo) random values I got the expected good results:
- average probes for lookup was 1.3
- average load at resize was .75
- average swaps per insert was 1.3 (although max was 55!)

An average load at resize of .75 means the overall average load is .56 (half full). That's fairly typical for hash tables. It's the cost you pay for the performance. On the positive side, this design has only 1 byte of overhead per slot, and no overhead for pointers or padding or separate allocation, so it's better than many hash tables.

Here's a snapshot of the code. It'll end up in the cSuneido repo at some point. Note - it's not a full STL container since that requires a lot of extra code that I don't need. It also assumes that keys and values are simple objects (like integers or pointers) that are fast to assign and copy. See the article linked above for a more general purpose implementation.

Friday, January 05, 2018

A Bittersweet Triumph

I've been fascinated with the game of Go since I was young. As a kid I played a fair bit of chess, never seriously, but I could usually beat my peers. That was balanced by regularly getting thrashed by a 90 year old in my neighborhood that I was conscripted to play with. After I discovered Go I pretty much gave up on Chess. My business partner and I used to play a lot of Go in the early days of our company. It was a good way to occupy our time on business trips. We were never very good, and almost never played with anyone else, but it was a fun mental challenge. I still play regularly on my iPad against SmartGo Kifu and Crazy Stone (which claims to use deep learning), but only small 9x9 games. (and I'm still not any good!)

One of the things that attracted me to Go was that it wasn't amenable to the same techniques that were used in software to conquer Chess. Of course, I don't believe there is any magic in human brains, so presumably there was a way for computers to handle it. Until AlphaGo, most people believed it would be a long time, perhaps a decade, before computers could beat the best human players. So it was a bit of a shock to everyone when AlphaGo recently beat the worlds best human players.

I highly recommend the full length (90 min) documentary about AlphaGo - a Go playing computer program developed by DeepMind, a company owned by Google. (Don't be scared off by "AI". The documentary isn't technical, it's more about the people.) It's available on Netflix, iTunes, Google Play, and Amazon.



For more of the technical details, see the Nature articles. They're behind a paywall on the Nature web site, but they seem to be available on Scribd - Mastering the game of Go with deep neural networks and tree search and Mastering the game of Go without human knowledge

I've never been involved in AI, but it's always fascinated me. It was one of the original reasons I got into computers. In the early days there was a lot of hype about neural networks. But they didn't live up to their hype and there was a backlash that meant they were downplayed for many years. Recently they've seen a huge comeback and are now living up to much of those original hopes. Partly that's because of advances in hardware like GPU's.

The first AlphaGo was trained partly on human games. Its successor, AlphaGo Zero, learned completely on its own, by playing itself. It surpassed the earlier version in a matter of days. Not long after, the AlphaGo project was ended and the team was disbanded. I guess it's a done deal.