Saturday, August 28, 2010

Fast unsigned byte lexicographical comparator

A common order used in an English dictionary is called the lexicographical order (or the alphabetic order). The same order can be used to sort binary data where bytes are letters and arrays of bytes are words. It is in fact commonly used to sort binary data or sequences of bytes (e.g., a compressed data or binary-encoded data).

Those bytes are usually unsigned. However, there is no standard routine in Java for sorting an array of bytes in the unsigned lexicographical order because Java's byte type is signed. The closest thing is the comparator of java.nio.ByteBuffer. Unfortunately, it is based on the signed byte order.

The other day, with my colleagues, I contributed a fast unsigned byte lexicographical comparator for Java as part of the Guava libraries. It turned out to be up to 3.9 times on x86 and 5.5 times on amd64 faster than a typical Java implementation in our microbenchmark.

The trick is to use sun.misc.Unsafe to compare eight bytes at a time by reading eight elements of a byte array as a single long value, instead of comparing byte by byte. In general, the same technique is often used for string comparison, string copying, substring search, etc. I know it's unsafe to use sun.misc.Unsafe, but when it's used wisely, it gives a major performance gain, which is not possible in plain Java.

What if sun.misc.Unsafe doesn't exist in your JVM? The Guava implementation transparently degrades to a plain (and slow) Java implementation when the underlying JVM does not have sun.misc.Unsafe. No worries.

Thanks to my colleagues, Martin Buchholz and Kevin Bourrillion.