Saturday, August 28, 2010

Fast unsigned byte lexicographical comparator

A common order used in an English dictionary is called the lexicographical order (or the alphabetic order). The same order can be used to sort binary data where bytes are letters and arrays of bytes are words. It is in fact commonly used to sort binary data or sequences of bytes (e.g., a compressed data or binary-encoded data).

Those bytes are usually unsigned. However, there is no standard routine in Java for sorting an array of bytes in the unsigned lexicographical order because Java's byte type is signed. The closest thing is the comparator of java.nio.ByteBuffer. Unfortunately, it is based on the signed byte order.

The other day, with my colleagues, I contributed a fast unsigned byte lexicographical comparator for Java as part of the Guava libraries. It turned out to be up to 3.9 times on x86 and 5.5 times on amd64 faster than a typical Java implementation in our microbenchmark.

The trick is to use sun.misc.Unsafe to compare eight bytes at a time by reading eight elements of a byte array as a single long value, instead of comparing byte by byte. In general, the same technique is often used for string comparison, string copying, substring search, etc. I know it's unsafe to use sun.misc.Unsafe, but when it's used wisely, it gives a major performance gain, which is not possible in plain Java.

What if sun.misc.Unsafe doesn't exist in your JVM? The Guava implementation transparently degrades to a plain (and slow) Java implementation when the underlying JVM does not have sun.misc.Unsafe. No worries.

Thanks to my colleagues, Martin Buchholz and Kevin Bourrillion.

Saturday, June 26, 2010

Joined the OpenJDK Hotspot group

Back in April, I was voted in and joined the OpenJDK HotSpot group. Here's a link to the call-for-votes email thread:

Here's the list of people in various groups of the OpenJDK project:

I thank Chuck Rasbold, who coordinated and hosted the voting process, and those who voted for my membership.

Short/Character.reverseBytes compiler intrinsics

I worked on a small improvement on the Hotspot server compiler.

Here's the mailing list thread:

Here's the committed change:

The Short/Character.reverseBytes() methods swap the bytes within the 2 byte short or char values. This is commonly done when, for a certain reason, one needs to convert a value from a big endian to little endian, or vice versa. The Short.reverseBytes() method looks like:

public static short reverseBytes(short i) {
  return (short) (((i & 0xFF00) >> 8) | (i << 8));

Ordinarily, this method is most likely going to be compiled into machine instruction sequences including an logical AND and two shift instructions. Since there is a machine instruction on x86 that does the swapping (BSWAP), why not let the compiler generate such code as a special code generation rule (called compiler intrinsics). In fact, the Integer/Long.reverseBytes() methods already took advantage of the instruction. This improvement is similar. The intrinsified Short.reverseBytes() looks like on x86 (32 bit):

BSWAP  reg 
SAR    reg, 16

Suppose the register contained four bytes B1 B2 B3 B4 (from the highest to the lowest on 32 bit). Since it was a short value (signed 2-byte integer), it must be either 0x00 0x00 B3 B4 (if the highest bit of B3 is 0) or 0xFF 0xFF B3 B4 (if the highest bit of B3 is 1). What we want eventually is either 0x00 0x00 B4 B3 (if the highest bit of B4 is 0) or 0xFF 0xFF B4 B3 (if B4's highest bit is 1).

The BSWAP instruction reverses the byte order of the word (four bytes) in register reg and produces B4 B3 B2 B1 in the register. Then, the SAR instruction shifts the word to the right by 2 bytes so that the lower two bytes are B4 B3. Since SAR fills the higher bits with the same bit as B4's highest bit, we get what we want.

Character.reverseBytes() are likewise, except that it needs to be treated unsigned.

In a microbenchmark bundled with the JDK, jdk/test/java/nio/Buffer/SwapMicroBenchmark, the char and short byte swap methods did speed up by 32% and 42%, respectively, with this improvement.

Thanks to colleagues, Chuck Rasbold (putting up the webrev on my behalf), Martin Buchholz (who worked on the JDK change to make it effective), Tom Rodriguez (who reviewed the change and finished the SPARC implementation), Christian Thalinger (who also reviewed the change).

Tuesday, February 9, 2010

Shorter long multiplication in Server Compiler on x86 32 bit

The other day, I contributed a patch that lets Hotspot Server compiler emit shorter multiplication sequence for long (64bit) multiplications. Here are some links:

The patch + the commit message:

Sun bug:

Mailing list posts:

It applies well to the BigInteger.mulAdd loop:

  private final static long LONG_MASK = 0xffffffffL;
  static int mulAdd(int[] out, int[] in, int offset, int len, int k) {
    long kLong = k & LONG_MASK;
    long carry = 0;
    offset = out.length-offset - 1;
    for (int j=len-1; j >= 0; j--) {
      long product = (in[j] & LONG_MASK) * kLong +
          (out[offset] & LONG_MASK) + carry;
      out[offset--] = (int)product;
      carry = product >>> 32;
    return (int)carry;

It improved an internal microbenchmark that uses BigInteger.mulAdd by 12%, and crypto.rsa and crypto.signverify in SPECjvm2008 by 7% and 2.3%, respectively in my measurements.

Here's low level details. On x86 (32 bit), a 64 bit long multiplication is implemented as a sequence of multiple 32 bit multiplications as in:
  MOV    EBP,ECX.lo
  MOV    EDX,ECX.hi

This is based on the following idea:

Result = x * y

is implemented as

lo(result) = lo(x_lo * y_lo)

hi(result) = hi(x_lo * y_lo) + lo(x_hi * y_lo) + lo(x_lo * y_hi)

where lo(x) indicates the lower 32 bits of x and hi(x) the higher 32 bits of x.

In the above mulAdd code, the higher 32 bits of both of operands are known to be zero because of the '&' operations. In such cases, some 32 bit multiplications are redundant. In the above formula, if x_hi=0 and y_hi=0,

lo(result) = lo(x_lo * y_lo)
hi(result) = hi(x_lo * y_lo)

because lo(x_hi * y_lo) = 0 and lo(x_lo * y_hi) = 0.

In this case, we need to emit only


which is one instruction instead of 7 and is faster to execute. There are variations of this where only one of x_hi and y_hi is zero.

I thank Chuck Rasbold, Tom Rodriguez, Christian Thalinger, and Vladimir Kozlov who reviewed this change.