Hiroshi Yamauchi: February 2010

The other day, I contributed a patch that lets Hotspot Server compiler emit shorter multiplication sequence for long (64bit) multiplications. Here are some links:

The patch + the commit message: http://hg.openjdk.java.net/jdk7/hotspot-comp/hotspot/rev/e8443c7be117

Sun bug:
http://bugs.sun.com/view_bug.do?bug_id=6921969

Mailing list posts:http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2010-January/002639.html

http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2010-February/002660.html

It applies well to the BigInteger.mulAdd loop:

  private final static long LONG_MASK = 0xffffffffL;
  static int mulAdd(int[] out, int[] in, int offset, int len, int k) {
    long kLong = k & LONG_MASK;
    long carry = 0;
    offset = out.length-offset - 1;
    for (int j=len-1; j >= 0; j--) {
      long product = (in[j] & LONG_MASK) * kLong +
          (out[offset] & LONG_MASK) + carry;
      out[offset--] = (int)product;
      carry = product >>> 32;
    }
    return (int)carry;
  }

It improved an internal microbenchmark that uses BigInteger.mulAdd by 12%, and crypto.rsa and crypto.signverify in SPECjvm2008 by 7% and 2.3%, respectively in my measurements.

Here's low level details. On x86 (32 bit), a 64 bit long multiplication is implemented as a sequence of multiple 32 bit multiplications as in:

  MOV    EBP,ECX.lo
  IMUL   EBP,EDX
  MOV    EDX,ECX.hi
  IMUL   EDX,EAX
  ADD    EBP,EDX
  MUL    EDX:EAX,ECX.lo
  ADD    EDX,EBP

This is based on the following idea:

Result = x * y

is implemented as

lo(result) = lo(x_lo * y_lo)

hi(result) = hi(x_lo * y_lo) + lo(x_hi * y_lo) + lo(x_lo * y_hi)

where lo(x) indicates the lower 32 bits of x and hi(x) the higher 32 bits of x.

In the above mulAdd code, the higher 32 bits of both of operands are known to be zero because of the '&' operations. In such cases, some 32 bit multiplications are redundant. In the above formula, if x_hi=0 and y_hi=0,

lo(result) = lo(x_lo * y_lo)
hi(result) = hi(x_lo * y_lo)

because lo(x_hi * y_lo) = 0 and lo(x_lo * y_hi) = 0.

In this case, we need to emit only

MUL    EDX:EAX,ECX.lo

which is one instruction instead of 7 and is faster to execute. There are variations of this where only one of x_hi and y_hi is zero.

I thank Chuck Rasbold, Tom Rodriguez, Christian Thalinger, and Vladimir Kozlov who reviewed this change.

Hiroshi Yamauchi

Tuesday, February 9, 2010

Shorter long multiplication in Server Compiler on x86 32 bit