Tuesday, September 3, 2013

OpenJDK at Google

At the JVM Language Summit 2013Jeremy Manson gave a talk on some of the work we did around OpenJDK at Google:

  Slides: http://www.oracle.com/technetwork/java/jvmls2013manson-2013920.pdf
  Video: http://medianetwork.oracle.com/video/player/2630310903001

It covers some of my profiling and garbage collection work (on slides 29, 36 and 37):

- Making the stack trace decoding code more robust and precise and adding support for the native stack frames for the low-overhead AsyncGetCallTrace-based profiler.

- Parallel full GC for the CMS garbage collector. 2-4x pause time improvements in full GC.

- Parallel initial-mark and remark phases in the CMS garbage collector. 2-4x pause time improvements.

- Partial heap defragmentation/compaction for the CMS garbage collector. Decrease heap fragmentation-caused full GC by up to 90% (or eliminate completely.)

Give back unused RAM to the system. 20-30% RAM savings.

Thursday, August 29, 2013

Parallel initial mark (CMSParallelInitialMarkEnabled) and more-parallel remark (CMSEdenChunksRecordAlways) phases in the CMS garbage collector

The Concurrent Mark Sweep (CMS) garbage collector in the Hotspot JVM has two stop-the-world pauses in its algorithm: the initial mark and the remark phases. The initial mark phase starts the concurrent mark process by marking the objects in the old generation that are directly (as opposed to transitively) reachable from the GC roots (such as local variables, static fields, and the objects in the young generation.) The remark phase finishes the concurrent mark process by rescanning the objects in the GC roots and the objects in the dirtied cards.

One issue that I had with the CMS garbage collector was that the pause times of these two phases often spike up to 500 milliseconds or more, which is rather long especially when it's desirable to have server request latency that's much shorter than that.

A reason for the long initial mark pauses is that the initial mark code is not parallelized. Since the general trend of the increasing Java heap sizes and the initial mark phase needs to scan the entire young generation (whose size is typically a proportion of the heap size,) the existing single-threaded initial mark phase tends to result in long pauses even on modern processors.

Similarly, one reason for the long remark phase is that while the remark code is already parallelized (handled by multiple threads and CPU cores), its existing algorithm has a glitch that the workload distribution among the GC worker threads during its young generation scan sometimes gets unbalanced. When the parallel workload is unbalanced, the worker threads with less work will just wait idle for the ones with more work, and the overall time to finish the entire workload gets longer.

I recently made OpenJDK contributions that fix those issues. I implemented a parallel version of the initial mark phase (the CMSParallelInitialMarkEnabled flag/option) and a more evenly workload distributed version of the remark phase (the CMSEdenChunksRecordAlways flag/option). With these contributions, the initial/remark pause times get shorter by a factor of 5 or more. For example, in a test with a 1 GB young generation (within a 3 GB heap), the pause times stayed below 100 milliseconds compared to 500 milliseconds or more without these contributions.

A more detailed description is here:


After several long email threads that are archived at:

The patches have been accepted into the OpenJDK Hotspot code base as in:

  http://hg.openjdk.java.net/hsx/hotspot-gc/hotspot/rev/ca9dedeebdec (parallel initial mark)
  http://hg.openjdk.java.net/hsx/hotspot-gc/hotspot/rev/7b06ae405d7b (better parallelized remark)
  http://hg.openjdk.java.net/hsx/hotspot-gc/hotspot/rev/cb9da55b1990 (a bug fix, authored by Jon)

Hopefully, with these changes, your initial/remark pause times will be shorter than before, especially in your big-heap Java applications.

Thanks to Jon Masamitsu, Thomas Schatzl, and Chuck Rasbold for sponsoring and/or reviewing the patches.

Friday, June 21, 2013

Making the JVM release memory

It's well known that a Java application (the JVM) won't typically release much memory once it's warmed up, even when the application is lightly loaded or even idle at a later time.

If you plot the memory usage (or the resident set size) of a Java application, it typically looks like a mostly flat line after an upslope at the beginning. At a low level, this corresponds to the memory pages of the Java heap gradually getting allocated, and once all the pages are allocated, the memory usages stays mostly flat even when a large portion of the heap is not used * **.

This can be a problem if a Java application is run on a non-dedicated system (a server or desktop) where it co-exists with other (non-Java) applications. In a non-dedicated system, one application that's not playing nice with others by dominating the memory can slow down the other applications, or prevent them from running.

This is where an experimental JVM feature, DeallocateHeapPages, that I worked on comes in. It causes the underlying memory pages that correspond to the unused (free) parts of the heap to be deallocated (released) and helps reduce the memory usage of a Java application. Internally, it calls the system call madvise(MADV_DONTNEED) for the bodies of free chunks in the old generation without unmapping the heap address space.

Another way to look at this is that this feature makes the memory usage of a Java application behave more like that of a C/C++ application where the process memory usage is more in line with the memory actually used by the application.

This has been very useful for servers and desktop tools that we have at Google and helped save a lot of memory (RAM) usage.

The implementation currently supports the concurrent mark sweep (CMS) collector and the Linux platform.

Here's the email thread on the OpenJDK mailing list and a link to the JVM patch:



The patch hasn't been accepted (yet) as the support for all the other OS platforms is deemed necessary for that to happen, which it lacks. I might be able to address that at some point, if I have the time and resources to make it happen.

* For simplicity, I am ignoring the memory use other than the heap such as the native C heap and the thread stacks here as the heap uses usually by far the largest amount of memory.

** Though the serial garbage collector (-XX:+UseSerialGC) of the JVM can occasionally shrink the heap and return memory, it's almost never used in production for an obvious performance reason. The parallel collector and the concurrent mark sweep (CMS) collector, which are often used in production, almost never shrink the heap and return memory, in my experience.