Thursday, August 29, 2013

Parallel initial mark (CMSParallelInitialMarkEnabled) and more-parallel remark (CMSEdenChunksRecordAlways) phases in the CMS garbage collector

The Concurrent Mark Sweep (CMS) garbage collector in the Hotspot JVM has two stop-the-world pauses in its algorithm: the initial mark and the remark phases. The initial mark phase starts the concurrent mark process by marking the objects in the old generation that are directly (as opposed to transitively) reachable from the GC roots (such as local variables, static fields, and the objects in the young generation.) The remark phase finishes the concurrent mark process by rescanning the objects in the GC roots and the objects in the dirtied cards.

One issue that I had with the CMS garbage collector was that the pause times of these two phases often spike up to 500 milliseconds or more, which is rather long especially when it's desirable to have server request latency that's much shorter than that.

A reason for the long initial mark pauses is that the initial mark code is not parallelized. Since the general trend of the increasing Java heap sizes and the initial mark phase needs to scan the entire young generation (whose size is typically a proportion of the heap size,) the existing single-threaded initial mark phase tends to result in long pauses even on modern processors.

Similarly, one reason for the long remark phase is that while the remark code is already parallelized (handled by multiple threads and CPU cores), its existing algorithm has a glitch that the workload distribution among the GC worker threads during its young generation scan sometimes gets unbalanced. When the parallel workload is unbalanced, the worker threads with less work will just wait idle for the ones with more work, and the overall time to finish the entire workload gets longer.

I recently made OpenJDK contributions that fix those issues. I implemented a parallel version of the initial mark phase (the CMSParallelInitialMarkEnabled flag/option) and a more evenly workload distributed version of the remark phase (the CMSEdenChunksRecordAlways flag/option). With these contributions, the initial/remark pause times get shorter by a factor of 5 or more. For example, in a test with a 1 GB young generation (within a 3 GB heap), the pause times stayed below 100 milliseconds compared to 500 milliseconds or more without these contributions.

A more detailed description is here:

After several long email threads that are archived at:

The patches have been accepted into the OpenJDK Hotspot code base as in: (parallel initial mark) (better parallelized remark) (a bug fix, authored by Jon)

Hopefully, with these changes, your initial/remark pause times will be shorter than before, especially in your big-heap Java applications.

Thanks to Jon Masamitsu, Thomas Schatzl, and Chuck Rasbold for sponsoring and/or reviewing the patches.