Hiroshi Yamauchi: 2009

Tuesday, December 29, 2009

GC Threads

There are two JVM flags/options that control how many GC threads are used in the JVM: ParallelGCThreads and ParallelCMSThreads.

ParallelGCThreads

This flag controls the number of threads used in the parallel garbage collector. This includes the young generation collector used by default. If the parallel GC is used (-XX:+UseParallelGC) or turned on by default on a 'server-class' machine, this is what you care with regard to the number of GC threads. Here's the formula that decides how many GC threads are used in the JVM on Linux/x86:

ParallelGCThreads = (ncpus <= 8) ? ncpus : 3 + ((ncpus * 5) / 8)

Some examples are:

When ncpus=4, ParallelGCThreads=4

When ncpus=8, ParallelGCThreads=8

When ncpus=16, ParallelGCThreads=13

A rationale for the number of GC threads lower than the core count in higher core count machines, that I can think of, is that parallel GC does not scale perfectly and the extra core count didn't help or even degraded the performance.

ParallelCMSThreads

This flag controls the number of threads used for the CMS (concurrent mark and sweep) garbage collector (-XX:+UseConcMarkSweepGC). CMS is often used to minimize the server latency by running the old generation GC with the application threads mostly concurrently. Even when CMS is used (for the old gen heap), a parallel GC is used for the young gen heap. So, the value of ParallelGCThreads still matters. Here's how the default value of ParallelCMSThreads is computed on Linux/x86:

ParallelCMSThreads = (ParallelGCThreads + 3) / 4

Some examples are:

When ncpus=4, ParallelCMSThreads =1

When ncpus=8, ParallelCMSThreads =2

When ncpus=16, ParallelCMSThreads =4

Typically, when the CMS GC is active, the CMS threads occupy the cores. The rest of the cores are available for application threads. For example, on a 8 core machine, since ParallelCMSThreads is 2, the remaining 6 cores are available for application threads. (As a side note, because all the threads have the same scheduling priority at the POSIX thread level in the JVM under Linux/x86, the CMS threads may not necessarily be on cores all of the time.)

Takeaways

Here are the takeaways for GC tuners out there:

Since ParallelCMSThreads is computed based on the value of ParallelGCThreads, overriding ParallelGCThreads when using CMS affects ParallelCMSThreads and the CMS performance.
Knowing how the default values of the flags helps better tune both the parallel GC and the CMS GC. Since the Sun JVM engineers probably empirically determined the default values in certain environment, it may not necessarily be the best for your environment.
If you have worked around some multithreaded CMS crash bug in older Sun JDKs by running it single-threaded (for example this one), the workaround would have caused a tremendous performance degradation on many-core machines. So, if you run newer JDK and still uses the workaround, it's time to get rid of the workaround and allow CMS to take advantage of multicores.

Building OpenJDK faster

A basic build instruction was described here. But the full build takes long. Here are the variables that I use to build OpenJDK faster for everyday builds:

NO_DOCS=true. This causes the build not to generate the javadoc docs for the JDK source code, which isn't very useful for daily engineering.
NO_IMAGES=true and DEV_ONLY=true. By default, the JDK makefile builds the JDK and the JRE images (as in the state ready for deployment). For everyday debugging purposes, it's not necessary. By setting this variable to true, the image creation is omitted, which saves the build time.
HOTSPOT_BUILD_JOBS=[ncpu] (for the JVM) and PARALLEL_COMPILE_JOBS=[ncpu](for the JDK). These are for parallel builds. By setting them to the number of the CPU cores available on the build machine, the build process runs in parallel.

There are ways to build either the JVM or the JDK parts of the build, which is a real build time saver.

When I only need to build the JVM part, here's how I build the JVM only:

Go to the hotspot/make directory.
Build make targets all_fastdebug copy_fastdebug_jdk export_fastdebug_jdk. Replace 'fastdebug' with 'product' for a product build. The JDK part of the build is copied from the import JDK.
Look for a java launcher in hotspot/build/.

When I only need to build the JDK part, here's how I build the JDK only:

Go to the jdk/make direcotry.
Build make target fastdebug. The JVM part of the build is copied from the import JDK.
Look for a java launcher in jdk/build/.

JVM process memory

Have you wondered what consumes memory in the JVM process? Here are the most of the list:

The Java heap. The maximum size is controlled by flag -Xmx. This is where Java objects are allocated.
The permanent generation (perm gen) heap. The maximum size is controlled by -XX:MaxPermSize. The default is 64MB on Linux/x86. This is where the JVM-level class metadata objects, interned strings (String.intern), and JVM-level symbol data are allocated. This often fills up unexpectedly when you use dynamic code/class generation in your application.
The code cache. The JIT compiled native code is allocated here.
The memory mapped .jar and .so files. The JDK's standard class library jar files and application's jar files are often memory mapped (typically only part of the files.) Various JDK shared library files (.so files) and application shared library files (JNI) are also memory mapped.
The thread stacks. The maximum size of a thread's stack is controlled by flag -Xss or -X:ThreadStackSize. On Linux/x86, 320KB is the default (per thread.)
The C/malloc heap. Both the JVM itself and any native code (either JDK's or application's) typically uses malloc to allocate memory from this heap. NIO direct buffers are allocated via malloc on Linux/x86.
Any other mmap calls. Any native code could call to allocate pages in the address space using mmap.

A side note is that most of the above are allocated lazily. That is, they are allocated in terms of virtual memory early but committed only on demand. Your application's physical memory use (RSS) may look small under light load, but may get substantially high under heavy load. A takeaway is it makes sense to consider the above factors when diagnosing memory footprint problems in the JVM.

Wednesday, July 29, 2009

More 2D bugfixes

I worked on a patch to fix a bug in the Pisces renderer:

http://hg.openjdk.java.net/jdk7/2d/jdk/rev/fb03586d68b6

http://bugs.openjdk.java.net/show_bug.cgi?id=100030

The bug was that a circle was rendered like a 'C' shape, that is, the right most edge of the circle wasn't rendered properly. It turned out that there was a off-by-one error.

And there are a few more bugfixes in progress:

https://bugs.openjdk.java.net/show_bug.cgi?id=100035

https://bugs.openjdk.java.net/show_bug.cgi?id=100031

https://bugs.openjdk.java.net/show_bug.cgi?id=100088

https://bugs.openjdk.java.net/show_bug.cgi?id=100089

Thanks to Jennifer Godinez.

[Note: at least the 100031 patch has been checked into OpenJDK.]

A server compiler crash fix

Here's my latest contribution to OpenJDK:

http://hg.openjdk.java.net/jdk7/hotspot-comp/hotspot/rev/fd50a67f97d1

The patch fixes a server compiler crash that happens in a very rare condition, described here. A key attribute of the server compiler IR is that an instruction must dominate all its inputs. However, in the function remix_address_expressions(), this property was violated and caused a crash in a later split if transformation.

Thanks to Chuck Rasbold and Tom Rodriguez.

Friday, April 10, 2009

An event dispatch bug in JVMTI

I encountered an event dispatch bug in JVMTI. See here for the communication on the openjdk mailing list. Here's a summary of what happened.

According to the JVMTI spec, no JVMTI events should be sent during the JVMTI dead phase (after the VMDeath event was sent). However, I observed that the CompiledMethodLoad and CompiledMethodUnload events were sent during the dead phase after the Agent_OnUnload callback happened. These compile events were actually for the last Java method JIT-compiled. This can cause a nasty memory corruption bug because is Agent_OnUnload is usually where the data structures of a JVMTI agent are deallocated and the callback handlers for the above compile events touch the already-deallocated data structure.

After looking into the Hotspot code, I noticed that events dispatch and the JVMTI phase changes are not synchronized at all (i.e, race conditions). And in theory, this bug can happen not just for the two compile events, but for any events. In practice, this bug would probably happen for the compile events because those are triggered by the compile threads rather than application threads. I was able to suppress this bug in two ways.

By not deallocating memory (perhaps, only the one related to the compile events) in Agent_OnUnload. That way, late event callbacks handlers only touch still-valid memory.
By adding extra synchronization in the Hotspot JVMTI code (the details are in the mailing list log).

1 is a more practical approach where you cannot change the VM code, or you want to be portable. 2 is harder since it's not obvious what the performance implications would be and because once we start fixing the race conditions, we need to fix more race conditions.

As far as I can tell, the same race conditions exist in updating event callback handlers (SetEventCallbacks) and enabling/disabling individual event callback (SetEventNotificationMode). So, what does it mean? It means that, if you are a JVMTI agent writer, your request to change the event callback handlers in the middle of an application run or to disable an event dispatch temporarily and enable back later on may not be honored due to the race conditions. Scary? Yes, especially on modern multicore machines.

Monday, March 16, 2009

Java 2D Miter Line Join Decoration Bugfix

I'm not really a Java 2D person, but I contributed a small bugfix to OpenJDK for the miter line join rendering in the Java 2D Rendering Engine:

http://hg.openjdk.java.net/jdk7/2d/jdk/rev/9318628e8eee

What's miter line join? For a quick tutorial on the basics of Java 2D rendering, see this page from the Java tutorial:

http://java.sun.com/docs/books/tutorial/2d/geometry/strokeandfill.html

Sunday, January 25, 2009

Multithreaded CMS crash

I ran across a CMS crash bug in Hotspot 10 b19 (OpenJDK6 b11), which is described here:

http://bugs.sun.com/view_bug.do?bug_id=6722116

The symptom is that VM crashes in the concurrent GC thread at the following stack trace (with mangled symbols):

_ZN24YieldingFlexibleWorkGang10start_taskEP24YieldingFlexibleGangTask
_ZN12CMSCollector13do_marking_mtEb
_ZN12CMSCollector13markFromRootsEb
_ZN12CMSCollector21collect_in_backgroundEb
_ZN25ConcurrentMarkSweepThread3runEv
_Z10java_startP6Thread

A workaround is to use the JVM option -XX:-CMSConcurrentMTEnabled to disable the multithreaded CMS collection. Upgrading to the JDK with newer Hotspot is a better idea, though.

Wednesday, January 14, 2009

Mysterious int arrays in a heap dump

This week I have been investigating the issue of mysterious (primitive) int array objects showing up in a heap dump (generated by the jmap utility or the Hotspot MXBean.) They are mysterious because they do not have any references to them (appears to be dead objects) according to the jhat output.

Today I learned that the garbage collectors may fabricate fake int array objects in certain cases. A heap compaction may not fully compact the heap and leave some 'holes' in the heap when the amount of the de-fragmented memory isn't worth the cost of compaction. Those holes are turned into int arrays perhaps because there is an assertion that the heap needs to look fully compacted for a certain reason, I guess. The int arrays are harmless because they look just like dead objects (unreachable from the roots) and will be garbage collected in a near-future garbage collection.