Tuesday, December 29, 2009

GC Threads

There are two JVM flags/options that control how many GC threads are used in the JVM: ParallelGCThreads and ParallelCMSThreads.

ParallelGCThreads
This flag controls the number of threads used in the parallel garbage collector. This includes the young generation collector used by default. If the parallel GC is used (-XX:+UseParallelGC) or turned on by default on a 'server-class' machine, this is what you care with regard to the number of GC threads. Here's the formula that decides how many GC threads are used in the JVM on Linux/x86:

ParallelGCThreads = (ncpus <= 8) ? ncpus : 3 + ((ncpus * 5) / 8)

Some examples are:

When ncpus=4, ParallelGCThreads=4
When ncpus=8, ParallelGCThreads=8
When ncpus=16, ParallelGCThreads=13

A rationale for the number of GC threads lower than the core count in higher core count machines, that I can think of, is that parallel GC does not scale perfectly and the extra core count didn't help or even degraded the performance.

ParallelCMSThreads
This flag controls the number of threads used for the CMS (concurrent mark and sweep) garbage collector (-XX:+UseConcMarkSweepGC). CMS is often used to minimize the server latency by running the old generation GC with the application threads mostly concurrently. Even when CMS is used (for the old gen heap), a parallel GC is used for the young gen heap. So, the value of ParallelGCThreads still matters. Here's how the default value of ParallelCMSThreads is computed on Linux/x86:

ParallelCMSThreads = (ParallelGCThreads + 3) / 4

Some examples are:

When ncpus=4, ParallelCMSThreads =1
When ncpus=8, ParallelCMSThreads =2
When ncpus=16, ParallelCMSThreads =4

Typically, when the CMS GC is active, the CMS threads occupy the cores. The rest of the cores are available for application threads. For example, on a 8 core machine, since ParallelCMSThreads is 2, the remaining 6 cores are available for application threads. (As a side note, because all the threads have the same scheduling priority at the POSIX thread level in the JVM under Linux/x86, the CMS threads may not necessarily be on cores all of the time.)

Takeaways
Here are the takeaways for GC tuners out there:
  • Since ParallelCMSThreads is computed based on the value of ParallelGCThreads, overriding ParallelGCThreads when using CMS affects ParallelCMSThreads and the CMS performance.
  • Knowing how the default values of the flags helps better tune both the parallel GC and the CMS GC. Since the Sun JVM engineers probably empirically determined the default values in certain environment, it may not necessarily be the best for your environment.
  • If you have worked around some multithreaded CMS crash bug in older Sun JDKs by running it single-threaded (for example this one), the workaround would have caused a tremendous performance degradation on many-core machines. So, if you run newer JDK and still uses the workaround, it's time to get rid of the workaround and allow CMS to take advantage of multicores.

Building OpenJDK faster

A basic build instruction was described here. But the full build takes long. Here are the variables that I use to build OpenJDK faster for everyday builds:
  • NO_DOCS=true. This causes the build not to generate the javadoc docs for the JDK source code, which isn't very useful for daily engineering.
  • NO_IMAGES=true and DEV_ONLY=true. By default, the JDK makefile builds the JDK and the JRE images (as in the state ready for deployment). For everyday debugging purposes, it's not necessary. By setting this variable to true, the image creation is omitted, which saves the build time.
  • HOTSPOT_BUILD_JOBS=[ncpu] (for the JVM) and PARALLEL_COMPILE_JOBS=[ncpu](for the JDK). These are for parallel builds. By setting them to the number of the CPU cores available on the build machine, the build process runs in parallel.
There are ways to build either the JVM or the JDK parts of the build, which is a real build time saver.

When I only need to build the JVM part, here's how I build the JVM only:
  • Go to the hotspot/make directory.
  • Build make targets all_fastdebug copy_fastdebug_jdk export_fastdebug_jdk. Replace 'fastdebug' with 'product' for a product build. The JDK part of the build is copied from the import JDK.
  • Look for a java launcher in hotspot/build/.
When I only need to build the JDK part, here's how I build the JDK only:
  • Go to the jdk/make direcotry.
  • Build make target fastdebug. The JVM part of the build is copied from the import JDK.
  • Look for a java launcher in jdk/build/.

JVM process memory

Have you wondered what consumes memory in the JVM process? Here are the most of the list:
  • The Java heap. The maximum size is controlled by flag -Xmx. This is where Java objects are allocated.
  • The permanent generation (perm gen) heap. The maximum size is controlled by -XX:MaxPermSize. The default is 64MB on Linux/x86. This is where the JVM-level class metadata objects, interned strings (String.intern), and JVM-level symbol data are allocated. This often fills up unexpectedly when you use dynamic code/class generation in your application.
  • The code cache. The JIT compiled native code is allocated here.
  • The memory mapped .jar and .so files. The JDK's standard class library jar files and application's jar files are often memory mapped (typically only part of the files.) Various JDK shared library files (.so files) and application shared library files (JNI) are also memory mapped.
  • The thread stacks. The maximum size of a thread's stack is controlled by flag -Xss or -X:ThreadStackSize. On Linux/x86, 320KB is the default (per thread.)
  • The C/malloc heap. Both the JVM itself and any native code (either JDK's or application's) typically uses malloc to allocate memory from this heap. NIO direct buffers are allocated via malloc on Linux/x86.
  • Any other mmap calls. Any native code could call to allocate pages in the address space using mmap.
A side note is that most of the above are allocated lazily. That is, they are allocated in terms of virtual memory early but committed only on demand. Your application's physical memory use (RSS) may look small under light load, but may get substantially high under heavy load. A takeaway is it makes sense to consider the above factors when diagnosing memory footprint problems in the JVM.