Thursday 16 July 2015

Automatically Generating a Weblogic Thread Dump during High CPU (using JRockit)

One of the best ways to gain insight into why a Weblogic Server instance consumes maximum system CPU is to take a thread-dump of the JVM when the high CPU incident occurs.

The Weblogic Admin Console allows you to create a thread-dump via GUI under the Server view, however sometimes this isn't a good solution because:

  1. The high CPU incident occurs when you do not have access to the Admin Console
  2. The high CPU incident locks up your Admin Console, which can occur when the Admin Console is hosted on the same host as the JVM with high CPU, or if there is a lot of network traffic between the faulty Weblogic JVM and the Webserver causing your Admin Console to perform poorly.
A better solution is to automatically dump the threads via the system shell. The following instructions are for JRockit (and Linux) and utilise some JRockit utils. If you are using Java SE you will find a good guide to creating a similar script here at the middlewaremagic.com blog.

highcpu.sh
WL_PID=`ps -ef | grep weblogic.Name=my_server1 | grep -v grep | awk '{print $2}'`
DATESTAMP=`date "+%m%d_%I%M"`
jrcmd $WL_PID print_threads > /var/logs/oracle/${DATESTAMP}-threaddump.log
top -H -b -n 1 -p $WL_PID > /var/logs/oracle/${DATESTAMP}-threadids.log

The script explained:
  1. find the PID of the Weblogic Server process. I assume you will know which server is encountering high CPU problems, and can substitute it for the "my_server1" value. If not you can grep from the 'top' command and find the java process with the highest CPU.
  2. Creates a timestamp for our log file in a preferred format
  3. Executes the JRockit tool jrcmd, passes it the Weblogic Server PID and dumps out the threads to the TIMESTAMP-threaddump.log
  4. Executes top for the Weblogic Server PID and dumps all the ID in descending order of CPU consumption.
With these two log files you can now examine which threads are causing the high CPU consumption. Find the Thread ID (called pid) in the threadids.log and then find the corresponding "tid" in threaddump.log. Hopefully from this you can learn which thread activity and code is causing your high CPU usage.

Now, in order to have this to run automatically when the high CPU occurs there are many options. With just Linux you can monitor top with a cron job which runs the dump when CPU is detected to be over a certain threshold. Or if you have a monitoring tool you should be able to configure an alert that runs the script. In my case I can configure Oracle Enterprise Manager to execute the script as a "corrective action" when it detects CPU Utilization on the host reaches a defined threshold.