Monday 21 December 2015

Embedded LDAP error

When starting the Weblogic admin server the following error reports continuously. The LDAP or it's changelog index has become corrupted (in my case it was due to server running out of space during startup).

####<Dec 12, 2015 12:17:00 PM CST> <Error> <EmbeddedLDAP> <myserver.example.edu> <my_admin> <VDE Replication Thread> <<anonymous>> <> <e412251de17330a1:535cadc6:151c23028bb:-8000-0000000000000007> <1450662420044> <BEA-000000> <Error reading changelog entry#: 0>
####<Dec 12, 2015 12:17:00 PM CST> <Critical> <EmbeddedLDAP> <myserver.example.edu> <my_admin> <VDE Replication Thread> <<anonymous>> <> <e412251de17330a1:535cadc6:151c23028bb:-8000-0000000000000007> <1450662420044> <BEA-000000> <java.lang.NullPointerException
at com.octetstring.vde.EntryChanges.readBytes(EntryChanges.java:256)
at com.octetstring.vde.EntryChanges.<init>(EntryChanges.java:72)
at com.octetstring.vde.replication.BackendChangeLog.getChange(BackendChangeLog.java:548)
at com.octetstring.vde.replication.Replicator.run(Replicator.java:180)
at com.octetstring.vde.replication.Replication.run(Replication.java:339)

Steps to Resolve

Shutdown Admin Server and all Managed Servers
cd $WEBLOGIC_DOMAIN/servers/my_admin/data/ldap
mv ldapfiles/changelog.index ldapfiles/_changelog.index
mv ldapfiles/changelog.data ldapfiles/_changelog.data

Unzip a backup from the backups directory
unzip $WEBLOGIC_DOMAIN/servers/my_admin/data/ldap/backup/EmbeddedLDAPBackup.zip -d /

Start Admin Server, then rest of managed servers.
changelog.inded and changelog.data will be recreated.

You might need to re-add any customisations to the Internal LDAP. Using the Admin Console, check Summary of Security Realms >myrealm >Users and Groups and filter on DefaultAuthenticator to check the users still exist.

Monday 9 November 2015

Investigating SOA SCA Composite Faults without Enterprise Manager (EM)

If you're ever facing a problem with SOA instances faulting, and at the same time you're having problems running or accessing Oracle Enterprise Manager for SOA Suite, you can directly access the Composite faults via the SOAINFRA Database.

The name of the table containing the faults is:
COMPOSITE_INSTANCE_FAULT

A demonstration of SQL to generate a summary of the latest faults is:

select CPST_PARTITION_DATE, ID, COMPOSITE_DN, SERVICE_NAME, ERROR_MESSAGE, STACK_TRACE
from MYPREFIX_SOAINFRA.COMPOSITE_INSTANCE_FAULT
order by created_time desc;

The ERROR_MESSAGE and STACK_TRACE fields will provide the same error message details that are usually visible in Enterprise Manager.

Monday 31 August 2015

Solving NoClassDefFound SOA Suite BPM Errors Post-Patching

A Missing Class during Business Processing

In our development environment a BPEL Fault began occurring with the following error in the SOA Managed Server logs:

<bpelFault><faultType>0</faultType><remoteFault xmlns="http://schemas.oracle.com/bpel/extension"><part name="summary"><summary>oracle.fabric.common.FabricInvocationException: java.lang.NoClassDefFoundError: oracle/tip/adapter/api/MutatorsAsProperties</summary></part><part name="detail"><detail>oracle/tip/adapter/api/MutatorsAsProperties</detail></part><part name="code"><code>null</code></part></remoteFault></bpelFault>

************************

java.lang.NoClassDefFoundError: oracle/tip/adapter/api/MutatorsAsProperties

************************

In this environment the OSB and SOA Managed Servers are both running on the same VM/Domain, so in this case they both use the same ./bin/setDomainEnv.sh script which is magically generated during installation.

Running both OSB and SOA on the same node has issues that occur when a patch to one or both products changes which classes are available in the libraries used by the managed servers. I suspected the issue was a recent OSB patch had caused setDomainEnv to use an OSB copy of a JAR needed by SOA, but how can we prove it?

First, let's identify the location (jar file) containing the class we want.

We can find it here in the SOA Product home:

[oracle@dev-01 ~]$ find /path/to/products/Oracle_SOA1/ -name "*.jar" | xargs grep MutatorsAsProperties

Binary file ./soa/modules/oracle.soa.adapter_11.1.1/jca-binding-api.jar matches

Try the same search in the OSB home and there's no matches:

[oracle@dev-01 ~]$ find /path/to/products/Oracle_OSB1/ -name "*.jar" | xargs grep MutatorsAsProperties | wc -l

From this investigation we learn that the SOA server needs the jca-binding-api.jar in its classpath. So let's check the server logs to see what's happening. Here’s a copy of the classpath from the SOA1.out log:

CLASSPATH=/app/oracle/product/oracle_common/modules/oracle.jdbc_11.1.1/ojdbc6dms.jar:/app/oracle/product/Oracle_SOA1/soa/modules/user-patch.jar:/app/oracle/product/Oracle_SOA1/soa/modules/soa-startup.jar::/app/oracle/product/Oracle_OSB1/lib/osb-server-modules-ref.jar:/app/oracle/product/patch_wls1036/profiles/default/sys_manifest_classpath/weblogic_patch.jar:/app/oracle/product/patch_ocp371/profiles/default/sys_manifest_classpath/weblogic_patch.jar:/usr/java/latest/lib/tools.jar:/app/oracle/product/wlserver_10.3/server/lib/weblogic_sp.jar:/app/oracle/product/wlserver_10.3/server/lib/weblogic.jar:/app/oracle/product/modules/features/weblogic.server.modules_10.3.6.0.jar:/app/oracle/product/wlserver_10.3/server/lib/webservices.jar:/app/oracle/product/modules/org.apache.ant_1.7.1/lib/ant-all.jar:/app/oracle/product/modules/net.sf.antcontrib_1.1.0.0_1-0b2/lib/ant-contrib.jar:blah blah blah

Wow. It’s a long list, and we check through it we don't see jca-binding-api.jar anywhere in there. There are a lot of JARs that make up the SOA product, many of them are included in the classpath inside the Manifests of other JARs so they do not need to be explicitly written in the classpath that the server shows.

So, we could try adding it manually to this list, but then maybe another class will be missing. Our best bet is to work out which OSB class is being loaded in place of the SOA. Let’s cut down the CLASSPATH list to just ones coming from the OSB Home:

[oracle@dev-ofm-01 logs]$ grep CLASSPATH soa1.out | tr ":" "\n" | grep Oracle_OSB1

/app/oracle/product/Oracle_OSB1/lib/osb-server-modules-ref.jar
/app/oracle/product/Oracle_OSB1/soa/modules/oracle.soa.common.adapters_11.1.1/oracle.soa.common.adapters.jar
/app/oracle/product/Oracle_OSB1/lib/version.jar
/app/oracle/product/Oracle_OSB1/lib/alsb.jar
/app/oracle/product/Oracle_OSB1/3rdparty/classes
/app/oracle/product/Oracle_OSB1/lib/external/log4j_1.2.8.jar

One JAR immediately stands out as the culprit of our problems (because it references SOA) -- ./Oracle_OSB1/soa/modules/oracle.soa.common.adapters_11.1.1/oracle.soa.common.adapters.jar

We know from our earlier searching that the MutatorsAsProperties class was not found in the OSB home. Let’s now check if there is a corresponding JAR in the SOA home that includes the jca-bindings-api.

Yes, there is:

/app/oracle/product/Oracle_SOA1/soa/modules/oracle.soa.common.adapters_11.1.1/oracle.soa.common.adapters.jar

Let’s peek inside the JAR and we see only a manifest file with a classpath to include other JARs in the SOA Home:

Manifest-Version: 1.0

…

Class-Path: ../oracle.soa.adapter_11.1.1/jca-binding-api.jar ../oracle

.soa.adapter_11.1.1/adapter_xbeans.jar ../oracle.soa.fabric_11.1.1/bp

m-infra.jar ../oracle.soa.fabric_11.1.1/oracle-soa-client-api.jar ../

maverick-all.jar com.bea.alsb.client_1.4.0.0.jar osb_soa_client.jar

We found it! So, it seems that our SOA Managed Server should be using ./Oracle_SOA1/soa/modules/oracle.soa.common.adapters_11.1.1/oracle.soa.common.adapters.jar

While our OSB Managed Server should use:

./Oracle_SOA1/soa/modules/oracle.soa.common.adapters_11.1.1/oracle.soa.common.adapters.jar

Update the setDomainEnv.sh with some logic to reflect this.

Checking the setDomainEnv.sh script in the domain home we find the lines:

POST_CLASSPATH="/u01/app/oracle/product/Oracle_OSB1/soa/modules/oracle.soa.common.adapters_11.1.1/oracle.soa.common.adapters.jar${CLASSPATHSEP}}${POST_CLASSPATH}"

export POST_CLASSPATH

Let’s change this as follows (The parts in bold should be checked for your own environment):

if [ "${SERVER_NAME}" == "dev01_osb1" ] ; then

POST_CLASSPATH="/u01/app/oracle/product/Oracle_OSB1/soa/modules/oracle.soa.common.adapters_11.1.1/oracle.soa.common.adapters.jar${CLASSPATHSEP}${ALSB_HOME}/soa/modules/oracle.soa.adapter_11.1.1/adapter_xbeans.jar${CLASSPATHSEP}${ALSB_HOME}/soa/modules/oracle.soa.adapter_11.1.1/jca-binding-api.jar${CLASSPATHSEP}${POST_CLASSPATH}"

export POST_CLASSPATH

else

POST_CLASSPATH="/u01/app/oracle/product/Oracle_SOA1/soa/modules/oracle.soa.common.adapters_11.1.1/oracle.soa.common.adapters.jar${CLASSPATHSEP}${POST_CLASSPATH}"

export POST_CLASSPATH

Restart both the SOA and OSB VMs and check the CLASSPATH output during startup. We should see the OSB Server using the OSB version of the JAR and the SOA Server using the SOA copy.

Re-create the conditions where the class originally called and the error happened and now the process does not encounter any fault.

Thursday 16 July 2015

Automatically Generating a Weblogic Thread Dump during High CPU (using JRockit)

One of the best ways to gain insight into why a Weblogic Server instance consumes maximum system CPU is to take a thread-dump of the JVM when the high CPU incident occurs.

The Weblogic Admin Console allows you to create a thread-dump via GUI under the Server view, however sometimes this isn't a good solution because:

The high CPU incident occurs when you do not have access to the Admin Console
The high CPU incident locks up your Admin Console, which can occur when the Admin Console is hosted on the same host as the JVM with high CPU, or if there is a lot of network traffic between the faulty Weblogic JVM and the Webserver causing your Admin Console to perform poorly.

A better solution is to automatically dump the threads via the system shell. The following instructions are for JRockit (and Linux) and utilise some JRockit utils. If you are using Java SE you will find a good guide to creating a similar script here at the middlewaremagic.com blog.

highcpu.sh

WL_PID=`ps -ef | grep weblogic.Name=my_server1 | grep -v grep | awk '{print $2}'`

DATESTAMP=`date "+%m%d_%I%M"`

jrcmd $WL_PID print_threads > /var/logs/oracle/${DATESTAMP}-threaddump.log

top -H -b -n 1 -p $WL_PID > /var/logs/oracle/${DATESTAMP}-threadids.log

The script explained:

find the PID of the Weblogic Server process. I assume you will know which server is encountering high CPU problems, and can substitute it for the "my_server1" value. If not you can grep from the 'top' command and find the java process with the highest CPU.
Creates a timestamp for our log file in a preferred format
Executes the JRockit tool jrcmd, passes it the Weblogic Server PID and dumps out the threads to the TIMESTAMP-threaddump.log
Executes top for the Weblogic Server PID and dumps all the ID in descending order of CPU consumption.

With these two log files you can now examine which threads are causing the high CPU consumption. Find the Thread ID (called pid) in the threadids.log and then find the corresponding "tid" in threaddump.log. Hopefully from this you can learn which thread activity and code is causing your high CPU usage.

Now, in order to have this to run automatically when the high CPU occurs there are many options. With just Linux you can monitor top with a cron job which runs the dump when CPU is detected to be over a certain threshold. Or if you have a monitoring tool you should be able to configure an alert that runs the script. In my case I can configure Oracle Enterprise Manager to execute the script as a "corrective action" when it detects CPU Utilization on the host reaches a defined threshold.

The manual explains pretty well how to do this.

Thursday 4 June 2015

Understanding Host Memory Utilization Monitoring and Alerts for Java Based Middleware

Have you received an alert like this?

EM Event: Warning:Memory Utilization is 85.004%, crossed warning (85) or critical (90) threshold.

[ALERT] priority of [Medium] for alert [Memory Utilization above 80%] on resource

You’ve probably decided to set up Oracle Enterprise Manager (OEM) or JBoss Operations Network (JON) monitoring for the Linux host machine running your Oracle Middleware/Database software or JBoss middleware.

After some time running this alert about the host’s Memory Utilization begins to occur. You check OEM or JON consoles and even more disturbingly this is an upward trend that has been happening since the software started running.

No problems are occurring within the applications or database itself, so what should be done?

TL;DR

Linux’s memory management by design will always try to use free RAM for caching. If memory requirements increase these buffers will be released immediately. No action should be taken and the alert should be tuned for either a much higher threshold (98%) or disabled. See the final section of this post for what metrics you should monitor to manage the health of your platform’s memory.

The Long Explanation

If you still need reassurance that the alert is nothing to worry about we can go through the process of checking ourselves to make sure everything is healthy.

How does Oracle EM Agent determine Memory Utilization on a Linux Host?

The answer to this question is found in the following Oracle Support doc:

https://support.oracle.com/rs?type=doc&id=730104.1

To summarise, OEM will calculate the memory capacity by comparing MemTotal and MemFree in /proc/meminfo

[oracle@prd-soa1 ~]$ grep Mem /proc/meminfo

MemTotal: 68042432 kB

MemFree: 4403660 kB

(4403660 / 68042432) * 100 = Free Memory Percentage. (Currently about 6%)

JBoss ON probably uses a similar method for calculating the memory utilization.

How does Linux Manage Memory Utilization?

Using the free command we can see the Linux memory state (pass the -m flag for megabytes, default is kilobytes).

[oracle@prd-soa1 prd-ofm-domain]$ free -m

total used free shared buffers cached

Mem: 66447 62159 4288 0 572 25056

-/+ buffers/cache: 36530 29917

Swap: 1023 0 1023

The second row of output, preceded by “-/+ buffers/cache” represents the true state of the Linux memory when the RAM level caching isn’t counted. In this example we can see that the actual available memory is 30GB, despite the first row showing that only 4GB of the 64GB is free.

We can also see our swap used is 0. This is a good sign of a healthy system. Linux will utilise the swap space when it cannot fit objects into the RAM.

Check Memory Health Using vmstat

The inbuilt vmstat command can provide us with some more helpful diagnostics of the state of the Linux memory system. Here I pass two arguments, the first is the wait between checks and the second is the number of checks. The vmstat process will check ten times over ten seconds. You can adjust these as needed. Finally the output is piped to column to help with formatting and readability.

[oracle@prd-soa prd-ofm-domain]$ vmstat 1 10 | column -t

prcs		-----------memory----------				-swap-		--io--		--system--		-----cpu-----
r	b	swpd	free	buff	cache	si	so	bi	bo	in	cs	us	sy	id	wa	st
5	0	0	4403052	585904	25646240	0	0	1	11	2	0	6	1	93	0	0
3	0	0	4403160	585904	25646240	0	0	0	32	7441	4705	44	3	53	0	0
2	0	0	4403032	585904	25646280	0	0	0	448	7226	4555	41	2	57	0	0
1	0	0	4403032	585904	25646352	0	0	0	144	1993	644	23	0	77	0	0
2	0	0	4403024	585904	25646352	0	0	0	12	8188	4682	47	4	50	0	0
2	0	0	4402972	585908	25646408	0	0	0	792	7094	5075	37	2	61	0	0
3	0	0	4402540	585908	25646916	0	0	0	0	7954	4909	45	3	52	0	0
2	0	0	4402416	585908	25647008	0	0	0	128	6852	5058	40	2	58	0	0
2	0	0	4402208	585908	25647436	0	0	0	2	7934	5010	41	3	56	0	0
1	0	0	4402208	585908	25647564	0	0	0	0	1527	491	20	0	80	0	0

There are many excellent articles online about how to interpret this data. In our case we are mainly interested in checking for activity in the Swap In (si) and Swap Out (so) columns. Entries in the Swap Out column indicate that the operating system is swapping memory from RAM to disk when it doesn’t have the capacity to store everything in RAM.

In our example above we see no activity in the swap columns, so we can feel confident that memory utilisation is healthy.

Of course, running the vmstat logging over a longer period of time while consecutively loading your system will give the best indication of how swap is being used. However, why should we do that manually?

What Monitoring You Should Have For Linux Hosts Running Java Based Middleware

Because Oracle Weblogic and JBoss Fuse components all run in Java Virtual Machines we actually have total control of how much RAM our platform’s components will consume. It’s set by the Java arguments. So long as we do not set the combined total of our Xmx (Memory Max) arguments for all JVMs on the box (and allow some space for non-java processes to use RAM) we should never exceed the memory due to our JVMs.

Note: Of course it's possible to set Xms (Memory Min) to a value less than the Xmx, which would permit you to exceed the total memory of the system in your combined total of Xmx, and not realise the problem until each JVM tries to consume the Xmx value. It is best practice, however, to set the Xms and Xmx values to be the same. This improves performance and garbage collection and has the handy side-effect of preventing us from running too many JVMs.

By setting the Xmx value it means that if we do try start a Java process and the operating system memory (minus buffers) is not sufficient the JVM will fail to start.

Monitoring Correctly For Memory Usage

Use the following monitoring configurations to more accurately keep an eye on your hosts memory health:

Availability of JVM Processes - because if there’s not enough memory to start or run a process, the availability alert will let us know.

OS Swap Space Usage - because if the operating system is starting to consume swap space then we want to know, as that is a sign that there is no actual free RAM to use.

Middleware Process JVM Usage - because we can keep track of how high the JVM Heap usage is. If heap utilisation is consistently very low and we experience alerts for the other two alerts above it might be rational to decrease the size of the JVM to correct the problem.