Thursday, 4 June 2015

Understanding Host Memory Utilization Monitoring and Alerts for Java Based Middleware

Have you received an alert like this?

EM Event: Warning:Memory Utilization is 85.004%, crossed warning (85) or critical (90) threshold.



[ALERT] priority of [Medium] for alert [Memory Utilization above 80%] on resource


You’ve probably decided to set up Oracle Enterprise Manager (OEM) or JBoss Operations Network (JON) monitoring for the Linux host machine running your Oracle Middleware/Database software or JBoss middleware.

After some time running this alert about the host’s Memory Utilization begins to occur. You check OEM or JON consoles and even more disturbingly this is an upward trend that has been happening since the software started running.




No problems are occurring within the applications or database itself, so what should be done?


TL;DR
Linux’s memory management by design will always try to use free RAM for caching. If memory requirements increase these buffers will be released immediately. No action should be taken and the alert should be tuned for either a much higher threshold (98%) or disabled. See the final section of this post for what metrics you should monitor to manage the health of your platform’s memory.


The Long Explanation
If you still need reassurance that the alert is nothing to worry about we can go through the process of checking ourselves to make sure everything is healthy.


How does Oracle EM Agent determine Memory Utilization on a Linux Host?
The answer to this question is found in the following Oracle Support doc:


To summarise, OEM will calculate the memory capacity by comparing MemTotal and MemFree in /proc/meminfo


[oracle@prd-soa1 ~]$ grep Mem /proc/meminfo
MemTotal:       68042432 kB
MemFree:         4403660 kB


(4403660 / 68042432) * 100 = Free Memory Percentage. (Currently about 6%)


JBoss ON probably uses a similar method for calculating the memory utilization.


How does Linux Manage Memory Utilization?
Using the free command we can see the Linux memory state (pass the -m flag for megabytes, default is kilobytes).


[oracle@prd-soa1 prd-ofm-domain]$ free -m
            total       used       free     shared    buffers     cached
Mem:         66447      62159       4288          0        572      25056
-/+ buffers/cache:      36530      29917
Swap:         1023          0       1023


The second row of output, preceded by “-/+ buffers/cache” represents the true state of the Linux memory when the RAM level caching isn’t counted. In this example we can see that the actual available memory is 30GB, despite the first row showing that only 4GB of the 64GB is free.


We can also see our swap used is 0. This is a good sign of a healthy system. Linux will utilise the swap space when it cannot fit objects into the RAM.


Check Memory Health Using vmstat
The inbuilt vmstat command can provide us with some more helpful diagnostics of the state of the Linux memory system. Here I pass two arguments, the first is the wait between checks and the second is the number of checks. The vmstat process will check ten times over ten seconds. You can adjust these as needed. Finally the output is piped to column to help with formatting and readability.


[oracle@prd-soa prd-ofm-domain]$ vmstat 1 10 | column -t
prcs
-----------memory----------
-swap-
--io--
--system--
-----cpu-----
r
b
swpd
free
buff
cache
si
so
bi
bo
in
cs
us
sy
id
wa
st
5
0
0
4403052
585904
25646240
0
0
1
11
2
0
6
1
93
0
0
3
0
0
4403160
585904
25646240
0
0
0
32
7441
4705
44
3
53
0
0
2
0
0
4403032
585904
25646280
0
0
0
448
7226
4555
41
2
57
0
0
1
0
0
4403032
585904
25646352
0
0
0
144
1993
644
23
0
77
0
0
2
0
0
4403024
585904
25646352
0
0
0
12
8188
4682
47
4
50
0
0
2
0
0
4402972
585908
25646408
0
0
0
792
7094
5075
37
2
61
0
0
3
0
0
4402540
585908
25646916
0
0
0
0
7954
4909
45
3
52
0
0
2
0
0
4402416
585908
25647008
0
0
0
128
6852
5058
40
2
58
0
0
2
0
0
4402208
585908
25647436
0
0
0
2
7934
5010
41
3
56
0
0
1
0
0
4402208
585908
25647564
0
0
0
0
1527
491
20
0
80
0
0

There are many excellent articles online about how to interpret this data. In our case we are mainly interested in checking for activity in the Swap In (si) and Swap Out (so) columns. Entries in the Swap Out column indicate that the operating system is swapping memory from RAM to disk when it doesn’t have the capacity to store everything in RAM.


In our example above we see no activity in the swap columns, so we can feel confident that memory utilisation is healthy.
Of course, running the vmstat logging over a longer period of time while consecutively loading your system will give the best indication of how swap is being used. However, why should we do that manually?


What Monitoring You Should Have For Linux Hosts Running Java Based Middleware
Because Oracle Weblogic and JBoss Fuse components all run in Java Virtual Machines we actually have total control of how much RAM our platform’s components will consume. It’s set by the Java arguments. So long as we do not set the combined total of our Xmx (Memory Max) arguments for all JVMs on the box (and allow some space for non-java processes to use RAM) we should never exceed the memory due to our JVMs.


Note: Of course it's possible to set Xms (Memory Min) to a value less than the Xmx, which would permit you to exceed the total memory of the system in your combined total of Xmx, and not realise the problem until each JVM tries to consume the Xmx value. It is best practice, however, to set the Xms and Xmx values to be the same. This improves performance and garbage collection and has the handy side-effect of preventing us from running too many JVMs.


By setting the Xmx value it means that if we do try start a Java process and the operating system memory (minus buffers) is not sufficient the JVM will fail to start.


Monitoring Correctly For Memory Usage
Use the following monitoring configurations to more accurately keep an eye on your hosts memory health:

Availability of JVM Processes - because if there’s not enough memory to start or run a process, the availability alert will let us know.


OS Swap Space Usage - because if the operating system is starting to consume swap space then we want to know, as that is a sign that there is no actual free RAM to use.

Middleware Process JVM Usage - because we can keep track of how high the JVM Heap usage is. If heap utilisation is consistently very low and we experience alerts for the other two alerts above it might be rational to decrease the size of the JVM to correct the problem.