Followers

Tuesday, February 4, 2014

How Statement Pooling in JDBC affects the Garbage Collection

There are various techniques to improve the performance of your Java application. In this article I will talk about Statement Pooling Configuration and its effect on Garbage Collection process.
Statement Pooling allows to improve the performance of an application by caching SQL statements that are used repeatedly. Such caching mechanism allows to prepare frequently used statements only once and reuse them multiple times, thus reducing the overall number of times the database server has to parse, plan, and optimize these queries. A well-configured number of statements (maxStatements) to be cached can be as good as tuning the Garbage Collection. Now let's see how Statement Pooling can affect the Garbage Collection.

Why Check the Number of Statement in the Pool?

Often the size of the JDBC statement pool is set to the default value. Using the default value, of course does not usually lead to any special issue. But a well-configured maxStatements value can be as effective as GC tuning. If you are using the default maxStatements value and would like to optimize the use of memory, let's think about the correct statement pool value before attempting GC tuning.
As was discussed in Understanding Java Garbage Collection, a weak generational hypothesis (most objects quickly become unreachable and a reference from an old object to a new object is rare) was used as the precondition when creating garbage collector in Java. For the majority of NHN web services there should be a response within 300ms at the latest, unless it is a special case. Therefore, NHN web services are more applicable to the above situations than the general stand-alone type applications.

The GC Process between HTTP Request and Response

When developing a web service using web containers like Tomcat and other frameworks, the lifespan of objects created by a developer tend to be either very short or very long. Web developers usually write codes like Interceptor, Action, BO, or DAO (BO and DAO are generated and used as singletons from applicationContex in Spring, and are not the target of GC). The objects generated from these codes stay alive for a very brief time that exists between the time HTTP is requested and the time it has responded. For this reason, such objects are usually collected during Young GC.
There are also objects, such as singleton objects, that stay alive long enough to exist for the lifecycle of Tomcat. Such objects will be promoted to the old area soon after Tomcat starts running. Yet, when continuously monitoring web applications through jstat and the like, there are always some objects promoted to the old area during Young GC. These objects are usually used after being stored in the cache used for improving the performance of frameworks in most of the containers and projects. Whether the cached objects become the target of GC or not is determined by their cache hit ratio, not their age, so unless the hit ratio is 100%, they cannot avoid being promoted to the old area, even when the Young GC cycle is set to be long.
Among these caches, statement pooling affects the memory usage the most. If you are using iBatis, as iBatis processes all SQLs as preparedStatment, you will be using statement pooling. If the size of statement pooling is smaller than the number of SQLs being used, the cache hit ratio will decrease and result in cache maintenance cost. Objects that are reachable in the old area become the target of GC and will be retrieved, then will be regenerated during the HTTP request process, only to be cached and promoted to the old area. The full GC cycles are affected by this process.

Size of the Statement Objects

It would be safe to say that the size of a single statement object is proportional to the length of the SQL code processed by the same statement. Even for a long and complex SQL, the size of the object should be around 500 bytes. The object's small size would seem to have little effect on the full GC cycles, but such an assumption would be incorrect.
When you look at the JDBC specifications, each connection has its own statement pool (maxStatementsPerConnection), as described in Figure 1 below. So, although a statement object is as small as 500 bytes, if there are many connections, the statements cache may occupy the proportional amount of the heap.
relationship-between-connection-and-statement.png
Figure 1: Relationship between the Connection and the Statement.
(Though the statement has the ResultSet, it should be clarified that ResultSet is not an object for caching. ResultSet is allocated as null when rs.close() is called by iBatis, then retrieved in the young area during young GC.)

The Effect of Statement Pool's Cache Hit Ratio on the Full GC

A simple test program was created to assess the effect of cache hit ratio on the full GC. One cache hit ratio was set to 100% while the other was set to 50%. When the same amount of load was applied, the results presented in Table 1 and 2 were obtained.
In both cases, the occurrences of young GC were very similar but the results for the full GC was different. When the cache hit ratio was 100%, full GC occurred only once, because the number of objects promoted to the old area during young GC was small. When the ratio was 50%, full GC occurred 4 times because the number of statement objects promoted to the old area during young GC was high, as the objects were cached in the statement pool, then removed from the pool in LRU way, then cached again at the next request.
Table 1. Cache hit ratio = 100%.
... OC OU YGC FGC FGCT GCT
... 10688.0 6940.9 532 1 0.190 1.274
... 10688.0 6940.9 532 1 0.190 1.274

Table 2. Cache hit ratio = 50%.
... OC OU YGC FGC FGCT GCT
... 10240.0 7092.7 554 4 0.862 2.253
... 10240.0 7412.0 555 4 0.862 2.255
I would like to add one more thing. When the cache hit ratio is 50%, it violates the 2nd category of weak generational hypothesis I introduced previously. When low cache hit ratio causes frequent pool registration and subsequent removal, it means the statement object generated in the young area is being referenced in the pool from the old area, which leads to additional strain during GC because the card marking technique is used to manage the references separately.

In Conclusion

In Lucy (NHN's internal Java Framework), the maxStatements value for statement pooling in Oracle and MySQL is 500. In most cases, 500 should be enough. However, when more SQL is being used, increasing the default value to meet such demand would be a way to improve the system efficiency (when using $(String replacement) for query on iBatis for the reason of table partitioning and the like, the number of queries must be multiplied by the number of partitioned tables).
However, when the default value is higher than necessary, this leads to a different problem. A higher value means more memory usage and higher likelihood of an Out Of Memory (OOME) occurrence.
In a situation where the number of SQLs are 10,000 and the number of connections are 50, then the total size of statement objects is about 250 MB. (500 byte * 50 * 10,000 = 250 MB). It should be easy to determine the likelihood of OOME occurrence by checking the Xmx configuration for the service in use.

No comments:

Post a Comment