Oracle Consulting Oracle Training Oracle Support Development
Home
Catalog
Oracle Books
SQL Server Books
IT Books
Job Interview Books
eBooks
Rampant Horse Books
911 Series
Pedagogue Books

Oracle Software
image
Write for Rampant
Publish with Rampant
Rampant News
Rampant Authors
Rampant Staff
 Phone
 800-766-1884
Oracle News
Oracle Forum
Oracle Tips
Articles by our Authors
Press Releases
SQL Server Books
image
image

Oracle 11g Books

Oracle tuning

Oracle training

Oracle support

Remote Oracle

STATSPACK Viewer

Privacy Policy

 

 
 

A case study in 10g SQL tuning with histograms

V. J. Jain

One of the many advantages of upgrading to Oracle 10g is that the cost-based optimizer (CBO) has been improved to allow for better performance by using a cost model that accounts for I/O and CPU. This cost model allows the optimizer to choose the best execution plans based on statistics of the data. However, the decisions made by the CBO depend on the statistics that are available at the time that the execution plan is parsed. One of the most common causes of performance issues in 10g databases is bad statistics.

Luckily, Oracle 10g also introduces several extremely useful tools for monitoring database performance and activity including Active Workload Repository (AWR) and Automatic Database Diagnostic Monitor (ADDM). These tools can be used to pinpoint where bad statistics might be causing the optimizer to make bad decisions.

The obvious question is, "What constitutes bad statistics?" Unfortunately, the answer is not as obvious as the question. In fact, unless you understand the nuts and bolts of how the CBO chooses its execution plan and how parsed execution plans are used in the library cache, it is unlikely that you will even know when bad statistics are causing your database to fail. . .  

These intermittent failures were a top priority for this client since each failure had a substantial cost in resources and lost production time. Adding to the client's frustration, they had recently completed an engagement with a consulting company for performance tuning that same database.

My general approach to determining a root cause was similar to the client's approach. I wanted to find out what changed and then determine the cause for that change. Using the AWR to analyze the database CPU and I/O, I was able to pinpoint the exact snapshot where the sudden resource spike occurred. . .

The source of resource consumption on a segment can be ascertained by understanding all of the operations taking place in the database at a particular time. The next action in the diagnosis was to look at the SQL stats in AWR for the top SQL during the trouble snapshot.

I focused on the major consumers identified in the top SQL of the Statspack report during the trouble period, verified by running an ADDM report. Starting with the greatest consumer, we see that the logical and physical read deltas had greatly increased while the executions remained fairly consistent.

SQL> SELECT s.snap_id "BEGIN|SNAPSHOT",
SQL.executions_delta    EXEC|DELTA,
SQL.buffer_gets_delta  "Buffer|Gets|DELTA",
SQL.ccwait_delta       "Cncr|WAIT|DELTA",
SQL.cpu_time_delta     "CPU|TIME|DELTA",
SQL.elapsed_time_delta "Elpsd|TIME|DELTA"
FROM
dba_hist_sqlstat        SQL,
dba_hist_snapshot       s
WHERE
s.snap_id = SQL.snap_id
AND
sql_id = (:1)
ORDER BY
BEGIN_INT_TIME ASC;
/

 

BEGIN|
SNAPSHOT
EXEC|
DELTA
Buffer|Gets|DELTA Cncr|WAIT|DELTA CPU|TIME|DELTA Elpsd|TIME|DELTA
39240 3,658 9,854 0 329,982 329,982
39241 3,473 8,987 0 311,709 311,709
39242 3,161 8,630 0 293,170 293,170
39243 3,470 9,414 0 324,012 324,012
39244 1,014 20,939,367 5,225 717,640,129 859,772,865
39245 3,857 81,967,204 133,178 2,754,815,969 3,478,521,604

This situation held true for each SQL statement in the top SQL. Furthermore, I noticed that the text of each of the top SQL had a similar predicate on the same table. Naturally, the next question was, "What caused the marginal cost of the same SQL query to increase exponentially?" This question can be answered by querying the AWR's SQL stats again for additional information.

SQL> SELECT b.snap_id "BEGIN|SNAPSHOT", a.plan_hash_value 
FROM 
dba_hist_sqlstat a, dba_hist_snapshot b 
WHERE a.sql_id IN (:1) 
AND a.snap_id BETWEEN 39240 AND 39257 
AND a.snap_id = b.snap_id ORDER BY a.snap_id ASC;

 

BEGIN|SNAPSHOT PLAN_HASH_VALUE
39240 3058762805
39241 3058762805
39242 3058762805
39243 3058762805
39244 6038888
39245 6038888

Immediately, we notice that the sql plan hash value had changed. In other words, the execution plans for the same SQL has changed during the trouble snapshot. Comparing the two execution plans using AWR, I saw that the original plan used an efficient index (using an ID) while the new execution plan used a very inefficient plan (using a DATE).

The field CREATION_TS is a date field that shows the time that the record was first created. This particular table has several thousands of rows inserted daily. Why would the optimizer choose to range scan a time period with thousands of records to return one record, rather than using a selective index that only has one to three values? To any reasonable person viewing these two plans, it is obvious that the old plan is much more efficient. So what is causing the CBO to choose such a poor plan? Also, what is causing the execution plan to be reparsed causing the plans to change?

In our case, this invalidation was causing a new execution plan that was extremely inefficient leading to the failure of the database. Given this situation, I could easily have recommended using plan stability, but this would have been treating the symptom instead of curing the cause.

There are a few ways to determine the optimizer's cost-based decisions. Arguably, the most effective is a CBO trace. With this particular situation, the root problem was signified by more circumstantial data. The optimizer was driving decisions based on two predicates -- a date column and an ID column. For this table, there were only a few values for each ID while there are thousands of records for each date. The optimizer changed execution plans from using an index on the ID column to using an index on the DATE column.

Comparing the two plans, the cost for the plan using ID was being calculated equal to the cost using the DATE. However, after executing each plan, the session stats show that the plan using ID has only scanned 17 blocks while the plan using DATE had scanned over 38,000 blocks. Why is the cost for the poor plan being evaluated as lower than a much more efficient plan?

This client has an OLTP environment. Each of the queries that were identified as the top consumers were all querying for the most recent day's data. In other words, the value of the DATE range is between TRUNC(SYSDATE) and SYSDATE. Given all of this diagnostic data, my suspicion was that there was a column histogram on the DATE field causing the CBO to calculate costs based on old statistics. Sure enough, I was able to confirm that there was a height balanced column histogram on the DATE field.

SQL> SELECT histogram FROM dba_tab_cols where table_name 
= 'LEADS' and column_name = 'CREATION_TS'
/
SQL> HEIGHT BALANCED

With statistics being gathered weekly to biweekly, it was logical that the CBO believed the cost of using the most recent dates would be very low. In other words, when reparsing the SQL, the CBO believed that there were no values in the specified date range since the column histogram showed no values for that bucket. . .

This situation was easy to demonstrate by using an old date range compared with a new date range. All queries using a date range prior to the date of the latest statistics formed execution plans using the index on the ID column while any date ranges after the latest statistics used the index on the DATE field. This demonstration showed that any plans using these predicates that were reparsed after the date of the latest statistics would favor using the index on the DATE field, resulting in very inefficient execution plans.

According to Oracle, it is only proper to use column histograms when the values of the column are highly skewed. Obviously a DATE field with a fairly consistent distribution of values does not fit this prescription. After asking the client's DBAs why they were gathering histograms for all indexed columns, they said that their developers had requested this option.

Finally, we had the root cause determination: the use of histograms on DATE fields in an OLTP database where the queries on this field were specifying the newest dates. This caused the CBO to miscalculate costs based on its belief that there was no data for the specified dates, hence resulting in fewer blocks to scan. The triggering event for the sudden degradation of performance was the reparsing of the SQL much after the date of the latest statistics. By removing the histogram on this column and reparsing the SQL, an immediate database-wide improvement and lasting stability was realized.

The really remarkable part about this root cause analysis is that the time of the analysis was completely independent from the date of the actual failure. In other words, Oracle 10g AWR and ADDM have made the once crucial need for analysis between crash and recovery a thing of the past. These tools are instrumental in analyzing database issues while allowing for the minimum downtime.

 

 

   

 Copyright © 1996 -2017 by Burleson. All rights reserved.


Oracle® is the registered trademark of Oracle Corporation. SQL Server® is the registered trademark of Microsoft Corporation. 
Many of the designations used by computer vendors to distinguish their products are claimed as Trademarks