One of the many
advantages of upgrading to Oracle 10g is
that the cost-based optimizer (CBO) has
been improved to allow for better
performance by using a
cost model that accounts for I/O and CPU.
This cost model allows the optimizer to
choose the best execution plans based on
statistics of the data. However, the
decisions made by the CBO depend on the
statistics that are available at the
time that the execution plan is parsed.
One of the most common causes of
performance issues in 10g databases is
bad statistics.
Luckily, Oracle 10g
also introduces several extremely useful
tools for monitoring database
performance and activity including
Active Workload Repository (AWR) and
Automatic Database Diagnostic Monitor
(ADDM). These tools can be used to
pinpoint where bad statistics might be
causing the optimizer to make bad
decisions.
The obvious question is, "What
constitutes bad statistics?"
Unfortunately, the answer is not as
obvious as the question. In fact, unless
you understand the nuts and bolts of how
the CBO chooses its execution plan and
how parsed execution plans are used in
the library cache, it is unlikely that
you will even know when bad statistics
are causing your database to fail. . .
These intermittent failures were a
top priority for this client since each
failure had a substantial cost in
resources and lost production time.
Adding to the client's frustration, they
had recently completed an engagement
with a consulting company for
performance tuning that same database.
My general approach to determining a
root cause was similar to the client's
approach. I wanted to find out what
changed and then determine the cause for
that change. Using the AWR to analyze
the database CPU and I/O, I was able to
pinpoint the exact snapshot where the
sudden resource spike occurred. . .
The source of resource consumption on
a segment can be ascertained by
understanding all of the operations
taking place in the database at a
particular time. The next action in the
diagnosis was to look at the SQL stats
in AWR for the top SQL during the
trouble snapshot.
I focused on the major consumers
identified in the top SQL of the
Statspack report during the trouble
period, verified by running an ADDM
report. Starting with the greatest
consumer, we see that the logical and
physical read deltas had greatly
increased while the executions remained
fairly consistent.
SQL> SELECT s.snap_id "BEGIN|SNAPSHOT",
SQL.executions_delta EXEC|DELTA,
SQL.buffer_gets_delta "Buffer|Gets|DELTA",
SQL.ccwait_delta "Cncr|WAIT|DELTA",
SQL.cpu_time_delta "CPU|TIME|DELTA",
SQL.elapsed_time_delta "Elpsd|TIME|DELTA"
FROM
dba_hist_sqlstat SQL,
dba_hist_snapshot s
WHERE
s.snap_id = SQL.snap_id
AND
sql_id = (:1)
ORDER BY
BEGIN_INT_TIME ASC;
/
BEGIN|
SNAPSHOT |
EXEC|
DELTA |
Buffer|Gets|DELTA |
Cncr|WAIT|DELTA |
CPU|TIME|DELTA |
Elpsd|TIME|DELTA |
39240 |
3,658 |
9,854 |
0 |
329,982 |
329,982 |
39241 |
3,473 |
8,987 |
0 |
311,709 |
311,709 |
39242 |
3,161 |
8,630 |
0 |
293,170 |
293,170 |
39243 |
3,470 |
9,414 |
0 |
324,012 |
324,012 |
39244 |
1,014 |
20,939,367 |
5,225 |
717,640,129 |
859,772,865 |
39245 |
3,857 |
81,967,204 |
133,178 |
2,754,815,969 |
3,478,521,604 |
This situation held true for each SQL
statement in the top SQL. Furthermore, I
noticed that the text of each of the top
SQL had a similar predicate on the same
table. Naturally, the next question was,
"What caused the marginal cost of the
same SQL query to increase
exponentially?" This question can be
answered by querying the AWR's SQL stats
again for additional information.
SQL> SELECT b.snap_id "BEGIN|SNAPSHOT", a.plan_hash_value
FROM
dba_hist_sqlstat a, dba_hist_snapshot b
WHERE a.sql_id IN (:1)
AND a.snap_id BETWEEN 39240 AND 39257
AND a.snap_id = b.snap_id ORDER BY a.snap_id ASC;
BEGIN|SNAPSHOT |
PLAN_HASH_VALUE |
39240 |
3058762805 |
39241 |
3058762805 |
39242 |
3058762805 |
39243 |
3058762805 |
39244 |
6038888 |
39245 |
6038888 |
Immediately, we notice that the sql
plan hash value had changed. In other
words, the execution plans for the same
SQL has changed during the trouble
snapshot. Comparing the two execution
plans using AWR, I saw that the original
plan used an efficient index (using an
ID) while the new execution plan used a
very inefficient plan (using a DATE).
The field CREATION_TS is a date field
that shows the time that the record was
first created. This particular table has
several thousands of rows inserted
daily. Why would the optimizer choose to
range scan a time period with thousands
of records to return one record, rather
than using a selective index that only
has one to three values? To any
reasonable person viewing these two
plans, it is obvious that the old plan
is much more efficient. So what is
causing the CBO to choose such a poor
plan? Also, what is causing the
execution plan to be reparsed causing
the plans to change?
In our case, this invalidation was
causing a new execution plan that was
extremely inefficient leading to the
failure of the database. Given this
situation, I could easily have
recommended using plan stability, but
this would have been treating the
symptom instead of curing the cause.
There are a few ways to determine the
optimizer's cost-based decisions.
Arguably, the most effective is a CBO
trace. With this particular situation,
the root problem was signified by more
circumstantial data. The optimizer was
driving decisions based on two
predicates -- a date column and an ID
column. For this table, there were only
a few values for each ID while there are
thousands of records for each date. The
optimizer changed execution plans from
using an index on the ID column to using
an index on the DATE column.
Comparing the two plans, the cost for
the plan using ID was being calculated
equal to the cost using the DATE.
However, after executing each plan, the
session stats show that the plan using
ID has only scanned 17 blocks while the
plan using DATE had scanned over 38,000
blocks. Why is the cost for the poor
plan being evaluated as lower than a
much more efficient plan?
This client has an OLTP environment.
Each of the queries that were identified
as the top consumers were all querying
for the most recent day's data. In other
words, the value of the DATE range is
between TRUNC(SYSDATE) and SYSDATE.
Given all of this diagnostic data, my
suspicion was that there was a column
histogram on the DATE field causing the
CBO to calculate costs based on old
statistics. Sure enough, I was able to
confirm that there was a height balanced
column histogram on the DATE field.
SQL> SELECT histogram FROM dba_tab_cols where table_name
= 'LEADS' and column_name = 'CREATION_TS'
/
SQL> HEIGHT BALANCED
With statistics being gathered weekly
to biweekly, it was logical that the CBO
believed the cost of using the most
recent dates would be very low. In other
words, when reparsing the SQL, the CBO
believed that there were no values in
the specified date range since the
column histogram showed no values for
that bucket. . .
This situation was easy to
demonstrate by using an old date range
compared with a new date range. All
queries using a date range prior to the
date of the latest statistics formed
execution plans using the index on the
ID column while any date ranges after
the latest statistics used the index on
the DATE field. This demonstration
showed that any plans using these
predicates that were reparsed after the
date of the latest statistics would
favor using the index on the DATE field,
resulting in very inefficient execution
plans.
According to Oracle, it is only
proper to use column histograms when the
values of the column are highly skewed.
Obviously a DATE field with a fairly
consistent distribution of values does
not fit this prescription. After asking
the client's DBAs why they were
gathering histograms for all indexed
columns, they said that their developers
had requested this option.
Finally, we had the root cause
determination: the use of histograms on
DATE fields in an OLTP database where
the queries on this field were
specifying the newest dates. This caused
the CBO to miscalculate costs based on
its belief that there was no data for
the specified dates, hence resulting in
fewer blocks to scan. The triggering
event for the sudden degradation of
performance was the reparsing of the SQL
much after the date of the latest
statistics. By removing the histogram on
this column and reparsing the SQL, an
immediate database-wide improvement and
lasting stability was realized.
The really remarkable part about this
root cause analysis is that the time of
the analysis was completely independent
from the date of the actual failure. In
other words, Oracle 10g AWR and ADDM
have made the once crucial need for
analysis between crash and recovery a
thing of the past. These tools are
instrumental in analyzing database
issues while allowing for the minimum
downtime. |