Symmetric Difference in SQL
By Vadim Tropashko - Vadim is the author of
"SQL
Design Patterns", the experts guide to SQL programming.
How to compare the content of two tables is
one of the most popular topics on Ask Tom forum. In this
article, we'll investigate performance of three different table
comparison techniques. Among the other things, this case study
is an entertaining journey into oracle optimizer performance
arena.
The problem
Suppose there are two tables
A
and B
with the same columns, and we would like to know if there is any
difference in their contents.
In the set theory a similar question is
well known as a symmetric difference operation.
A reader with an Object Oriented programming
background might wonder why the Relational model
in general, and SQL in particular, doesn’t have
an equality operator. You'd think the first
operator you'd implement for a data type,
especially a fundamental data type, would be
equality! Simple: this operator is not
relationally closed, as the result is Boolean
value and not a relation.
|
SHAPE
\* MERGEFORMAT   


Figure 1: Symmetric difference of two sets
A and B can be expressed as either (A \ B)
∪ (B \ A),
or (A ∪ B)
\ (A ∩ B).
Set Solution
The figure 1 with the expressions (A \ B)
∪ (B \ A)
and (A ∪
B) \ (A ∩
B) pretty much exhaust all the theory involved. It is
straightforward to translate them into SQL queries. Here is the
first one:
(
select * from A
minus
select * from B
) union all (
select * from B
minus
select * from A
)
In practice, however, this query is a
sluggish performer. With a naïve evaluation strategy, the
execution flow and the operators are derived verbatim from the
SQL which we have written. First, each table has to be scanned
twice. Then, four sort operators are applied in order to exclude
duplicates. Next, the two set differences are computed, and,
finally, the two results are combined together with the union
operator.
From the performance perspective, this
approach appears to be far from optimal. First, each table has
to be scanned twice. Second, four sorts have to be done on top
of that.
Let's quickly verify this guess. For this
test purpose any sufficiently large tables would do:
create table A as
select obj# id, name from sys.obj$
where rownum < 100000;
create table B as
select obj# id, name from sys.obj$
where rownum < 100010;
After executing the symmetric difference
query, and capturing the rowsource execution statistics (from
V$SQL_PLAN_STATISTICS_ALL dictionary view) we get the following
result:

If confirms our intuition on the number of
table scans and sorts. It also reveals that the time scanning a
single table (about 100 msec) is small compared to the time
spent when sorting each table (about 450 msec). Therefore we
could try to aim lower than 2.4 sec of total execution time.
This is not the only possible execution
plan for the symmetric difference query. Oracle has quite
sophisticated query rewrite capabilities, and transforming set
operation into join is one of them. In our example, both minus
operators could be rewritten as anti-joins. As of version 10.2,
however, this transformation is not enabled by default, and used
as part of view merging/ query unnesting framework only. It can
be enabled with
alter session set "_convert_set_to_join"=
true;
The other alternative is to rewrite the
query manually as
select * from A
where (col1,col2,…) not in (select
col1,col2,… from B)
union all
select * from B
where (col1,col2,…) not in (select
col1,col2,… from A)
The new execution statistics

evidences about 30% improvement in
execution time, which is considered small by optimizer
standards.
The prevalent perception in the performance
optimization field is that Hash Join (and Sort-Merge
Join as well, for that matter) is almost always inferior to
indexed Nested Loops. Would nested loops be better than
hash join in this case? In order to enable indexed nested loops
access path we have to create indexes, first:
CREATE INDEX A_id_name ON A(id, name);
CREATE INDEX B_id_name ON B(id, name);
Note that in more realistic case, tables
would probably have more columns, but most likely they might
already have an index associated with primary key.
Unfortunately, index appearance doesn't
affect the execution plan. We have to enforce it. Quick and
dirty fix to the plan is disabling hash and merge join
altogether:
alter session set "_hash_join_enabled" =
false;
alter session set "_optimizer_sortmerge_join_enabled" = false;
The less intrusive way to influence an
access path is hinting. But how could we hint the join method
when there is no join in the query as written in the first
place? OK, this might be a convenient opportunity to discuss
hint enhancements in 10g.
Let's get bird's-eye view into query
optimization architecture in oracle, first. The basic query
structure in oracle RDBMS engine is query block. A
subquery, an inner view, an outermost select statement that
contain those, all are query blocks. The particular query that
we are studying in this section has 5 query blocks: 2 pairs of
selects from the tables A and B, and one set operation. After
this query is transformed, it would consist of only 3 query
blocks which we might witness from the QBLOCK_NAME column of the
PLAN_TABLE.

Those query block names allow directing the
hints to the query blocks where they are supposed to be applied.
In our case, the joint method can be hinted as
/*+ use_nl(@"SEL$74086987" A)
use_nl(@"SET$D8486D66" B)*/
which produces the following rowsource
level execution statistics:

I included the LAST_STARTS column into
result set in order to double-check that index rowsource node
has been started exactly once per each row scan in the outer
table. Specifically, B_ID_NAME index node has been started
99999 times and processed 99999 rows total, therefore each index
scan probed 99999/99999=1 row in the inner table B only, as if
the index were unique.
From the statistics output we see
significant increase in buffer gets which is not offset by any
noticeable improvement in the execution time. It is fair to
conclude that indexed nested loops didn't meet our performance
expectations in this case.
Symmetric Difference via Aggregation
Symmetric difference can be expressed via
aggregation:
select * from (
select id, name,
sum(case when src=1 then 1 else 0 end) cnt1,
sum(case when src=2 then 1 else 0 end) cnt2
from (
select id, name, 1 src from A
union all
select id, name, 2 src from B
)
group by id, name
)
where cnt1 <> cnt2;
This is appeared as rather elegant solution
where each table has to be scanned once only, until we capture
execution statistics:

Although we succeeded decreasing buffer
gets count in half, the execution time still remained around 2
sec.
Hash Value based Table Comparison
When comparing data in two tables with
identical signatures there actually are two questions that one
might want to ask:
- Is there any difference? The expected
answer is boolean.
- What are the rows that one table
contain, and the other doesn't.
Answering the second question implies that
we can answer the first, yet, it is possible that the first
query might have better performance. We’ll approach it with the
hash-based technique in mind.
The standard disclaimer of any hash-based
method is that it is theoretically possible to get a wrong
result. The rhetorical question, however, is how often did the
reader experienced a problem due to hash value collision? I
never did. Would a user accept a tradeoff: significant
performance boost of the table difference query at the expense
of remote possibility of getting an incorrect result?
The idea of hash based method is
associating a single hash value with a table. This hash value
behaves like aggregate, and therefore, it can be calculated
incrementally: if a row is added into a table, then a new hash
value is a function of the old hash value and the added row.
Incremental evaluation means good performance, and the two hash
values comparison is done momentarily.
Figure 2: A method to “weight” a table into a single
value involves concatenating all the fields per each row first.
Then a concatenated string is mapped into a number -- hash
value. Finally all hash values are aggregated into a single one.
Had a single field in a single row have change, the table weight
would change as well.
A naive implementation is as simple as
select
sum(DBMS_UTILITY.get_hash_value(id||'|'||name,
1,
POWER(2,16)-1))
from A
union all
select sum(DBMS_UTILITY.get_hash_value(id||'|'||name,
1,
POWER(2,16)-1))
from B;
We concatenate all the row fields (with a
distinct separator, in order to guarantee uniqueness), then
translate this string into a hash value. Alternatively, we could
have calculated hash values for each field, and then apply some
asymmetric function in order for the resulting hash value to be
sensitive to column permutations:
select sum(DBMS_UTILITY.get_hash_value(id
1,
POWER(2,16)-1)
+ 2*DBMS_UTILITY.get_hash_value(name
1,
POWER(2,16)-1))
from A
union all
select sum(DBMS_UTILITY.get_hash_value(id
1,
POWER(2,16)-1)
+ 2*DBMS_UTILITY.get_hash_value(name
1,
POWER(2,16)-1))
from B;
Row hash values are added together with
ordinary sum
aggregate, but we could have written a modulo
216-1
user-defined aggregate
hash_sum in the spirit of CRC (Cyclic Redundancy Check)
technique.
Acknowledgements
Many thanks to Rafi Ahmed for patiently
answering my numerous questions about query transformation in
Oracle RDBMS. The aggregation solution has been suggested by
Marco Stefanetti in an exchange at Ask Tom forum.
 |
Vadim Tropashko graduated from Moscow
Institute of Physics and Technology in 1984. Tropashko
researched Petri Nets for five years following
graduation. In the early 90s, his interests switched to
OOP. Tropashko translated "The C++ Programming Language"
by B.Stroustrup into Russian. In the mid 90s,
Tropashko's interest switched from OOP to databases. He
published numerous research and programming articles and
authored the book "SQL Design Patterns" (Rampant Tech
Press). |
|