Oracle Consulting Oracle Training Oracle Support Development
Home
Catalog
Oracle Books
SQL Server Books
IT Books
Job Interview Books
eBooks
Rampant Horse Books
911 Series
Pedagogue Books

Oracle Software
image
Write for Rampant
Publish with Rampant
Rampant News
Rampant Authors
Rampant Staff
 Phone
 800-766-1884
Oracle News
Oracle Forum
Oracle Tips
Articles by our Authors
Press Releases
SQL Server Books
image
image

Oracle 11g Books

Oracle tuning

Oracle training

Oracle support

Remote Oracle

STATSPACK Viewer

Privacy Policy

 

 
 

Symmetric Difference in SQL

By Vadim Tropashko - Vadim is the author of "SQL Design Patterns", the experts guide to SQL programming.

How to compare the content of two tables is one of the most popular topics on Ask Tom forum. In this article, we'll investigate performance of three different table comparison techniques. Among the other things, this case study is an entertaining journey into oracle optimizer performance arena.

The problem

Suppose there are two tables A and B with the same columns, and we would like to know if there is any difference in their contents.

In the set theory a similar question is well known as a symmetric difference operation.

Relation Equality

A reader with an Object Oriented programming background might wonder why the Relational model in general, and SQL in particular, doesn’t have an equality operator. You'd think the first operator you'd implement for a data type, especially a fundamental data type, would be equality! Simple: this operator is not relationally closed, as the result is Boolean value and not a relation.

 

 SHAPE  \* MERGEFORMAT Text Box: A\B

 

Text Box: B\A

 

Text Box: A∩B

 

 

 

 

Figure 1: Symmetric difference of two sets A and B can be expressed as either (A \ B) (B \ A), or (A B) \ (A B).

 

Set Solution

 

The figure 1 with the expressions (A \ B) (B \ A) and (A B) \ (A B) pretty much exhaust all the theory involved. It is straightforward to translate them into SQL queries. Here is the first one:

 

(

 select * from A 

 minus

 select * from B

) union all ( 

 select * from B 

 minus

 select * from A

)

 

In practice, however, this query is a sluggish performer. With a naïve evaluation strategy, the execution flow and the operators are derived verbatim from the SQL which we have written. First, each table has to be scanned twice. Then, four sort operators are applied in order to exclude duplicates. Next, the two set differences are computed, and, finally, the two results are combined together with the union operator.

 

From the performance perspective, this approach appears to be far from optimal. First, each table has to be scanned twice. Second, four sorts have to be done on top of that.

 

Let's quickly verify this guess. For this test purpose any sufficiently large tables would do: 

 

create table A as
select obj# id, name from sys.obj$
where rownum < 100000;

create table B as
select obj# id, name from sys.obj$
where rownum < 100010;

 

After executing the symmetric difference query, and capturing the rowsource execution statistics (from V$SQL_PLAN_STATISTICS_ALL dictionary view) we get the following result:

 

If confirms our intuition on the number of table scans and sorts. It also reveals that the time scanning a single table (about 100 msec) is small compared to the time spent when sorting each table (about 450 msec). Therefore we could try to aim lower than 2.4 sec of total execution time. 

 

This is not the only possible execution plan for the symmetric difference query. Oracle has quite sophisticated query rewrite capabilities, and transforming set operation into join is one of them. In our example, both minus operators could be rewritten as anti-joins. As of version 10.2, however, this transformation is not enabled by default, and used as part of view merging/ query unnesting framework only. It can be enabled with

 

alter session set "_convert_set_to_join"= true;

 

The other alternative is to rewrite the query manually as

 

select * from A 

where (col1,col2,…) not in (select col1,col2,… from B)

   union all  

select * from B 

where (col1,col2,…) not in (select col1,col2,… from A)

 

The new execution statistics

evidences about 30% improvement in execution time, which is considered small by optimizer standards.

 

The prevalent perception in the performance optimization field is that Hash Join (and Sort-Merge Join as well, for that matter) is almost always inferior to indexed Nested Loops. Would nested loops be better than hash join in this case? In order to enable indexed nested loops access path we have to create indexes, first:

 

CREATE INDEX A_id_name ON A(id, name);
CREATE INDEX B_id_name ON B(id, name);

 

Note that in more realistic case, tables would probably have more columns, but most likely they might already have an index associated with primary key.

 

Unfortunately, index appearance doesn't affect the execution plan. We have to enforce it. Quick and dirty fix to the plan is disabling hash and merge join altogether:

 

alter session set "_hash_join_enabled" = false;
alter session set "_optimizer_sortmerge_join_enabled" = false;

 

The less intrusive way to influence an access path is hinting. But how could we hint the join method when there is no join in the query as written in the first place? OK, this might be a convenient opportunity to discuss hint enhancements in 10g.

 

Let's get bird's-eye view into query optimization architecture in oracle, first. The basic query structure in oracle RDBMS engine is query block. A subquery, an inner view, an outermost select statement that contain those, all are query blocks. The particular query that we are studying in this section has 5 query blocks: 2 pairs of selects from the tables A and B, and one set operation. After this query is transformed, it would consist of only 3 query blocks which we might witness from the QBLOCK_NAME column of the PLAN_TABLE.

 

 

Those query block names allow directing the hints to the query blocks where they are supposed to be applied. In our case, the joint method can be hinted as

 

/*+ use_nl(@"SEL$74086987" A)
    use_nl(@"SET$D8486D66" B)*/

 

which produces the following rowsource level execution statistics:

 

 

I included the LAST_STARTS column into result set in order to double-check that index rowsource node has been started exactly once per each row scan in the outer table. Specifically,  B_ID_NAME index node has been started 99999 times and processed 99999 rows total, therefore each index scan probed 99999/99999=1 row in the inner table B only, as if the index were unique.

 

From the statistics output we see significant increase in buffer gets which is not offset by any noticeable improvement in the execution time. It is fair to conclude that indexed nested loops didn't meet our performance expectations in this case.

 

Symmetric Difference via Aggregation

 

Symmetric difference can be expressed via aggregation:

 

select * from (
  select id, name,
    sum(case when src=1 then 1 else 0 end) cnt1,
    sum(case when src=2 then 1 else 0 end) cnt2
  from (
    select id, name, 1 src from A
    union all
    select id, name, 2 src from B
  )
  group by id, name
)
where cnt1 <> cnt2;

 

This is appeared as rather elegant solution where each table has to be scanned once only, until we capture execution statistics:

 

 

 

Although we succeeded decreasing buffer gets count in half, the execution time still remained around 2 sec.

 

Hash Value based Table Comparison

 

When comparing data in two tables with identical signatures there actually are two questions that one might want to ask:

  • Is there any difference? The expected answer is boolean.
  • What are the rows that one table contain, and the other doesn't.

Answering the second question implies that we can answer the first, yet, it is possible that the first query might have better performance. We’ll approach it with the hash-based technique in mind.

 

The standard disclaimer of any hash-based method is that it is theoretically possible to get a wrong result. The rhetorical question, however, is how often did the reader experienced a problem due to hash value collision? I never did. Would a user accept a tradeoff: significant performance boost of the table difference query at the expense of remote possibility of getting an incorrect result?

 

The idea of hash based method is associating a single hash value with a table. This hash value behaves like aggregate, and therefore, it can be calculated incrementally: if a row is added into a table, then a new hash value is a function of the old hash value and the added row. Incremental evaluation means good performance, and the two hash values comparison is done momentarily.

 

 

 

 

 

 

 

 

 

 

 

 

 

 


 

Figure 2: A method to “weight” a table into a single value involves concatenating all the fields per each row first. Then a concatenated string is mapped into a number -- hash value. Finally all hash values are aggregated into a single one. Had a single field in a single row have change, the table weight would change as well.

 

A naive implementation is as simple as

 

select sum(DBMS_UTILITY.get_hash_value(id||'|'||name,
                                       1,
                                       POWER(2,16)-1))
from A
 union all
select sum(DBMS_UTILITY.get_hash_value(id||'|'||name,
                                       1,
                                       POWER(2,16)-1))
from B;

 

We concatenate all the row fields (with a distinct separator, in order to guarantee uniqueness), then translate this string into a hash value. Alternatively, we could have calculated hash values for each field, and then apply some asymmetric function in order for the resulting hash value to be sensitive to column permutations:

 

 

 

select sum(DBMS_UTILITY.get_hash_value(id
                                       1,
                                       POWER(2,16)-1)
       + 2*DBMS_UTILITY.get_hash_value(name
                                       1,
                                       POWER(2,16)-1))
from A
 union all
select sum(DBMS_UTILITY.get_hash_value(id
                                       1,
                                       POWER(2,16)-1)
       + 2*DBMS_UTILITY.get_hash_value(name
                                       1,
                                       POWER(2,16)-1))
from B;

 

Row hash values are added together with ordinary sum aggregate, but we could have written a modulo 216-1 user-defined aggregate hash_sum in the spirit of CRC (Cyclic Redundancy Check) technique.

 

Acknowledgements

 

Many thanks to Rafi Ahmed for patiently answering my numerous questions about query transformation in Oracle RDBMS. The aggregation solution has been suggested by Marco Stefanetti in an exchange at Ask Tom forum.

 

Vadim Tropashko graduated from Moscow Institute of Physics and Technology in 1984. Tropashko researched Petri Nets for five years following graduation. In the early 90s, his interests switched to OOP. Tropashko translated "The C++ Programming Language" by B.Stroustrup into Russian. In the mid 90s, Tropashko's interest switched from OOP to databases. He published numerous research and programming articles and authored the book "SQL Design Patterns" (Rampant Tech Press).

 

 

   

 Copyright © 1996 -2017 by Burleson. All rights reserved.


Oracle® is the registered trademark of Oracle Corporation. SQL Server® is the registered trademark of Microsoft Corporation. 
Many of the designations used by computer vendors to distinguish their products are claimed as Trademarks