Oracle RAC Cache Coherency
by Donald K. Burleson
As we noted in the first installment of this RAC series,
cache coherency is the mechanism to allow multiple RAM
data caches (as defined by the db_cache_size and
db_block_buffers parameters) to remain
synchronized. This is especially critical when dozens of
Oracle 10g instances (SGA regions) share a single copy
of the Oracle10g Grid database.
In the Oracle RAC System, concurrency, and consistency
is maintained as if it is a single image system. Even
though the same set of blocks (obtained from the IO
Device or Storage) is brought into cache of each system,
data integrity should be maintained.
From the Ault and Tumma book, Oracle RAC and Grid, we
see this excellent description of the cache coherency
mechanism:
“In a RAC system, users can connect with multiple
instances to run database queries. Typically, users will
be connected to different nodes but access the same set
of data or data blocks. This situation demands that the
data consistency, formerly confined to a single
instance, be effectively extended to multiple instances.
Therefore, buffer cache coherence from multiple
instances must be maintained.
Instances require three main types of concurrency:
-
Concurrent reads on multiple instances
— When users on two different instances need to read
the same set of blocks.
-
Concurrent reads and writes on different
instances — A user intends to read
a data block that was recently modified, and the
read can be for either the current version of the
block, or for a read-consistent previous version.
-
Concurrent writes on different instances
— When the same set of data blocks are modified by
different users on different instances”
Cache Coherency demands that even though there are multiple
instances (each with a separate db_cache_size data
buffer region) in which data blocks can reside or brought
in, block consistency must be maintained. Oracle RAC
achieves this by following the inter-instance block
transfers through Cache Fusion mechanism. The global cache
services (GCS), which is implemented as a set of processes,
organizes this facility. GCS also ensures that only one
instance modifies the block at any given time. Even when the
same data block is cached in different instances at the same
time, global consistency is maintained.
Let’s take a closer look at the data block writing mechanism
in Oracle RAC.
Data Block Writing Method
Oracle follows the concept of Dirty Block and Past Image of
the block. Let’s understand what they are.
“Whenever a server process changes or modifies a
data block, it becomes a dirty block. Once a server
process makes changes to the data block, the user may
commit transactions, or transactions may not be
committed for quite some time. In either case, the dirty
block is not immediately written back to disk.
Writing dirty blocks to disk takes place under the
following two conditions:
-
When a server process cannot find a clean, reusable
buffer after scanning a threshold number of buffers,
then the database writer process writes the dirty
blocks to disk.
-
When the checkpoint takes place the database writer
process writes the dirty blocks to disk”
As we are aware, a typical data block is not written to the
disk immediately, even after it becomes dirty as the
result of an update. When the same dirty data block is
requested by another instance for write or read purposes, an
image of the block is created at the owning instance, and
only that block is shipped to the requesting instance. This
backup image of the block is called the past image (PI) and
is kept in memory.
In the event of instance failure, Oracle can reconstruct the
current version of the block by reading the PIs from RAM. It
is also possible to have more than one past image in the
memory depending on how many times the data block was
requested in the dirty stage. The process of writing the
blocks back to the I/O device (disk storage unit) depends on
the checkpoint schedule defined by the DBA for the RAC
cluster. Once the checkpoint interval is reached, Oracle’s
Database Writer (DBWR) process initiates an asynchronous
write of the dirty blocks to disk.
When the write takes place, a message is sent across Cache
Fusion to change the status for the block in the other
instances and the past images (PI), on all other instances
are invalidated and discarded.
For more details, refer to Oracle Metalink Document Note #
139436.1 titled, “Understanding 9i Real Application Clusters
Cache Fusion.”
Internal Lock Messaging in RAC
Remember, Oracle uses a lock escalation mechanism to
maintain cache coherency. There can only be one block
buffered in the “xcur” exclusive state in the cluster
at any one time and to modify a block, each instance must
assign an xcur state to the buffer containing the
block.
For example, if another instance requests reading
the same block in its most current version, then oracle
sends a message to change the access mode from exclusive to
shared, sends the block to the requesting instance and keeps
a Prior Image (PI) buffer if the buffer contained a dirty
(changed) block. It then sends a “current read” version of
the block to the requesting instance. The original instance
keeps a copy in current mode, but the overall status of the
block becomes global. Again, there can be multiple copies of
the shared current (scur mode) cached at any time.
In early versions of Oracle OPS, one master instance kept
track of the lock status, so if the master instance crashed,
the entire OPS system went down. Obviously, this was a
serious shortcoming, remedied in RAC. In later versions of
OPS and RAC, only the uncommitted transactions on the
instance that goes down are lost. The other instances stay
active.
In RAC there is still a master node, but while the first
node to start-up becomes the “master” node, it is strictly a
bookkeeping method, and there are no repercussions to the
cluster if the master node dies. The Cache Fusion mechanisms
for Global Caching Service (GCS) and Global Enqueue Service
(GES) are global resources, running on all nodes in the
cluster, serving to maintain copies of the global
dictionary.
Now that we understand the RAC block updating process, we
are ready to move even deeper into RAC internals. Our next
installment will examine RAC invalidation mechanisms.