|
Oracle RAC Global Block
Management
by Donald K.
Burleson
As a review of our
last installment of this article, we learned about
the evolution and internal mechanisms for data cache
management in multi-instance Oracle10g Grid
databases. As you may recall, Oracle instances
require three main types of concurrency locking:
-
Concurrent
reads on multiple instances — When
users on two different instances need to read
the same set of blocks.
-
Concurrent reads and writes
on different instances — A
user intends to read a data block
that was recently modified, and the
read can be for either the current
version of the block, or for a
read-consistent previous version.
-
Concurrent
writes on different instances — When
the same set of data blocks are modified by
different users on different instances
The Global Cache
Services (GCS) is the RAC process mechanism for
maintaining cache coherency. Next, let’s dive into
the internals of multi-instance cache invalidation
and cache coherency mechanisms.
RAC
Invalidation Mechanism
One important
aspect of the cache coherency and cache fusion
architectures is the concept of block invalidation.
In general, block invalidation is the process by
which in-memory blocks are flagged as “invalid.”
Block invalidation occurs in RAC under the following
conditions:
-
Block
changes on other nodes — As blocks are
changed, the Global Caching Service sends
messages via Cache fusion to change the status
of the block.
-
Failure of
nodes — Upon instance failure, Oracle
RAC manages the recovery of all other instances,
maintaining the status of updated blocks to
ensure that no updates are lost.
However, regardless
of the reason, invalidation only happens at a data
block level, not at the cache level. Essentially,
block invalidation involves “status changes” in the
data block buffers of each RAC instance. These
status changes are based on the messages transmitted
across the interconnect via the cache fusion
processes.
In Oracle OPS, the
DLM and IDLM processes handled the invalidation of
blocks by controlling the latches placed on those
blocks. The latches were controlled via messages
sent over the high-speed interconnect to the DLM or
IDLM processes. Remember, in OPS, an invalid block
has to be re-read by the instance to become valid,
and this extra disk I/O was a major bottleneck.
As we already know,
Oracle with cache fusion (8.1.6 and later versions
of OPS and Oracle9i RAC), the blocks can be passed
back and forth between the nodes and the images of
the blocks can be merged to maintain the correct
image, which reduces the need to totally invalidate
a block and force a re-read from disk.
As individual nodes
make changes to blocks and issue commits, the
individual nodes are responsible for writing the
blocks back to disk and flagging the appropriate
latches with semaphore signals as to state changes.
Prior to Oracle’s
writing their own lock manager software (for Windows
and Linux platforms), each hardware vendor was
responsible for implementing a layer of software
allowing cluster database processing by using
Operating System-Dependent (OSD) Layers. These
layers provide communication links between the
Operating System and the Real Application Clusters
software. The OSD for each vendor is proprietary
code used to coordinate the activities of the
cluster that is independent of Oracle, but which
Oracle depends on to track node status in the
cluster.
RAC Data
Block Transfer Mechanism
Oracle treats the
data blocks as resources. They are synchronized
while the operation of the database processing
activity is in progress. Coordination of concurrent
tasks is called synchronization.
GCS resources such
as data blocks and enqueues are synchronized as
nodes within a cluster acquire and release ownership
of blocks. The synchronization provided by real
application clusters maintains a cluster-wide
concurrency of resources, and in turn, ensures the
integrity of the shared data.
The following is a
description of the RAC lock mechanism, taken from
the Ault and Tumma book, Oracle9i RAC:
“The data block
(or GCS resource) can be held in different
resource modes, depending on whether a resource
holder intends to modify the data or read the
data. The modes are as follows:
-
Null (N) mode — Holding a resource at this level conveys
that there are no access rights. Null mode is usually
held as a placeholder, even if the resource is not
actively used. [Steve Adams]
-
Shared (S) mode — When a resource is held at this level,
it will ensure that the data block is not modified by
another session, but will allow concurrent shared
access.
-
Exclusive (X) mode — This level grants the holding
process exclusive access. Other processes cannot write
to the resource. It may have consistent read blocks.”
The resource mode is an important mechanism to maintain data
integrity, perform escalation and avoid data corruption issues.
Within RAC, the GCS resources are allowed to have global
roles or local roles. These roles are mutually exclusive and
serve very different purposes:
1. When a block is first read
into the cache of an instance and other instances have not
read the same block, then the block is said to be locally
managed and is therefore assigned a local role.
2. After the block
has been modified by the local instance and
transmitted to another instance, it is considered to
be globally managed, and is therefore assigned a
global role.
Thus, Oracle treats
the data buffer or data block as a resource, and
coordinates among all the instances (within each
node’s db_cache_size), the data buffer shipping to
other instances. This mechanism is sometimes
referred to as the buffer state, and it is an
important way to categorize a block.
Internally, lock
escalation and data block transfer is accomplished
by sending lock messages to the Cache Fusion layer
using the GCS processes on each node. There is no
shipping of the instruction or piece of code to
other instance to do a certain task, and the Cache
Fusion layer manages the state for every data block
in every instance.
As we will see in
the following example, the total functioning of data
block movement and provision of a single system
image to the application user connections is based
on series of escalations of buffer states. Let’s
take a closer look at the block transfer steps:
1. When an Instance
desires to read a data block, it makes a request to
the GCS, which keeps track of the resources,
location, and status. In this example, Instance 1
intends to modify or update the data block and
submits a request to GCS.
2. The GCS then
sends a message and forwards the request to the
owning instance.
3. The holding
instance then transmits a copy of the block to the
requesting instance, but keeps the resource in
shared mode and also retains the local role for the
block. Before sending the block, the resource is
downgraded to null mode and the changed (dirty)
buffer is kept as a PI. Thus, the role changes to
global (G) because the block is dirty. Along with
the block, Instance 2 informs the requestor that it
retained a PI copy and a null resource. The same
message also specifies that the requestor can take
the block held in exclusive mode and with a global
role (X, G).
4. The receiving
instance now informs the GCS of its own resource
disposition (S, L) and also that of the instance
that sent the block (S, L).
Hence, the block
transfer involves no disk I/O. The block transfer
took place through the high-speed private
interconnect. This is a key feature of the Oracle
RAC System and the major reason OPS was re-named to
RAC. Next, let’s explore the global resource
directory and see how it manages the state of all
data blocks.
Inside the
Global Resource Directory
As we noted, a
major shortcoming of OPS was the requirement that
“master” instances manage the global locks for the
cluster. Changed in RAC, a global view of the blocks
(in the clustered cache) is maintained in the RAC
Global Resource Directory (GRD). This is an
important management feature that keeps track of all
the data block resources of multiple instances. The
following is a description of the how the GCS and
GES control data block information, also taken from
the Ault and Tumma book, Oracle9i RAC:
“The GES and GCS
together maintain a global resource directory (GRD)
to record information about resources and
enqueues. The GRD remains in the memory and is
stored on all the instances. Each instance
manages a portion of the directory. The
distributed nature of the GRD is a key point for
the fault tolerance of RAC.
The GRD is an
internal database that records and stores the
current status of the data blocks. Whenever a
block is transferred out of a local cache to
another instance’s cache, the GRD is updated.
The following resource information is available
in GRD:
-
Data Block
Addresses (DBA).
-
Location of
most current versions.
-
Modes of the
data blocks ((N)Null, (S)Shared,
(X)Exclusive ).
-
The Roles of
the data blocks (local or global).
The GRD is similar
to the previous version of the lock directory
from a functional perspective, but it has been
expanded with more components. It contains an
accurate inventory of resource status and
location.”
Using this
directory, Oracle RAC implementations always keeps
the global view of all the caches involved in the
cluster. In OPS, Oracle9i RAC and Oracle10g
Grid, nodes may join and leave the cluster at will
(via the Grid control mechanism in OEM), but the
global view of data blocks participating in
multi-instance caches is maintained through the GRD.
Now that we
understand the mechanism for lock escalation and
global block management in RAC, we are ready to
conclude this series with a discussion of
inter-instance node management. This is a core
feature of Oracle10g Grid computing and an
area that every DBA must understand at an intimate
level.
|
|
|