Sunday, August 2, 2015

Oracle Data Pump Architecture 11g

Understanding Data Pump Architecture will let you take advantage of its speed and features. Also, knowing the intricacies of a tool does lend you the powers to identify the use cases when that tool would be a better/worse fit than the alternatives.

So without further ado, here we go:

The MASTER TABLE

  This is a table created in the schema of the user who runs the data pump.
  It contains following details about the job:

-        current state of every object being imported/exported
-        locations of objects in the dumpfile set
-        status of every worker process
-        current set of dumpfiles
-        restart information etc

  During export, the master table is built as the first step of export job and written to the dump set as the last step.
  During import, the master table is loaded first into the schema from the dumpset


The PROCESSES

·         Client Process

The expdp and impdp sessions invoked by users are the client processes
A client is not required once a job is underway.
Multiple clients may attach/detach from a job as necessary for monitoring and control 

·         Shadow Process

A standard shadow (foreground) process is created when any client logs into Oracle database. Same is true for the Data Pump client processes.

It services Data Pump API requests.

In other words :
o   Upon receipt of DBMS_DATAPUMP.OPEN request, shadow process creates a job, which consists of creating Master table, Advanced Queuing (AQ) queues for communication among processes and master control process.
o   Once Data Pump job is running, main task of the shadow process is to service GET_STATUS requests from the client.

If client process detaches, shadow process also goes away.

·         Master Control Process (MCP)

This is the Captain of the Ship, and as there is only one captain for a ship, there is only one MCP for each Data Pump job.

It takes care of following:
o   job state
o   job description
o   restart information in master table
o   dumpfile info in master table
o   starting/stopping worker processes

Job is divided into various phases of metadata and data loading/unloading and each phase has associated worker processes.
MCP hands out work requests to worker processes appropriate for current phase. Majority of MCP processing is spent in this work dispatch loop.

MCP also has file management duties, maintaining active dumpfile list, handing out file pieces as requested by process unloading/loading data.

MCP process is of form: <instance>_DMnn_<pid>               

·         Worker Process

These are the petty officers and subordinates that actully do the work on Master's direction. (cruel world eh?)

Upon receipt of START_JOB request, MCP creates worker processes based on the PARALLEL parameter. Worker process performs tasks assigned by MCP (primarily loading/unloading metadata and data) and maintain object rows that make up bulk of master table.

As object rows are loaded/unloaded, status is updated in the master table, pending/completed/failed etc.

Workers also maintain TYPE COMPLETION ROWS which track the type of objects being worked on: tables/views/indexes etc. These rows are used during restart.

A worker process has name of form : *DWnn*

·         Parallel Query Process -

If External table data access method is chosen for loading/unloading table/partition, worker process creates parallel query processes which carry out actual load/unload. Worker process act as query coordinators. These are standard parallel query execution slaves.

In RAC, to take advantage of load sharing opportunity, the PQ processes may be created on a different instance and all other processes may operate on the initial instance.


Now let's move a little beyond the meta information and look at the bird's eye view of the actual work

The DATA MOVEMENT

There are four data movement methods
·         Data File Copying (Transportable Tablespace)
·         Direct Path Load
·         External Tables
·         Conventional Path

Data file copying has some limitations as certain data types cannot be transported.
Also, Character sets must be same at both the systems.
Some types of data, some types of tables can't be moved. e.g. encrypted data cannot be moved.

              ___________________________________________________
              EXPDP Conventional/Direct path loads write the dump
              in a binary stream that is at least 15% smaller
              than original EXP representation
              ___________________________________________________


The METADATA MOVEMENT

DBMS_METADATA package is used by worker processes.

EXP utility used to store object definitions like SQL DDL.
EXPDP on the other hand writes object definitions to dump files as XML docs.

XML might take more space but allows for more features (object's ownership, storage characteristics, tablespace etc. can be changed during import). COMPRESSED parameter might be added to reduce size.

This also allows more flexibility in remapping the data into different schemas, changing the object names during import etc.


The INTERPROCESS COMMUNICATION

AQ mechanism is used by Data Pump processes to communicate.

There are 2 queues:

1.       Command and Control Queue:
All processes except clients subscribe to this queue. All API commands, work requests and responses, file requests and log messages are processed by this queue.

2.       Status Queue:
Only MCP writes to this queue and only shadow process subscribes to this. Recieves work in progress and error message info.

The SECURITY

All the IO related to the data pump job is handled by the Oracle background server process.
Hence, the OS user doing the IO is 'oracle'.
This is a security risk as 'oracle' is a privileged account.

Hence, all directory specifications are made using Oracle directory objects, with the directory read/write grants established by the DBA.


The INITIALIZATION PARAMETERS

DISK_ASYNCH_IO should be set to TRUE to allow for asynchronous IO
DB_BLOCK_CHECKSUM should ideally be FALSE but impact is minimal (5%) otherwise

STREAMS_POOL_SIZE should be set sufficiently