DB2/6000 Parallel Edition 
        ***************************

by Gilles Fecteau 


This article describes IBM's plans for a parallel DB2/6000 database server using the shared-nothing
hardware model and function-shipping. It also shows how this approach provides a scalable database
server that meshes with the IBM POWERParallel architecture. The DB2/6000 Parallel Edition will
allow customers to grow databases by adding nodes, with near linear performance improvement. The
beta testing is underway with availability announced later in 1994. 

Parallel processing is a way to support large UNIX-based applications. Parallel processing is the
concurrent execution of two or more processors as a single unit. Most relational database management
systems are being enhanced to support parallel processing to take advantage of low cost workstations. 

IBM has studied parallel technology since 1989. This work started with prototypes developed by IBM
research. The IBM research division developed a prototype for a parallel database system using multiple
PS/2 workstations and the OS/2 operating system. This prototype was demonstrated at COMDEX in
1990. 

Since then, IBM has been working on an architecture that allows DB2/6000 to run on several
independently connected workstations managed as a single database by DB2/6000 parallel code. This
product will operate on a range of hardware from LAN-connected RISC System/6000 systems to the
IBM POWERParallel system. 

Parallel query execution alone is not be sufficient--large database users also need the ability to maintain
very large databases. Achieving the goal of an effective parallel database means focussing on database
management utilities and efficacious parallel queries. 

Parallel Architectures
======================

There are several possible architectures that can exploit multiple processors, large memory, and many
disks. To develop the best parallel system, IBM considered several. The most common architectures are:

   o Shared-nothing : uses multiple processors, each with its own memory and disk storage. 

   o Shared-disks : uses multiple processors, each with its own memory, and shared disk storage. 

   o Symmetric Multiprocessors (SMP) : uses multiple processors that have common memory
   and share disk storage. 

IBM is extending DB2/6000 to support the shared-nothing architecture, where each processor has its
own memory and disks. The database manager (the component handling database requests from an
application) uses a communication link to coordinate the work of the multiple processors. We selected
this architecture for two reasons: 

   o Scalability : it can scale to hundreds of processors (ref: Dewitt and Gray, ParallelDatabase
   Systems: The Future of High Performance Database Systems, Communication of the ACM, June
   1992/Vol.35, No.6.) 

   o Portability : since it only requires a communication link between processors, we can port the
   design to any platform that has communications. This is not true for SMP or shared-disk. Although
   performance varies with the efficiency of the communication protocol, the high-speed switch in
   IBM POWERParallel systems (as well as many forms of UNIX sockets) ensures good
   performance. 

The DB2/6000 Parallel Edition will support Symmetric Multiprocessors; in a parallel database
configuration, each node could be an SMP. 

Shared-nothing 

Figure 1 shows an example of a DB2 parallel database running on three RISC System/6000s supporting
a network of clients. A data node is the disk storage plus a portion of the database engine that manages
the disk <>. In the DB2 parallel implementation, a database is stored across a network of processors that
provide separate buffers, lock structures, logs, and disks for each processor. This prevents a cache
contention problem resulting from all processors sharing one set of resources--as with an SMP
implementation. 

Because each processor has separate logs, applications are not limited to the I/O bandwidth of a single
log and can perform recovery from failure in parallel. 

Function Shipping 

To minimize communication between processors, whenever possible, relational operators are executed
on the processor containing the data. Therefore, a central lock manager is not required as it would with
most shared-disk implementations. This process of sending the work to where the data is located is
known as function shipping. 

Let's assume an EMPLOYEE table is distributed over multiple processors. The following SQL statement
is executed: 


  SELECT EMP_NO FROM EMPLOYEE WHERE SALARY>100000


The database manager on the coordinating processor (the processor from which the statement was
issued) issues a request to every other processor in the group to select its subset of rows that meet the
condition of having a SALARY value over 100,000. Each processor in the group then returns its answer
set to the coordinating processor for final processing. 

Because each processor does the initial extraction from its own part of the EMPLOYEE table, no central
lock manager is needed. At regular intervals, a global deadlock detector analyzes locks held to determine
if a deadlock is present. It then selects a victim to resolve the deadlock. This avoids sending thousands of
lock requests between systems--a quick deadlock detection takes place every few seconds. 

A less efficient alternative to DB2's function-shipping is I/O shipping, where one or moreof the group's
processors are arbitrarily selected to run a query. All processors must send their data for a given query to
the executing processors. There is more data movement between processors (all data pages of the
employee table must be sent) than with function shipping. With function shipping, only the qualifying rows
(employees with salary over 100,000) are sent from the processor owning the data to the processor
coordinating the query. 

The system configuration for the I/O-shipping model differs from the function-shipping model. 

   o I/O shipping : nodes that specialize in I/O must be configured with large numbers of disks, and
   other nodes without disks handle user queries. 

   o Function shipping : the configuration must provide a moderate number of disks on each
   processor, and the data must be partitioned over many processors. 

The number of disks required for either model is generally the same for large databases from 50
Gigabytes to Terabytes. While databases smaller than 50 GB are not likely to use parallel technology, I/O
shipping may be less expensive there. 

DB2/6000 Parallel Edition does not limit the function-shipping model to queries only. It can also be used
for database updates and for running utilities. When a row is to be inserted into a table, it is sent to the
appropriate node where it is inserted (The appropriate node is determined by the hashing algorithm used
for table partitioning). The index entries are updated, and the row information is logged at the same node.
Index maintenance, locking, and logging are distributed across processors. 

Data Placement 

In the context of large databases, data placement can be complex and system administration can be
difficult. Therefore, appropriate Data Definition Language statements and administration utilities are
needed to manage data partitioning. The DB2 parallel implementation is easier to administer than a
nonpartitioned database of the same size. For example, without partitioning, some functions like backup
would take too long. Tools are provided that efficiently manage the large tables that occur in large
databases. The design of DB2 ensures that database design decisions are separated from load-balancing
decisions. 

Two important features of the DB2/6000 parallel database design that help in data partitioning and
database administration are the partitioning key and nodegroups. 

Partitioning key 

For a database consisting of many tables, an application developer can define a partitioning key for each
table. Frequently joined tables should be partitioned on their respective join columns. The partitioning key
information is specified as part of the CREATE TABLE statement, as shown below. 


CREATE TABLE ACCOUNT (
     C_BRANCH INTEGER,
     CUST_NO INTEGER,
     CUSTOMER_NAME VARCHAR(50),
     LAST_DATE DATE,
     BALANCE DEC(8,2),
     .
     .
     .)
IN SMALLPOOL PARTITION BY HASHING(C_BRANCH)
CREATE TABLE BRANCH (
     BRNO INTEGER NOT NULL,
     ADDRESS VARCHAR(200),
     .
     .
     .
     PRIMARY KEY(BRNO) )
IN SMALLPOOL PARTITION BY HASHING(BRNO)


The first statement specifies that the ACCOUNT table is partitioned on the C_BRANCH column, and the
second specifies that the BRANCH table is partitioned on the BRNO column. More than one column can
be specified as the partitioning key. A system-defined hashing function is applied to the partitioning key
value to determine the processor where a particular row will reside. 

The C_BRANCH and BRNO columns were chosen as the partitioning keys for these two tables because
they best suit the application. The suitability of these columns as partitioning keys remains valid,
regardless of table size and the number of processors on which the tables are partitioned. Tables can be
designed without designers having prior knowledge of the configuration of the parallel system and the
load on the system. Tuning for load balancing can be done after database definition time. 

In the example above, the processors on which the tables are partitioned are not specified directly. The
CREATE TABLE statement has been extended to include an IN clause that provides this information (A
nodegroup is an arbitrary name given to a set of node that will be used for tables. Tables in the same
nodegroup will be partitioned over the same processors.). The identifier, SMALLPOOL, following the
keyword IN, is the name of a group of nodes in a nodegroup. 

Nodegroup 

In a shared-nothing hardware configuration (such as IBM's SPx family or LAN-connected
workstations), each processor runs the equivalent of a single node DB2/6000 database system. Thus, the
database storage capabilities of each processor are the same as those provided by DB2/6000, including
segmented table support that can implement tables of up to 64 GB. With the DB2/6000 Parallel Edition,
however, the storage capability of a system with N nodes is N times that of a uniprocessor DB2/6000
system. 

Database tables can be defined across a set of nodes by first defining a nodegroup and then creating
tables in it. A nodegroup can be created as follows: 


  CREATE NODEGROUP SMALLPOOL


     ON NODES (DBMACH1,DBMACH2,DBMACH6)


This statement defines a nodegroup called SMALLPOOL, consisting of three processors: DBMACH1,
DBMACH2, and DBMACH6. Tables created in nodegroup SMALLPOOL are partitionedacross these
three processors. A nodegroup can contain one or more processors, and a processor can be a member of
more than one nodegroup in the same database or across databases. 

As the size of tables or the number of processors in the system increase, the ALTER NODEGROUP
statement can be used to add processors to an existing nodegroup. Once a processor is added to a
nodegroup, data belonging to tables in the nodegroup can be redistributed using the rebalance utility. 

The following characteristics make the DB2/6000 Parallel Edition easy to manage: 

   oThe application view of a table is separate from the physical placement of data. 

   oData can be distributed over multiple disks to reduce I/O bottlenecks. 

   oThe number of processors can be increased as the workload or size of database grows. 

Parallel Query Processing 

The DB2/6000 Parallel Edition generates a parallel execution strategy for all SQL statements using a
cost-based relational database optimizer. The cost-based optimizer compares several parallel execution
strategies for each SQL statement and selects the most efficient one. An SQL statement (such as,
SELECT, INSERT, UPDATE, or DELETE) is divided into a number of separate tasks. One of these
tasks, the coordinator, runs at the node where the application connects. It fetches input data from the
application and returns the answer set to the application. Subordinate tasks, called slave tasks, perform
the bulk of the activity required for the query, and cooperate with each other when necessary. While
there can only be one instance of a coordinator task for each application, there may be multiple instances
of each of the slave tasks. 

The DB2/6000 Parallel Edition does not impose any new restrictions on SQL statements, thereby
protecting the investment made in existing applications. The generation of a parallel execution strategy
for a given SQL statement is automatic. A DB2/6000 application program does not have to be recompiled
to exploit parallel execution. When the application program is bound to a parallel database, the
appropriate parallel execution strategy is generated, and, if required, stored. 

The optimization process of an SQL statement in a parallel DB2/6000 environment is based on two
primary factors: 

   o The distribution of the data across nodes : DB2/6000 Parallel Edition supports data
   partitioning by hashing the values of a set columns across a set of nodes. 

   o The cost of functions associated with different operations : The DB2/6000 optimizer has
   cost formulas for the different tasks that need to be done. New assumptions are added to account
   for parallel operations and messages (rows). The database manager generates the optimal parallel
   plan rather than having the best serial plan simply executed in parallel. This optimization is
   performed by the database manager without the need for any external information. 

The repertoire of parallel strategies used by DB2/6000 Parallel Edition includes: 

   oFor tables: parallel table scans and parallel index scans 

   oFor joins: co-located, hashing redistributed, or broadcast joins 

   oLocal and global aggregates, including the GROUP BY clause 

The full power of parallel processing is also used for other SQL constructs, such as subqueries, set
operations (union/difference/intersect), and UPDATE, INSERT and DELETE operations. 

Example of parallel queries 

Using a database consisting of ACCOUNT and BRANCH tables, this section describes three parallel
execution strategies. 

Parallel Relation Scan : The following query is divided into two task types: the coordinator task, which
returns the answer set to the application, and slave tasks, which, on a given partition of the ACCOUNT
table, selects the matching rows and pipes them to the coordinator. The slave task has instances on all the
nodes where ACCOUNT table resides. 


  SELECT CUSTOMER_NAME FROM ACCOUNT
     WHERE BALANCE > 20000


The execution snapshot for this query is shown in Figure 3. The data predicates are evaluated as soon as
possible at the nodes where the data resides, minimizing the amount of data exchanged using messages.
Expressions and scalar functions are also "pushed down" and evaluated as soon as possible. 

Parallel Aggregation, including GROUP BY clause : This example shows how the DB2/6000 Parallel
Edition forces aggregation tasks to be executed at the site of the data, minimizing the sequential work at
the coordinator task. 


  SELECT C_BRANCH, COUNT(*)
     FROM ACCOUNT
     WHERE BALANCE > 20000
       GROUP BY C_BRANCH

The execution strategy involves two task types: 

   oThe slave tasks select matching rows on a given partition, group them according to C_BRANCH
   value, perform local aggregation (that is, a local count for each group), and return the grouping
   column and the local count to the coordinator task. 

   oThe coordinator task performs the global aggregation (that is, a global count from the local count
   values) by merging the grouping values and local counts returned from the different slave tasks,
   and returns the answer to the application. 

Co-located Joins : In the following example, both the ACCOUNT and BRANCH tables arepartitioned
on their joining columns. 

SELECT BRANCH_NAME, CUSTOMER_NAME 

FROM ACCOUNT,BRANCH 

WHERE ACCOUNT.C_BRANCH = BRANCH.BRNO 

AND ACCOUNT.LAST_DATE > CURRENT DATE - 2 YEARS 

One possible execution strategy consists of the following tasks: 

 o The slave task scans its partition of ACCOUNT table, applying the predicate, and then joins the
   resulting relation with its partition of BRANCH table. The slave task then sends the joined rows to
   the coordinator task. 

 o The coordinator task merges the results and sends the answer set to the application. 

Typical decision-support applications include SQL queries that join relations in a usually predictable
way. In those cases, it may be beneficial to partition the relations on their joining columns. This action
may reduce the number of data exchange messages required for join processing. 

Redirected Join : If the ACCOUNT table is partitioned using the CUST_NO attribute rather than the
C_BRANCH attribute, a co-located join strategy will generate incorrect results for the SQL query
shown above because a row of the ACCOUNT table may join with a row of the BRANCH table that
resides on a different partition. In this case, DB2/6000 could generate a redirected join strategy by
hashing each selected ACCOUNT table row using the C_BRANCH value and redirecting them to the
corresponding BRANCH table row location. 

Broadcast Join : DB2/6000 Parallel Edition also uses a Broadcast Join where the selected rows of the
ACCOUNT or BRANCH tables are broadcast to the nodes containing the partitions of the other table.
This strategy is useful for an index-based join strategy or when the size of one of the joining tables is
quite small. 

Redistributed Join : where neither the BRANCH nor the ACCOUNT table is partitioned on the joining
columns, a redistributed join strategy would be used: 

 o ACCOUNT table rows are redirected by hashing on the C_BRANCH attribute. 

 o BRANCH table rows are redirected by hashing on the BRNO attribute. 

 o The redistributed partitions of ACCOUNT and BRANCH tables can then be joined locally. 

One or more of the above join strategies may be viable, based on the data partitioning. The optimizer
selects the strategy that minimizes the overall execution cost. The optimization strategy extends elegantly
when more than two relations are being joined. For example, a multi-join query execution strategy might
involve a mix of co-located, redirected, broadcast, and redistributed joins. 

The parallel execution strategies in DB2/6000 Parallel Edition are executed asynchronously. Coordinator
involvement is restricted to initializing slave tasks and to collecting final result sets. Specifically, in a
multi-join query, there is no coordinator involvement between joins%all processing is data-flow driven
between slave tasks. 

Parallel utilities 

The DB2/6000 Parallel Edition provides linear scaleup and speedup for all utilities. Two important utilities
that help managing a database are the Data Load and Rebalance utilities. 

Data Load 

Data can be loaded into database tables using the DB2/6000 import utility or by using fast loading utilities
such as Bridge Fastload or the DB2 optional fast-load program. The import utility loads data by
performing INSERT operations. DB2/6000 provides an enhanced INSERT where rows are batched
before being sent to the destination processor. With the DB2/6000 INSERT, rows are batched by their
destination processor and the batch is sent to that processor when the buffer is full. This new INSERT is
available to most application programs that use SQL INSERT and coordinate restart points using
COMMIT. It is implemented as an option at bind time. Most programs can use this option without any
modification. 

Another option for loading data is to directly create database files without going through the INSERT
mechanism. In the parallel database, the best approach is to partition the input data file based on the
defined partition key values, thereby creating one file for each processor where the table will be stored.
Next, each partition is loaded in parallel across the processors. An API facilitates this process. It
determines on which processor a row of a table should be stored, given its partition key value and its
nodegroup name. This API can be used to partition a data file into several files, one per processor. 

The Rebalance Utility 

A system-defined hashing function is used to determine the partitioning key value of a row in a table and
the processor on which the row is stored. The processor selected is one of the processors in the
nodegroup where the table was created. A uniform hashing function distributes data evenly across the
set of processors in the nodegroup. Data may have to be moved between processors when a new
processor is added to a nodegroup or when the data distribution across processors is not uniform
(because of the skew in the data values in the partitioning key column). The rebalance utility can be used
to achieve a uniform distribution of data. 

The rebalance utility is specified at the nodegroup level. Rebalancing a nodegroup results in the
rebalancing of all tables in that nodegroup. Rebalance is an online operation--the database system nor
the database need not be shut down. Database administrators can specify what data should be moved
and where it should be moved. 

For example, suppose a nodegroup consists of 10 processors (numbered 1 to 10). Two processors,
numbered 15 and 16 are added. The rebalance utility can move some data to take advantage of the new
processors. Data can also be moved from processors 1-5, to processor 15, and data moved from
processors 6-10 to processor 16, to achieve a uniform distribution of data across all processors in the
nodegroup. 

In another example, suppose a nodegroup consists of four processors numbered 1 to 4. Processor 1 has
40% of the rows, and the other three each have 20%. The rebalance utility can be used to move
approximately 15% of the rows from processor 1 and send about 5% each to the other three. 

It is possible to break up a large rebalancing operation into smaller steps by rebalancingonly part of the
data each time. For example, if 50% of the rows need to be moved from one processor to another, the
move could be done in five separate rebalancing operations, each moving 10% of the data. Thus, if large
tables need to be rebalanced, the work can be spread over several intervals of low activity. 

Other Utilities 

This section describes how several functions and utilities are executed in parallel. 

Index creation : DB2/6000's parallel capabilities support the creation of unique indexes, where the key
columns include the partitioning key columns, or non-unique indexes. Index creation is performed in
parallel across processors. A local index is created for each partition of the table stored in a processor. 

Reclustering utility (REORG) : The REORG utility reclusters the rows of a table on disk. Clustering of
each individual table partition at a processor is sufficient to exploit clustered index scan performance. The
REORG operation executes in parallel across all processors. The operation can provide linear speedup
with number of processors. 

Backup/Restore : The DB2/6000 backup utility is used at each processor to back up the segment of the
database that is resident at that processor. The backup image includes the processor number. All
processors do not have to be backed up at the same time. Backups can be taken in parallel across
processors, but this method requires sufficient backup resources, such as tape drives at each processor.
All logs must be available at each processor to recover the entire database to a consistent point in time. 

DB2/6000 parallel functions interface with the IBM ADSM product which manages backups from
multiple RISC System/6000 machines. The ADSM interface keeps track of backups and their associated
logs. Thus, ADSM enhances the manageability of DB2/6000 in parallel environments. 

Forward recovery : Except for the resolution of the final in-doubt transactions, all processors can
perform parallel forward recovery by re-applying log entries after a system failure or a restore from
backup. DB2/6000 ensures that all nodes are recovered to the latest log. 

Parallel transaction processing 

DB2/6000 is a full-function parallel relational database management system. It is not limited to queries
and can be used to extend the capacity of an online transaction workload beyond a single RISC
System/6000. 

Combined with IBM's HACMP/6000, the DB2/6000 Parallel Edition provides concurrent access to a
database from all processors in a HACMP/6000 cluster of RISC System/6000s. DB2/6000's
function-shipping implementation is not limited to a single HACMP cluster--multiple clusters can be
joined to support larger databases or higher throughput. 

Conclusion 

The IBM DB2/6000 parallel database server can manipulate large amounts of data efficiently by
partitioning data over several nodes and by executing queries in parallel. It provides the following
advantages: 

 o Cost-based optimization 

 o The best parallel processing strategies 

 o Efficient, asynchronous execution of subtasks 

DB2/6000 also provides efficient transaction processing capabilities and a suite of parallel utilities for
database management. 


About the authors
+++++++++++++++++

        Gilles Fecteau , IBM Software Solutions Laboratory, 1150 Eglinton Avenue East,
North York, Ontario M3C 1H7, Canada. Internet: gfecteau@vnet.ibm.com. Mr. Fecteau is one of the key
designers of the parallel version of the DB2 workstation products (DB2/2 and DB2/6000). He has been
involved in several advanced technology efforts to bring parallel databases to market. Mr. Fecteau has a
Bachelor of Engineering Science degree from the Laval University in Quebec City, Quebec. 


Copyright and TradeMarks
++++++++++++++++++++++++