DB2/6000 Parallel Edition *************************** by Gilles Fecteau This article describes IBM's plans for a parallel DB2/6000 database server using the shared-nothing hardware model and function-shipping. It also shows how this approach provides a scalable database server that meshes with the IBM POWERParallel architecture. The DB2/6000 Parallel Edition will allow customers to grow databases by adding nodes, with near linear performance improvement. The beta testing is underway with availability announced later in 1994. Parallel processing is a way to support large UNIX-based applications. Parallel processing is the concurrent execution of two or more processors as a single unit. Most relational database management systems are being enhanced to support parallel processing to take advantage of low cost workstations. IBM has studied parallel technology since 1989. This work started with prototypes developed by IBM research. The IBM research division developed a prototype for a parallel database system using multiple PS/2 workstations and the OS/2 operating system. This prototype was demonstrated at COMDEX in 1990. Since then, IBM has been working on an architecture that allows DB2/6000 to run on several independently connected workstations managed as a single database by DB2/6000 parallel code. This product will operate on a range of hardware from LAN-connected RISC System/6000 systems to the IBM POWERParallel system. Parallel query execution alone is not be sufficient--large database users also need the ability to maintain very large databases. Achieving the goal of an effective parallel database means focussing on database management utilities and efficacious parallel queries. Parallel Architectures ====================== There are several possible architectures that can exploit multiple processors, large memory, and many disks. To develop the best parallel system, IBM considered several. The most common architectures are: o Shared-nothing : uses multiple processors, each with its own memory and disk storage. o Shared-disks : uses multiple processors, each with its own memory, and shared disk storage. o Symmetric Multiprocessors (SMP) : uses multiple processors that have common memory and share disk storage. IBM is extending DB2/6000 to support the shared-nothing architecture, where each processor has its own memory and disks. The database manager (the component handling database requests from an application) uses a communication link to coordinate the work of the multiple processors. We selected this architecture for two reasons: o Scalability : it can scale to hundreds of processors (ref: Dewitt and Gray, ParallelDatabase Systems: The Future of High Performance Database Systems, Communication of the ACM, June 1992/Vol.35, No.6.) o Portability : since it only requires a communication link between processors, we can port the design to any platform that has communications. This is not true for SMP or shared-disk. Although performance varies with the efficiency of the communication protocol, the high-speed switch in IBM POWERParallel systems (as well as many forms of UNIX sockets) ensures good performance. The DB2/6000 Parallel Edition will support Symmetric Multiprocessors; in a parallel database configuration, each node could be an SMP. Shared-nothing Figure 1 shows an example of a DB2 parallel database running on three RISC System/6000s supporting a network of clients. A data node is the disk storage plus a portion of the database engine that manages the disk <>. In the DB2 parallel implementation, a database is stored across a network of processors that provide separate buffers, lock structures, logs, and disks for each processor. This prevents a cache contention problem resulting from all processors sharing one set of resources--as with an SMP implementation. Because each processor has separate logs, applications are not limited to the I/O bandwidth of a single log and can perform recovery from failure in parallel. Function Shipping To minimize communication between processors, whenever possible, relational operators are executed on the processor containing the data. Therefore, a central lock manager is not required as it would with most shared-disk implementations. This process of sending the work to where the data is located is known as function shipping. Let's assume an EMPLOYEE table is distributed over multiple processors. The following SQL statement is executed: SELECT EMP_NO FROM EMPLOYEE WHERE SALARY>100000 The database manager on the coordinating processor (the processor from which the statement was issued) issues a request to every other processor in the group to select its subset of rows that meet the condition of having a SALARY value over 100,000. Each processor in the group then returns its answer set to the coordinating processor for final processing. Because each processor does the initial extraction from its own part of the EMPLOYEE table, no central lock manager is needed. At regular intervals, a global deadlock detector analyzes locks held to determine if a deadlock is present. It then selects a victim to resolve the deadlock. This avoids sending thousands of lock requests between systems--a quick deadlock detection takes place every few seconds. A less efficient alternative to DB2's function-shipping is I/O shipping, where one or moreof the group's processors are arbitrarily selected to run a query. All processors must send their data for a given query to the executing processors. There is more data movement between processors (all data pages of the employee table must be sent) than with function shipping. With function shipping, only the qualifying rows (employees with salary over 100,000) are sent from the processor owning the data to the processor coordinating the query. The system configuration for the I/O-shipping model differs from the function-shipping model. o I/O shipping : nodes that specialize in I/O must be configured with large numbers of disks, and other nodes without disks handle user queries. o Function shipping : the configuration must provide a moderate number of disks on each processor, and the data must be partitioned over many processors. The number of disks required for either model is generally the same for large databases from 50 Gigabytes to Terabytes. While databases smaller than 50 GB are not likely to use parallel technology, I/O shipping may be less expensive there. DB2/6000 Parallel Edition does not limit the function-shipping model to queries only. It can also be used for database updates and for running utilities. When a row is to be inserted into a table, it is sent to the appropriate node where it is inserted (The appropriate node is determined by the hashing algorithm used for table partitioning). The index entries are updated, and the row information is logged at the same node. Index maintenance, locking, and logging are distributed across processors. Data Placement In the context of large databases, data placement can be complex and system administration can be difficult. Therefore, appropriate Data Definition Language statements and administration utilities are needed to manage data partitioning. The DB2 parallel implementation is easier to administer than a nonpartitioned database of the same size. For example, without partitioning, some functions like backup would take too long. Tools are provided that efficiently manage the large tables that occur in large databases. The design of DB2 ensures that database design decisions are separated from load-balancing decisions. Two important features of the DB2/6000 parallel database design that help in data partitioning and database administration are the partitioning key and nodegroups. Partitioning key For a database consisting of many tables, an application developer can define a partitioning key for each table. Frequently joined tables should be partitioned on their respective join columns. The partitioning key information is specified as part of the CREATE TABLE statement, as shown below. CREATE TABLE ACCOUNT ( C_BRANCH INTEGER, CUST_NO INTEGER, CUSTOMER_NAME VARCHAR(50), LAST_DATE DATE, BALANCE DEC(8,2), . . .) IN SMALLPOOL PARTITION BY HASHING(C_BRANCH) CREATE TABLE BRANCH ( BRNO INTEGER NOT NULL, ADDRESS VARCHAR(200), . . . PRIMARY KEY(BRNO) ) IN SMALLPOOL PARTITION BY HASHING(BRNO) The first statement specifies that the ACCOUNT table is partitioned on the C_BRANCH column, and the second specifies that the BRANCH table is partitioned on the BRNO column. More than one column can be specified as the partitioning key. A system-defined hashing function is applied to the partitioning key value to determine the processor where a particular row will reside. The C_BRANCH and BRNO columns were chosen as the partitioning keys for these two tables because they best suit the application. The suitability of these columns as partitioning keys remains valid, regardless of table size and the number of processors on which the tables are partitioned. Tables can be designed without designers having prior knowledge of the configuration of the parallel system and the load on the system. Tuning for load balancing can be done after database definition time. In the example above, the processors on which the tables are partitioned are not specified directly. The CREATE TABLE statement has been extended to include an IN clause that provides this information (A nodegroup is an arbitrary name given to a set of node that will be used for tables. Tables in the same nodegroup will be partitioned over the same processors.). The identifier, SMALLPOOL, following the keyword IN, is the name of a group of nodes in a nodegroup. Nodegroup In a shared-nothing hardware configuration (such as IBM's SPx family or LAN-connected workstations), each processor runs the equivalent of a single node DB2/6000 database system. Thus, the database storage capabilities of each processor are the same as those provided by DB2/6000, including segmented table support that can implement tables of up to 64 GB. With the DB2/6000 Parallel Edition, however, the storage capability of a system with N nodes is N times that of a uniprocessor DB2/6000 system. Database tables can be defined across a set of nodes by first defining a nodegroup and then creating tables in it. A nodegroup can be created as follows: CREATE NODEGROUP SMALLPOOL ON NODES (DBMACH1,DBMACH2,DBMACH6) This statement defines a nodegroup called SMALLPOOL, consisting of three processors: DBMACH1, DBMACH2, and DBMACH6. Tables created in nodegroup SMALLPOOL are partitionedacross these three processors. A nodegroup can contain one or more processors, and a processor can be a member of more than one nodegroup in the same database or across databases. As the size of tables or the number of processors in the system increase, the ALTER NODEGROUP statement can be used to add processors to an existing nodegroup. Once a processor is added to a nodegroup, data belonging to tables in the nodegroup can be redistributed using the rebalance utility. The following characteristics make the DB2/6000 Parallel Edition easy to manage: oThe application view of a table is separate from the physical placement of data. oData can be distributed over multiple disks to reduce I/O bottlenecks. oThe number of processors can be increased as the workload or size of database grows. Parallel Query Processing The DB2/6000 Parallel Edition generates a parallel execution strategy for all SQL statements using a cost-based relational database optimizer. The cost-based optimizer compares several parallel execution strategies for each SQL statement and selects the most efficient one. An SQL statement (such as, SELECT, INSERT, UPDATE, or DELETE) is divided into a number of separate tasks. One of these tasks, the coordinator, runs at the node where the application connects. It fetches input data from the application and returns the answer set to the application. Subordinate tasks, called slave tasks, perform the bulk of the activity required for the query, and cooperate with each other when necessary. While there can only be one instance of a coordinator task for each application, there may be multiple instances of each of the slave tasks. The DB2/6000 Parallel Edition does not impose any new restrictions on SQL statements, thereby protecting the investment made in existing applications. The generation of a parallel execution strategy for a given SQL statement is automatic. A DB2/6000 application program does not have to be recompiled to exploit parallel execution. When the application program is bound to a parallel database, the appropriate parallel execution strategy is generated, and, if required, stored. The optimization process of an SQL statement in a parallel DB2/6000 environment is based on two primary factors: o The distribution of the data across nodes : DB2/6000 Parallel Edition supports data partitioning by hashing the values of a set columns across a set of nodes. o The cost of functions associated with different operations : The DB2/6000 optimizer has cost formulas for the different tasks that need to be done. New assumptions are added to account for parallel operations and messages (rows). The database manager generates the optimal parallel plan rather than having the best serial plan simply executed in parallel. This optimization is performed by the database manager without the need for any external information. The repertoire of parallel strategies used by DB2/6000 Parallel Edition includes: oFor tables: parallel table scans and parallel index scans oFor joins: co-located, hashing redistributed, or broadcast joins oLocal and global aggregates, including the GROUP BY clause The full power of parallel processing is also used for other SQL constructs, such as subqueries, set operations (union/difference/intersect), and UPDATE, INSERT and DELETE operations. Example of parallel queries Using a database consisting of ACCOUNT and BRANCH tables, this section describes three parallel execution strategies. Parallel Relation Scan : The following query is divided into two task types: the coordinator task, which returns the answer set to the application, and slave tasks, which, on a given partition of the ACCOUNT table, selects the matching rows and pipes them to the coordinator. The slave task has instances on all the nodes where ACCOUNT table resides. SELECT CUSTOMER_NAME FROM ACCOUNT WHERE BALANCE > 20000 The execution snapshot for this query is shown in Figure 3. The data predicates are evaluated as soon as possible at the nodes where the data resides, minimizing the amount of data exchanged using messages. Expressions and scalar functions are also "pushed down" and evaluated as soon as possible. Parallel Aggregation, including GROUP BY clause : This example shows how the DB2/6000 Parallel Edition forces aggregation tasks to be executed at the site of the data, minimizing the sequential work at the coordinator task. SELECT C_BRANCH, COUNT(*) FROM ACCOUNT WHERE BALANCE > 20000 GROUP BY C_BRANCH The execution strategy involves two task types: oThe slave tasks select matching rows on a given partition, group them according to C_BRANCH value, perform local aggregation (that is, a local count for each group), and return the grouping column and the local count to the coordinator task. oThe coordinator task performs the global aggregation (that is, a global count from the local count values) by merging the grouping values and local counts returned from the different slave tasks, and returns the answer to the application. Co-located Joins : In the following example, both the ACCOUNT and BRANCH tables arepartitioned on their joining columns. SELECT BRANCH_NAME, CUSTOMER_NAME FROM ACCOUNT,BRANCH WHERE ACCOUNT.C_BRANCH = BRANCH.BRNO AND ACCOUNT.LAST_DATE > CURRENT DATE - 2 YEARS One possible execution strategy consists of the following tasks: o The slave task scans its partition of ACCOUNT table, applying the predicate, and then joins the resulting relation with its partition of BRANCH table. The slave task then sends the joined rows to the coordinator task. o The coordinator task merges the results and sends the answer set to the application. Typical decision-support applications include SQL queries that join relations in a usually predictable way. In those cases, it may be beneficial to partition the relations on their joining columns. This action may reduce the number of data exchange messages required for join processing. Redirected Join : If the ACCOUNT table is partitioned using the CUST_NO attribute rather than the C_BRANCH attribute, a co-located join strategy will generate incorrect results for the SQL query shown above because a row of the ACCOUNT table may join with a row of the BRANCH table that resides on a different partition. In this case, DB2/6000 could generate a redirected join strategy by hashing each selected ACCOUNT table row using the C_BRANCH value and redirecting them to the corresponding BRANCH table row location. Broadcast Join : DB2/6000 Parallel Edition also uses a Broadcast Join where the selected rows of the ACCOUNT or BRANCH tables are broadcast to the nodes containing the partitions of the other table. This strategy is useful for an index-based join strategy or when the size of one of the joining tables is quite small. Redistributed Join : where neither the BRANCH nor the ACCOUNT table is partitioned on the joining columns, a redistributed join strategy would be used: o ACCOUNT table rows are redirected by hashing on the C_BRANCH attribute. o BRANCH table rows are redirected by hashing on the BRNO attribute. o The redistributed partitions of ACCOUNT and BRANCH tables can then be joined locally. One or more of the above join strategies may be viable, based on the data partitioning. The optimizer selects the strategy that minimizes the overall execution cost. The optimization strategy extends elegantly when more than two relations are being joined. For example, a multi-join query execution strategy might involve a mix of co-located, redirected, broadcast, and redistributed joins. The parallel execution strategies in DB2/6000 Parallel Edition are executed asynchronously. Coordinator involvement is restricted to initializing slave tasks and to collecting final result sets. Specifically, in a multi-join query, there is no coordinator involvement between joins%all processing is data-flow driven between slave tasks. Parallel utilities The DB2/6000 Parallel Edition provides linear scaleup and speedup for all utilities. Two important utilities that help managing a database are the Data Load and Rebalance utilities. Data Load Data can be loaded into database tables using the DB2/6000 import utility or by using fast loading utilities such as Bridge Fastload or the DB2 optional fast-load program. The import utility loads data by performing INSERT operations. DB2/6000 provides an enhanced INSERT where rows are batched before being sent to the destination processor. With the DB2/6000 INSERT, rows are batched by their destination processor and the batch is sent to that processor when the buffer is full. This new INSERT is available to most application programs that use SQL INSERT and coordinate restart points using COMMIT. It is implemented as an option at bind time. Most programs can use this option without any modification. Another option for loading data is to directly create database files without going through the INSERT mechanism. In the parallel database, the best approach is to partition the input data file based on the defined partition key values, thereby creating one file for each processor where the table will be stored. Next, each partition is loaded in parallel across the processors. An API facilitates this process. It determines on which processor a row of a table should be stored, given its partition key value and its nodegroup name. This API can be used to partition a data file into several files, one per processor. The Rebalance Utility A system-defined hashing function is used to determine the partitioning key value of a row in a table and the processor on which the row is stored. The processor selected is one of the processors in the nodegroup where the table was created. A uniform hashing function distributes data evenly across the set of processors in the nodegroup. Data may have to be moved between processors when a new processor is added to a nodegroup or when the data distribution across processors is not uniform (because of the skew in the data values in the partitioning key column). The rebalance utility can be used to achieve a uniform distribution of data. The rebalance utility is specified at the nodegroup level. Rebalancing a nodegroup results in the rebalancing of all tables in that nodegroup. Rebalance is an online operation--the database system nor the database need not be shut down. Database administrators can specify what data should be moved and where it should be moved. For example, suppose a nodegroup consists of 10 processors (numbered 1 to 10). Two processors, numbered 15 and 16 are added. The rebalance utility can move some data to take advantage of the new processors. Data can also be moved from processors 1-5, to processor 15, and data moved from processors 6-10 to processor 16, to achieve a uniform distribution of data across all processors in the nodegroup. In another example, suppose a nodegroup consists of four processors numbered 1 to 4. Processor 1 has 40% of the rows, and the other three each have 20%. The rebalance utility can be used to move approximately 15% of the rows from processor 1 and send about 5% each to the other three. It is possible to break up a large rebalancing operation into smaller steps by rebalancingonly part of the data each time. For example, if 50% of the rows need to be moved from one processor to another, the move could be done in five separate rebalancing operations, each moving 10% of the data. Thus, if large tables need to be rebalanced, the work can be spread over several intervals of low activity. Other Utilities This section describes how several functions and utilities are executed in parallel. Index creation : DB2/6000's parallel capabilities support the creation of unique indexes, where the key columns include the partitioning key columns, or non-unique indexes. Index creation is performed in parallel across processors. A local index is created for each partition of the table stored in a processor. Reclustering utility (REORG) : The REORG utility reclusters the rows of a table on disk. Clustering of each individual table partition at a processor is sufficient to exploit clustered index scan performance. The REORG operation executes in parallel across all processors. The operation can provide linear speedup with number of processors. Backup/Restore : The DB2/6000 backup utility is used at each processor to back up the segment of the database that is resident at that processor. The backup image includes the processor number. All processors do not have to be backed up at the same time. Backups can be taken in parallel across processors, but this method requires sufficient backup resources, such as tape drives at each processor. All logs must be available at each processor to recover the entire database to a consistent point in time. DB2/6000 parallel functions interface with the IBM ADSM product which manages backups from multiple RISC System/6000 machines. The ADSM interface keeps track of backups and their associated logs. Thus, ADSM enhances the manageability of DB2/6000 in parallel environments. Forward recovery : Except for the resolution of the final in-doubt transactions, all processors can perform parallel forward recovery by re-applying log entries after a system failure or a restore from backup. DB2/6000 ensures that all nodes are recovered to the latest log. Parallel transaction processing DB2/6000 is a full-function parallel relational database management system. It is not limited to queries and can be used to extend the capacity of an online transaction workload beyond a single RISC System/6000. Combined with IBM's HACMP/6000, the DB2/6000 Parallel Edition provides concurrent access to a database from all processors in a HACMP/6000 cluster of RISC System/6000s. DB2/6000's function-shipping implementation is not limited to a single HACMP cluster--multiple clusters can be joined to support larger databases or higher throughput. Conclusion The IBM DB2/6000 parallel database server can manipulate large amounts of data efficiently by partitioning data over several nodes and by executing queries in parallel. It provides the following advantages: o Cost-based optimization o The best parallel processing strategies o Efficient, asynchronous execution of subtasks DB2/6000 also provides efficient transaction processing capabilities and a suite of parallel utilities for database management. About the authors +++++++++++++++++ Gilles Fecteau , IBM Software Solutions Laboratory, 1150 Eglinton Avenue East, North York, Ontario M3C 1H7, Canada. Internet: gfecteau@vnet.ibm.com. Mr. Fecteau is one of the key designers of the parallel version of the DB2 workstation products (DB2/2 and DB2/6000). He has been involved in several advanced technology efforts to bring parallel databases to market. Mr. Fecteau has a Bachelor of Engineering Science degree from the Laval University in Quebec City, Quebec. Copyright and TradeMarks ++++++++++++++++++++++++