NPAC Spring 1995 Seminar Series:
Gang Cheng, Marek Podgorny NPAC, Syracuse University February 3, SCI/TECH 1-218, 1pm ABSTRACT An overview of NPAC's parallel database hardware and software infrastructure and current activities in parallel database technology is presented. Our experience with Decision Support System (DSS) benchmarking is highlighted, followed by a technical discussion on the parallel Oracle 7 software on the nCUBE2 and IBM SP2. Issues on benchmarking DSS using a parallel Relational Database Management System (RDBMS) are presented.
. Conventional (mainframe) systems unable to support . time-critical quiers . simultaneous OLTP and decision support . gigantic scientific and commercial databases . complex queries (approximate matching, pattern recognition ...) . In commercial business world, MPP systems . are mature and highly available NOW ("Prime time") . have better cost/performance than the mainframe system . can find more profitable applications than the scientific world . Market Demand . Most commercial cooperations re-engineer their businesses to be more efficient with a smaller workforce . need better system scalability and price/performance . need more computing capacity to grow business and fast response to have better services . Parallel Database Technology Appeals . Scalability . system expandability alleviates migration problems . business branches of different size can use the same technology . Power and price/performance ratio . simultaneous OLTP and complex Decision Support queries (single enterprise image) . nCUBE + Oracle capable of >100 tps at ~$2000/tps (TPC-B benchmark, compare with 425 tps for 4*VAX cluster at $16,500/tps) . High availability . Integrability . Parallel Oracle RDBMS integrates as easily as conventional RDBMS . do not require re-investment in application software . Industry standard database language - SQL (Structured Query Language) . Parallelization is carried out at RDBMS server side . Migration to paralllel RDBMS is painless (compare to scientific applications) . Massive amount of data . EOS, HEP, weather, census, global environment models, oil exploration, medical history files, multimedia support ...
. OLTP (On Line Transaction Processing) --- consists of performing many small independent requestes submitted by many clients simuteneously and operating on a shared database (parallel server) . DSS (Decision Support System) --- . Data Warehouse --- operations mixing DSS and OLTP . Data mining --- uncover/extract meaningful information from mass data sources Application Areas: commercial, administration, scientific . High transaction rate applications . banks, airlines, telecommunication, security brokers, retailers ... . Real-time applications . stock trading, C3I (militart, civil defense), air traffic, process control (manufacturing, accelarator, reactors/generators), real-time insurance claim analysis ... . Complex queries . decision support (market analysis), molecular biology, chemistry/pharmacology ...
Objectives: . understand customer buying habits and product preference . predict market tendency . make efficient, productive and profitable decisions must be based on answers to questions on ALL historical business transaction data (Lots of !) Intelligent Business systems . Typical Questions: . "Give me a breakdown of all customers likely to default in the coming year" . "Do three things: predict year-end demand, tell me which customers will fuel that demand, and tell me why" . "Define a highly concentrated micromarket (make special offers to customer segements that meet very select criteria) within my database" . "Create a model that really explains why some customers renew their subscriptions and others don't" . "Suggest ways of regrouping our customers into new market segements (for Direct-Marketing)" . "Find some customer-preference pattern I might not be aware of" Intelligent Business systems . Major Technology Challenges in this application . Lots of Historical Data (and they are growing daily !) . High Performance Computing System and DBMS . Intelligent Data Analysis (Optimization) . not possible on a mainframe system in terms of both performance and capacity Intelligent Business systems . Solutions: . MPP system . Parallel Relational Database Management System . Intelligent Algorithms
. Oracle (Oracle7 with parallel server and parallel query options) . Sybase (Navigation Server on AT&T 3600, SUN SMPs, and IBM SP2 (planned), SQL Server 10) . Informix (INFORMIX-OnLine Dynamic Server) . IBM (Oracle7, DB2/6000, DB2/2, SP2, IBM 390 Parallel Sysplex) . SGI (Oracle7 on SMP Challenge database server) . Taradata with AT&T and NCR . Tandem (Himalaya K10000) . Cray Research (SuperServer) . SUN (Oracle7, DB2 on SUN's SMP) . AT&T and NCR . TMC (CM5, Darwin, Oracle7, Decision/SQL, Parasort) . KSR (Query Decomposor, Oracle7) . Amdahl (Oracle7) . nCUBE (Oracle7, nCUBE2) . data-CACHE Corp (SQL dataMANAGER) . Data General Corp (Oracle7, DSO) . DEC (POLYCENTER Manager) . HP (Oracle7, IMAGE/SQL, HP 3000) . Encore Computer Corp (Oracle7 on Infinity 90)
. Kmart . American Express . AAA Airline . Citibank . Prudential Securities . Bell Atlantic Corp . BellSouth . BT (British Telecom) . Criminal Investigative Technology Inc.
. shared-memory (also called Symmetric Multiprocessors SMP) all processors share direct access to a common global memory and to all disks, example: Oracle7 on SGI Challenge . shared-disks each processor has a private memory but has direct access to all disks, example: Oracle 7 on IBM SP2 . shared-nothing each memory and disk is owned by some processor that acts as a server for that data. Mass storage in such an architecture is distributed among processors by connecting one or more disks (Good for system scalability and portability) example: Oracle 7 on nCUBE2 . shared-data combine features of the above three models, use data sharing technology directly interfaces with the individual processors to enable each processor to cache shared data in its own local memory with full read/write capability (as in shared-memory model). Data sharing maintains data integrity by ensuring cache coherency across all processors and provide efficient cross-system locking mechanisms (as in shared-nothing model). All disk devices are accessible to each processor so that a workload may be scheduled to run on any available processor (as in shared-disks model), instead of run only on processors with data affinity (as in shared-nothing model). example: DB2 on IBM System/390 Parallel Sysplex. Four Figures illustrating the four architectures
(with "Scalability" in mind) . Shared-Memory and Shared-Disk . interference limits scalability . complicated and relatively expensive . Shared Nothing . minimal interference between processors . moves only questions and answers over the newtwork . can scale up to hundreds (maybe thousands) of processors . composed of cheap, stock parts and a fast interconnecting network . achieve near-linear improvements
. Pipelined: "assembly line" . average query does not have that many steps . some operational stages cannot be pipled (e.g. sort) . one operator is longer than the other (a kind of skew) . Partitioned: "divide and conquer" . fairly straight forward to implement . "tough" operations are divided, parts run in parallel . Combination: "best of both world" . several partitions running parallel . each partition is, where possible, a short pipeline
. pipelined parallelism (inter-operator parallelism) . query parallelism . inter-query parallelism . intra-operator parallelism . partitioned (declustered) parallelism (I/O parallelism) . transaction parallelism
How to divide data among multiple disks ? . a common practice on large number of disks and mass storage systems(file striping) . high I/O bandwidth of multiple disks and parallel I/O . no need for specialized hardware . important to load balance on a shared-nothing system . major approaches . Round Robin . data not scattered . cannot locate specific records . Hashing . scatters the data . can locate specific records . danger of data skew . Range Partitioning: . data scattered . can locate specific records . danger of data skew . "related" data can be clustered (figures)
. operations/application dependent . danger of data skew arises on shared-nothing with data partitioning . data skew is the focus of a number of research projects . automatic OS file striping vs. manual table striping
. speedup (degree of parallelism) . scaleup (scalbility) -- transaction scaleup and batch scaleup . sizeup (both)
. startup (letency) . interference (communication, locking) . load balance (execution skew - all the execution occurs in one partition) . data-partition (data skew - all the data is placed in one partition) . imbalance of system configuration (CPUs and I/O channels) . CPU-intensive or I/O intensive queries
Data Structure . relations (files, tables) . tuples (records, rows) . attributes (fields, columns) (figure)
. scan (select-project) (a relation R, a predicate P, and an attribute list L. . sort (reorder) . aggregate operators (SUM,AVG,MAX,MIN,...) . insert/delete/update . set operators (union, intersection, difference) . join, merge and division . embeded operators --- source of data-flow execution model . uniformity of the data and operators (figure showing example of a join)
. Data Access --- SQL, Transactions, PL/SQL, Data Integrity . Data Concurrency and Consistancy --- Concurrency, Locking . Data Security --- Users and Schemas, Roles and Prileges, Profiles and Auditing. Database Backup and Recovery --- Redo log, Rolling Forward/Back . Distributed Processing
. non-procedure language, only specify "what-to-do", but "how-to-do" (4GL) . SQL and software rewitten, compared to scientific areas. . built on relational model . data-independent --- two-phase database schema (logical and physical database)
. Distributed Lock Manager . Implementation of Parallel Cache Management . Support transaction parallelism of multiple OLTP . A simple and effective approach to port sequential RDBMS to MPP and loosely coupled systems with shared-nothing architecture . support parallel loading, parallel indexing, parallel insert/update and parallel recovery . Major functionalities . keep track of the current "ownship" of a resource . accepts requests for resources from application processes . notifies the requesting process when a resource is available . get exclusive access to a resource for a resource . Parallel Query Optimizier . Parallel Query Optimizier builds a query execution plan employing all system resources. On the basis of data-partition knowledge, it can easily employ few tens of processors to execution a single query with high degree of parallelism. Both data-parallel and data-flow paradigms are used to achieve this goal . Support complex CPU-intensive query for DSS . near-linear or even super-linear speedup can be achieved . Major focus of research area --- optimized parallel algorithms needed . carried out by a query coordinator on the RDBMS server (the query coordinator generates the optimal parallel plan rather than having the best serial plan simply executed in parallel. This optimization is performed by the coordinator without the need for any external information. (DB2 and Oracle7 use different approaches) . most relational operators are parallizable, including full-table scan, sort, aggregate operators, join, merge . system/server tuning plays an important role in achieving optimized performance in a parallel RDBMS --- benchmarking is an effective means
nCUBE 2 64 computing nodes, 2 GB RAM 32 disk I/O nodes, 256 MB RAM, 32 SCSI2 channels 96 2 GB disks, 192 GB total OS 3.3 (GFS) and 3.4 beta (PFS, NFS) Oracle 7.1.3 Parallel Server with PDQ Oracle Parallel Text Server 3.0 beta Video-On-Demand software upcoming upgrade (Summer '95) nCUBE3, 96 nodes, 6 GB RAM 32 SCSI Fast/Wide channels, 1 TB on-line disk space 8 ATM DS3 network connections
IBM SP2 8 Thin and 4 Wide nodes, 256 MB RAM each 13/21 GB SCSI2 disks per node, 188 GB total TB2 high-performane switch parallel I/O supported via VSD software database software IBM parallel DB2 (beta) Oracle 7.1.3.2 Parallel Server (beta) upcoming upgrade: switched Ethernet/FDDI/ATM hook-up to NPAC backbone
. Planning, design, implementation, and interpretation of Decision Support benchmarks using parallel database technology . InfoMall product, supported by technology vendors . Standard and customer defined benchmarks . NPAC provides state-of-the-art infrastructure, MPP and parallel database expertise, and trained personnel . InfoMall assumes most of the risks and hazards involved in the benchmarking process . Non-disclosure arrangements available
InfoMall project NPAC teams with a small software vendor, major software vendor, and hardware vendor to provide an integrated Decision Support system for a number of Fortune 500 companies NPAC role: help parallelizing the code kernel of the Decision Support package interface the package to a parallel database system and provide capacity planning based on customer performance requirements and NPAC parallel database expertise provide infrastructure and system support for project duration Project deliverable: integrated turnkey DS system with all elements implemented of the same MPP platform
. Beta test sites with: Oracle, nCUBE and IBM for their parallel RDBMS products . direct and close contacts with PRDBMS developers at Oracle and IBM . Hands-on Expertise in large-scale (up to 100 GB db size) DSS benchmarking . Expertise in Parallel computing and application, parallel system integration and data visualization
. A high-performance RDBMS for loosely coupled and MPP systems . currently supported systems: IBM SP2, nCUBE2, SGI SMP, SUN SMP, HP SMP, Amdahl, Encore, KSR, CM5, . Also known as multi-instance Oracle -- Multiple instances running against a single, shared database . Key technology: Parallel Cache Management and Parallel Query Processing . A envolutionary approach on existing sequential Oracle RDBMS . first introduced with Oracle Oracle 6.2 . version 6.2 = version 6.0 (sequential) + Parallel Cache Management . version 7.1 = version 6.2 + parallel query processing . newest version 7.1.3 = sequential Oracle + Parallel Server Option + Parallel Query Option . production release on nCUBE is just a week ago . All parallel implementation is transparent to users . key components: a Distributed Lock Manager and a Parallel Query Optimzer
(fig. ) . Oracle 7 Database Files Data Files --- Contain all the database data; logical structure, such as tables and indexes, are physically stored in the data files Redo Log Files --- Transaction logs that record all changes made to database information for recovery Control Files --- File that record the physical structure of the database including all datafiles and redo log files
. Every time the Oracle 7 Server is started, the System Global Area (SGA) is allocated and the Oracle 7 Server background processes are started. The combination of memory buffers and background processes is called an Oracle Server Instance. SGA (fig. ) . The SGA is a group of shared memory buffers allocated by the Oracle 7 Server for an instance . The background processes asynchronously perform distinct tasks on behalf of all database users of the instance
(fig. ) . start an Oracle Server instance on each node . each instance has its own set of redo log files . all instances can concurrently execute transactions against the same database and each instance can have multiple users executing transactions
. Multiple Instances on different nodes share a single physical database . The data is accessed by all nodes through a parallel file subsystem . All instances have common data & controls files and their individual log files . All instances can simultaneously execute transactions against the single database, and each instance can have multiple users executing transactions . Data is kept consistent using PCM Technology . Implemented by the Distributed Lock Manager
. Concurrency control of buffer caches among instances . for instances to maintain independent, yet consisstent cache of blocks . for multiple nodes to simultaneous read/write to database (disks) . Maximizes concurrent data access and minimizes (disk) I/O activity . Block-level PCM locks, row-level locking preserved . A fault-tolerant DLM: if a node goes down, the DLMs will automatically recover and reconstract the lock database on behalf of the failed DLM (fig. )
time 1 Node 1 reads block 1 from disk and stores in Node 1 memory cache, Node 2 reads block 1 from disk and stores in Node 2 memory cache. 2 Node 1 writes information to block 1, Node 2 writes information to block 1. 3 Node 1 writes block 1 to disk. 4 Node 3 reads block 1 from disk and stores in Node 3 memory cache. In this situation, Node 1 and 3 have the same information in each memory cache, but Node 2 has an incoherent cache. Using Oracle 7 Server's PCM locks, . An instnace acquires a PCM lock as a resource for update purpose . all buffer caches must conatin the same copy of a resource PCM blocks: . Ensure cache coherency by only allowing instances to change blcoks while holding a lock . PCM lock work independently of transaction locks (row-level locking) . Ensure cache coherency using minimum amount of communication . PCM locks are allocated to files when a instance is started Consider the following scenario: Time 1 Instance X attempts to update row 1 in (data blcok n) Instance X becomes owner of PCM lock for data block n Instance X update row 1 2 Instance Y attempts to update row 2 in (data blcok n) Instance Y requests the PCM lock from instance X for data block n Instance X releases the PCM lock Instance Y becomes owner of PCM lock for data block n Instance Y update row 2 3 Instance X attempts to update row 3 in (data blcok n) Instance X requests the PCM lock from instance Y for data block n Instance Y releases the PCM lock Instance X becomes owner of PCM lock for data block n Instance X update row 3 . The PCM is handled separately from transaction lock -- when instance wish to update the same data (e.g.: Instnace X and Y attept to update Row 1), normal transaction locking is involved.
. Parallel Query Processing . enables the parallel processing of lon and complex queries with parallelizable operators such as scan,join and sorts . speeds up query response time (vs. thoughput in Parallel Server Option) . Parallel Data Load . Loads data in both a direct and parallel method . speeds up the laoding of large data loads . Parallel Index Creation . speeds up the creation of large indexes . Parallel Recovery . enables concurrent recovery sessions Best suited to data-intensive applications such as decision support systems and data warehouse
Data Flow (fig. ): 1 A client process issues a query 2 The statement is parsed and a shared server process executes the statement 3 The shared server process determines if the statement can utilize parallel processing, and then enlists two or more query servers to execute the statement in parallel. The shared server process partitions the work of each operation in the query equally among the query servers 4 Query servers process certan operations in the query independently and return sub-results back to the shared server process 5 The shared server process reassembles the results from the query servers and returns the query's results to the client
1 Parse --- a query optimizer determines optimal execution plan, parallelize query if a full scan was included in the execution plan (note the query is parallelized dynamically at execution time, automatically adapting changes in data partition and server configuration) 2 Execute --- perform operations 3 Fetch (Queries) --- Return rows from query server processes The execution and fetch phases can be performed in parallel ( pipelined parallelism) by multiple server processes
(fig. , fig. ) . the shared query server is known as Query Coordinator who is responsible for parallelizingoperations and interpreting the results from query servers . not all operators can be parallelized in Oracle 7.1.3
. Table Scans . Sorts (for ORDER BY, GROUP BY, Joins, DISTINCT) . Joins (Sort-Merge, Nested Loops) . Aggregates (GROUP BY, MIN, MAX, AVG, etc.) . SET operations (e.g., UNION, INTERSECT or MINUS) are not parallelized
. Oracle 7 on nCUBE2 and IBM SP2 are actually implemented as shared-disk (vs. shared-nothing) architectures . shared database by sharing disks through parallel file system or parallel I/O system: nCUBE -- the GFS (Global File System) over a Gigcache Subcube and nCUBE2's parallel I/O system; IBM SP2 -- VSD (Virtual Shared Disk) parallel I/O system (over the HPS) . Automatic file striping and disk partitions by the operating system on nCUBE2 . general OS and server level, less flexible, but relatively transparent to DBA . data skew is less possible, due to the Gigcache . Manually file striping at database table level . application-specific, flexible, but difficult to handle with large number of files and impossible to determine optimal striping length . easy to have data skew
. Transaction Processing Council TPCD A,B,C TPC D (not yet finalized) . Wisconsin Benchmark (aka Dewitt) . Set Query . The Neal Nelson Database Benchmark (tm) . SPEC . The Customer -- or worse >> The Consultant
. few insert/update/delete transactions, read-only queries. . non-selective & data-intensive; few in number, large in size . may return a large volume of data: e.g.: mailing list . queries frequently scan a large percentage of the data in the database . often used to mine data in a recursive, drill-down fasion
A common suite for DSS benchmarking and comparing: . different software and hardware systems (such as Oracle 7 on nCUBE2 vs. DB2 on IBM SP2) . different software Products on the same machine (such as Oracle7 and DB2 on IBM SP2) . different machines in a compatible family (such as Oracle7 on IBM SP1 and SP2, or on a RS/6000) . different releases of a product on the same machine (such as Oracle 7.1.0, 7.1.2, 7.1.3 on nCUBE2) Our TPCD benchmarking project has gone through most of the above examples
. TPC-D definition requires that either one (power test) or many (scale-up test) query streams are run and timed against at least three sizes of a predefined database, with sizes growing exponentially (e.g., 1G, 10GB, 100GB sequence) . Query stream consists of 17 complex queries plus two concurrently running update operations . No extensive database tuning in allowed - optimizer hints are explicitly forbidden . timings for individual queries are not part of the TPC-D metric Database structure well represents most common DSS DBs in applications, i.e. one largest table takes ~70% of the total DB size.
Queries . model complex queries against a relatively large and complex database all currently parallelizable relational operators (Oracle 7 Server) are used, including: Aggregate (Group by) - Q2, Q6, Q11, Q14, Q15 Merge Join - Q8 Nested Loops - All except Q1 Sort (Group by) - All except Q2 Sort (Join) - Q8, Q12 Sort (Order by) - Q3, Q5, Q10, Q11, Q16 Sort (Unique) - Q4 Table Access (Full) - All
Problems with TPC-D design . Query stream design: Query 17 takes more than 95% of the time needed to execute the stream . Overal design unnecessarily mimics TPC OLTP benchmarks Metrics more suitable for Data Warehouse applications than for ad hoc DS queries . Benchmark expensive to run (huge disk space requirements)
Is TPC-D suitable for parallel DBMS evaluation? . No scalability performance metrics suitable for parallel data query evaluation i.e., query time vs. number of instances (CPUs), and/or, query time vs. available I/O resources ( I/O channels, drivers, disks and disk partitions) . Scalability of a parallel database system relies on both CPU resources and I/O resources. By using multiple instances (CPUs), parallel server/query decreases query response time and increases throughput. By using partitioned data on multiple disks (with multiple I/O channels), parallel server/query achieves higher I/O bandwidth for large DBs. The two characteristics are equally important and must be benchmarked together. In current TPC-D specification, neither performance metrics is defined.
. For DS applications, query stream execution time is largely irrelevant. It is far more important to understand performance and scalability of elementary SQL operations and of typical queries encountered in customer application. . Loading and indexing may take a large portion of the overall time in preparing a large DB. Both parallel loading and indexing become or large DSS databases should be defined and benchmarked, together with the query benchmarks. Currently none of them is defined or mentioned in the TPC-D specification.
. TPC-D design contains flaws that make its use questionable even for sequential database systems Given the complexity of the parallel database systems, attempts to reduce their performance estimate to one number are most likely futile TPC-D defines no metrics useful for understanding behavior of parallel RDBMSes. TPC-D designers should re-examine academic research work pertinent to the field . Future research work towards definition of DS benchmark for parallel databases seems important
. Data partitioning must be very carefully designed to balance I/O and CPU . Parallel Loading is great ! --- almost linear speedup . Backup may be important to save time (we don't have) . the most difficult and trouble-making phase is the database preparation/loading/indexing phase which always involves storage management parameters . you must be both a DBA and sysadm to get the job done quickly ! . in the worst case, you may need to bring down the whole parallel system a couple of times a day to have a cleared system . benchmarking relies critically on system stability . be prepared that the results are negative (nothing may be obtained !) . you need at least twice the disk space of your original database size . performance tuning is extremely difficult with Oracle 7 parallel server
we regret that no performance data is given here, due to our compliance with the non-disclosure agreements with our sponsors and vendors