Given by Gang Cheng Marek Podgorny Chao-Wei Ou at ARL Database Tutorial on February 98. Foils prepared 7 October 97
Outside Index
Summary of Material
This presentation contains the first two sections
|
Of the full ARL Parallel Database Module |
The first section sets the scene by motivating the need for paraalel databases while |
The second section reviews both Sequentional and Parallel Relational Databases looking at explicit examples
|
We also discuss database system architectures and review |
The SQL Query language |
Outside Index
Summary of Material
Click here for subtitle |
Gang Cheng, Geoffrey C. Fox, Nancy McCracken, Marek Podgorny, Chao-Wei Ou |
October 1997 |
Northeast Parallel Architectures Center |
at Syracuse University |
This presentation contains the first two sections
|
Of the full ARL Parallel Database Module |
The first section sets the scene by motivating the need for paraalel databases while |
The second section reviews both Sequentional and Parallel Relational Databases looking at explicit examples
|
We also discuss database system architectures and review |
The SQL Query language |
This Presentation has five Sections |
Parallel Database Technology in Commercial Applications and Industry |
Parallel Database Technology and Theory |
Parallel Database Projects at NPAC |
Parallel Oracle7 RDBMS -- A Case Study |
Parallel Database Benchmarking |
Conventional (mainframe) systems unable to support
|
In commercial business world, MPP systems
|
High transaction rate applications
|
Real-time applications
|
Complex queries
|
Massive amount of data
|
OLTP (On Line Transaction Processing) --- consists of performing many small independent requests submitted by many clients simultaneously and operating
|
DSS (Decision Support System)
|
Data Warehouse --- operations mixing DSS and OLTP |
Data mining --- uncover/extract meaningful information from massive data sources |
understand customer buying habits and product preference |
predict market tendency |
make efficient, productive and profitable decisions |
must be based on answers to questions on ALL historical business transaction data (Lots of !) |
"Give me a breakdown of all customers likely to default in the coming year" |
"Do three things: predict year-end demand, tell me which customers will fuel that demand, and tell me why" |
"Define a highly concentrated micromarket (make special offers to customer segments that meet very select criteria) within my database" |
"Create a model that really explains why some customers renew their subscriptions and others don't" |
"Suggest ways of regrouping our customers into new market segments (for Direct-Marketing)" |
"Find some customer-preference pattern I might not be aware of" |
Lots of Historical Data (and they are growing daily !) |
High Performance Computing System and DBMS |
Intelligent Data Analysis (Optimization) |
Not possible on a mainframe system in terms of both performance and capacity |
MPP system |
Parallel Relational Database Management System |
Intelligent Datamining Algorithms
|
Oracle (Oracle7 with parallel server and parallel query options) |
Sybase (Navigation Server on AT&T 3600, SUN SMPs, and IBM SP2 (planned),SQL Server 10) |
Informix (INFORMIX-OnLine Dynamic Server) |
IBM (Oracle7, DB2/6000, DB2/2, SP2, IBM 390 Parallel Sysplex) |
SGI (Oracle7 on SMP Challenge database server) |
Taradata with AT&T and NCR |
Tandem (Himalaya K10000 |
Cray Research (SuperServer) |
SUN (Oracle7, DB2 on SUN's SMP) |
AT&T and NCR |
TMC (CM5, Darwin, Oracle7, Decision/SQL, Parasort) |
KSR (Query Decomposor, Oracle7) |
Amdahl (Oracle7) |
nCUBE (Oracle7, nCUBE2) |
data-CACHE Corp (SQL dataMANAGER) |
Data General Corp (Oracle7, DSO) |
DEC (POLYCENTER Manager) |
HP (Oracle7, IMAGE/SQL, HP 3000) |
Encore Computer Corp (Oracle7 on Infinity 90) |
Kmart |
American Express |
AAA Airline |
Citibank |
Prudential Securities |
Bell Atlantic Corp |
BellSouth |
BT (British Telecom) |
Criminal Investigative Technology Inc. |
Check http://greatwall.npac.syr.edu:1963 for more info about parallel database technology in industry |
A typical hardware setup is: |
shared-memory (also called Symmetric Multiprocessors SMP)
|
shared-disks
|
shared-nothing
|
Shared-Memory and Shared-Disk
|
Shared Nothing
|
Shared Data
|
On a shared-nothing architecture, one or more disks are connectedto each processor, depending on machine's (parallel) I/O architecture, via direct connecting or some I/O channels. |
The shared-nothing does not focus on the physical I/O connection architectures, rather, it refers tohow the data is partitioned on the disk arrays and how the data are movedinto the procesor buffers for the parallel query processing. |
In this architecture, data locations determines where the data will be processed and how data is shared by processors other than its local processor (ie. the processor which has the local disk holding the data if direct connection like DB2 on SP2, or the shortest path from disks to the process if other I/O architectures like Oracle on ncube). |
Unlike shared-disk where data may be shared via the interconnection network when it is first readfrom a disk to a remote processor (buffer), if the data local to processorA is required by a remote processor B to perfom some parallel query processing, A sends the data to B via the communication network. |
That is the only shared media for shared-nothing is the communication network. |
In most case, data placement has determined how and where (even when) a parallel query processing is decomposed or partitioned. |
Data (or I/O) load balance determines the CPU load balance in the system. |
This is built in the query decomposer. eg. |
if Processor A gets 10% data local to its disk, while B gets remaining 90%. |
Then in a shared-nothing system, A spends 10% CPU and then is idle while B takes most of CPU time. |
On the otherhand for a shared-disk system, A and B will process similiar amount of data with some overhead of sending data from B to A. |
Pipelined: "assembly line"
|
Partitioned: "divide and conquer"
|
Combination: "best of both world"
|
pipelined parallelism (inter-operator parallelism) |
query parallelism
|
partitioned (declustered) parallelism (I/O parallelism) |
transaction parallelism |
a common practice on large number of disks and mass storage systems (file striping) |
high I/O bandwidth of multiple disks and parallel I/O |
no need for specialized hardware |
important for query load balance on a shared-nothing system |
this is of course a central problem in scientific computing and much of High Performance Fortran is devoted to data partitioning |
Round Robin ( cyclic in HPF/Scientific Computing Notation)
|
Hashing ( scattered decomposition in HPF style)
|
Range Partitioning (block cyclic in HPF style)
|
operations/application dependent |
danger of data skew arises on shared-nothing with data partitioning |
data skew is the focus of a number of research projects |
automatic OS file striping vs manual (application-specific) table striping |
speedup (degree of parallelism)
|
scaleup (scalability) -- transaction scaleup, batch scaleup, processor scaleup, I/O scaleup
|
sizeup (both)
|
startup (latency) |
interference (communication, locking) |
load balance (execution skew - all the execution occurs in one partition) |
data-partition (data skew - all the data is placed in one partition) |
imbalance of system resources (CPUs and I/O channels) |
CPU-intensive or I/O intensive queries |
Data Structure
|
Relation operators
|
embedded operators
|
SQL: Select A.A1,A.A3,B.B1,B.B2 from A,B where A.A1=B.B1; |
a database language specifies the semantics of various components of a DBMS: structures and operation of a data model, data definition, data access, security, programming language interface, and data administration |
Industry accepted Standard, first introduced by ANSI in 1986, current SQL92 by ANSI and ISO, new standard SQL3 with enhancements in object-oriented data management is undergoing. Portable to all RDBMS systems. |
built on relational model, simple and powerful |
non-procedural 4GL language, only specify "what-to-do", not "how-to-do" , extended to object-oriented programming interface
|
applications in SQL do not need to be rewritten to run on parallel server (all parallelizations of SQL programs are carried out by the server), compared to scientific areas where both parallel application algorithms and a parallel compiler are required |
data-independent --- two-phase database schema (logical and physical database) |
SQL in Oracle 7
|
Data Access --- SQL, Transactions, PL/SQL, Data Integrity |
Data Concurrency and Consistency --- Concurrency, Locking |
Data Security --- Users and Schemes, Roles and Privileges, Profiles and Auditing |
Database Backup and Recovery --- Redo log, Rolling Forward/Back |
Distributed Processing --- Client-Server environment |
Historians of programming languages and database people tend to classify programming languages by generations: |
1GL --- machine language (in 0/1 binary) |
2GL --- assembly language (in symbolic machine language) |
3GL --- high-level programming language (structural, modular and usually procedural) |
4GL --- non-procedural (declarative). Most AI languages are 4GL. |
PL/SQL illustrates some low-level data access tools/languages that are available in most RDBMS. |
Basically, there are three levels of database access interfaces: |
1. highest level --- SQL
|
2. Middle-level --- PL/SQL
|
3. Embedded SQL in C (or f77,Cobol) --- programming language interfaces
|
A way to enforce error-checking and formating in data-types. Usually the rules for data integrity is created when a table is specified. Examples:
|
Data integrity is the major way to enforce the data-consistency in the database. |
Schemes are used to define scope of different data objects to enforce data secuity and naming issues in a DBMS. |
A Scheme in a DB is a logical hierarchy of data storage structures. |
Major scheme objects are: users, tablespaces, tables, views, index, cluster, synonyms, columns. |
Access to Different Schemes require different roles and privileges for different users. |
A way to classify level of database users. Like sysadm usually has superuser privileges, roles are used to group users with different privileges. |
Because a DBMS system has many levels of operations which set many levels(kinds) of operation privilages, |
Role is another logical level of grouping to define/specify who will have what system privileges. |
It makes system management easier and more flexible. |
Profiles and Auditing are the way to monitor (system logging) and record user database actions |
They are used for:
|
These are just a DB way of saying logging. |
As in other cases, Database field sometimes use its own different terms for the same thing in other fields. |
For a RDBMS there are two levels of abstractions of how data being stored and represented in the system: |
1) Logical database consists of the conceptual view of a DB system which is described by an ER (entity-relationship) model, defined and access by SQL and represented by tables,columns,views and other abstractions of data-object; |
2) The physical database consists of the actual phyical view of the DB which is represented by files,blocks,indexes,clusters,partitions,etc. |
End-user and developers only need to deal with the logical level, |
While the DBMS and a DBA (database administrator) define and perform the mapping between the two levels. |
This is the reason why SQL achieves portability (oompared to f77), from a viewpoint of a data independent model. |
Implementation of Parallel Cache Management |
Support transaction parallelism of multiple OLTP |
A simple and effective approach to port sequential RDBMS to MPP and loosely coupled systems with shared-nothing architecture |
support parallel loading, parallel indexing, parallel insert/update and parallel recovery |
Major functionality
|
Note this can work fine in OLTP with many uncorrelated queries but will not work in scientific computing where ALL updates are correclated and reserving a resource introduces a sequential bottleneck |
Parallel Query Optimizer
|