Given by Gang Cheng Marek Podgorny (Geoffrey Fox) at HPDC-4 Tutorial/CPS616 Spring95 on August 1, Pentagon City Virginia. Foils prepared July 23,1995
Abstract * Foil Index for this file
See also color IMAGE
Commercial Applications and Motivation for large Parallel Databases |
Overview of base Technology including parallel architecture and review of sequential SQL |
NPAC's Parallel Database Installation |
Parallel Oracle on SP2 and nCUBE |
DR-LINK high level frontend from Textwise |
Some experiences with benchmarking --TPC-D |
Linking of relational databases to the Web
|
This table of Contents
Abstract
HPDC4 |
Washington DC |
August 1,1995 |
NPAC |
Gang Cheng, Marek Podgorny with Geoffrey Fox |
Syracuse University |
111 College Place |
Syracuse |
NY 13244-4100 |
Commercial Applications and Motivation for large Parallel Databases |
Overview of base Technology including parallel architecture and review of sequential SQL |
NPAC's Parallel Database Installation |
Parallel Oracle on SP2 and nCUBE |
DR-LINK high level frontend from Textwise |
Some experiences with benchmarking --TPC-D |
Linking of relational databases to the Web
|
Click here for subtitle |
Gang Cheng, Marek Podgorny |
April 1995 |
Northeast Parallel Architectures Center |
at Syracuse University |
This presentation contains the first two sections
|
Of the full CPS616 Parallel Database Module |
The first section sets the scene by motivating the need for paraalel databases while |
The second section reviews both Sequentional and Parallel Relational Databases looking at explicit examples
|
We also discuss database system architectures and review |
The SQL Query language |
This Presentation has five Sections |
Parallel Database Technology in Commercial Applications and Industry |
Parallel Database Technology and Theory |
Parallel Database Projects at NPAC |
Parallel Oracle7 RDBMS -- A Case Study |
Parallel Database Benchmarking |
Conventional (mainframe) systems unable to support
|
In commercial business world, MPP systems
|
High transaction rate applications
|
Real-time applications
|
Complex queries
|
Massive amount of data
|
OLTP (On Line Transaction Processing) --- consists of performing many small independent requests submitted by many clients simultaneously and operating
|
DSS (Decision Support System)
|
Data Warehouse --- operations mixing DSS and OLTP |
Data mining --- uncover/extract meaningful information from massive data sources |
understand customer buying habits and product preference |
predict market tendency |
make efficient, productive and profitable decisions |
must be based on answers to questions on ALL historical business transaction data (Lots of !) |
ÒGive me a breakdown of all customers likely to default in the coming yearÓ |
ÒDo three things: predict year-end demand, tell me which customers will fuel that demand, and tell me whyÓ |
ÒDefine a highly concentrated micromarket (make special offers to customer segments that meet very select criteria) within my databaseÓ |
ÒCreate a model that really explains why some customers renew their subscriptions and others donÕtÓ |
ÒSuggest ways of regrouping our customers into new market segments (for Direct-Marketing)Ó |
ÒFind some customer-preference pattern I might not be aware ofÓ |
Lots of Historical Data (and they are growing daily !) |
High Performance Computing System and DBMS |
Intelligent Data Analysis (Optimization) |
Not possible on a mainframe system in terms of both performance and capacity |
MPP system |
Parallel Relational Database Management System |
Intelligent Datamining Algorithms
|
Oracle (Oracle7 with parallel server and parallel query options) |
Sybase (Navigation Server on AT&T 3600, SUN SMPs, and IBM SP2 (planned),SQL Server 10) |
Informix (INFORMIX-OnLine Dynamic Server) |
IBM (Oracle7, DB2/6000, DB2/2, SP2, IBM 390 Parallel Sysplex) |
SGI (Oracle7 on SMP Challenge database server) |
Taradata with AT&T and NCR |
Tandem (Himalaya K10000 |
Cray Research (SuperServer) |
SUN (Oracle7, DB2 on SUNÕs SMP) |
AT&T and NCR |
TMC (CM5, Darwin, Oracle7, Decision/SQL, Parasort) |
KSR (Query Decomposor, Oracle7) |
Amdahl (Oracle7) |
nCUBE (Oracle7, nCUBE2) |
data-CACHE Corp (SQL dataMANAGER) |
Data General Corp (Oracle7, DSO) |
DEC (POLYCENTER Manager) |
HP (Oracle7, IMAGE/SQL, HP 3000) |
Encore Computer Corp (Oracle7 on Infinity 90) |
Kmart |
American Express |
AAA Airline |
Citibank |
Prudential Securities |
Bell Atlantic Corp |
BellSouth |
BT (British Telecom) |
Criminal Investigative Technology Inc. |
Check http://greatwall.npac.syr.edu:1963 for more info about parallel database technology in industry |
A typical hardware setup is: |
shared-memory (also called Symmetric Multiprocessors SMP)
|
shared-disks
|
shared-nothing
|
On a shared-nothing architecture, one or more disks are connectedto each processor, depending on machineÕs (parallel) I/O architecture, via direct connecting or some I/O channels. |
The shared-nothing does not focus on the physical I/O connection architectures, rather, it refers tohow the data is partitioned on the disk arrays and how the data are movedinto the procesor buffers for the parallel query processing. |
In this architecture, data locations determines where the data will be processed and how data is shared by processors other than its local processor (ie. the processor which has the local disk holding the data if direct connection like DB2 on SP2, or the shortest path from disks to the process if other I/O architectures like Oracle on ncube). |
Unlike shared-disk where data may be shared via the interconnection network when it is first readfrom a disk to a remote processor (buffer), if the data local to processorA is required by a remote processor B to perfom some parallel query processing, A sends the data to B via the communication network. |
That is the only shared media for shared-nothing is the communication network. |
In most case, data placement has determined how and where (even when) a parallel query processing is decomposed or partitioned. |
Data (or I/O) load balance determines the CPU load balance in the system. |
This is built in the query decomposer. eg. |
if Processor A gets 10% data local to its disk, while B gets remaining 90%. |
Then in a shared-nothing system, A spends 10% CPU and then is idle while B takes most of CPU time. |
On the otherhand for a shared-disk system, A and B will process similiar amount of data with some overhead of sending data from B to A. |
Combine features of the above three models, data sharing technology directly interfaces with the individual processors to enable each processor to cache shared data in its own local memory with full read/write capability (as in shared-memory model) Data sharing maintains data integrity by ensuring cache coherency across all processors and provide efficient cross-system locking mechanisms (as in shared-nothing model)All disk devices are accessible to each processor so that a workload may be scheduled to run on any available processor (as in shared-disks model), instead of being run only on the processors with data affinity (as in shared-nothing model). Example of shared-data system: DB2 on IBM System/390 Parallel Sysplex. |
It is difficult to characterise the Oracle7 on ncube2 as it mixed both shared-disk and shared-nothing architectures features by using an additonal subcube as a giga-cache which lies between ncube I/O system (including I/O nodes, multiple I/O channels and disks drives) and the compute-node subcube(s). |
Before data is read into the buffer in compute-processor, they first are cached in a Giga-cache node. |
So if you look just at the compute-subcube <--> Giga-cache subcube, it looks likea shared-nothing system, |
But if you look at the whole compute-subcube <--> Giga-cache subcube <--> disk-arrays, it looks like a shared-disk system. |
Strictly speaking and compared to DB2 on SP2, it is not a shared-nothing system as the data-placement on disk-arrays has little to do with how query processing is paralleliazed or decomposed. |
This is also the reason why we found data partition schemes have less I/O performance impact for Oracle7 on nucbe than that on SP2 (the latter is a shared-disk system). |
Shared-Memory and Shared-Disk
|
Shared Nothing
|
Shared Data
|
Pipelined: Òassembly lineÓ
|
Partitioned: Òdivide and conquerÓ
|
Combination: Òbest of both worldÓ
|
pipelined parallelism (inter-operator parallelism) |
query parallelism
|
partitioned (declustered) parallelism (I/O parallelism) |
transaction parallelism |
a common practice on large number of disks and mass storage systems (file striping) |
high I/O bandwidth of multiple disks and parallel I/O |
no need for specialized hardware |
important for query load balance on a shared-nothing system |
this is of course a central problem in scientific computing and much of High Performance Fortran is devoted to data partitioning |
Round Robin ( cyclic in HPF/Scientific Computing Notation)
|
Hashing ( scattered decomposition in HPF style)
|
Range Partitioning (block cyclic in HPF style)
|
operations/application dependent |
danger of data skew arises on shared-nothing with data partitioning |
data skew is the focus of a number of research projects |
automatic OS file striping vs manual (application-specific) table striping |
speedup (degree of parallelism)
|
scaleup (scalability) -- transaction scaleup, batch scaleup, processor scaleup, I/O scaleup
|
sizeup (both)
|
startup (latency) |
interference (communication, locking) |
load balance (execution skew - all the execution occurs in one partition) |
data-partition (data skew - all the data is placed in one partition) |
imbalance of system resources (CPUs and I/O channels) |
CPU-intensive or I/O intensive queries |
Data Structure
|
Relation operators
|
embedded operators
|
SQL: Select A.A1,A.A3,B.B1,B.B2 from A,B where A.A1=B.B1; |
a database language specifies the semantics of various components of a DBMS: structures and operation of a data model, data definition, data access, security, programming language interface, and data administration |
Industry accepted Standard, first introduced by ANSI in 1986, current SQL92 by ANSI and ISO, new standard SQL3 with enhancements in object-oriented data management is undergoing. Portable to all RDBMS systems. |
built on relational model, simple and powerful |
non-procedural 4GL language, only specify Òwhat-to-doÓ, not Òhow-to-doÓ , extended to object-oriented programming interface
|
applications in SQL do not need to be rewritten to run on parallel server (all parallelizations of SQL programs are carried out by the server), compared to scientific areas where both parallel application algorithms and a parallel compiler are required |
data-independent --- two-phase database schema (logical and physical database) |
SQL in Oracle 7
|
Data Access --- SQL, Transactions, PL/SQL, Data Integrity |
Data Concurrency and Consistency --- Concurrency, Locking |
Data Security --- Users and Schemes, Roles and Privileges, Profiles and Auditing |
Database Backup and Recovery --- Redo log, Rolling Forward/Back |
Distributed Processing --- Client-Server environment |
Historians of programming languages and database people tend to classify programming languages by generations: |
1GL --- machine language (in 0/1 binary) |
2GL --- assembly language (in symbolic machine language) |
3GL --- high-level programming language (structural, modular and usually procedural) |
4GL --- non-procedural (declarative). Most AI languages are 4GL. |
PL/SQL illustrates some low-level data access tools/languages that are available in most RDBMS. |
Basically, there are three levels of database access interfaces: |
1. highest level --- SQL
|
2. Middle-level --- PL/SQL
|
3. Embedded SQL in C (or f77,Cobol) --- programming language interfaces
|
A way to enforce error-checking and formating in data-types. Usually the rules for data integrity is created when a table is specified. Examples:
|
Data integrity is the major way to enforce the data-consistency in the database. |
Schemes are used to define scope of different data objects to enforce data secuity and naming issues in a DBMS. |
A Scheme in a DB is a logical hierarchy of data storage structures. |
Major scheme objects are: users, tablespaces, tables, views, index, cluster, synonyms, columns. |
Access to Different Schemes require different roles and privileges for different users. |
A way to classify level of database users. Like sysadm usually has superuser privileges, roles are used to group users with different privileges. |
Because a DBMS system has many levels of operations which set many levels(kinds) of operation privilages, |
Role is another logical level of grouping to define/specify who will have what system privileges. |
It makes system management easier and more flexible. |
Profiles and Auditing are the way to monitor (system logging) and record user database actions |
They are used for:
|
These are just a DB way of saying logging. |
As in other cases, Database field sometimes use its own different terms for the same thing in other fields. |
For a RDBMS there are two levels of abstractions of how data being stored and represented in the system: |
1) Logical database consists of the conceptual view of a DB system which is described by an ER (entity-relationship) model, defined and access by SQL and represented by tables,columns,views and other abstractions of data-object; |
2) The physical database consists of the actual phyical view of the DB which is represented by files,blocks,indexes,clusters,partitions,etc. |
End-user and developers only need to deal with the logical level, |
While the DBMS and a DBA (database administrator) define and perform the mapping between the two levels. |
This is the reason why SQL achieves portability (oompared to f77), from a viewpoint of a data independent model. |
Implementation of Parallel Cache Management |
Support transaction parallelism of multiple OLTP |
A simple and effective approach to port sequential RDBMS to MPP and loosely coupled systems with shared-nothing architecture |
support parallel loading, parallel indexing, parallel insert/update and parallel recovery |
Major functionality
|
Note this can work fine in OLTP with many uncorrelated queries but will not work in scientific computing where ALL updates are correclated and reserving a resource introduces a sequential bottleneck |
Produced by Gang Cheng April 1995 |
Produced by Gang Cheng April 1995 |
Parallel Query Optimizer
|
Produced by Gang Cheng April 1995 |
Gang Cheng |
Marek Podgorny |
(and a bit of Geoffrey Fox) |
gcheng@npac.syr.edu,marek@npac.syr.edu |
NPAC |
111 College Place |
Syracuse |
New York 13244-4100 |
This discusses the last three parts of presentation |
What is situation at NPAC including hardware, InfoMall activities and the DR-LINK product of Textwise (Liz Liddy) |
Detailed Discussion of Parallel Oracle on SP2 and nCUBE |
Results of NPAC's Benchmarking Activities
|
Beta test sites with Oracle, nCUBE and IBM for their parallel RDBMS products |
Direct contacts with PRDBMS developers at Oracle and IBM |
Hands-on Expertise in large-scale (up to 100 GB DB size) DSS benchmarking |
Experience in parallel computing and applications in industry, parallel system integration and data visualization |
nCUBE 2
|
IBM SP2
|
Upcoming upgrade:
|
Planning, design, implementation, and interpretation of Decision Support benchmarks using parallel database technology |
InfoMall product, supported by technology vendors |
Standard and customer defined benchmarks |
NPAC provides state-of-the-art infrastructure, MPP and parallel database expertise, and trained personnel |
InfoMall assumes most of the risks and hazards involved in the benchmarking process |
Non-disclosure arrangements available |
InfoMall project
|
This research activity of Professor E. Liddy of Information Studies School at Syracuse University is being commercialized by InfoMall member -- the start-up company Textwise. |
The Surface Realizations i.e. particular strings of words in documents or database quert, do not reflect the breadth, richness or precision of meaning intended by either
|
Simple word level representation cannot produce the recall and prescision possible when the full conceptual intent of queries and documents is the basis of retrieval
|
DR-LINK achieves this deeper conceptual representation of texts by utilizing all the levels of language at which meaning is conveyed:
|
Accepts lengthy ambiguous complex Natural Language Queries |
Translates Natural Language queries into precise Boolean representations of the user's implied requirements for relevance of a document to query |
Produces summary-level semantic SFC (Subject Field Codes) vector representations of query and documents for quick filtering of large databases |
Catures Text Structure dimensions of time, source, state of completion, credibility, definiteness, intentionality and causality |
Fulfills Proper Noun requirements of queries by:
|
Provides both high recall and high precision via controlled expansion of Complex Nominals |
Evaluates Implicit and Explicit Semantics to assign similiarity between Query and Documents using its Integrated Matcher |
Predicts how many documents need to be viewed to acieve the user-specified level of recall via the Cut-Off Criterion |
Raw |
Documents |
A high-performance RDBMS for loosely coupled and MPP systems |
Currently supported systems: IBM SP2, nCUBE2, SGI SMP, SUN SMP, HP SMP, Amdahl, Encore, KSR, CM5, |
Also known as multi-instance Oracle: Multiple instances running against a single, shared database |
A evolutionary approach on existing sequential Oracle RDBMS |
Two independent options
|
Parallel Server option available for most Unix platforms |
Parallel Data Query available for few SMPs |
combination of PS and PDQ mostly in beta, available for a handful of SMP clusters and MPP machines |
parallelizes execution of a single SQL query |
parallel execution carried out by concurrent Unix processes communicating via shared memory |
single Oracle instance sufficient - no PS necessary |
uses multiple CPUs on SMP architectures but can run on single CPU machines as well with modest performance benefits |
essential for Decision Support |
irrelevant for OLTP |
lets multiple Oracle instances access the same database |
requires a cluster of single CPU workstations, a cluster of SMPs, or an MPP to run |
requires a Distributed Lock Manager to ensure data consistency |
requires all disk storage accessible by each Oracle node |
supports PDQ distributed across multiple instances |
essential for OLTP and Òapplication partitioningÓ |
relevant for scalability of ad hoc DS queries |
First introduced with Oracle 6.2 |
version 6.2 = version 6.0 (sequential) + Parallel Cache Management |
version 7.1 = version 6.2 + parallel query processing |
newest version 7.1.3 = sequential Oracle + Parallel Server Option + Parallel Query Option |
production release on nCUBE is in Jan. 1995 |
All parallel implementation is transparent to users |
Key technology: Parallel Cache Management and Parallel Query Processing |
Key components: a Distributed Lock Manager and a Parallel Query Optimizer |
Oracle 7 Server Architecture
|
Oracle 7 Database Files:
|
Produced by Gang Cheng April 1995 |
There is a larger Better Quality Image available |
Produced by Gang Cheng April 1995 |
There is a larger Better Quality Image available |
Produced by Gang Cheng April 1995 |
There is a larger Better Quality Image available |
Produced by Gang Cheng April 1995 |
There is a larger Better Quality Image available |
Every time the Oracle 7 Server is started, the System Global Area (SGA) is allocated and the Oracle 7 Server background processes are started. The combination of memory buffers and background processes is called an Oracle Server Instance. |
The SGA is a group of shared memory buffers allocated by the Oracle 7 Server for an instance |
The background processes asynchronously perform distinct tasks on behalf of all database users of the instance |
Start an Oracle Server instance on each node |
Each instance has its own set of redo log files |
All instances can concurrently execute transactions against the same database and each instance can have multiple users executing transactions |
Each node of a parallel system runs an instance of the Oracle7 Server that accesses a shared database |
Multiple Instances on different nodes share a single physical database |
The data is accessed by all nodes through a parallel file subsystem |
All instances have common data & controls files and their individual log files |
All instances can simultaneously execute transactions against the single database, and each instance can have multiple users executing transactions |
Data is kept consistent using PCM Technology |
Implemented by the Distributed Lock Manager |
Oracle Components
|
IBM Components
|
Concurrency control of buffer caches among instances |
Maintain independent, yet consistent cache of blocks for all instances |
Allow multiple nodes to simultaneous read/write to database (disks) |
Maximizes concurrent data access and minimizes (disk) I/O activity |
Block-level PCM locks, row-level locking preserved |
A fault-tolerant DLM: if a node goes down, the DLMs will automatically recover and reconstruct the lock database on behalf of the failed DLM |
Three possible failures |
Time Steps:
|
In this situation, Node 1 and 3 have the same information in each memory cache, but Node 2 has an inconsistent cache. |
The example of previous and following foil, illustrates why concurrency control is needed in a parallel DB server environment. |
It shows how contention for shared resources(data,disk access) is being handled under a multi-node parallel system sharing a single database. |
The blackened boxes in the figure are the same blocks in each node (not rows !) requested by different nodes(instances) and located in different SGA bufferes. |
Concurrency control needs to be implemented to ensure a consistant database, |
i.e. the same block from the disk must be reflected as exactly the same copy in different buffers. |
This is achieved by PCM locks. |
3 Oracle Instances Accessing Shared database |
An instance acquires a PCM lock as a resource for update purposes |
All buffer caches must contain the same copy of a resource |
PCM blocks:
|
Time
|
The PCM is handled separately from transaction lock -- when instance wish to update the same data (e.g.: Instance X and Y attempt to update Row 1), normal transaction locking is involved. |
Parallel Query Processing
|
Parallel Data Load
|
Parallel Index Creation
|
Parallel Recovery
|
Typical Data Flow with 5 steps:
|
1 Parse --- a query optimizer determines optimal execution plan, parallelize query if a full scan was included in the execution plan (note the query is parallelized dynamically at execution time, automatically adapting changes in data partition and server configuration) |
2 Execute --- perform operations |
3 Fetch (Queries) --- return rows from query server processes |
The execution and fetch phases can be performed in parallel ( pipelined parallelism) by multiple server processes |
The shared query server is known as Query Coordinator who is responsible for parallelizing operations and interpreting the results from query servers |
Not all operators can be parallelized in Oracle 7.1.3 |
Oracle Server Parallizable Operations
|
See following two figures |
Full Table Scan without Parallel Query Option |
Multiple Query Servers Performing a Full Table Scan |
Oracle 7 on nCUBE2 and IBM SP2 are actually implemented as shared-disk (vs. shared-nothing) architecture
|
Shared database by sharing disks through parallel file system or parallel I/O system:
|
Automatic file striping and disk partitions by the operating system on nCUBE2
|
Manually file striping at database table level on IBM SP2
|
Transaction Processing Council
|
Wisconsin Benchmark (aka Dewitt)
|
Set Query
|
The Neal Nelson Database Benchmark (tm)
|
SPEC (Standard Performance Evaluation Corporation)
|
The Customer -- or worse >> The Consultant
|
Few insert/update/delete transactions, read-only queries. |
Non-selective & data-intensive; few in number, large in size |
May return a large volume of data: e.g.: mailing list |
Queries frequently scan a large percentage of the data in the database |
Often used to mine data in a recursive, drill-down fashion |
A common suite for DSS benchmarking and comparing:
|
Our TPCD benchmarking project has gone through most of the above examples |
TPC-D definition requires that either one (power test) or many (scale-up test) query streams are run and timed against at least three sizes of a predefined database, with sizes growing exponentially (e.g., 1G, 10GB, 100GB sequence) |
Query stream consists of 17 complex queries plus two concurrently running update operations |
No extensive database tuning in allowed - optimizer hints are explicitly forbidden |
timings for individual queries are not part of the TPC-D metric |
Database structure well represents most common DSS DBs in applications, i.e. one largest table takes ~70% of the total DB size. |
model complex queries against a relatively large and complex database |
All currently parallelizable relational operators (Oracle 7 Server) are used, including:
|
Query stream design: Query 17 takes more than 95% of the time needed to execute the stream |
Overal design unnecessarily mimics TPC OLTP benchmarks |
Metrics more suitable for Data Warehouse applications than for ad hoc DS queries |
Benchmark expensive to run (huge disk space requirements) |
Is TPC-D suitable for parallel DBMS evaluation?
|
Scalability of a parallel database system relies on both CPU resources and I/O resources. |
By using multiple instances (CPUs), parallel server/query decreases query response time and increases throughput. |
By using partitioned data on multiple disks (with multiple I/O channels), parallel server/query achieves higher I/O bandwidth for large DBs. |
The two characteristics are equally important and must be benchmarked together. In current TPC-D specification, neither performance metrics is defined. |
For DS applications, query stream execution time is largely irrelevant. It is far more important to understand performance and scalability of elementary SQL operations and of typical queries encountered in customer application. |
Loading and indexing may take a large portion of the overall time in preparing a large DB. Both parallel loading and indexing become important functionality of a parallel RDBMS. Their performance for large DSS databases should be defined and benchmarked, together with the query benchmarks. Currently none of them is defined or mentioned in the TPC-D specification. |
TPC-D design contains flaws that make its use questionable even for sequential database systems |
Given the complexity of the parallel database systems, attempts to reduce their performance estimate to one number are most likely futile |
TPC-D defines no metrics useful for understanding behavior of parallel RDBMSes. TPC-D designers should re-examine academic research work pertinent to the field |
Future research work towards definition of DS benchmark for parallel databases seems important |
Data partitioning must be very carefully designed to balance I/O and CPU |
Parallel Loading is great ! --- almost linear speedup |
Backup may be important to save time (we don't have) |
The most difficult and trouble-making phase is the database preparation/loading/indexing phase which always involves storage management parameters |
you must be both a DBA and sysadm and benchmarking relies critically on system stability |
Be prepared to bring down the whole parallel system a couple of times a day |
Be prepared that the results are negative (nothing may be obtained !) |
you need at least twice the disk space of your original database size |
performance tuning is extremely difficult with Oracle 7 parallel server |
Gang Cheng |
NPAC |
111 College Place |
Syracuse NY 13244-4100 |
These foils were prepared by Gang Cheng and presented by Geoffrey Fox in CPS600 |
They contrast the strengths and weaknesses of Web and Database Information models |
They describe Web, Conventional and DB-WEb linked Client Servers Models |
They describe Oracle's WOW Oracle to Web Interface |
They describe one of our experimental implementations which links mh mail to Oracle |
Internet as the major vehicle in networking industry |
Driving force: information access and dissemination |
WWW: Internet navigation and multimedia information access --- a killer networking application |
Main WWW components today:
|
Client-server as the communication model |
(Flat hierarchical UNIX) File system as the major file (data) management system |
Focus on transparent networking navigation and a graphical user-interface for hypertext-based information dissemination |
Relational database management system (RDBMS) as the major data model in database industry |
Driving force: information management, retrieval and processing |
Database server: store, manipulate and manage data in a reliable and powerful information system |
Main RDBMS components
|
database techniques used in Web technology
|
networking techniques used in distributed database technology
|
static, browser-oriented client |
data (information in a narrow sense) repository |
tedious to manage and update (HTML) files on the server side |
difficult to keep most updated data automatically --- modifies usually done by a human using word processor or editing tools |
A plain UNIX file system only supports primitive functions such as open,read/write and close.
|
a non-universal, non-portable remote networking access to database server |
plain-text oriented, no multimedia |
nonstandard front-end browser (most are line-based) |
a powerful backend text searching engine supporting complex queries |
an efficient, reliable and sophisticated data management system supporting representation and organization of logical relationships among information entities |
a dynamic data processing engine, provided with opportunity for optimized searching performance in large information systems |
a time-dimension added HTML programming engine (see case study examples later) |
a transparent networking interface for remote database server access |
a universal GUI-based search interface for hypertext-added data retrieval |
a truly globally-shared data system for maximum information dissemination |
an integrated multi-media information system |
Basic features in HTTP protocol -- a client-server model:
|
client & server open the connection ---> |
client sends request messages to server ---> |
server locates requested information and sends |
response back to client ----> client & server close the connection (note each URL represents a separate connection process) |
Fill-out forms are currently used for the server to accept clients run-time input from a HTML page |
Two different strategies for WWW client-server interactions:
|
SQL is the standard way to extract(query) data out from the database. |
Low-level SQL-based interfaces are available on different vendors RDBMS |
most commonly used one: Embedded SQL in C |
SQL programs: static (compiled) SQL and dynamic (interpretive) SQL |
clients send SQL statements to server ---> |
server parses and executes the SQL against the current database ---> |
returned rows are buffered in server's main memory ---> |
server fetches buffered rows and sends them to clients |
Embedded SQL in C + CGI (C program) |
Form-based relational querying and text searching |
Different approaches for different vendor's RDBMS systems (Sybase,Oracle,Informix, DB2, etc) |
more details can be found at "http://greatwall.npac.syr.edu:1963/PDB/ |
web-db.html" |
Note: Web Server and DB client must be on the same machine, |
while DB client and DB server can be on different machine and a proprietary networking protocol (e.g., SQL*NET for Oracle7) is used. |
Inter-process Comunication (IPC) is used for the Web server to communicate with DB client on the same machine |
Data and SQL engine
|
use PL/SQL for data manipulation and processing. |
PL/SQL is
|
in a specific environment: PL/SQL stored programs |
major features:
|
A small shell-script to set environment-variables |
A C stub called wowstub implemented as a normal CGI program |
A PL/SQL package called wow |
An optional, but strongly recommended set of PL/SQL packages called HTP and HTF. These encapsulate HTML formatting, and will make migration to future HTML versions easy |
A stand-alone PL/SQL compiler with CGI enhancements. The compiler is a |
stand-alone gateway that will allow you to execute pl/sql code directly from within an HTML document |
Note: the gateway wowstub program simply passes PL/SQL program name and input parameters gathered from forms to DB server. |
The DB server does both SQL query and HTML processing/formatting |
implemented using wow |
another Web interface to MH mailbox (similar to xmh but under web) |
a enhanced backend mail archives server with advanced search capability |
basic components
|
Note: field definition of each table is not shown here; |
Inserting a parsed MH mail into the database will affect the tables: |
folder,msg header,msg body and people |
home page --- generated by a PL/SQL procedure which first stores the client's information of the current connection to a 'visitor' table and then generates links of query options. clients history can be easily pull over from the database. (fig. 1) |
query the mailbox --- find out the summary information about the mailbox (fig. 2) |
query the folder --- choose a folder from all folders currently available in the database (fig. 3), and find out summary information of the chosen folder (fig. 4) |
All the above queries are summary information which are meta-data (accumulated or statistic data about basic tables) and produced at the time when clients requested (done by simple SQL calculation statements) |
Produced by Gang Cheng April 1995 |
There is a larger Better Quality Image available |
Produced by Gang Cheng April 1995 |
There is a larger Better Quality Image available |
Produced by Gang Cheng April 1995 |
There is a larger Better Quality Image available |
Produced by Gang Cheng April 1995 |
There is a larger Better Quality Image available |
Client chooses all folders or a folder currently available in the database (fig. 5) |
browse all mail headers (show msgid,date,sender,size in bytes and subject) (fig. 6) |
show chosen mail's body after the mail header browsing (fig. 14) |
optionally show the complete original MH mail in UNIX mailbox (fig. 15) |
Note that full table scan only on the msg header table is required to list all the mail header in a folder (msg body is not touched during the header browsing). |
This is Optimized browsing performance, compared to MH header browsing where header and body are stored in a single file and are not indexed |
Produced by Gang Cheng April 1995 |
There is a larger Better Quality Image available |
Produced by Gang Cheng April 1995 |
There is a larger Better Quality Image available |
Produced by Gang Cheng April 1995 |
There is a larger Better Quality Image available |
Produced by Gang Cheng April 1995 |
There is a larger Better Quality Image available |
Most recent five mails currently in the db |
mail related to now (the exact time as clients click the query-by-date page) in all folders:
|
mail related to today (the same day as clients click the query-by-date page) in selected folders:
|
mail at any specified time or within any specified time range. Choose from hour(1-12), am/pm, day(1-31), month(1-12), year (fig. 8) |
system clock will be consulted to form above queries |
Produced by Gang Cheng April 1995 |
There is a larger Better Quality Image available |
Produced by Gang Cheng April 1995 |
There is a larger Better Quality Image available |
Query by a keyword --- return mails containing the entered keyword (non-indexed pattern matching, future option: context searching using Oracle TextServer search engine) (fig. 9, 10) |
the keyword can be a typed in or chosen from a table of previously entered keywords |
search in all folders or a selected folder from currently available folders in the DB |
search keyword occurrence in:
|
search with either case insensitive or sensitive for the keyword |
typein keyword will or will not be added to the DB table for future use or keyword in the current DB table will or will not be removed from the DB |
database insert/delete operations are used to add/remove keywords |
in the table --- the power of remembering clients interactions on the web |
Produced by Gang Cheng April 1995 |
There is a larger Better Quality Image available |
Produced by Gang Cheng April 1995 |
There is a larger Better Quality Image available |
Query all mail sent from a selected sender, based on all senders gathered in the current mail database (fig. 11) |
sender is uniquely identified by his/her userid |
a specified folder can be used to narrow the searching, as in many other query options |
Note the sender list is not pre-entered or edited. It is automatically built when a new mail is being incorporated into the database from the MH mail folder by the parser --- another dynamic feature of web-db integration |
Produced by Gang Cheng April 1995 |
There is a larger Better Quality Image available |
For security and database consistence reason, userid/password at the database level (not Mosaic page-level) is enforced to update the database (fig. 12) |
Two kinds of folders: MH folder in UNIX file system and folder table in the database system. They hold the same information but in different places. |
Original MH mail (in MH format) in a folder must be converted into the database tables (in Oracle table format) to enable all the above query options on the Oracle db server |
Assume each database folder must have a unique MH folder in UNIX file system. The database folder is created when the first mail in the UNIX MH folder is being incorporated in the database |
Produced by Gang Cheng April 1995 |
There is a larger Better Quality Image available |
Before incorporating new mail from a UNIX folder, the PL/SQL procedure first finds out current maximum msg id in the corresponding DB folder and only try to parse/incorporate those MH mails with msg id greater than the maximum msg id in DB (fig. 13) |
If the the MH folder contains subfolders, new mails in all subfolders will be considered to incorporate, and this is a recursive process. |
Implemented by both a perl script (for parsing) and PL/SQL procedures |
A time-intensive process as insertions may be occurred on up to four DB tables which are indexed. But this process can be done in batch ! |
Incorporating can be done at the same time when other query options are requested by other clients on the web. The DB server will ensure the query concurrency and data consistency --- a reliable information system ! |
Produced by Gang Cheng April 1995 |
There is a larger Better Quality Image available |
List of mail headers satisfying the query (show msgid,date,sender,size in bytes and subject) (fig. 6) |
Show chosen mail's body after the mail header browsing (fig. 14) |
Optionally show the complete original MH mail in UNIX mailbox (fig. 15) |
Produced by Gang Cheng April 1995 |
There is a larger Better Quality Image available |
Produced by Gang Cheng April 1995 |
There is a larger Better Quality Image available |
Produced by Gang Cheng April 1995 |
There is a larger Better Quality Image available |
Query the People |
Query By Mail Type |
Query By Date and Sender |
Query By Date and Subject |
Query By Date and Sender and Subject |
Interface the mail database with WebMail tools (CPS600 server) |
Another example using WOW for a similar database can be found at |
http://kayak.npac.syr.edu:1963 under link TravelVenture Demo for "travel agency" application |