Oracle Data Warehousing Unleashed

- 2 -

Data Warehouse Architecture

Six Steps to Develop the Architecture
The Data Warehouse Infrastructure
Data Management Architecture: Data Layers
Ongoing Maintenance: Warehouse Infrastructure
Summary

The architecture of a data warehouse by necessity is complex, and includes many elements. The reason for this is that a data warehouse is an amalgamation of many different systems. Integration of diverse elements is its primary concern, and to accomplish this integration, many different systems and processes are necessary.

Most software development projects require selection of the technical infrastructure, and this is true for the warehouse as well. Basic technical infrastructure includes operating system, hardware platform, database management system, and network. The DBMS selection becomes a little more complicated than a straightforward operational system because of the unusual challenges of the data warehouse, especially in its capability to support very complex queries that cannot be predicted in advance. This will be explored in more detail later on in the chapter.

How does the data get into the data warehouse? The warehouse requires ongoing processes to feed it; these processes require their own infrastructure. Many times, IS shops overlook this aspect when they plan for the data warehouse. Data layers need to be understood and planned for. Data cleansing usually involves several steps; where will the "staging area" be stored? How will ongoing data loads, cleansing, and summarizing be accomplished?

Backup and recovery are interesting challenges in the data warehouse, mainly because data warehouses are usually so large.

How will users get information out of the warehouse? The choice of query tool becomes very important, and depends upon a multiplicity of factors. This chapter will provide an overview of the infrastructure and architecture needed for the data warehouse project.

Six Steps to Develop the Architecture

It is important to discuss the steps you must follow to develop this architecture. Each and every one of these steps needs to be performed in order to have the best opportunity of succeeding. The six important steps to effective data warehouse architecture development are as follows:

: 1. The most important step in developing an effective data warehouse architecture is to enlist the full support/commitment (project sponsor) of an executive of the company.
: 2. Next, you must staff an architecture team with strong personnel. It is not necessarily the technology you choose for your architecture, it is the personnel you have designing and developing the architecture that makes the project successful.
: 3. Prototype/benchmark all the technologies you are interested in using. Design and develop a prototype that can be used to test all of the different technologies that are being considered.
: 4. Give the architecture team enough time to build the architecture infrastructure before development begins. For a large organization, this can be anywhere from six months to a year or more.
: 5. Make sure you train the development staff on the use of the architecture before development begins. Spend time letting the development team get full exposure to the capabilities and components of the architecture.
: 6. Provide the architecture team an opportunity to enhance and improve the architecture as the project moves forward. No matter how much time is spent up front developing an architecture, it will not be perfect the first time around.

As we examine the architecture of a data warehouse, we will look at it from three views: the overall data warehouse infrastructure, data layer components, and ongoing maintenance infrastructure.

The Data Warehouse Infrastructure

The data warehouse consists of the following architectural components, which compose the data warehouse infrastructure:

System infrastructure: Hardware, software, network, database management system, and personnel components of the infrastructure.
Metadata layer: Data about data. This includes, but is not limited to, definitions and descriptions of data items and business rules.
Data discovery: The process of understanding the current environment so it can be integrated into the warehouse.
Data acquisition: The process of loading data from the various sources. This is described in more detail in the ongoing maintenance section later in this chapter.
Data distribution: The dissemination/replication of data to distributed data marts for specific segmented groups.
User analysis: Includes the infrastructure required to support user queries and analysis. This is described in more detail in the "User Access" section, later in this chapter.

Figure 2.1 shows the overall view of a data warehouse with each of these components.

FIGURE 2.1. Data warehouse architectural components.

Data Warehouse System Infrastructure

The technical architecture of a data warehouse is one of, if not the most, important component. The reason for this is that the technical architecture is used as the base for building all the other data warehouse components. This is why the technical architecture is called the infrastructure.

The infrastructure foundation upon which the data warehouse is built is often called the platform. It is made up of the following components:

Hardware, including operating system: Should be open, meaning that a variety of tools are able to run on the platform, and data is able to flow to/from the platform with a minimal amount of effort required. Most of the hardware of a data warehouse will consist of a number of large machines. Large machines are 6 to 8 or even 12 CPUs with a gigabyte(s) of memory and many gigabytes or even a terabyte of disk space.
Network: Should minimize complexity, maximize bandwidth. Should connect (directly) all components and locations of the corporate enterprise that need access to the data warehouse.
Software: Of course, the most important software component of a data warehouse is the Database Management System (DBMS) (see the next section). However, there are other important software components as well: the monitoring, administration, and network management tools used to maintain the database; the software used to support user access; and data modeling tools used by the development staff to design, implement, and maintain the data warehouse (i.e. ER/Win, Oracle Enterprise Manager, ClearCase [Configuration Management], and third-party utilities).
Personnel: This may seem like an odd component of a data warehouse architecture, but it is the most important (and the most expensive!). There are a number of technology choices on the market today, and each of these technologies has good features and bad features. Therefore, choosing the right components is not an exact science. The key factor in whether or not the technology will work is the skill level of the individuals designing and developing the architecture. Good components and experienced architects/developers will make the difference in the end.

WAR STORY: I have been involved in the development of three large client/server architectures for major corporations. All three of them utilized the same major components. On two of the three, I had some of the best architects and database personnel in the industry. However, on the other project, I had some good people, but they had very little experience or knowledge of the technology. You can probably guess which of the three architectures were successful. The two with the strong personnel were delivered with a robust architecture that is still running strong today. The other architecture was scrapped, and a third-party vendor system was purchased to perform the business functions instead. Also, this third project was missing some of the other key factors in an architecture. (See the summary at the end of this chapter.)

Database Management System (DBMS)

The database management system (DBMS) is a very critical part of system infrastructure. The data warehouse is generally made up of at least three separate databases: interim data store, metadata repository, and production data warehouse database; sometimes a fourth is added: an analytical database employing OLAP, ROLAP, or DOLAP (or some other variation of online analytical processing). These analytical database options will be discussed later, in the "User Analysis" section.

The marketplace offers many different database management options for the data warehouse. This is due to the fact that the data warehouse must support a wide variety of unpredictable queries in very large environments, often measured in terabytes. Here is a short list of some of these options:

Oracle7 Server
Sybase System 11
Informix
MS SQL Server 6.5
Ingres
Red Brick Warehouse
Sybase IQ

The standard means of data storage for most operational systems, relational databases, are often considered first for the data warehouse. Relational databases are easier to link with the operational environment, and there is a well-defined standard interface language for these databases (SQL). There is a plethora of tools available for most common relational databases. They are well understood, and talent is more commonly available for these systems than some of the others.

Sometimes, relational databases may not lend themselves well to the types of queries the users want to do, however. Performance may suffer, and the users may have a difficult time expressing these queries in relational terms. There are some other technologies that can help in these areas. Often, relational database technology is more suited to the extremely large data store which the warehouse is comprised of, and users may select parts that are relevant to the types of queries they want to concentrate on, and these parts comprise data marts. Each data mart can employ a special analytical storage and retrieval mechanism (such as OLAP and its variants) to optimize their particular queries. See the "User Analysis" section for a short description of OLAP technology; for a more detailed treatment, see Chapter 23, "Front-End Tools."

Choosing a DBMS Which DBMS is right for your needs? Each enterprise should analyze volumetrics, type of analysis required, and so on, and determine which technology type and products make most sense. This book approaches data warehousing implemented on an Oracle relational database. Oracle is a good choice due to many features which facilitate the management of very large databases (VLDB), such as partitioned tables. However, many of the concepts in this book equally apply to other database technologies.

Metadata

Metadata is pivotal to the data warehouse's success. Metadata has two critical roles: It serves as documentation to help technicians maintain the warehouse over time, and it facilitates business users by providing business definitions of data items and how they relate to other data items and business policies/rules. These two roles roughly correspond to the two types of metadata: control metadata and business metadata. See Chapter 7, "Metadata," for more information about metadata, and how you go about building this component of the warehouse.

Data Discovery

Data discovery is probably the most expensive component of the warehouse, because it is so resource-intensive and time-consuming. It takes the involvement of several different technical analysts, as well as the subject matter experts. Data discovery is all about deciphering what is in the current operational systems: what data elements exist and what they mean. It is all about uncovering business rules both past and present, and reconciling them with how the data should appear in the data warehouse. It is an exercise in semantics. It involves lots and lots of intensive research. Some estimates say that data discovery can take as much as 60 percent or more of the total warehouse building time. I call this phase "sleuthing" because you feel like Sherlock Holmes ferreting out obscure information! Chapter 8, "Data Quality and Scrubbing," covers this subject in detail, because you cannot create the scrub rules for the warehouse without this research. But because this research is so important to the warehouse, I believe it merits its own architectural component.

Data Acquisition

As stated in the section above, data discovery is a prerequisite to data acquisition. Data acquisition itself can be broken up into several major sequential steps:

Extracts
Load into interim data store
Transform
Aggregate/summarize
Load into warehouse

The details of these steps and how they are performed can vary based on many factors. The tools chosen to perform these steps influence how each step is performed, as well as the architecture of the data layers (see the "Data Management Architecture: Data Layers" section, later in this chapter). The scrubs and loads chapters (8 and 19, respectively) discuss some of these issues.

Data Distribution

Some users may query the data warehouse directly; others may be geographically distributed and smaller segments of the warehouse can be exported directly to their location for their specific needs. Others may be located in the same area as the warehouse, but for performance reasons a special technology may be used for their unique query needs. These special subsets of the warehouse distributed directly to the users are called data marts. The dissemination and propagation of data for the data marts is a separate architectural component, and should be carefully planned. There are tools that can populate data marts automatically from the warehouse and schedule them. See Chapter 26, "Data Marts," for more information.

User Access

The user access component of any data warehouse includes special tools required to formulate user queries and analysis. Most of these tools require an administrator to set up the user environment in some way, which involves building a user layer in the tool. This layer allows users to formulate queries in a more user-friendly way, and hides anomalies of relational data structures such as associative tables and join fields.

The user access data warehouse component also includes any additional data structures that might be required for decision support, such as online analytical processing (OLAP) cubes.

Figure 2.2 illustrates the user access to the data warehouse.

FIGURE 2.2. User access to the warehouse.

CAUTION: Don't get into the mindset that one end-user query tool will fit all end users at your company. You will most likely have to choose more than one query tool. For instance, you may want to support the sophisticated business analyst with a data mining type tool, and other, less computer literate people with an easy, customized interface by using Andyne's GQL.

OLAP

Online analytical processing (OLAP) tools are more commonly seen today as a layer on top of a relational database, and are built in conjunction with the query tool. An example of a tool like this is PowerPlay by Cognos. However, some vendors (including Oracle) offer databases that store the data in an OLAP format. Oracle's offering is called Express.

The advantage of these tools is that they can respond easier and faster to drill down/roll up kinds of queries (see the chapters on dimensions and query hierarchies and front-end tools for more information). However, because they store the data in an unusual format, other tools may not work well with them.

ROLAP, DOLAP, Etc.

ROLAP stands for Relational Online Analytical Processing, and usually refers to a hybrid of relational and OLAP technologies. The OLAP engine often sits on top of the relational database engine. An example of this is Red Brick Warehouse. This technology attempts to offer the best of both relational and OLAP technologies, without either stepping on one another. DOLAP is Desktop Online Analytical Processing, and refers to an OLAP cube built either directly on the user's PC or on a local server.

Sybase IQ

There are some new developments within DBMS choices for the warehouse environment, such as Sybase IQ. It is a bitmapped structure, compressing every field and basically having an index on every single field. It enables any query to combine any dimension and still be extremely fast. In a similar vein, some companies are using Oracle's new bitmap indexes to perform fast queries. IQ is a little different than a selective bitmap here and there, however; it organizes the entire data set this way.

NOTE: It is possible to have a different DBMS for your analytical environment than your warehouse, and yet still another database for the operational environment. For instance, you may be using DB2 for your OLTP systems, Oracle for your data warehouse, and Sybase IQ for the analytical environment/data marts.

Data Management Architecture: Data Layers

Figure 2.3 illustrates the overall high-level data architectural components required for the typical data warehouse effort. When you build a data warehouse, you typically are building at least two separate databases: an interim "staging area" and the warehouse itself.

When the data is loaded into the warehouse, if it comes over from the legacy system as is with no transformation, it is considered Level 0.

FIGURE 2.3. Data warehouse data components.

WAR STORY: One system I worked on simply used FTP to bring data from the legacy system as pipe-delimited flat files to the data warehouse environment. The data then got loaded as is into the interim staging database. This is an example of a Level 0 data store.

Alternatively, some scrubbing can take place on the legacy system (if the load of the system allows it). Some tools enable you to enter scrub rules and the tool will generate code, which can then be executed in various ways.

If some initial scrubbing occurs on the legacy side, then the data coming over to the interim staging area is called Level 1, because it has already been processed once. Each time data is scrubbed, the data level is incremented.

The data is placed in the interim environment so scrubbing can take place. Often, primary keys must be resolved. Sometimes one row of data in a warehouse table will be sourced by more than one legacy system. The primary key is pieced together in the interim environment. This must take place first before any other scrubbing can occur; you must have a proper identifier for each row before proceeding. Often, more than one iteration of scrubbing occurs in the staging area.

Figure 2.4 shows a data warehouse with four data levels. Level 0 is the straight extract, taken as is from the legacy environment. Level 1 takes place in the interim staging area; its main purpose is to create a primary key, a single field that will serve as the unique identifier for each row. Then, Level 2 cleans up miscellaneous data anomalies such as replacing non-standard project codes for the approved values. Another set of scrub routines then performs summarization information that will be stored in the data warehouse. This summarization is Level 3. Some warehouses may have one or more levels of summarization. Figure 2.4 shows more granular summaries calculated, which is represented as Level 4.

Most data warehouses in the real world don't have all of these levels shown. As stated previously, if scrubbing takes place before the data is shipped to the interim environment, Level 0 is not even shown. It is never represented in the warehouse or stored. A warehouse with some scrubbing taking place on the legacy system is shown in Figure 2.5.

FIGURE 2.4. Four data levels.

FIGURE 2.5. Data warehouse with scrubbing on the legacy side.

Figure 2.6 illustrates how each level of data requires the earlier one before it.

FIGURE 2.6. Data levels.

Ongoing Maintenance: Warehouse Infrastructure

Data warehouses are fed periodically, and repeatable processes must be in place for this to occur. The following processes are part of this iterative cycle:

Extract data from source system database
Export data from source environment to warehouse platform
Copy data into interim staging area database
Perform necessary scrubs
Process errors (such as moving rows with bad data into an error table, flagging problem rows, etc.)
Perform summarization/aggregation
Load data from interim staging area into the warehouse
Perform backup if required
Propagate subscribed data to data marts as required

This cycle is repeated every time a load is performed. Figure 2.7 shows these essential architectural components required for ongoing maintenance of the data warehouse.

FIGURE 2.7. Ongoing maintenance architecture.

TIP: Don't forget to consider backup strategies. If the warehouse goes down, how much recovery time can you allow? How long will a backup take? You may want to consider special hardware and software to facilitate backups of very large databases. See Chapter 18, "Backup and Recovery," for more information.

Summary

The design and development of a data warehouse architecture is one of the most important focus areas in the entire process. Building a data warehouse is extremely complex, and the architecture, coupled with the methodology (see Chapter 3, "Methodology and Project Management"), will provide you with a roadmap for this exciting and challenging journey.