Intelligent Business Systems How Well Do You Know Your Customers? Intelligent Business Systems: A Three-Level Approach Intelligent Data Analysis The Intelligent Processing Difference Intelligent Processing Asks Better Questions, Offers Better Answers The Power Behind the Intelligence The Scalable Disk Array: The Heart of Every Intelligent Business System SuperSPARC Processing Nodes Fast Access to Other Corporate Systems A Powerful Array of Standards-Based Software Oracle Decision/SQL Parasort Integrating Familiar Client Applications Summary 1.0 How Well Do You Know Your Customers? This question, more than any other, crystallizes the challenge of doing business in the 1990s. Understanding the preferences, habits, and trends of the world in which you operate is critical -- because these behaviors will determine your future success. Getting close to customers means anticipating what they will do next. It means asking a lot of questions -- questions whose answers once depended on somebody's hunch, a committee consensus, or an outdated rule-of-thumb. Who will be next year's big spenders? Which clients will default on their accounts before the end of this year? Which product will capture the greatest market share in the third quarter? Why are some people buying a product while others are not? Thinking Machines now offers you comprehensive answers to these questions from the most objective source possible: your data. If you have a lot of data, you have the answers to these and many other business questions buried deep inside it. Good answers. Answers that come from the actual transactions of your business. Answers that can make you efficient, productive, and profitable. So how do you get them out? Don't count on mainframes or warehouse environments: mainframes lack the power and warehouses lack the analytic capabilities necessary to find subtle patterns in billions of pieces of information. What you need is a system that can tell you where your customers are headed next week, next month, next year. Right now. What you need is an Intelligent Business System from Thinking Machines Corporation. 2.0 Intelligent Business Systems: A Three-Level Approach Intelligent Business Systems from Thinking Machines combine the expandable power of massively parallel processing, the compatibility of standards-based relational database technology, and the predictive capabilities of intelligent algorithms. This combination provides a return on information-system investments that grows with your data, because that data will yield answers whose business value rises faster than its storage and processing costs. 1. Our expandable parallel hardware is essential to business computing because it keeps your information system affordable at each growth step. Intelligent Business Systems grow with your company's data. Adding capacity and performance is as easy as plugging in a module. These modules are cost-effective because they use mass-produced components such as microprocessors and small disks -- the same components that are in your workstations. 2. To manage your data, Thinking Machines business systems offer standard database environments, including Oracle. Our systems interconnect with existing platforms using standard hardware and software protocols. For example, you can easily feed transaction data from dedicated OLTP systems into our system for intelligent decision support. 3. Intelligent algorithms make the transition from retrospective to predictive uses of data. Predictive power comes from data set size. The larger the data set, the more telling the patterns that are embedded in it. But traditional algorithms can't discover these patterns. They were designed in the mainframe era to recapitulate comparatively small amounts of data. Thinking Machines intelligent algorithms can predict what your customers will do tomorrow. That's what we mean when we say that our Intelligent Business Systems get you closer to your customers. 3.0 Intelligent Data Analysis Early on, Thinking Machines recognized that transaction databases would require, and benefit from, a new generation of intelligent prediction tools. Since 1984, we have been developing and testing just such a tool set. Back when large datasets were rare, we sought out the biggest available, such as those at the U.S. Census Bureau. We have worked with some of the most successful companies in America, refining our techniques for intelligent database processing. Every new experience honed the accuracy and predictive capabilities of these tools. Today, as the data-rich business environment becomes the norm, Thinking Machines is ready. Our suite of tools recognizes subtle shifts among billions of pieces of data, then uses these patterns to predict product preferences, buying habits, and market tendencies. These are not black-box tools that give you conclusions without explanations. Our tools are engineered to give you both the answer and the reasons why. Because of our ten-year head start, we are the only company in the industry to offer these capabilities. 3.1 The Intelligent Processing Difference Traditional statistical methods of data analysis work well with small amounts of data. Given files with a dozen or so fields per record, these mainframe-era techniques can find important correlations efficiently. But when the amount of data goes up -- as it already has in virtually every business -- traditional methods bog down. Circumventing this problem is generally left to the interpreter, or data analyst, who picks and chooses a few fields to analyze. Out of one thousand fields, an analyst might have to discard 980, then try to work with the remaining twenty. That's not very intelligent. Intelligent methods adapt to the data, without preconceived notions about what the answer will look like. Instead of discarding 980 fields before it even begins, for example, an intelligent algorithm keeps all thousand in play, and gradually learns which ones contribute positively to the analysis. Only when it understands which fields of data really matter does it rule out the rest. And it does this for every individual question, because fields that are irrelevant to answering one business question may hold the key to answering the next one. The quality of the answer you get from an intelligent algorithm is based on the amount of data you analyze: the more clues you have, the better it gets. The quality you get from a traditional approach depends much more directly on the analyst doing the work. 3.2 Intelligent Processing Asks Better Questions, Offers Better Answers Thinking Machines intelligent processing is breakthrough technology that leverages large, rapidly expanding databases to bring companies close to their customers. Our intelligent processing toolkit, Darwin, incorporates sophisticated prediction and classification algorithms, some inspired by neurobiology. Darwin currently includes four tools: StarMatch, StarNet, StarTree, and StarGene. StarMatch, which uses memory-based reasoning (MBR) technology, compares in parallel the characteristics of one database record to all others in the database to find similar situations which can be used to predict outcomes. StarNet uses artificial neural network (ANN) technology to create the rules for defining a record group; StarTree uses a parallel implementation of techniques similar to Classification and Regression Trees (CART) approaches to perform this function. StarGene uses evolutionary techniques drawn from genetics to optimize existing prediction algorithms. The power of Darwin's techniques lies in their generality. If the answer is in the data, Darwin will find it. Here are just a few examples of questions that Thinking Machines clients have asked us to answer, using actual business data. They are questions that traditional approaches were not designed to address. These are the kinds of questions that put you close to your customers, not just close to your business. "Give me a breakdown of all customers likely to default in the coming year." Predicting which customers are likely to default, before they actually do, is a concern for every business. That's why a leading consumer company asked Thinking Machines to use its data to attack this problem. From the baseline of the number of defaulters that could be picked out by just guessing, Darwin was able to predict twice as many additional defaulters as traditional statistical techniques. Darwin analysis started with a historical database of customers already known to have defaulted or not defaulted. Our task was to predict which customers would default during the next six months, based on this data. We applied Darwin's StarTree tool to the data. StarTree, which uses technology similar to Classification and Regression Trees (CART) techniques, surveyed the large, historical database. It considered every field of data as potentially important, choosing the most relevant fields to compose a system of rules for predicting future defaulters. After creating rules for finding likely defaulters, Darwin then applied StarMatch. This tool, which uses techniques comparable to the K Nearest Neighbor (KNN) model, is able to apply rules to individual records in ways that are too complex to carry out on a mainframe. StarMatch compared in parallel the characteristics of each new customer to all those in the historical database. When the nearest matching customer turned out to be a defaulter, the system had powerful evidence that the new customer might also default. It took Darwin just an hour to build the rules for predicting defaulters -- and even less time to make its predictions. When all was said and done, Darwin's tools were twice as effective at zeroing in on likely defaulters when compared with standard statistical techniques. What's more, the Intelligent Business System provided a 192X performance and a 55X price/performance advantage over the company's mainframe. "Do three things: predict year-end demand, tell me which customers will fuel that demand, and tell me why." Companies with seasonal sales patterns are always trying to develop strategies to even out their revenue curves and make more efficient use of their production capacity. Recently, Thinking Machines helped the division of a Fortune 100 client get behind the peaks and valleys of its sales charts and see how customer subgroups affected its bottom line. Based on its predictions of year-end demand, Darwin discovered three distinct classes of customer behavior. The division's business was quite seasonal -- sales were high in the first quarter, but they dropped at mid-year, before growing back in the third and fourth quarters. Giving us its January-November sales data, the division asked Thinking Machines to predict December revenue, and then to identify the customers who would drive that year-end demand. StarTree scoured the data for subtle patterns. It worked its way back through the historical records, evaluating every field along the way. It examined customer spending as many as six months prior to December to refine its end-of-year predictions. Once it had forecast December revenue, Darwin used these predictions to generate complete annual spending profiles for all the division's customers. In so doing, it uncovered multiple distinct buying patterns. The top ten percent of all customers bought at a rate consistent with, though less pronounced than, the division's overall seasonal variations. Of the remaining customers, most bought steadily through August. But Darwin isolated one large subgroup whose sales were quite different from the rest. Why did this segment virtually stop buying for three months? What other tendencies did this group exhibit? Was there any way to boost their spending? Darwin's intelligent processing had taken the company from the realm of hypothesis testing into that of true data exploration. To verify the accuracy of its results, the client then applied Darwin's model to a new, untested dataset; its predictions proved accurate within five percent. "Define a highly concentrated micromarket within our database." Many businesses want to perform more tightly targeted marketing, often referred to as "micromarketing." Micromarketing involves making special offers to customer segments that meet very select criteria. To make this approach effective, companies need a fast, efficient way of identifying unique customer subgroups within their databases. Intelligent Business Systems help make that possible. Late last year, a nationwide company asked Thinking Machines to find and define highly concentrated subgroups within its database that would respond favorably to future, targeted, mailings. Its goal was straight-forward: to reach a higher percentage of customers likely to respond, while minimizing mailings to those unlikely to do so. The Intelligent Business System had to produce a quantitatively defensible profile of each micromarket it defined. We applied Darwin's StarTree tool against a large database with over 1,000 fields per record. The tool's task was to create rules that could identify responders and non-responders to future mailings. StarTree's search was exhaustive -- it scoured every field in the record, considering millions of combinations that might characterize customer behavior. By analyzing historical records, Darwin zeroed in on a concentrated 5% segment of the database that would account for nearly sixty percent of all potential responses. Targeting only these customers cut mailing costs by a factor of twenty, and yielded a six-fold improvement in response rate. "Create a model that really explains why some customers renew their subscriptions and others don't." Why do good customers suddenly take their business elsewhere? Every year, companies around the world confront this question, and struggle to find cost-effective methods for accurately predicting customer behavior. Intelligent processing now makes such predictions possible. Because of their limited capacity for data, traditional methods (left) often operate on only 1-2% of the data available in each record. Darwin techniques (right) look at every bit of data in the record. Thinking Machines worked with a major U.S. service bureau to build a system for predicting customer attrition, or non-renewal of membership. There were significant challenges: to salvage an account, the company had to know in advance it might lose it; and to reach only those it was in danger of losing, the attrition prediction had to be painstakingly precise. Thinking Machines used StarNet to implement a neural network. Our massively parallel processing made for quick modification and improvement of the neural net's behavior and predictions. After analyzing all the records in the database, Darwin identified a set of subscribers ten times more likely than the average customer to cancel their accounts within the next few months. Using conventional programming techniques, this investigation could easily have taken a year, without the accuracy delivered by Darwin. Because Darwin's modules train themselves, we were able to produce superior results in just two weeks. "Suggest ways of regrouping our customers into new market segments." Intelligent Darwin technology is a powerful aid for companies engaged in direct marketing. Darwin supports the design and targeting, for example, of unique customer catalogs, each with its own customer subset and products known to appeal to that particular group. Darwin can cluster individual customers into market segments based on actual similarities in the data, rather than using guesswork or last year's segmentation. Conceptually, we start by creating a custom catalog for each customer. These individual catalogs provide the starting point for optimization. The system then randomly clusters pairs of customers into the same catalog. When predicted net revenue increases from the clustering, the system keeps the pair together, and eliminates a catalog. If the pair has nothing in common, Darwin returns each back to its original catalog. The system then applies this approach again and again -- millions and millions of times -- until it discovers the best balance of catalogs and overall net revenue. Using this method, Darwin can find profitable groupings that no one knew existed, because the new market-segment definitions emerge from the data itself. "Find some customer-preference patterns I might not be aware of." Intelligent Business Systems cannot make people buy your product, but they can discover undetected, and perhaps unusual, purchasing patterns. The generality of Darwin and the sheer power of its underlying hardware let you make the most open-ended kinds of requests possible: "Go find me something interesting." Darwin highlights patterns and correlations that it discovers by performing an unstructured analysis of the data. With a few rules that define a minimum acceptable correlation, Darwin can investigate every field in every database record and compare them to all others. You might, for example, run such analyses overnight, using the full power of the system without interrupting prime shift processing. Or you can analyze a database on a time-shared basis, using system resources only when they are available. You can then decide which correlations to explore further. You may discard a purported correlation between a person's middle initial and their account balance. But you may want to follow up on an unexpected relationship between their account balance and the dates of their monthly payments, or differences between their mid-week and weekend buying patterns. Only Intelligent Business Systems from Thinking Machines let you explore this full range of opportunities. "I need this answer in an hour." Not all questions are discovery-type questions. When you know exactly what you want, the system's Decision/SQL facility can provide the exact answer 50-500 times faster than a traditional mainframe. Decision/SQL means you can ask more complex questions about bigger datasets, and still get the answer fast. Consider these four businesses: a credit union, a department store chain, a regional telephone company, and a nationwide credit-card firm. Each company uses more data than the previous one, and the complexity of their queries grows along with the size of their datasets. The credit union, for example, may ask a very simple question, "How many members have written over $1,000 worth of bad checks in any given month?" Questions like these can be answered in about an hour on a workstation. Meanwhile, at the department store, the level of query complexity increases. A manager might ask, "How many customers made at least one third of their purchases in a single department?" In a typical retail environment, answering this question entails searching through roughly 30-million transaction records, or a 3-Gbyte database, and takes an hour to answer using a mainframe computer. At a regional telephone company, a company official might ask, "How many customers made more than $100 worth of phone calls from Minneapolis to Chicago in November?" The phone traffic between the two cities could easily generate 50 Gbytes of data. A small Intelligent Business System running Decision/SQL could process this query in an hour. Of the four businesses, the credit-card firm would have the most data, as every customer purchase would create a new transaction record. At this level, queries become very complex. A marketing manager might ask, "How many accounts have total charges of $10,000 or more with airlines in calendar years 1990-1993? Exclude accounts that were not active for the entire period, break the results down by state, and give me the average airline charges for all accounts." This kind of query involves about 100 Gbytes of transactional data and 1.2 Gbytes of customer data. Yet processing all this information using Decision/SQL on a large Intelligent Business System takes just one hour. "Predict how external changes like currency rates might affect my business." While working to better understand their customers, companies must also keep an eye on all the other variables that affect their business. The power of intelligent processing lies in its generality. Whether it's fashion trends or exchange rates, if you have a lot of data about the past, Darwin can analyze it for the telltale patterns that hold clues to the future. In a competition sponsored by the Santa Fe Institute, Darwin out-predicted aall other entraants. The Santa Fe Institute recently held a Time Series Prediction and Analysis Competition. The contest included leading computing firms and universities from around the world, and challenged participants to predict an unspecified business trend; in fact, contestants did not even know that one of the files contained exchange-rate data. Each participant received a set of 30,000 data points. Contestants then had to predict six future values of this data. Using Darwin's StarNet tool, Thinking Machines constructed an artificial neural network for making our predictions. The network employed a back propagation algorithm that used the 30,000 data points to learn about historical data patterns and behaviors. The CM-5 sped through this learning process, updating hundreds of millions of linkages per second. Developed in just two weeks, the Darwin model received first prize in the competition. 4.0 The Power Behind the Intelligence Answers to tough questions like those on the previous pages illustrate how Darwin can take large datasets and turn them into clear answers and trustworthy predictions. The more data detail there is, the clearer the answers and the more trustworthy the predictions. Darwin tools can do all this because they operate in a very advanced hardware and software environment. Indeed, the job of the underlying Intelligent Business System hardware and software is to provide an economical and expandable home for the data, and an equally powerful environment for analyzing it. 4.1 The Scalable Disk Array: The Heart of Every Intelligent Business System At the heart of every Intelligent Business System is an advanced form of the RAID (Redundant Arrays of Inexpensive Disks) storage architecture pioneered by Thinking Machines (our patents go back to 1985) and since adopted by every major manufacturer of commercial data processing systems. As you know, RAID uses large numbers of mass-produced workstation disks to provide storage that is more reliable, more expandable, and higher performance than traditional drives. Data is always recorded redundantly across multiple drives. If an individual drive fails, it is electronically replaced by a spare. The redundancy information is used to reconstruct every bit of data that was on the disk that failed and copy it to the spare. We were the first company in the industry to ship a high-performance RAID disk system, our DataVault, which we announced in 1987. Among our DataVault customers, Dow Jones & Company has been in continuous 7-by-24 production operation for over five years. Our Scalable Disk Array uses mass-produced, 3-1/2" disk drives that can be removed and replaced without powering down the system. Each disk holds approximately 2 Gbytes of data. An array of 256 drives, which fits within a single cabinet, holds half a terabyte. Thinking Machines routinely operates systems with multiple cabinets; we are experts in the support of very large system configurations. In fact, of the 100 most powerful computers currently installed in the world, we have installed 31 -- more than anyone else, including IBM. The system is designed to mix multiple generations of disk technology in the same cabinetry. Those who use 2-Gbyte drives today, for example, can expand with the next generation of 5-Gbyte drives when they move into volume production. A single expandable data network interconnects all Intelligent Business System components: disk modules, processing modules, and input-output interfaces. The capacity of this network grows with the total number of modules connected to it; in large configurations it routinely sustains total data rates of many Gbytes/sec. So-called Symmetric Multi-processor (SMP) systems do not expand in this critical way -- their processors and disks attach to a fixed bus whose performance does not expand. The Scalable Disk Array offers unique flexibility as you grow. You can add a combined storage module and network interface to the system, or simply attach a storage module to an existing interface. Either way, the data capacity increases. But when you add a new interface, the performance of the disk system increases as well. Only Thinking Machines business systems offer this ability to tune capacity and performance independently. Our installed base includes systems that sustain disk-transfer rates from 9 Mbytes/sec to 250 Mbytes/sec, the fastest in the computer industry. A second unique advantage of the Scalable Disk Array is that you can flexibly specify the level of RAID redundancy you need. An individual file system can use hundreds of drives, or just a few. Each structure has its own site-specified complement of parity and spare drives. The file structure heals itself if there is a component failure by reconstructing the information and writing it to a spare drive. You can then remove the failed drive for repair or replacement. 4.2 SuperSPARC Processing Nodes Every processing node inside an Intelligent Business System is, in essence, a SuperSPARC workstation, compatible with the other SPARC workstations in your organization. Rated at 64 Mips apiece, these processing nodes connect to the same data network that the disks connect to. Because processing nodes and storage nodes attach to the same data network, their numbers can grow independently in any proportion. As the amount of data grows, additional storage nodes accommodate it. As the daily volume, or complexity, of the questions grows, additional processing nodes accommodate them. Thinking Machines is the industry leader in scalable computing. Our current customer base represents a performance spectrum of more than a factor of sixty, all using the same hardware components and the same applications software. The scalability of Intelligent Business Systems makes capacity planning easy. You buy only what you need, knowing the expansion headroom is there when you need it. 4.3 Fast Access to Other Corporate Systems A centralized Intelligent Business System continually receives data from other corporate systems and disburses extracts of its data to them. Powerful input/output subsystems move this data at very high speeds. The CM-5 business system provides mainframe connections using standard mainframe channels. Alternatively, you can connect to a high-performance gateway such as an RS/6000, communicating using industry-standard interfaces with peak rates of up to 100 Mbytes/sec. You can also use FDDI, Ethernet, and Token Ring connections. 4.4 A Powerful Array of Standards-Based Software Intelligent Business Systems feature Oracle7 relational database management software; optimized Decision/SQL query facilities; and Parasort by MRJ, Inc., a high-speed sorting utility. They also offer links to familiar, standard client applications. 4.4.1 Oracle Oracle's open architecture integrates Oracle and non-Oracle DBMSs and one of the industry's most comprehensive collection of tools, applications, and third-party software into an industry standard environment. Oracle Corporation was a pioneer in offering a true relational database management system commercially, and has continually led innovations in the database field. Oracle is portable. Applications developed for Oracle can be ported to other platforms with little or no modification. Oracle is compatible with industry standards, including most industry standard operating systems. Oracle is connectable. It allows different types of computers and operating systems to share information across networks. The capabilities of Oracle result in a comprehensive and powerful system for information storage and retrieval. Oracle7 Version 7.1 establishes Oracle as a clear leader in parallel processing, programmable server technology, and large database support. It introduces the parallel query option, which significantly improves performance for lengthy data-intensive operations. The parallel query option allows Oracle7 to split up query execution, data loading, and index creation tasks, and execute them concurrently on multiple processing nodes. The combination of CM-5 computing power and Oracle7 parallel features extends Oracle capability into the multi-terabytes range. The CM-5 is the ideal parallel environment for Oracle7, because both processing nodes and disk storage nodes connect as peers to the same expandable network. Any processing node can fetch data from any disk storage node without having to go through any other processors. The network grows in performance to match the total number of nodes (processing and storage) in the system. 4.4.2 Decision/SQL For users processing high volumes of very complex queries on the non-transaction versions of their data, Thinking Machines offers Decision/SQL. Decision/SQL provides a query-only version of the Structured Query Language (SQL) standard and a set of load utilities to optimize load performance. It takes full advantage of the CM-5 architecture: all system processors work cooperatively on a query. Knowing the data will not be changing on the fly, the system organizes it in a manner that is optimized for retrieval. Decision/SQL uses few locking mechanisms or indices. Instead, it employs a vertical-partitioning approach and advanced parallel algorithms to accelerate processing. Decision/SQL fully utilizes the underlying I/O bandwidth of the hardware. It accesses the data in bulk for queries that do full-table scans and for loads. Performance on GROUP BY, JOIN, and ORDER BY operations is exceptionally fast. Decision/SQL can process complex queries on databases ranging in size from tens of gigabytes to terabytes. It outperforms traditional DB2 systems by factors of 50 to 500, depending on the number of processing and disk nodes in the system. 4.4.3 Parasort To deliver the full processing capabilities of our Intelligent Business Systems, Thinking Machines offers Parasort, a parallel sorting package developed by MRJ, Inc. Parasort supports databases of hundreds of gigabytes in size, and offers performance that grows with the size of a CM-5 business system. Parasort features include support for multiple keys of differing data types, the ability to process variable- and fixed-length records, checkpointing, and the ability to handle database merge operations. 4.4.4 Integrating Familiar Client Applications Intelligent Business Systems let you employ familiar third-party applications, including graphical user interfaces (GUIs), fourth-generation languages, cross-tabulation programs, application generators, and data analysis packages. Decision/SQL supports the Sybase Open Client/Open Server protocol, Microsoft's ODBC protocols, and such popular tools as GQL from Andyne Computing, Powersoft's PowerBuilder, Impromptu from KnowledgeWare, and Trinzic's Forest & Trees. Similarly, Oracle provides access to client applications through the SQL*Net protocol, providing access to all tools that conform to the standard. 5.0 Summary Back in the mainframe era, when data was scarce, companies could gain competitive advantage simply by having data when their competitors didn't. In today's data-rich business environment, the mere possession of lots of data no longer gives a competitive edge, because everybody has lots of data. Today, competitive advantage comes from what you do with your data: Companies whose computer systems grow smoothly and economically to take in the most recent data have an advantage over those whose systems are slow and painful to expand. Companies who use their data to predict the future have an advantage over those whose systems only report the past. Thinking Machines is the only company in the industry with the breadth of systems and applications experience to deliver both of these competitive advantages in a single system. To some, the combination will seem a breakthrough. To others it is simply intelligent. Thinking Machines Corporation. All rights reserved. Thinking Machines and Connection Machine are trademarks of Thinking Machines Corporation. All other trademarks are the property of their respective owners. Photography: Steve Grohe