Excerpt from "Data Mining Your Website" by Jesus Mena | Digital Press | 1999 | 320 pp. ISBM #1-55558-222-2

WEB MINING: CREATING, ENHANCING, MINING AND ACTING ON WEB DATA

In their frenzy to be the next Amazon.com companies of all size and types are scrambling to set up their e-commerce sites. They often concentrate on the mechanics of transactional processing; setting up their inventory and shopping carts -- but usually fail to plan for the vast amount of customer data their site will generate. Most companies fail to see that in e-commerce success will depend on how this web data is leveraged to convert visitors into customers. The web data that is generated with a single sale is of more value then the sale itself since it can lead to a long and profitable relationship with that customer.

Every visit to a retailing site generates important consumer behavioral data, regardless of whether a sale is made. Every visitor action is a digital gesture exhibiting habits, preferences and tendencies. These interactions reveal important trends and patterns that can help a company design a website that effectively communicates and markets its products and services. Companies can aggregate, enhance and mine web data in order to learn what sells, what works and what doesn't, who is buying and who is not.

CREATING WEB DATA

Since every visit to your website signals a consumer's interest in your product or service, it is vital that you closely scrutinize every interaction. However, web data is diverse and voluminous. Thus, to analyze e-commerce data you must assemble the divergent data components captured via server log files, form databases and e-mails generated by visitors into a cohesive, integrated and comprehensive view. If you plan ahead of time how you will capture important customer information, you can more easily integrate and mine web data.

By planning strategically before you implement your e-commerce website, you can capture important information about your visitors' preferences and online behavior. By taking the time to consider the overall design of your site, such as what prompts and links you position in your home page, you can map the movements of your visitors. In addition, by prompting for a quick and short registration at the onset of a visit, an inquiry or a purchase you can also capture important personal information which you can latter enhance and mine.

One key to compiling and capturing this shopper information is a unique identifier: a visitor id number. A proven strategy is having visitors register initially at the site by enticing them with a special service or incentive. Offer access to a special section of your site. Have contests or door prizes. The point is that you need them to register in order to set a cookie, which can be used as the unique id number. From that point the unique key can enable the retailer to track every interaction with that visitor. This unique key will allow the site to link log files and forms database with the company's data warehouse and other demographic and household information, ad server networks or collaborative filtering engines.

ENHANCING WEB DATA

Server log files provide domain types, time of access, keywords, and search engine used by visitors and can provide some insight into how a visitor arrived at a website and what keywords they used to locate it. Cookies dispensed from the server can track browser visits and pages viewed and can provide some insight into how often this visitor has been to the site and what sections they wander into. Forms can provides important visitor personal information, such as gender, age, and ZIP code. This is probably the most important customer view since it contains information that can be used to append additional data such as that from a data warehouse. You can also append to visitors form information demographic and household data, including a visitor's probable income, the type of auto they drive and the number of children they have.

This external information can be linked to website data and enable additional insight into the identity, attributes, lifestyle and behavior of visitors. It's available from various vendors, including Acxiom, Equifax, Experian, MetroMail, Polk and others. There is an entire industry devoted to segmenting, classifying and reselling consumer behavior information to companies, including of course those with websites.

In addition new providers of 'webographics' have recently emerged who sell either software or services, and sometimes both, for collaborative filtering, relational marketing and visitor profiling. These new data providers represent a whole new genre of web companies seeking to capture and generate information about Internet users' behavior and preferences. It includes such firms as DoubleClick, Engage Technologies, Net.Perceptions, Firefly, and others. These new players used a myriad of solutions to track and profile visitors -- everything from proprietary software and databases to commingling cookies via server networks.

All of this internal and external information can be written to an Oracle table, or a flat file, which then can be linked or imported into a data mining tool. These include automated tools, which have principally been used in data warehouses to extract patterns, trends and relationships and new easy-to-use data mining data mining tools with GUI interfaces that are designed for business and marketing personnel. These data mining analyses can provide actionable solutions in many formats, which can be shared with those individuals responsible for the design, maintenance and marketing of an e-commerce site.

MINING WEB DATA

So far most analyzes of web data have involved log traffic reports, most of which provide cumulative accounts of server activity but do not provide any true business insight about customer demographics and online behavior. Most of the current traffic analysis software, including NetIntellect, Bazaar Analyzer Pro, HitList, NetTracker, Surf Report, WebTrends, and others offer predefined reports about server activity based on the analysis of log files. This basically limits the scope of these tools to statistics about domain names, IP addresses, cookies, browsers and other TCP/IP specific machine-to-machine activity.

On the other hand, the mining of web data for an e-commerce site yields visitor behavior analyses and profiles, rather than server statistics. An e-commerce site needs to know about the preferences and lifestyles of its visitors. Data mining in this context is about addressing such business questions as, "Who is buying what items and at what rates." You also would like to know what is selling so you can adjust your inventory and plan your orders and shipping. You need to know how to sell and what incentives, offers and ads work, and how you should design your site to optimize your profits.

Data mining is a "hot" new technology about one of the oldest processes of human endeavor: pattern recognition. Our hairy ancestors relied on their ability to recognize the patterns of predators, paths, prey, and the seasons to survive. Today, sites inundated with data -- generated daily by customer visits -- are faced with the same challenge of recognizing the patterns of opportunity and threat to their survival. One of the common traits of firms who have traditionally used data mining is that they have mountains of transactional data and find themselves competing for customer loyalty and dollars in crowded markets -- where it cost little for customers to switch. Which, if you stop and think about it, is a good description of the evolving electronic commerce landscape.

A fast, competitive marketplace where millions of online transactions are being generated (and captured) in log files and registration forms every hour of every day -- a marketplace that doubles every 100 days. A marketplace where online shoppers browse by retailing sites with their fingers poised over their mouses, ready to buy or move on should they not find what they are looking for -- should the content, wording, incentive, promotion, product or service of that site not meet their preferences. A marketplace where browsers are attracted and retained based on how well the retailer remembers the customers' needs and whims. Where the goal is to know and serve every customer, one at a time, and to build long-term, mutually beneficial relationships.

Data mining is the key to customer knowledge and intimacy in this type of competitive and crowded marketplace. In hyper-competitive markets, the strategic use of customer information is critical to survival. As such, AI in the form of data mining, has become a mainstay to doing business in fast-moving markets. In a networked electronic environment the margins and profits go to the quick and responsive players who are able to leverage predictive models to anticipate customer behavior and preferences. Data mining of customer information is required in order to make decision about which clients are the most profitable and desirable and what their characteristics are in order to find more customers just like them. Electronic retailers and advertisers are beginning to expect such customer profiling and business knowledge from the Web after years of heavy investments and marginal ROIs.

The information that a merchant gathers from its site and mines can reveal what products have cross-selling opportunities, or what information and incentives the merchant should provide to its visitors based on their gender, age, demographics and life style interests. The process involves capturing important visitor attributes from server logs, cookies and forms and appending to it household and demographic information, and then, using powerful pattern-recognition technologies, such as neural networks, machine-learning and genetic algorithms, profiling customers in order to predict their propensity to buy.

Data mining solutions come in many types, such as association, segmentation, clustering, classification (prediction), visualization, and optimization. For example, using a data mining tool incorporating a machine learning algorithm a website database can be segmented into unique groups of visitors each with individual behavior. These same tools perform statistical tests on the data and partition it into multiple market segments independent of the analyst or marketer. These types of data mining tools can autonomously identify key intervals and ranges in the data, which distinguish the good from the bad prospect. These types of data mining tools generally output their results in the form of graphical decision trees or IF/THEN rules. This type of 'web' mining allows a merchant to make some projections about the profitability potential of its visitors in the form of business rules, which can be extracted, directly from the web data:

IF search keyword is "PC_software"
AND gender male
AND age 24-29
THEN average projected sale amount is $267.26 <= Low

Or,

IF search keyword is "math_software"
AND search engine YAHOO
AND subdomain .AOL
THEN average projected sale amount is $379.95 <= High

On the other hand, predicting customer propensity to purchase can also be done using a data mining tool incorporating a back-propagation neural network. Neural networks can be used to construct customer behavior models that can predict who will buy, or how much they are likely to buy. The ability to learn is one of the features of neural networks. They are not programmed as much as trained. A neural network trains on samples and can construct predictive models for "scoring" visitors' propensities to purchase behavior. Typically, a neural network is "trained" on observations about data relationships for example, "AOL sub-domains purchase printers but not scanners." A net can gradually learn to detect this relationship and the features of these types of consumers. Neural networks are basically computing memories where the operations are association and similarity. They can learn when sets of events go together, such as when one product is sold, another is likely to sell as well, based on patterns it has observed over time.

TEN STEPS TO MINING YOUR WEB DATA

Before you start to mine your data you must define your objective and what information you will need to capture to achieve your objective. For example, you may need to issue visitor identification cookies when they complete registration forms at your website. This will enable you to match the information captured from your forms, such as the visitor's ZIP code, with the transaction information generated from your cookies. It will also allow you to merge your cookie information, which will detail the locations where your visitors go to while in your website, with the specific attributes like age and gender from your forms. Additionally, a ZIP code or visitor address will allow you to match your cookie and form data with demographics and household information matched from third-party data resellers.

You will likely need to scrub and prepare the data from your website before you begin any sort of data mining analysis. Log files, for example, can be fairly redundant since a single "hit" generates a record of not only that HTML but also of every graphic on that page. However, once a template, script, or procedure has been developed for generating the proper recording of a single visit, the data can be input into a database format from which additional manipulations and refinements can take place. If you are using a site traffic analyzer tool, this data may already be format-ready for additional mining analysis. Keep in mind that several steps may be required prior to undertaking your analysis, including the following ones, which are discussed more fully in the book "Data Mining Your Website."

PLAN YOUR PROJECT: IDENTIFY YOUR OBJECTIVE The mining of your website involves some advanced planning about what type and level of information you intend to capture at your server and what additional data you plan to match it with. This by itself will ensure your data mining efforts will yield measurable business results. For example, you need to plan with your web team what kind of log, cookie and form information you intend to capture at what juncture from your visitors. Next, you need to decide with your business, sales and marketing teams what kind of demographic and household information you need to purchase to merge with your server data. In addition, you should consider asking your information system team to help integrate your data mart or data warehouse and customer database with your web data.
SELECT YOUR DATA Once your business objective has been defined, you must then select the web server and company data for meeting this goal. Here is a quick checklist:

-Is the data adequate to describe the phenomena the data mining analysis is attempting to model?
-Is there a common field in your web data being used for linking to other databases?
-Can the data from your web be consolidated with your data warehouse?
-Will the data being mined be the same and available after the analysis? 
-What internal and external information is available for the analysis?
-How current and relevant is the data to the business objective?
-Are the data sets being merged consistent with each other?
-Who is knowledgeable about the data being gathered?
-Is there redundancy in the data sets being merged?
-What joins are needed for the various databases?
-Is there lifestyle or demographic data available?

3. PREPARE THE DATA
Once the data has been assembled and visually inspected, you must decide which attributes to exclude and which attributes need to be converted into usable formats. Here is another checklist:

-What condition is the data in, and what steps are needed to prepare it for analysis? 
-What conversions and mapping of the data are required prior to the analysis?  
-Are these processes acceptable to the users and the deliverable solution? 
-How skewed is the data, are log and or square transformation needed?
-Do you need to do 1-of-N conversion for categorical fields?
-How will you handle missing data and noise or outliers? 
-Normalize dollar fields by dividing them by 1000?
-Convert purchase dates to continuous values?
-Convert addresses to sectors?

-Convert Yes/No field to 1/0?

4. EVALUATE THE DATA
You should evaluate your data's structure to determine what type of data mining tools to use for your analysis. Here is a checklist:

-What is the ratio of categorical/binary attributes in the database? 
-What is the nature and structure of the database?  
-What is the overall condition of the data set?  
-What is the distribution of the data set? 
-How skewed is the data set?

As a general rule neural networks work best on data sets with a large number of numeric attributes. Machine-learning algorithms incorporated in most decision tree and rule-generating data mining tools work best with data sets with a large number of records and a large number of attributes. Empirical studies* have shown that the structure of the data critically impacts on the accuracy of a data mining tool. For example, data sets with extreme distributions (skew > 1 and kurtosis > 7) and with many binary/categorical attributes (> 38%) tend to favor machine-learning based data mining tools.

Often, derived ratios of input fields may be required in order to capture the impact or the true value of the inputs -- to capture the "velocity of a client value, such as profit or propensity to buy." For example, a common derived ratio is one of debt-to-income, so that rather than using simply the debt and income attributes as inputs, more can be gained by the ratio rather than the individual values. In your web analysis, the number of site visits or the number of purchases made over time may provide a better insight into the true value of a web site customers:

# of purchases/ # of visits: 7/9 = .77 Propensity to Purchase Ratio

Amount of sales/ # of visits: $39/5 = 7.8 Profit Ratio

5. FORMAT THE SOLUTION
As previously mentioned there are a number of web mining formats or solutions. When you evaluate your web data and set your business objectives you must select the format of your e-commerce solution. Here is yet another checklist:

-What is the desired format of your solution: decision tree, rules, C code, graph, map? 
-What is the goal of the solution: classification, regression, clustering, segmentation?
-How will you distribute the knowledge gained by the data mining process?
-What are the available format options from the data mining process?
-What does management really need, insight or sales?
-What do you need from the data mining process?

You may need to use multiple tools in order to come up with the ideal web mining format for your web site. For example, you may need to extract rules from a clustering analysis. To do so you will need to first perform the clustering analysis using a Self-Organization Map, or Kohonen Network. Next you will need to run the identified clusters through a machine-learning algorithm in order to generate the descriptive IF/THEN rules which "profile" the extracted clusters. Conversely, you may need to first do an analysis using a machine-learning algorithm on a data set with a large number of attributes in order to compress it: to identify a few significant attributes. Then run those significant attributes through a neural network for the final classification model.

6. SELECT THE TOOLS
To choose the right mining tool, you must select not only the right technology but also must consider the characteristics and structure of your data. Here is a checklist of data related issues you should considered when selecting a data mining tool:

-Number of continuous value fields
-Number of dependent variables
-Number of categorical fields
-Length and type of records
-"Skeweness" of the data set

As a rule, machine-learning algorithms perform better on skewed data sets with a high number of categorical attributes and with a high number of fields per records. Neural networks, on the other hand, do better with numeric data.

7. CONSTRUCT THE MODELS
It is not until this stage that you actually being mining your web site files. Again, during the mining process you search for patterns in a data set and generate classification rules, decision trees, clustering, scores, and weights, and evaluate and compare error rates. Here is a quick checklist of items to consider:

-What are the model error rates, and are they acceptable or can they be improved? 
-Is additional data available which could help the performance of the models?
-Is a different methodology necessary to improve model performance?  
-How many models do you require for your entire web site?
-Train and test models using a random number seed?  
-Output SQL syntax for distribution to end-users?  
-Supervised learning or unsupervised learning?
-Incorporate C code into a production system? 
-Integrate rules in a decision support system? 
-Purge noisy and redundant data attributes?  
-Classification, prediction or clustering?
-Monitor and evaluate results?

8. VALIDATE THE FINDINGS
As previously mentioned, a data mining analysis of your web site will most likely involve individuals from several departments, such as Information Systems, Marketing, Sales, Inventory, etc. It most definitely will involve the administrators, designers, analysts, managers, and engineers responsible for designing and maintaining the day to day operations of your web site. It is important after you have completed your data mining analysis that you share and discuss with all of them your findings. Domain experts, people who are the specialists in their area, need to be briefed on the results of the analysis to ensure the findings are correct and appropriate to your site's business objectives. This is the sanity check step. You need to be objective and focused on your initial goal for mining your web site. If your data mining results are faulty whether its due to the data, tool or methodology, you may need to do another analysis and reconstruct a new set of models with your domain experts' participation and input.

9. DELIVER THE FINDINGS
A report should be prepared documenting the entire web mining process, including the steps you took in selecting and preparing your data, the tools you used and why, the tool settings, your findings, and an explanation of what the code that was generated is supposed to do, etc. As with any business process you need to establish for your web mining initiative both baselines and procedures. In your analysis report you need to comment on the results of the data mining analysis, stating whether it meets the business objective of your web site. If for some reasons it doesn't, you should state why not. You may want to include in your report how the data mining analysis results can be improved, such as by the addition of different or new data. You might merge external demographic and household information or capture better information via newly designed registration forms or cookies.

10. INTEGRATE THE SOLUTIONS
This final step is really a commitment to continue the process of learning from your firms online transactions. This process involves incorporating the findings into your firm's business practices, marketing efforts, and strategic planning. Web mining is a pattern-recognition process involving hundreds, thousands or maybe millions of daily transactions in your web site. This final step of your web mining analysis also involves monitoring the performance of the models that you have generated. All models will age and their performance will deteriorate, so you must monitor the accuracy of your web mining models. Be prepared to re-train and test new ones. Because today's business environment, especially the web and the data it generates, is highly dynamic, economic conditions change and the models you build or analysis you perform will likely need to be readjusted or re-done over time.

Clearly not all of these ten steps are required, but you should consider them prior to starting any in-depth analysis. They certainly do not always follow this exact sequence, but in most assignments I've undertaken these steps represent the issues that needed to be resolved before we could complete the project. In most of my previous data mining projects, I analyzed customer information files, datamarts, and data warehouses from retailers, banks, insurers, phone, and credit card companies, but they typically dealt with the same client-centered issues or questions, mainly: Who are the customers? What are their features? And how are they likely to behave? Electronic retailers face the same questions today.

ACTING AND ASSESSING WEB MINING FINDINGS

Most likely you will need to do your web mining on a separate server dedicated to analysis. After your analysis you will need to validate your results through some sort of production system such as a marketing test e-mail campaign. Note that the costs involved with email versus physical mail or phone calls allow for a very rapid assessment of your web mining and marketing efforts. It is certainly a very economical way to evaluate your web mining project: it only costs about five cents to e-mail a potential customer, compared with as much as five dollars for direct mail and eight to twelve dollars for a phone sales call. Planning and executing a traditional marketing campaign used to take months; today on the Web an e-mail campaign can take hours. The Web has accelerated the trend toward one-to-one marketing and the validation of web mining results by allowing the rapid evaluation of predictive models.

It is not difficult to assess the benefits of web mining and its return-on-investment (ROI). Simply consider the quantitative counts of clickthroughs of ads or banners prior and after your web mining analysis. Consider the percentage of sales or requests for product information, as well as the amounts of purchases made as a result of a web mining analysis. Consider the rates prior to your data mining efforts and afterwards. If you initiate a marketing e-mail campaign on the basis of your data mining analysis, consider the rate of responses by splitting your e-mails between those individuals targeted via your analysis and those excluded from the targeting. Compare the improved rate of responses and sales from those targeted via the web mining analysis to those without it.

The dynamics of your industry and marketplace will dictate how often you should mine your website data. The intervals for mining your data will depend on how often the attributes of your customers change. For example, a bank may have a cross-selling model for its call site that can be quite effective for months. The intervals in which the bank model are created may take place on a quarterly or monthly basis and still be relevant to the business questions they are trying to answer, such as cross-selling opportunities of their financial products like CDs, bankcards, loans, etc. For a portal, such as a search engine, models may need to be refreshed on a weekly basis, because the dynamics of the content, their visitors, and their features change more quickly than those for a bank's customers. The end products the portal is trying to predict are also subject to change more frequently, for a bank it is a loan, for a portal it is a banner or ad.

For an Internet company, which exists completely on the Web, the web mining process represents a biofeedback system to its entire supply chain. Web mining can identify for electronic retailers key market segments, which can impact directly on its overall website design and inventory control systems. As with physical retailers, by leveraging data mining web retailers can position the right message, product, and service in front of the right customers at the right time in the right format.

WEB MINING AND CUSTOMER RELATIONSHIP MARKETING

Web mining is not an isolated process carried in a vacuum; it must be integrated into the entire electronic retailing and marketing processes. This is especially true with virtual storefronts because everything-selections, transactions, orders, customer communications-is accelerated to "Internet time." For a website entirely supported by advertising, data mining is even more critical since it can quickly discover and measure the effectiveness of a multitude of banners and ads on its continuous stream of visitors.

Electronic retailing changes not only the distribution and marketing of products but more importantly it also alters the process of consumption and the related transactions of buying and selling. The data, which is an aftermath of every product and service purchased on the web, is the core ore -- which can be mined to develop customize products, forecast demand, profile customers and improve relational marketing. Because of the interactive nature of electronic retailing; consumers not only order and buy product online, they can also indicate in some venues (auctions) their willingness to pay price points.

The act of retailing on the web is an interactive one in which the consumer can negotiate, exchange information, specify and customize the product and services they wants from the retailer. For the electronic retailer it is of paramount importance that they analyze what consumers are doing and saying. Web mining can serve retailers by providing them the technology to segment, model and predict how to sell more, learn what's working and what's not and quickly adjust their marketing, pricing, inventory and communications.

As billions of business interactions evolve and organize themselves into revenue streams, subtle transformations occur between consumers and retailers in this dynamic marketplace. The mining of website data -- with AI-based tools, like neural networks and machine-learning and genetic algorithms, themselves programs designed to mimic human functions -- is an attempt to recognize, anticipate and learn the buying habits and preferences of customers in this new evolutionary, mutating business environment.

It is of paramount importance that retailers in a networked economy such as this be adaptive and receptive to the needs of their customers. In this expansive, competitive, and volatile environment web mining will be a critical process impacting every retailer's long-term success, where failure to quickly react, adapt, and evolve can translate into customer "churn" with the click of a mouse. Electronic retailing represents a growing exchange of data between consumers and retailer, evolving and changing -- much as an organism develops a nervous system.

Excerpted from "Data Mining Your Website" by Jesus Mena. Copyright � 1999 by Jesus Mena. ISBN # 1-55558-222-2. Excerpted by permission of Digital Press (http://www.bh.com/digitalpress), a division of Butterworth Heinemann. All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.

DATA MINING YOUR WEBSITE: SIGN UP FOR THE BOOTCAMP Increasingly the first point of contact between a customer and a company is at their website -- where a staggering amount of behavioral consumer data can be collected and mined. Attend this intense bootcamp and learn how to mine web data for cross selling, product affinities, customer profiling and relational marketing. Learn about the latest products and services, technologies and techniques for the segmentation, clustering, profiling and modeling of web data. Included are best-of-breed case studies.

WEB MINING FOR MARKETERS: HOW TO MINE WEB DATA FOR ONE-TO-ONE MARKETING http://www.webminer.com/bootcamp.htm