You are on page 1of 15

Why Mine Data?

Commercial Viewpoint
Lots of data is being collected and warehoused
Web data, e-commerce purchases at department/ grocery stores Bank/Credit Card transactions

Computers have become cheaper and more powerful Competitive Pressure is Strong
Provide better, customized services for an edge (e.g. in Customer Relationship Management)

Why Mine Data? Scientific Viewpoint


Data collected and stored at enormous speeds (GB/hour)
remote sensors on a satellite telescopes scanning the skies microarrays generating gene expression data scientific simulations generating terabytes of data

Traditional techniques infeasible for raw data Data mining may help scientists
in classifying and segmenting data in Hypothesis Formation

Examples: What is (not) Data Mining?


 What is not Data  What is Data Mining?

Mining?

Look up phone
number in phone directory

Certain names are more


prevalent in certain US locations (OBrien, ORurke, OReilly in Boston area) Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,)

Query a Web
search engine for information about Amazon

Overview:
Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases.

What is data mining


Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner.

What can data mining do?

Data mining is primarily used today by companies with a strong consumer focus - retail, financial, communication, and marketing organizations. It enables these companies to determine relationships among "internal" factors such as price, product positioning, or staff skills, and "external" factors such as economic indicators, competition, and customer demographics. And, it enables them to determine the impact on sales, customer satisfaction, and corporate profits. Finally, it enables them to "drill down" into summary information to view detail transactional data. With data mining, a retailer could use point-of-sale records of customer purchases to send targeted promotions based on an individual's purchase history. By mining demographic data from comment or warranty cards, the retailer could develop products and promotions to appeal to specific customer segments.

How does data mining work?


While large-scale information technology has been evolving separate transaction and analytical systems, data mining provides the link between the two. Data mining software analyzes relationships and patterns in stored transaction data based on open-ended user queries. Several types of analytical software are available: statistical, machine learning, and neural networks. Generally, any of four types of relationships are sought: Classes: Stored data is used to locate data in predetermined groups. For example, a restaurant chain could mine customer purchase data to determine when customers visit and what they typically order. This information could be used to increase traffic by having daily specials. Clusters: Data items are grouped according to logical relationships or consumer preferences. For example, data can be mined to identify market segments or consumer affinities. Associations: Data can be mined to identify associations. The beer-diaper example is an example of associative mining.

Sequential patterns: Data is mined to anticipate behavior patterns and trends. For example, an outdoor equipment retailer could predict the likelihood of a backpack being purchased based on a consumer's purchase of sleeping bags and hiking shoes.

Data mining consists of five major elements:


y Extract, transform, and load transaction data onto the data warehouse system. y Store and manage the data in a multidimensional database system. y Provide data access to business analysts and information technology professionals. y Analyze the data by application software. y Present the data in a useful format, such as a graph or table. y

Data Mining Technologies:

The analytical techniques used in data mining are often well-known mathematical algorithms and techniques. What is new is the application of those techniques to general business problems made possible by the increased availability of data and inexpensive storage and processing power. Also, the use of graphical interfaces has led to tools becoming available that business experts can easily use.

Some of the tools used for data mining are:


Artificial neural networks - Non-linear predictive models that learn through training and resemble biological neural networks in structure. Decision trees - Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. Rule induction - The extraction of useful if-then rules from data based on statistical significance.

Genetic algorithms - Optimization techniques based on the concepts of genetic combination, mutation, and natural selection. Nearest neighbor - A classification technique that classifies each record based on the records most similar to it in an historical database.

Data Mining Concepts (Analysis Services - Data Mining):


Data mining is the process of discovering actionable information from large sets of data. Data mining uses mathematical analysis to derive patterns and trends that exist in data. Typically, these patterns cannot be discovered by traditional data exploration because the relationships are too complex or because there is too much data. These patterns and trends can be collected and defined as a data mining model. Mining models can be applied to specific business scenarios, such as: y y y y Forecasting sales Targeting mailings toward specific customers Determining which products are likely to be sold together Finding sequences in the order that customers add products to a shopping cart

Building a mining model is part of a larger process that includes everything from asking questions about the data and creating a model to answer those questions, to deploying the model into a working environment. This process can be defined by using the following six basic steps:

Defining the Problem:


The first step in the data mining process, is to clearly define the business problem, and consider ways to provide an answer to the problem. This step includes analyzing business requirements, defining the scope of the problem, defining the metrics by which the model will be evaluated, and defining specific objectives for the data mining project. These tasks translate into questions such as the following: What are you looking for? What types of relationships are you trying to find? Does the problem you are trying to solve reflect the policies or processes of the business? Do you want to make predictions from the data mining model, or just look for interesting patterns and associations?

To answer these questions, you might have to conduct a data availability study, to investigate the needs of the business users with regard to the available data. If the data does not support the needs of the users, you might have to redefine the project.

Preparing Data:
The second step in the data mining process, is to consolidate and clean the data that was identified in the Defining the Problem step. Data can be scattered across a company and stored in different formats, or may contain inconsistencies such as incorrect or missing entries. For example, the data might show that a customer bought a product before the product was offered on the market, or that the customer shops regularly at a store located 2,000 miles from her home. Data cleaning is not just about removing bad data, but about finding hidden correlations in the data, identifying sources of data that are the most accurate, and determining which columns are the most appropriate for use in analysis. For example, should you use the shipping date or the order date? Is the best sales influencer the quantity, total price, or a discounted price? Incomplete data, wrong data, and inputs that appear separate, but are in fact

strongly correlated, can influence the results of the model in ways you do not expect. Therefore, before you start to build mining models, you should identify these problems and determine how you will fix them.

Exploring Data:
The third step in the data mining process, is to explore the prepared data. You must understand the data in order to make appropriate decisions when you create the mining models. Exploration techniques include calculating the minimum and maximum values, calculating mean and standard deviations, and looking at the distribution of the data. For example, you might determine by reviewing the maximum, minimum, and mean values that the data is not representative of your customers or business processes, and that you therefore must obtain more balanced data or review the assumptions that are the basis for your expectations. Standard deviations and other distribution values can provide useful information about the stability and accuracy of the results. A large standard deviation can indicate that adding more data might help you improve the model. Data that strongly deviates from a standard distribution might be skewed, or might represent an accurate picture of a reallife problem, but make it difficult to fit a model to the data.

By exploring the data in light of your own understanding of the business problem, you can decide if the dataset contains flawed data, and then you can devise a strategy for fixing the problems or gain a deeper understanding of the behaviors that are typical of your business.

Building Models :
The fourth step in the data mining process, is to build the mining model or models. You will use the knowledge that you gained in the Exploring Data step to help define and create the models. You define which data you want to use by creating a mining structure. The mining structure defines the source of data, but does not contain any data until you process it. When you process the mining structure, Analysis Services generates aggregates and other statistical information that can be used for analysis. This information can be used by any mining model that is based on the structure It is important to remember that whenever the data changes, you must update both the mining structure and the mining model. When you update a mining structure by reprocessing it, Analysis Services retrieves data from the source, including any new data if the source is dynamically updated, and repopulates the mining structure. If you have models that are based on the

structure, you can choose to update the models that are based on the structure, which means they are retrained on the new data, or you can leave the models as is. For more information, see Processing Data Mining Objects.

Exploring and Validating Models :


The fifth step in the data mining process is to explore the mining models that you have built and test their effectiveness. Before you deploy a model into a production environment, you will want to test how well the model performs. Also, when you build a model, you typically create multiple models with different configurations and test all models to see which yields the best results for your problem and your data.

Deploying and Updating Models :


The last step in the data mining process, is to deploy the models that performed the best to a production environment. After the mining models exist in a production environment, you can perform many tasks, depending on your needs. Use Integration Services to create a package in which a mining model is used to intelligently separate incoming data into multiple tables. For example, if a database is continually updated with potential customers, you could use a mining model together with Integration Services to split the incoming data into customers who are likely to purchase a product and customers who are likely to not purchase a product. Update the models dynamically, as more data comes into the organization, and making constant changes to improve the effectiveness of the solution should be part of the deployment strategy.

How Data Mining Works:


How is data mining able to tell you important things that you didn't know or what is going to happen next? That technique that is used to perform these feats is called modeling. Modeling is simply the act of building a model (a set of examples or a mathematical relationship) based on data from situations where the answer is known and then applying the model to other situations where the answers aren't known. Modeling techniques have been around for centuries, of course, but it is only recently that data storage and communication capabilities required to collect and store huge amounts of data, and the computational power to automate modeling techniques to work directly on the data, have been available. As a simple example of building a model, consider the director of marketing for a telecommunications company. He would like to focus his marketing and sales efforts on segments of the population most likely to become big users of long distance services. He knows a lot about his customers, but it is impossible to discern the common characteristics of his best customers because there are

so many variables. From his existing database of customers, which contains information such as age, sex, credit history, income, zip code, occupation, etc., he can use data mining tools, such as neural networks, to identify the characteristics of those customers who make lots of long distance calls. For instance, he might learn that his best customers are unmarried females between the age of 34 and 42 who make in excess of $60,000 per year. This, then, is his model for high value customers, and he would budget his marketing efforts to accordingly.

What Are Some Problems that Data Mining Solves?


y Anomaly detection Commonly, fraud detection in the financial industry means looking for that one transaction or one customer among thousands who might be committing fraud. Data mining can find a single observation among even the millions which might be different. y Recommendation generation After a customer chooses one or more products, data mining suggests another product. y Churn analysis The term churn refers to losing a repeat customer or client, and knowing what early indicators might indicate someone is ready to switch can be important. y Risk management Credit ratings are often based on multivariate formulas which help predict levels of risk.

y Customer segmentation Grouping customers or clients together, even by their own self-determined characteristics, can allow large organizations to manage marketing campaigns or even just organize their service professionals around similar groupings. y Targeted ads Marketers use data mining to deliver customized ads online, but organizations always want to know how to tailor any communications to be based on what they already know about their customers or clients. y Forecasting Time-series analysis takes data from the past, and provides a look into the future, even when there are seasonal increases or declines.

The Future of Data Mining:


In the short-term, the results of data mining will be in profitable, if mundane, business related areas. Micro-marketing campaigns will explore new niches. Advertising will target potential customers with new precision. In the medium term, data mining may be as common and easy to use as e-mail. We may use these tools to find the best airfare to New York, root out a phone number of a long-lost classmate, or find the best prices on lawn mowers. The long-term prospects are truly exciting. Imagine intelligent agents turned loose on medical research data or on sub-atomic particle data. Computers may reveal new treatments for diseases or new insights into the nature of the universe. There are potential dangers, though, as discussed below.

Conclusion:
Data mining is an active research field, and you could spend years reading peer-reviewed articles and textbooks on different aspects of the topic. The field has been historically dominated by academic people, and there's much careful thought behind the not only the algorithms but the statistical philosophies of analysis and synthesis. Though I have provided data mining training, and teach at the university level, I consider myself a lifelong student of this topic. You might be or become an important part of that story. I encourage you to share what you know and learn.