This action might not be possible to undo. Are you sure you want to continue?
Data mining has attracted a great deal of attention in the information industry and in society as a whole in recent years, due to the availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge. Today as more data are gathered, with the amount of data doubling every three years, Data Mining is becoming an increasingly important tool to transform these data into information. It is commonly used in a wide range of profiling practices, such as marketing, surveillance, fraud detection and scientific discovery. INTRODUCTION: - Data Mining is the exploration and analysis of large sets, in order to discover meaningful patterns and rules. The key idea is to find effective ways to combine computers power to process data with the human eye’s ability .to detect patterns. The techniques of data mining are designed for work best with large data sets. Data mining is the process of extracting patterns from data It is the process of extraction of interesting (nontrivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data. It is the set of activities used to find new, hidden or unexpected patterns in data or unusual patterns in data. Using information contained within data warehouse, data mining can often provide answers to questions about an organization that a decision maker has previously not thought to ask. • Which products should be promoted to a particular customer? • What is the probability that a certain customer will respond to a planned promotion? • Which securities will be most profitable to buy or sell during the next trading session? • What is the likelihood that a certain customer will default or pay back a schedule? • What is the appropriate medical diagnosis for this patient? These types of questions can be answered surprisingly easily if the information hidden among the data in your databases can be located and utilized. The importance of collecting data that reflect your business or scientific activities to achieve competitive advantage is widely recognized now. Powerful systems for collecting data and managing it in large databases usually take place in all large and mid-range companies. However, the bottleneck of turning this data into your success is the difficulty of extracting knowledge about the system you study from the collected data. Human analysts with no special tools can no longer make sense of enormous volumes of data that require processing in order to make informed business decisions. Data mining automates the process of finding relationships and patterns in raw data and delivers results that can be either utilized in an automated decision support system or assessed by a human analyst. EVOLUTION Data mining techniques are the result of a long process of research and product development. This evolution began when business data was first stored on computers, continued with improvements
in data access, and more recently, generated technologies that allow users to navigate through their data in real time. Data mining takes this evolutionary process beyond retrospective data access and navigation to prospective and proactive information delivery. From the user’s point of view, the following four steps were revolutionary because they allowed new business questions to be answered accurately and quickly.
Data Collection (1960s)
Data Access (1980s)
Data Warehousing & Decision Support (1990s)
Data Mining (Emerging Today)
Fig 2: Evolutionary Stages of Data Mining
Data Collection (1960s): At this stage: Business question: “What was my total revenue in the last five years?". Enabling technologies: Computers, tapes, disks. Product Providers: IBM, CDC. Characteristics: Retrospective, static data delivery. Data Access (1980s): At this stage: Business question: "What were unit sales in New England last March?". Enabling technologies: Relational databases (RDBMS), Structured Query Language (SQL), ODBC. Product Providers: Oracle, Sybase, Informix, IBM, Microsoft. Characteristics: Retrospective, dynamic data delivery at record level. Data Warehousing & Decision Support (1990s): At this stage: Business question: "What were unit sales in New England last March? Drill down to Boston.” Enabling technologies: On-line analytic processing (OLAP), multidimensional databases, and data warehouses.
Product Providers: Pilot, Comshare, Arbor, Cognos, Micro strategy. Characteristics: Retrospective, dynamic data delivery at multiple levels. Data Mining (Emerging Today): At this stage: Business question: "What’s likely to happen to Boston unit sales next month? Why?". Enabling technologies: Advanced algorithms, multiprocessor computers, massive databases. Product Providers: Pilot, Lockheed, IBM, SGI, numerous startups (nascent industry). Characteristics: Prospective, proactive information delivery. The core components of data mining technology have been under development for decades, in research areas such as statistics, artificial intelligence, and machine learning. Today, the maturity of these techniques, coupled with high-performance relational database engines and broad data integration efforts, make these technologies practical for current data warehouse environments. THE PRESENT AND THE FUTURE The field of data mining has been growing in leaps and bounds, and has shown great potential for the future. What is the future of data mining? Certainly, the field has made great strides in past years, and many industry analysts and experts in the area feel that the future will be bright. There is definite growth in the area of data mining. Many industry analysts and research firms have projected a bright future for the entire data mining area, and its related area of CRM (customer relationship management). The growth in the CRM Analytic application market had approached 54.1% per year through 2003. In addition, data mining projects had grown by more than 300% by the year 2002. By 2003, over 90% of consumer-based industries with e-commerce orientation had utilized some kind of data mining model. As mentioned previously, the field of data mining is very broad, and there are many methods and technologies which have become dominant in the field. THE SCOPE OF DATA MINING Data mining derives its name from the similarities between searching for valuable business information in a large database and mining a mountain for a vein of valuable ore. Both processes require either sifting through an immense amount of material, or intelligently probing it to find exactly where the value resides. Given databases of sufficient size and quality, data mining technology can generate new business opportunities by providing these capabilities:
Automated prediction of trends and behaviors: Data mining automates the process of finding predictive information in large databases. Questions that traditionally required extensive hands-on analysis can now be answered directly from the data — quickly. A typical example of a predictive problem is targeted marketing. Data mining uses data on past promotional mailings to identify the targets most likely to maximize return on investment in future mailings. Other predictive problems include forecasting bankruptcy and other forms of default, and identifying segments of a population likely to respond similarly to given events. Automated discovery of previously unknown patterns. Data mining tools sweep through databases and identify previously hidden patterns in one step. An example of pattern
discovery is the analysis of retail sales data to identify seemingly unrelated products that are often purchased together. Other pattern discovery problems include detecting fraudulent credit card transactions and identifying anomalous data that could represent data entry keying errors. Data mining techniques can yield the benefits of automation on existing software and hardware platforms, and can be implemented on new systems as existing platforms are upgraded and new products developed. When data mining tools are implemented on high performance parallel processing systems, they can analyze massive databases in minutes. Faster processing means that users can automatically experiment with more models to understand complex data. High speed makes it practical for users to analyze huge quantities of data. Larger databases, in turn, yield improved predictions. TECHNIQUES OF DATA MINING The most commonly used techniques in data mining are:
Artificial neural networks: Non-linear predictive models that learn through training and resemble biological neural networks in structure. Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID). Genetic algorithms: Optimization techniques that use process such as genetic combination, mutation, and natural selection in a design based on the concepts of evolution. Nearest neighbor method: A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k ³ 1). Sometimes called the k-nearest neighbor technique. Rule induction: The extraction of useful if-then rules from data based on statistical significance.
Many of these technologies have been in use for more than a decade in specialized analysis tools that work with relatively small volumes of data. These capabilities are now evolving to integrate directly with industry-standard data warehouse and OLAP platforms. 1.5 THE TEN STEPS OF DATA MINING Here is a process for extracting hidden knowledge from your data warehouse, your customer information file, or any other company database. 1. Identify The Objective -- Before you begin, be clear on what you hope to accomplish with your analysis. Know in advance the business goal of the data mining. Establish whether or not the goal is measurable. Some possible goals are to
• • • •
Find sales relationships between specific products or services Identify specific purchasing patterns over time Identify potential types of customers Find product sales trends.
2. Select The Data -- Once you have defined your goal, your next step is to select the data to meet this goal. This may be a subset of your data warehouse or a data mart that contains specific product information. It may be your customer information file. Segment it as much as possible the scope of the data to be mined. Here are some key issues. • • • • • Are the data adequate to describe the phenomena the data mining analysis is attempting to model? Can you enhance internal customer records with external lifestyle and demographic data? Are the data stable—will the mined attributes be the same after the analysis? If you are merging databases can you find a common field for linking them? How current and relevant are the data to the business goal? 3 Prepare The Data -- Once you've assembled the data, you must decide which attributes to convert into usable formats. Consider the input of domain experts— creators and users of the data. Establish strategies for handling missing data, extraneous noise, and outliers. Identify redundant variables in the dataset and decide which fields to exclude Decide on a log or square transformation, if necessary
• • •
Identify the Objective
2. Select the data
3. Prepare the data
4. Audit the data Steps of DATA MINING
5. Select the Tools 6. Format the solution
7. Construct the solution
8. Validate the findings
9. Deliver the findings
10. Integrate the solution
• Visually inspect the dataset to get a feel for the database • Determine the distribution frequencies of the data You can postpone some of these decisions until you select a data-mining tool. For example, if you need a neural network or polynomial network you may have to transform some of your fields. 4. Audit The Data -- Evaluate the structure of your data in order to determine the appropriate tools. • What is the ratio of categorical/binary attributes in the database? • What is the nature and structure of the database? • What is the overall condition of the dataset? • What is the distribution of the dataset? Balance the objective assessment of the structure of your data against your users' need to understand the findings. Neural nets, for example, don't explain their results. 5. Select The Tools -- Two concerns drive the selection of the appropriate data-mining tool— your business objectives and your data structure. Both should guide you to the same tool. Consider these questions when evaluating a set of potential tools. • • • • Is the data set heavily categorical? What platforms do your candidate tools support? Are the candidate tools ODBC-compliant? What data format can the tools import?
No single tool is likely to provide the answer to your data-mining project. Some tools integrate several technologies into a suite of statistical analysis programs, a neural network, and a symbolic classifier. 6. Format The Solution -- In conjunction with your data audit, your business objective and the selection of your tool determine the format of your solution. The Key questions are: • • • • What is the optimum format of the solution—decision tree, rules, C code, SQL syntax? What are the available format options? What is the goal of the solution? What do the end-users need—graphs, reports, code?
7. Construct The Model -- At this point that the data mining process begins. Usually the first step is to use a random number seed to split the data into a training set and a test set and construct and evaluate a model. The generation of classification rules, decision trees, clustering sub-groups, scores, code, weights and evaluation data/error rates takes place at this stage. Resolve these issues: • • • • Are error rates at acceptable levels? Can you improve them? What extraneous attributes did you find? Can you purge them? Is additional data or a different methodology necessary? Will you have to train and test a new data set?
8. Validate The Findings -- Share and discuss the results of the analysis with the business client or domain expert. Ensure that the findings are correct and appropriate to the business objectives. • • • Do the findings make sense? Do you have to return to any prior steps to improve results? Can use other data mining tools to replicate the findings?
9. Deliver The Findings -- Provide a final report to the business unit or client. The report should document the entire data mining process including data preparation, tools used, test results, source code, and rules. Some of the issues are: • • • • Will additional data improve the analysis? What strategic insight did you discover and how is it applicable? What proposals can result from the data mining analysis? Do the findings meet the business objective?
10. Integrate The Solution -- Share the findings with all interested end-users in the appropriate business units. You might wind up incorporating the results of the analysis into the company's business procedures. Some of the data mining solutions may involve • • SQL syntax for distribution to end-users C code incorporated into a production system
Rules integrated into a decision support system.
Although data mining tools automate database analysis, they can lead to faulty findings and erroneous conclusions if you're not careful. Bear in mind that data mining is a business process with a specific goal—to extract a competitive insight from historical records in a database. DATA MINING APPLICATIONS
For Financial data analysis
Most banks and financial institutions offer a wide variety of banking services (such as checking, saving, and business and individual customer transactions), credit (such as business, mortgage, and automobile loans), and investment services (such as mutual funds). Some also offer insurance services and stock services. Financial data collected in the banking and financial industry is often relatively complete, reliable and high quality, which facilitates systematic data analysis and data mining. For example it can also help in fraud detection by detecting a group of people who stage accidents to collect on insurance money.
For Retail Industry
Retail industry collects huge amount of data on sales, customer shopping history, goods transportation and Consumption and service records and so on. The quantity of data collected continues to expand rapidly, especially due to the increasing ease, availability and popularity of the business conducted on web, or e-commerce. Retail industry provides a rich source for data mining. Retail data mining can help identify customer behavior, discover customer shopping patterns and trends, improve the quality of customer service, achieve better customer retention and satisfaction, enhance goods consumption ratios design more effective goods transportation and distribution policies and reduce the cost of business.
For Telecommunication Industry
The telecommunication industry has quickly evolved from offering local and long distance telephone services to provide many other comprehensive communication services including voice, fax, pager, cellular phone, images, e-mail, computer and web data transmission and other data traffic. The integration of telecommunication, computer network, Internet and numerous other means of communication and computing are underway. Moreover, with the deregulation of the telecommunication industry in many countries and the development of new computer and communication technologies, the telecommunication market is rapidly expanding and highly competitive. This creates a great demand from data mining in order to help understand business involved, identify telecommunication patterns, catch fraudulent activities, make better use of resources, and improve the quality of services.
Text Mining and Web Mining
Text mining is the process of searching large volumes of documents from certain keywords or key phrases. By searching literally thousands of documents various relationships between the documents can be established. Using text mining however, we can easily derive certain patterns in the comments that may help identify a common set of customer perceptions not captured by the
other survey questions. An extension of text mining is web mining. Web mining is an exciting new field that integrates data and text mining within a website. It enhances the web site with intelligent behavior, such as suggesting related links or recommending new products to the consumer. Web mining is especially exciting because it enables tasks that were previously difficult to implement. They can be configured to monitor and gather data from a wide variety of locations and can analyze the data across one or multiple sites. For example the search engines work on the principle of data mining.
An important challenge that higher education faces today is predicting paths of students and alumni. Which student will enroll in particular course programs? Who will need additional assistance in order to graduate? Meanwhile, additional issues such as enrollment management and time-to degree, continue to exert pressure on colleges to search for new and faster solutions. Institutions can better address these students and alumni through the analysis and presentation of data. Data mining has quickly emerged as a highly desirable tool for using current reporting capabilities to uncover and understand hidden patterns in vast databases.
The past decade has seen an explosive growth in biomedical research, ranging from the development of new pharmaceuticals and in cancer therapies to the identification and study of human genome by discovering large scale sequencing patterns and gene functions. Recent research in DNA analysis has led to the discovery of genetic causes for many diseases and disabilities as well as approaches for disease diagnosis, prevention and treatment.
As different types of data are available, approaches poses many challenging research issues in data mining. The design of a standard data mining languages, the development of effective and efficient data mining methods and systems, the construction of interactive and integrated data mining environments, and the applications of data mining to solve large applications large application problems are important tasks for data mining researches and data mining system and application developers. Here we will discuss some of the trends in data mining that reflect the pursuit of these challenges: • Application Exploration: Earlier data mining was mainly used for helping businesses gain a competitive edge. But as data mining is becoming more popular it is gaining wide acceptance in other fields also such as biomedicine, stock market, fraud detection, telecommunication and many more. And many new explorations are being done for this purpose. In addition for data mining for business continues to expand as e-commerce and marketing becomes mainstream elements of the retail industry. As generic data mining systems may have limitations in dealing with application-specific problems, we may see a trend toward the development of more application– specific data mining systems.Scalable data mining methods: The current data mining methods capable of handling only a particular type of data and limited amount of data, but as data is expanding at a massive rate, there is a need to develop new data mining methods which are scalable and can handle different types of data and large volume of data. The data mining methods should be more
interactive and user friendly. One important direction towards improving the repair efficiency of the timing process while increasing user interaction is constraint-based mining. This provide user with more control by allowing the specification and use of constraints to guide data mining systems in their search for interesting patterns. • Combination of data mining with database systems, data warehouse systems, and web database systems: Database systems, data warehouse systems, and WWW are loaded with huge amounts of data and have thus become the major information processing systems. It is important to make sure that data mining serves as essential data analysis component that can be easily included in to such an information-processing environment. The desired architecture for data mining system is the tight coupling with database and data warehouse systems. Transaction management query processing, online analytical processing and online analytical mining should be integrated into one unified framework. • Standardization of data mining language: Today few data mining languages are commercially available in the market like Microsoft’s SQL server 2005, IBM Intelligent Miner, SAS Enterprise Miner, SGI Mineset, Clementine, DBMiner and many more but a standard data mining language or other standardization efforts will provide the orderly development of data mining solutions, improved interpretability among multiple data mining systems and functions. • Visual data mining: It is rightly said a picture is worth a thousand words. So if the result of the mined data can be shown in the visual form it will further enhance the worth of the mined data. Visual data mining is an effective way to discover knowledge from huge amounts of data. The systematic study and development of visual data mining techniques will promote the use for data mining analysis. • New methods for mining complex types of data: The complex types of data like geospatial, multimedia, time series, sequence and text data poses an important research area in field of data mining. There is still a huge gap between the needs for these applications and the available technology. • Web mining: The World Wide Web is huge collection of globally distributed collection of news, advertisements, consumer records, financial, education, government, e-commerce and many other services. The WWW also contains huge and dynamic collection hyper linked information, providing a huge source for data mining. Based on the above facts, the Web also poses great challenges for efficient resource and knowledge discovery. • Biological data mining: Although biological data mining can be considered under “application exploration”, the unique combination of complexity, richness, size, and importance of biological warrants special attention in data mining. Mining DNA and protein sequences, mining high-dimensional microarray data are some of the interesting topics for biological data mining research. • Data mining and software engineering: As software programs become increasingly bulky in size, sophisticated in complexity, and tend to originate from the integration of multiple components developed by different software team, it is an increasingly challenging task to ensure software robustness and reliability. The analysis of the executions of a buggy software
program is essentially a data mining process- tracing the data generated during program executions may disclose important patterns and outliers that may lead to the eventual automated discovery of software bugs. • Distributed data mining: Traditional data mining methods, designed to work at a centralized location, do not work well in many of the distributed computing environments present today (e.g., intranets, Internets, LAN). Advances in distributed data mining methods are expected. • Real time data mining: Many applications involving stream data (such as e-commerce, web mining, stock analysis) require dynamic data mining models to be built in real time. Additional development is needed in this area
Comprehensive data warehouses that integrate operational data with customer, supplier, and market information have resulted in an explosion of information. Competition requires timely and sophisticated analysis on an integrated view of the data. However, there is a growing gap between more powerful storage and retrieval systems and the users’ ability to effectively analyze and act on the information they contain. Both relational and OLAP technologies have tremendous capabilities for navigating massive data warehouses, but brute force navigation of data is not enough. A new technological leap is needed to structure and prioritize information for specific end-user problems. The data mining tools can make this leap. Quantifiable business benefits have been proven through the integration of data mining with current information systems, and new products are on the horizon that will bring this integration to an even wider audience of users. Since data mining is a young discipline with wide and diverse applications, there is still a nontrivial gap between general principles of data mining and domain specific, effective data mining tools for particular applications. A few application domains of Data Mining (such as finance, the retail industry and telecommunication) and Trends in Data Mining which include further efforts towards the exploration of new application areas and new methods for handling complex data types, algorithms scalability, constraint based mining and visualization methods, the integration of data mining with data warehousing and database systems, the standardization of data mining languages, and data privacy protection and security.