P. 1
Data Mining and Data Visualization

Data Mining and Data Visualization

|Views: 12|Likes:
Published by Dr Singh

More info:

Categories:Types, School Work
Published by: Dr Singh on Feb 13, 2012
Copyright:Attribution Non-commercial


Read on Scribd mobile: iPhone, iPad and Android.
download as PPT, PDF, TXT or read online from Scribd
See more
See less





Data Mining and Data Visualization

Prof. Rushen Chahal

A Picture is Worth a Thousand Words
Data mining is the set of activities used to find new, hidden, or unexpected patterns in data. These techniques are often called knowledge data discovery (KDD), and include statistical analysis, neural or fuzzy logic, intelligent agents or data visualization. The KDD techniques not only discover useful patterns in the data, but also can be used to develop predictive models.

Verification Versus Discovery
In the past, decision support activities were primarily based on the concept of verification. This required a great deal of prior knowledge on the decision-maker¶s part in order to verify a suspected relationship. With the advance of technology, the concept of verification began to turn into discovery.

Data Mining¶s Growth in Popularity
One reason is that we keep getting more and more data all the time and need tools to understand it. We also are aware that the human brain has trouble processing multidimensional data. A third reason is that machine learning techniques are becoming more affordable and more refined at the same time.

Making Accurate Predictions with Data Mining Although the literature contains statements such as ³data mining will allow us to predict who will buy a particular product,´ that is against human nature. In situations where data mining is used to predict response to a marketing campaign, only about 5% of the people selected as ³likely respondents´ actually do respond.

Making Accurate Predictions with Data Mining (cont.) Although the accuracy of predicting individual behavior is not so good, it is better than it seems, since direct marketing efforts often have ³hit rates´ of only about 1% without data mining.

Online Analytical Processing (OLAP)
Codd developed a set of 12 rules for the development of multidimensional databases:
1. 2. 3. 4. 5. 6.

Multidimensional view Transparent to user Accessible Consistent reporting Client-server architecture Generic dimensionality

7. 8. 9. 10. 11. 12.

Dynamic sparse matrix handling Multiuser support Cross-dimensional ops Intuitive manipulation Flexible reporting Unlimited dimension and aggregation

OLAP as Implemented
To date, it does not appear that any implementation exists that satisfies all 12 rules. Some people argue it might not even be possible to attain all of them. More recently, the term OLAP has come to represent the broad category of software technology that enables multidimensional analysis of enterprise data.

Multidimensional OLAP (MOLAP)
Data can be viewed across several dimensions. Here sales are arrayed by region and product. A fourth dimension could be added by using several graphs -- perhaps at different time points. Most analyses have many more dimensions than this. MOLAP handles data as an n-dimensional hypercube.





0.4 4 0.3 1 2 3



2 3


Relational OLAP (ROLAP)
A large relational database server replaces the multidimensional one. The database contains both detailed and summarized data, allowing ³drill down´ techniques to be applied. SQL interfaces allow vendors to build tools, both portable and scalable. This does require databases with many relational tables which may lead to substantial processor overhead on complex joins.

A Typical Relational Schema

Data Mining Technologies
Statistics ± the most mature data mining technologies, but are often not applicable because they need clean data. In addition, many statistical procedures assume linear relationships, which limits their use. Neural networks, genetic algorithms, fuzzy logic ± these technologies are able to work with complicated and imprecise data. Their broad applicability has made them popular in the field.

Data Mining Technologies (cont.) Decision trees ± these technologies are conceptually simple and have gained in popularity as better tree growing software was introduced. Because of the way they are used, they are perhaps better called ³classification´ trees.

The Knowledge Discovery Search Process 

the business problem and obtain the data to study it.  Use data mining software to model the problem.  Mine the data to search for patterns of interest.

The Knowledge Discovery Search Process (cont.) 

the mining results and refine them by respecifying the model.  Once validated, make the model available to other users of the DW.

New Applications for Data Mining
As the technology matures, new applications emerge, especially in two new categories, text mining and web mining. Some text mining examples are:  Distilling the meaning of a text  Accurate summarization of a text  Explication of the text theme structure  Clustering of texts

Web mining
Web mining is a special case of text mining where the mining occurs over a website. It enhances the website with intelligent behavior, such as suggesting related links or recommending new products. It allows you to unobtrusively learn the interests of the visitors and modify their user profiles in real time. They also allow you to match resources to the interests of the visitor.

Current Limitations and Challenges to Data Mining
Despite the potential power and value, data mining is still a new field. Some things that that thus far have limited advancement are:  Identification of missing information ± not all knowledge gets stored in a database  Data noise and missing values ± future systems need better ways to handle this  Large databases and high dimensionality ± future applications need ways to partition data into more manageable chunks

3-6: Data Visualization: ³Seeing´ the Data

Visual Presentation
For any kind of high dimensional data set, displaying predictive relationships is a challenge. Shading is used to represent relative degrees of thunderstorm activity, with the darkest regions the heaviest activity.

A Bit of History
An early effort used sequences of twodimensional graphs to add depth. Current virtual reality programs allow the user to step through a data set. Try going to a realtor¶s website and taking a tour of a house up for sale.

Human Visual Perception and Data Visualization
Data visualization is so powerful because the human visual cortex converts objects into information so quickly. The next three slides show (1) usage of global private networks, (2) flow through natural gas pipelines, and (3) a risk analysis report that permits the user to draw an interactive yield curve. All three use height or shading to add additional dimensions to the figure.

Global Private Network Activity
High Activity Low Activity

Natural Gas Pipeline Analysis

Note: Height shows total flow through compressor stations.

An ³Enlivened´ Risk Analysis Report

Geographical Information Systems
A GIS is a special purpose database that contains a spatial coordinate system. A comprehensive GIS requires: 1. Data input from maps, aerial photos, etc. 2. Data storage, retrieval and query 3. Data transformation and modeling 4. Data reporting (maps, reports and plans)

The Special Capabilities of a GIS
In general, a GIS contains two types of data: Spatial data: these elements correspond to a uniquely-defined location on earth. They could be in point, line or polygon form. Attribute data: These are the data that will be portrayed at the geographic references established by spatial data. Example: Data from an opinion poll is displayed for multiple regions in the United States. Clicking on an area allows the user to drill down to the results for smaller areas.

Telephone Polling Results

Note: On the ³live´ map, clicking on an area allows the user to drill down and see results for smaller areas.

You're Reading a Free Preview

/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->