You are on page 1of 20

DATA MINING &

DATA
WAREHOUSING
Lecture – 1, 2 & 3

Presented By
Dr. S.Kanungo
Data Mining (DM)
• Rapid advances in data collection and storage
technology have enabled organizations to
accumulate vast amount of data. However,
extracting useful information has proven
extremely challenging.
• Traditional data analysis tools & techniques
cannot be used because of the massive size of
data set.
• It is a technology that blends traditional data
analysis methods with sophisticated algorithms
for processing large volume of data.
• Applications: Business, Medicine, Science &
Engineering
Why Data Mining?
•The Explosive Growth of Data: from terabytes to petabytes
– Data collection and data availability
• Automated data collection tools, database systems, Web,
computerized society
– Major sources of abundant data
• Business: Web, e-commerce, transactions, stocks, …
• Science: Remote sensing, bioinformatics, scientific
simulation, …
• Society and everyone: news, digital cameras, YouTube
•We are drowning in data, but starving for knowledge!
•“Necessity is the mother of invention”—Data mining—Automated
analysis of massive data sets
Evolution of DatabaseTechnology
• 1960s:
- Data collection, database creation, IMS and network DBMS
• 1970s:
- Relational data model, relational DBMS implementation
• 1980s:
- RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
- Application-oriented DBMS (spatial, scientific, engineering, etc.)
• 1990s:
- Data mining, data warehousing, multimedia databases, and
Web databases
• 2000s
- Stream data management and mining
- Data mining and its applications
-Web technology (XML, data integration) and global
information systems
What is Data Mining?
•It is the process of automatically discovering useful information in
large data repository.
•Not all information discovery tasks are considered to be Data Mining.
•It refers to extracting or mining knowledge from large amount of data.
•These techniques have been used to enhance information retrieval
system.
•Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of data
•Alternative names
– Knowledge discovery (mining) in databases (KDD), knowledge extraction,
data/pattern analysis, data archeology, data dredging, information harvesting,
business intelligence, etc.
DM as a step in the Knowledge Discovery process
• Typical Framework of a Data Warehouse
• What are Data Warehouses?
- It is a repository of information collected from
multiple sources, stored under a unified schema, and
usually residing at a single site.

• Example: A muldimensional Data Cube

• Example:

• DM systems for transactional data can identify frequent


itemsets i.e. sets of items that are frequently sold together.
• Advanced Database Systems and Applications
- Various kinds advanced data and information
systems have emerged and are undergoing
development to address the requirements of new
applications for handling:
1. Spatial data (maps)
2. Hypertext and Multimedia data(including text, image, video
and audio data)
3. Stream data(video surveillance and sensor data, where
data flow in and out like streams)
4. World Wide Web(a huge, widely distributed information
repository made available by the internet)
5. Engineering design data(design of buildings, system
components, or IC)
6. Time-related data(historical records or stock exchange data)
• Object-Relational Databases: Handle complex objects and
structure
• Text Databases: Contain word descriptions for objects
• Multimedia Databases: store image, audio and video data
• Heterogeneous Databases: Consists of a set of
interconnected, autonomous component databases
• Legacy Databases: A group of heterogeneous databases
that combines different kinds of database systems
• Temporal Databases: stores relational data that include
time-related attributes
• Sequence Databases: stores sequences of ordered
events, with or without a concrete notion of time
• Time-Series Databases: stores sequences of values or
events obtained over repeated measurements of time
• Spatial Databases: contain spatial-related information
• Spatiotemporal Databases: A spatial database that stores spatial
objects that change with time
• WWW: Capturing user access patterns in distributed information
environments is called Web usage mining (or web log mining)

• Data Mining Functionalities – What kinds of patterns can be mined?


- Data Mining Tasks: Descriptive(characterize the general properties of
the data in the database) or Predictive (performs inference on the
current data in order to make prediction)
- Data Mining Systems:
1.Can mine multiple kinds of patterns to accommodate different user
expectations or applications.
2. Should be able to discover patterns at various granularity (i.e.
different levels of abstraction)
3. Should allow users to specify hints to guide or focus the search for
interesting patterns.
• Concept/Class Description: Characterization and
Discrimination
- Data can be associated with classes or concepts
E.g. classes of items for sale include computers and printers;
concepts of customers include big spenders or budget spender.
- It can be useful to describe individual classes and concepts in
summarized, concise and yet precise terms.
- Data characterization: by summarizing the data of the class
under study (target class); O/p forms: pie chart, bar chart
- Data discrimination(contrasting classes): by comparison of
the target class with one or a set of comparative classes;
compare two groups of customers such as those who shop for
computer products regularly Vs those who rarely shop for such
products.
• Mining Frequent patterns: occurs frequently in data
- Frequent itemset: set of items that frequently appear together in
a transactional data set, such as milk & bread
• Association (Association rule): based on measures
support and confidence(certainty)
- buys(X, “Computer”) => buys(X, “Software”) X: customer
[support = 1%, confidence = 50%] Here, 1% of all the
transactions under analysis showed that computer and
software were purchased together with 50% chance. This
association rule involves a single attribute or predicate (i.e., buys)
, which is called single dimensional association rule.
- by dropping the predicate, this rule can be written as:
computer => software[1%,50%]
• Correlation
- statistical correlation between associated attribute-value pairs
• Classification and Prediction
- process of finding a model(or function) that describes and
distinguishes data classes and concepts to predict the class of
objects whose class label is unknown.
- The derived model is based on a set of training data(i.e, data
objects whose class label is known)
-
- The derived model may be represented in various forms such
as classification(IF-THEN) rules, decision trees, mathematical
formulae, or neural networks.
•Cluster Analysis
- Analyzes data objects without consulting a known class label
- Clustering can be used to generate class labels
- The objects are clustered or grouped based on the principle of
maximizing the intraclass similarity and minimizing the interclass
similarity
• Outlier Analysis
- A database may contain data objects that do not comply with the
general behaviour or model of the data. These data objects are
outliers.
- Most DM methods discard outliers as noise or exceptions
• Evolution Analysis
- Data evolution analysis describes and models regularities or trends
for objects whose behaviour changes over time.
- It may help predict future trends in stock market prices, contributing to
our decision-making regarding stock investment.
• Classification of data Mining Systems
- Techniques from other disciplines may be applied, such as neural
networks, fuzzy and/or rough set theory, knowledge representation,
inductive login programming, or high-performance computing.
- DM system may also integrate techniques from spatial data analysis,
information retrieval, pattern recognition, image analysis, signal
processing, computer graphics, web technology, economics,
business, bioinformatics or psychology

You might also like