You are on page 1of 13

Data Mining: Introduction

BIF 515
Neeru Redhu
CCS HAU
Data mining : finding hidden information in a
database
Also called as exploratory data analysis, data driven
discovery and deductive learning

SQL is used for traditional database queries


Datamining access vs Traditional access

Query: might not be well formed or precisely stated


Data: have been cleansed and modified to better
support the mining process
Output: output might not be a subset of the
database.
Algorithms attempt to fit a model to the data
o Examine the data
o Determine a model that is closest to the characteristics of the data

Data mining algorithms


o Model
o Preference
o Search

E.g. Credit card companies


Data Mining
Datamining

Predictive Descriptive

Time
Classificati Regressio
series Prediction
on n
analysis
Summariz Association Sequence
Clustering
ation rules Discovery
Predictive model
o Makes prediction about values of data using known
results found from the data

o E.g use of credit card history

Descriptive Model
o Identifies pattern and relationships in data

o Serves as a way to explore the properties of data


examined not to predict new properties
Basic Data Mining Task
Classification
Maps data into predefined classes
Pattern recognition
Regression
Map a data item to real valued prediction variable
Time Series analysis
Value of attribute is examined as it varies over time
Prediction
Clustering
Unsupervised learning
Summarization
Association Rules
Affinity analysis
Sequence Discovery
Used to determine sequential patterns in data
Knowledge Discovery Database
KDD is process of finding useful information and patterns in
data
Datamining is the use of algorithms to extract the information
and patterns derived by the KDD process

Steps of KDD process


Selection
Preprocessing
Transformation
Data mining
Interpretation / evaluation
Development of Data mining
Induction: used to proceed from very specific knowledge to
general information (AI)
Compression
Querying
Approximation
Search
History of data mining development

Time Contribution
Late 1700s Bayes Theorem of probability
Early 1900s Regression analysis
Early 1920s Maximum likelihood estimate
Early 1940 1950s Neural networks and nearest neighbor, perceptron, jack knife
estimator
1960s ML started, decision trees, clustering, relational data model
1970s SMART IR systems, genetic algorithms, K-means clustering
1980s Kohonen self-organizing maps
1990s Association rules, data warehousing, (Online Analytic
Processing) OLAP
Data Mining Issues
Human Interaction
Overfitting
Outliers
Interpretations of results
Visualization of results
Large Datasets
High Dimensionalty
Multimedia data
Missing data
Irrelevant data
Noisy data
Integration
Application
Implementation issues
Scalablity
Real world data
Update
Ease of use
END
Questions?