Data Mining Notes (AutoRecovered)

Data Mining notes: 7th semester.
CS 1435
Syllabus:
Structured: tables
Semi structured : tables, images
Unstructured: videos and stuff
Statistics: a branch of science dealing with collection of structured data oin huge amount.
Hypothesis: it is a study of data received. A statement and its outcome(See def.)
Null hypothesis: one variable doesn’t affect other.
------- hypothesis: one variable may affect other
Apart from AI, ML, stats, it also includes amths, dbms, etc
KDD- Knowledge discovering in db
Data cleaning – to remove the noise/ irrelevant data
uniform data in terms of key and data attributes
De-duplication of records – redundant data removed and at times replaced by key (according to
need of user)
Data selection – extract pattern and knowledge, domain experts and team member and managers
decide
Data transformation – modification of data so that it can be evaluated. Eg. Some may write
rs/inr/etc/…
Data descrtisation – numerical data deal
Missing data – like I wont give my data to salesperson of my income, or entry time maybe.
Clean – fill misiing val, identify outliers
Ignore the tuple – when a % of data missing, then only ignore
Filling missing data not always feasible
Use a global constant to fill by like unknown, or NA, or ?
Probable value using decision tree or Bayesian theory
Binning method – sort the data, partition them into bins (like equal width(distance) or equal
depth( frequency) bins), smooth by bin means.
Clustering – detect and remove the outliers
Combined computer and human inspection – detect suspicious method

Regression – use regression functions while feeding data
Equal width = W = (B-A)/N B – highest value, N is interval size
Equal depth - Divides the range into n intervals, contains approx. same no. of samples
Assignment – download 1 noisy data set. Perform binning for data smoothing, check which method
is good in terms of time
Bin data: 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46,
52, 70, USE DS to place in the number of bins. (like bucket sort).
a)
Data cleaning overall:
 Discrepancy detection – instrumntatition, human, outdated, inconsistecy

Unique rule – each value must be different for a particular attribute, consecutive rule – there
can be no missing value in between min and max value, null rule – specifies
blanks/?/sp.char/….. not available.use a global const
 Tools – data scrubbing tool – useful in classification, data auditing tool, etl – transform
through a graphical user interface
Assignment 2: tool – explore them whether open source or commercial, salesforce.com explore all
tools. And classify as opensource or commercial. State one one line on each. Take dta from previous
assignment, and use any 2 tools. Observation – write features available, and then tell output and tell
which is better.
Data integration : convert same data to some standard. Like 1hr and 60min. unique. Cm to m to vice-
versa
Handling redundant data: chi-square test
Correlation test for nomina: A,B data tuples, o = observed, e=expected, n = no.of tuples. I from 1 to
c, j from 1 to c
Eg 3.1 hellen kember data mining book
Data transformation: smoothing (regression, clustering, binning)
Attribute construction(feature construction) – like clg level student db. Tc what is data problem and
acc. Select attributes
Aggregation – summarization. Eg. Monthly attencdnace, monthly income, calc monthly, then
calculate for year
Normalization – scale it down acc to size of data. Acceptance of paper

Discretization – place at some range level. Like 1st yr students age 17-18.
Concept hierarchy generation – dept, nit, silchar, assam, india. Instead we can write silchar, assam,
india.
Min-max normalization – v = observed vlue. A =attribute, min A = minimum value for a particular
attribute A. new_maxA max of what we want to normalize it to
Min-max norm protects the out of bound values. It can detect out of bound errors
Eg3.4
z-score normalization –
data reduction strategies: it should closely maintain integrity of original data. Wavelet: for
continuous data.
Parameter – parameter of data is stored not totally data….non-para – clustering/regression etc used.
Lossy – lossless: data loss or not on getting back original data
CLASSIFICATION: to analyze the measurement of an object to identify the category to which it lies
Learning state: we describe state of predetermined classes
Classification – for classifying future or unknown object
Eg. If a customer is good to give a loan or not. So customer is safe or not
Decision tree induction: classifier in the form of tree.
Learning of decision tree from class level . so greedy strategy needed.
Id3: iterative determiser – decision tree algo
C4.5: successor of id3
Cart : classification and regression tree.
Built decision tree algo: to select branch : choose most suitable attribute, then create data structure
Entropy: summation(-pi log2 (pi) ===== summation pi log2 (p -i)) therefore a negative value
E(x) = -
Overfitting : in which the learning stem fits the training data so much that it cannot predict data for
untrained data. To avoid = pre-pruning (the partitioning(construction) is halted earlier) or post-
pruning(bottom up fashion for tree. )

Data Mining Notes (AutoRecovered)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining Notes (AutoRecovered)

Uploaded by

Copyright:

Available Formats

Data Mining notes: 7th semester.

Semi structured : tables, images

Unstructured: videos and stuff

Hypothesis: it is a study of data received. A statement and its outcome(See def.)

Null hypothesis: one variable doesn’t affect other.

------- hypothesis: one variable may affect other

KDD- Knowledge discovering in db

Data cleaning – to remove the noise/ irrelevant data

uniform data in terms of key and data attributes

Data descrtisation – numerical data deal

Clean – fill misiing val, identify outliers

Ignore the tuple – when a % of data missing, then only ignore

Filling missing data not always feasible

Use a global constant to fill by like unknown, or NA, or ?

Probable value using decision tree or Bayesian theory

Clustering – detect and remove the outliers

Combined computer and human inspection – detect suspicious method

Equal width = W = (B-A)/N B – highest value, N is interval size

Data cleaning overall:

 Discrepancy detection – instrumntatition, human, outdated, inconsistecy

Handling redundant data: chi-square test

Eg 3.1 hellen kember data mining book

Data transformation: smoothing (regression, clustering, binning)

Normalization – scale it down acc to size of data. Acceptance of paper

Lossy – lossless: data loss or not on getting back original data

Learning state: we describe state of predetermined classes

Classification – for classifying future or unknown object

Eg. If a customer is good to give a loan or not. So customer is safe or not

Decision tree induction: classifier in the form of tree.

Learning of decision tree from class level . so greedy strategy needed.

Id3: iterative determiser – decision tree algo

C4.5: successor of id3

Cart : classification and regression tree.

You might also like