You are on page 1of 4

Data Mining notes: 7th semester.

CS 1435
Structured: tables

Semi structured : tables, images

Unstructured: videos and stuff

Statistics: a branch of science dealing with collection of structured data oin huge amount.

Hypothesis: it is a study of data received. A statement and its outcome(See def.)

Null hypothesis: one variable doesn’t affect other.

------- hypothesis: one variable may affect other

Apart from AI, ML, stats, it also includes amths, dbms, etc

KDD- Knowledge discovering in db

Data cleaning – to remove the noise/ irrelevant data

uniform data in terms of key and data attributes

De-duplication of records – redundant data removed and at times replaced by key (according to
need of user)

Data selection – extract pattern and knowledge, domain experts and team member and managers

Data transformation – modification of data so that it can be evaluated. Eg. Some may write

Data descrtisation – numerical data deal

Missing data – like I wont give my data to salesperson of my income, or entry time maybe.

Clean – fill misiing val, identify outliers

Ignore the tuple – when a % of data missing, then only ignore

Filling missing data not always feasible

Use a global constant to fill by like unknown, or NA, or ?

Probable value using decision tree or Bayesian theory

Binning method – sort the data, partition them into bins (like equal width(distance) or equal
depth( frequency) bins), smooth by bin means.

Clustering – detect and remove the outliers

Combined computer and human inspection – detect suspicious method

Regression – use regression functions while feeding data

Equal width = W = (B-A)/N B – highest value, N is interval size

Equal depth - Divides the range into n intervals, contains approx. same no. of samples

Assignment – download 1 noisy data set. Perform binning for data smoothing, check which method
is good in terms of time

Bin data: 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46,
52, 70, USE DS to place in the number of bins. (like bucket sort).


Data cleaning overall:

 Discrepancy detection – instrumntatition, human, outdated, inconsistecy

Unique rule – each value must be different for a particular attribute, consecutive rule – there
can be no missing value in between min and max value, null rule – specifies
blanks/?/sp.char/….. not available.use a global const
 Tools – data scrubbing tool – useful in classification, data auditing tool, etl – transform
through a graphical user interface

Assignment 2: tool – explore them whether open source or commercial, explore all
tools. And classify as opensource or commercial. State one one line on each. Take dta from previous
assignment, and use any 2 tools. Observation – write features available, and then tell output and tell
which is better.

Data integration : convert same data to some standard. Like 1hr and 60min. unique. Cm to m to vice-

Handling redundant data: chi-square test

Correlation test for nomina: A,B data tuples, o = observed, e=expected, n = no.of tuples. I from 1 to
c, j from 1 to c

Eg 3.1 hellen kember data mining book

Data transformation: smoothing (regression, clustering, binning)

Attribute construction(feature construction) – like clg level student db. Tc what is data problem and
acc. Select attributes

Aggregation – summarization. Eg. Monthly attencdnace, monthly income, calc monthly, then
calculate for year

Normalization – scale it down acc to size of data. Acceptance of paper

Discretization – place at some range level. Like 1st yr students age 17-18.

Concept hierarchy generation – dept, nit, silchar, assam, india. Instead we can write silchar, assam,

Min-max normalization – v = observed vlue. A =attribute, min A = minimum value for a particular
attribute A. new_maxA max of what we want to normalize it to

Min-max norm protects the out of bound values. It can detect out of bound errors


z-score normalization –

data reduction strategies: it should closely maintain integrity of original data. Wavelet: for
continuous data.

Parameter – parameter of data is stored not totally data….non-para – clustering/regression etc used.

Lossy – lossless: data loss or not on getting back original data

CLASSIFICATION: to analyze the measurement of an object to identify the category to which it lies

Learning state: we describe state of predetermined classes

Classification – for classifying future or unknown object

Eg. If a customer is good to give a loan or not. So customer is safe or not

Decision tree induction: classifier in the form of tree.

Learning of decision tree from class level . so greedy strategy needed.

Id3: iterative determiser – decision tree algo

C4.5: successor of id3

Cart : classification and regression tree.

Built decision tree algo: to select branch : choose most suitable attribute, then create data structure

Entropy: summation(-pi log2 (pi) ===== summation pi log2 (p -i)) therefore a negative value

E(x) = -

Overfitting : in which the learning stem fits the training data so much that it cannot predict data for
untrained data. To avoid = pre-pruning (the partitioning(construction) is halted earlier) or post-
pruning(bottom up fashion for tree. )

You might also like