You are on page 1of 4

Data Mining notes: 7th semester.

CS 1435
Syllabus:
Structured: tables

Semi structured : tables, images

Unstructured: videos and stuff

Statistics: a branch of science dealing with collection of structured data oin huge amount.

Hypothesis: it is a study of data received. A statement and its outcome(See def.)

Null hypothesis: one variable doesn’t affect other.

------- hypothesis: one variable may affect other

Apart from AI, ML, stats, it also includes amths, dbms, etc

KDD- Knowledge discovering in db

Data cleaning – to remove the noise/ irrelevant data

uniform data in terms of key and data attributes

De-duplication of records – redundant data removed and at times replaced by key (according to
need of user)

Data selection – extract pattern and knowledge, domain experts and team member and managers
decide

Data transformation – modification of data so that it can be evaluated. Eg. Some may write
rs/inr/etc/…

Data descrtisation – numerical data deal

Missing data – like I wont give my data to salesperson of my income, or entry time maybe.

Clean – fill misiing val, identify outliers

Ignore the tuple – when a % of data missing, then only ignore

Filling missing data not always feasible

Use a global constant to fill by like unknown, or NA, or ?

Probable value using decision tree or Bayesian theory

Binning method – sort the data, partition them into bins (like equal width(distance) or equal
depth( frequency) bins), smooth by bin means.

Clustering – detect and remove the outliers

Combined computer and human inspection – detect suspicious method


Regression – use regression functions while feeding data

Equal width = W = (B-A)/N B – highest value, N is interval size

Equal depth - Divides the range into n intervals, contains approx. same no. of samples

Assignment – download 1 noisy data set. Perform binning for data smoothing, check which method
is good in terms of time

Bin data: 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46,
52, 70, USE DS to place in the number of bins. (like bucket sort).

a)

Data cleaning overall:

 Discrepancy detection – instrumntatition, human, outdated, inconsistecy


Unique rule – each value must be different for a particular attribute, consecutive rule – there
can be no missing value in between min and max value, null rule – specifies
blanks/?/sp.char/….. not available.use a global const
 Tools – data scrubbing tool – useful in classification, data auditing tool, etl – transform
through a graphical user interface

Assignment 2: tool – explore them whether open source or commercial, salesforce.com explore all
tools. And classify as opensource or commercial. State one one line on each. Take dta from previous
assignment, and use any 2 tools. Observation – write features available, and then tell output and tell
which is better.

Data integration : convert same data to some standard. Like 1hr and 60min. unique. Cm to m to vice-
versa

Handling redundant data: chi-square test

Correlation test for nomina: A,B data tuples, o = observed, e=expected, n = no.of tuples. I from 1 to
c, j from 1 to c

Eg 3.1 hellen kember data mining book

Data transformation: smoothing (regression, clustering, binning)

Attribute construction(feature construction) – like clg level student db. Tc what is data problem and
acc. Select attributes

Aggregation – summarization. Eg. Monthly attencdnace, monthly income, calc monthly, then
calculate for year

Normalization – scale it down acc to size of data. Acceptance of paper


Discretization – place at some range level. Like 1st yr students age 17-18.

Concept hierarchy generation – dept, nit, silchar, assam, india. Instead we can write silchar, assam,
india.

Min-max normalization – v = observed vlue. A =attribute, min A = minimum value for a particular
attribute A. new_maxA max of what we want to normalize it to

Min-max norm protects the out of bound values. It can detect out of bound errors

Eg3.4

z-score normalization –

data reduction strategies: it should closely maintain integrity of original data. Wavelet: for
continuous data.

Parameter – paratmeter of data is stored not totally data….non-para – clustering/regression etc


used.

Lossy – lossless: data loss or not on getting back original data

CLASSIFICATION: to analyze the measurement of an object to identify the category to which it lies

Learning state: we describe state of predetermined classes

Classification – for classifying future or unknown object

Eg. If a customer is good to give a loan or not. So customer is safe or not

Decision tree induction: classifier in the form of tree.

Learning of decision tree from class level . so greedy strategy needed.

Id3: iterative determiser – decision tree algo

C4.5: successor of id3

Cart : classification and regression tree.

Built decision tree algo: to select branch : choose most suitable attribute, then create data structure

Entropy: summation(-pi log2 (pi) ===== summation pi log2 (p -i)) therefore a negative value

E(x) = -

You might also like