Professional Documents
Culture Documents
Data Mining Notes (AutoRecovered)
Data Mining Notes (AutoRecovered)
CS 1435
Syllabus:
Structured: tables
Statistics: a branch of science dealing with collection of structured data oin huge amount.
Apart from AI, ML, stats, it also includes amths, dbms, etc
De-duplication of records – redundant data removed and at times replaced by key (according to
need of user)
Data selection – extract pattern and knowledge, domain experts and team member and managers
decide
Data transformation – modification of data so that it can be evaluated. Eg. Some may write
rs/inr/etc/…
Missing data – like I wont give my data to salesperson of my income, or entry time maybe.
Binning method – sort the data, partition them into bins (like equal width(distance) or equal
depth( frequency) bins), smooth by bin means.
Equal depth - Divides the range into n intervals, contains approx. same no. of samples
Assignment – download 1 noisy data set. Perform binning for data smoothing, check which method
is good in terms of time
Bin data: 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46,
52, 70, USE DS to place in the number of bins. (like bucket sort).
a)
Assignment 2: tool – explore them whether open source or commercial, salesforce.com explore all
tools. And classify as opensource or commercial. State one one line on each. Take dta from previous
assignment, and use any 2 tools. Observation – write features available, and then tell output and tell
which is better.
Data integration : convert same data to some standard. Like 1hr and 60min. unique. Cm to m to vice-
versa
Correlation test for nomina: A,B data tuples, o = observed, e=expected, n = no.of tuples. I from 1 to
c, j from 1 to c
Attribute construction(feature construction) – like clg level student db. Tc what is data problem and
acc. Select attributes
Aggregation – summarization. Eg. Monthly attencdnace, monthly income, calc monthly, then
calculate for year
Concept hierarchy generation – dept, nit, silchar, assam, india. Instead we can write silchar, assam,
india.
Min-max normalization – v = observed vlue. A =attribute, min A = minimum value for a particular
attribute A. new_maxA max of what we want to normalize it to
Min-max norm protects the out of bound values. It can detect out of bound errors
Eg3.4
z-score normalization –
data reduction strategies: it should closely maintain integrity of original data. Wavelet: for
continuous data.
Parameter – parameter of data is stored not totally data….non-para – clustering/regression etc used.
CLASSIFICATION: to analyze the measurement of an object to identify the category to which it lies
Built decision tree algo: to select branch : choose most suitable attribute, then create data structure
Entropy: summation(-pi log2 (pi) ===== summation pi log2 (p -i)) therefore a negative value
E(x) = -
Overfitting : in which the learning stem fits the training data so much that it cannot predict data for
untrained data. To avoid = pre-pruning (the partitioning(construction) is halted earlier) or post-
pruning(bottom up fashion for tree. )