Data mining

© All Rights Reserved

8 views

Data mining

© All Rights Reserved

- pavement Management System text book
- R Script Module 3
- 15 AAQ Stat Excel
- Wooldridge IE AISE SSM Ch02
- SSRN-id1846267
- EXCEL Multiple Regression
- 1Assignment - RM B.Sc.(AS-5)
- Bayesian Analysis of Binary and Polychotomous Response Data
- Determinants of Total Consumption in Sudan
- gender discrimination
- Polynomial Regression
- 054Hoermann.pdf
- 255Syllabus Fall2012
- ARTIGO INTERNAC.2011- Effect of PH and Temperature on Enzyme Activity of Chitosanase
- regression after midterm 5.ppt
- Douglas Etal JoH2009
- Ch 10 3rd NYU (1)
- Cheming e Part I
- lab#5-take-home
- CSC423 CSC324 Syllabus Fall2017 (1)

You are on page 1of 7

data if it were a signal.

Evans Rule (conservative): n/p > 10 (at least 10 observations per predictor)

Doanes Rule (relaxed): n/p > 5 (at least 5 observations per predictor)

Standardize the data

Training Partition (typically the largest partition) contains the data used to build the various models we are examining. The

same training partition is generally used to develop multiple models.

Validation Partition (sometimes called the test partition) is used to assess the performance of each model so that you can

compare models and pick the best one.

Test Partition (sometimes called the holdout or evaluation partition) is used if we need to assess the performance of the chosen

model with new data.

Simple linear regression: Regression analysis involving one independent variable (X) & one dependent variable (Y) in which

the relationship between the variables is approximated by a straight line.

Mean square residual (MSE) = Residual Standard Error (#8 on table on next page) MSR= SSR/K (K = # of predictors)

R2

SSR

SST

the response variable that is explained by the estimated

regression equation.

To test for a significant regression relationship conduct a

hypothesis test to determine whether the value of 1 is 0.

H0: 1 = 0 H1: 1 0

Reject H0 if p-value <

Test statistics for hypothesis tests about slope. Note: Degree

of freedom is n-2 t =b1/se(b1) se=standard error on

The coded variables are called dummy variables. If a categorical variable has m levels then we have to introduce m-1 dummy

variables in model.

Interpretation of s: 0 = B (Mean of base level) - 1 = A - B (Mean of base level) - 2 = B C

1. Backward elimination (Backward stepwise)

Type I (include some unimportant independent variables in the model) or Type II errors (eliminate some important independent

variables). RA2 - The adjusted coefficient of determination is done to penalize the inclusion of useless predictors.

Regression Selection R2 - RA2 - CP Want smallest CP error

R2 = SSR/SST = 1 SSE/SST = 1 [MSE/(n-k-1)] - RA2 = 1 (n-1)(MSE/SST)=1-[(n-1)/(n-k-1)]*(1-R2) - CP =SSEk/MSEL +2(k-1)-5

R2 > RA2 and for poor-fitting models R A2 may be negative. Choose highest adjusted R squared

If we reject the null hypothesis, there is enough evidence to support that at least one of the coefficients is zero. The overall

model appears to be statistically useful for predicting y

If we cannot reject the null hypothesis, there is not enough evidence to support that at least one of the coefficients is nonzero.

The overall model does not appear to be statistically useful for predicting y

in the dataset are eventually used for both training and testing

and each observation is used for validation exactly once

The idea in k-nearest neighbor methods is to identify k records in

the training dataset that are similar to the new record that we

wish to classify. We then use there similar (neighboring ) records

to classify the new record into a class, assigning the new record to the predominant class among these neighbors.

If we choose k too low, we may be fitting to the noise of the data.

If we choose k too high, we will armiss out on the methods ability to

capture the local structure in the data.

If k = n, we simply assign all records to the majority class in the training

data.

Typically, values of k falls in the range of 1-20.

The odd number is normally chosen to avoid ties.

We partition the data into training data and validation data.

For example, 18 data points for the training data and 6 data

points for the validation data.

Use the training data to classify the records in the validation data, then compute the error rates for various choices of k.

Perform k-foldcross validation and record the error rate for various choices of k.

Notes - The default of the cutoff value is 0.5 but it can be set differently. It can be used to classify the response with more than

two classes. It can be applied with a numerical response.

Steps

The first step of determining neighbors by computing distances remains unchanged.

The second step is modified such that we take the average response value of the k-nearest neighbors to determine the

prediction.

The best k can be determined by the other measure besides the misclassification rate.

Advantages - It is simple. It does not require parametric assumptions. It performs well with a large enough training set.

Shortcomings - The time to find the nearest neighbors in a large training set can be prohibitive. The number of records

required in the training set to qualify as large increases exponentially with the number of predictors p.

Kn only works with quantitative variables.

Misclassification rate wrong classified divided by all numbers summed up (b+c)/(a+b+c+d) = error

Accuracy = 1 error = (a+d)/(a+b+c+d)

If dataset is too small for partition it may yield unstable results

()In statistics, the number of degrees of freedom is the number of values in the final calculation of a statistic that are free to

vary.

()If p value is greater than alpha(level of significance) we accept, if less then then we reject

1-Residuals = he residuals are the difference between the

actual values of the variable you're predicting and predicted

values

2-Significance Stars= Shorthand for significance figures

3-Estimated Coeffecient= The estimated coeffecient is the

value of slope calculated by the regression.

4-Standard Error of the Coeffecient Estimate - Measure of

the variability in the estimate for the coeffecient. Lower

means better but this number is relative to the value fo the

coeffecient.

5-T value of coefficient estimate = Score that measures

whether or not the coeffecient for this variable is meaningful

for the model. You probably won't use this value itself, but

know that it is used to calculate the p-value and the

significance levels.

6-Variable P Value = Probability the variable is NOT relevant.

You want this number to be as small as possible.

8-Residual STD error/Degrees of freedom = The Residusal Std

Error is just the standard deviation of your residuals. The Degrees of

Freedom is the difference between the number of observations

included in your training sample adn the number of variables used in your model (intercept counts as a variable).

9-R2= Metric for evaluating the goodness of fit of your model. Higher is better with 1 being the best. Corresponds with the amount of variability in

what you're predicting that is explained by the model.

10- F-Statistic & resulting p-value= Performs an F-test on the model. This takes the parameters of our model (in our case we only have 1) and

compares it to a model that has fewer parmeters. In theory the model with more parameters should fit better. If the model with more parameters

(your model) doesn't perform better than the model with fewer parameters, the F-test will have a high p-value (probability NOT significant boost).

If the model with more parameters is better than the model with fewer parameters, you will have a lower p-value.

N=number of records

CP is the complexity parameter. Any split that does not decrease the overall lack of fit by a factor of CP is not attempt (Default is 0.01)

Relative Error is found from the whole dataset and it is always the same for different runs.

Xerror can change for different runs

Error Rate(whole dataset) = the root node error times the relative error =

Total Records

(smaller of

number is misclass)

10-fold CV error rate = root node error times the xerror

Prune Tree

Step 1: Find the best subtree of each size (1,2,3, ).

Step 2: Pick the tree in the sequence that gives the smallest misclassification error in the validation set.

The idea behind pruning is to recognize that a very large tree is likely to e overfitting the training data and that the weakest branches,

which hardly reduce the error rate, should be removed.

The tree method can also be used for numerical response variables regression tree.

- Both the principle and the procedure are the same.

- There are three details that are different from the

classification tree.

(i) Prediction

(ii) Impurity measures

(iii) Evaluating performance

Regression Tree

Both the principle and the procedure are the same. There are three details that are different from the classification tree.

(i) Prediction

(ii) Impurity measures

(iii) Evaluating performance

Logistic Regression

Logistic regression model explains a relationship between a binary response and predictors

using a logit link function.

Y is used to represent the binary response.

P(Y = 1) or p is the probability of belonging to class 1

p

odds

p

0 1 x1 ... k xk

p

0 1 x1 ... k xk p e

log it ( p ) ln

1 p

1 e 0 1x1 ... k xk Odds e 0 1x1 ... k xk

1 odds

Odds

If xj increases 1 unit, then odds changes by (e j -1)(100)% (holding all other predictors constant.)

Odds ratio

e 0 CD (1)... k xk

e CD

e 0 CD ( 0 )... k xk

Association Rules

Support

no. transactions that include both condition and result item sets

s

the total number of records

Confidencec no. transactions that include both condition and result item sets

no. transactions with condition item sets

P (result | condition)

P (condition )

confidence P(condition and result )

Lift ratio

P(result )

P(condition ) P (result )

Lift Ratio

1 p

A lift ration greater than 1.0 suggests that there is some usefulness to the rule - the level of

association between the condition and result item sets is higher than would be expected if they

were independent.The larger the lift ratio, the greater the strength of the association.

The support indicates its impact in terms of overall size. If only a small number of transactions are

affected, the rule may be of a little use (unless the consequent is very valuable and/or the rule is

very efficient in finding it).

Cluster Analysis

dij is a distance metric or dissimilarity

The following properties are required.

measure, between records i and j.

Nonnegative

dij 0.

(xi1, xi2, , xip) is the vector of p

Self-Proximity

dii = 0.

measurements for record i.

Symmetry

dij = dji.

(xj1, xj2, , xjp) is the vector of p

Triangle Inequality

dij dik + dkj.

measurements for record j.

2

.

std deviation

Interpretation

We explore the characteristics of each cluster

by

a. Obtaining summary statistics from each

cluster on each measurement that was

used in the cluster analysis

b. Examining the clusters for the presence

of some common feature (variable) that

was not used in the cluster analysis

c. Cluster labeling: based on the

interpretation, trying to assign a name

or label to each other.

AIC =L + 2k L is usually given so not 2L?

L = Residual Deviance Log likelihood k=# of parameters

Lower AIC and BIC better

AIC indicates how good the estimates that maximize the chance of obtaining the data.

AIC Gives penalty to a higher # of predictors

-Holding the other variables constant, the (response variable) is more/less likely to be in class 1 if

the variable is 1

Root Node X Error = Cross validation error rate

function of the combined set^then lowest of the new numbers after function

Splitting values, order lowest to highest, then half

of the pairs going down in a row?

(Different Note)

Rel. Error = relative error or misclassification for

tree at that stage to convert to absolute error

multiply by root node error.

Yes, take left number add all..no takes right

- pavement Management System text bookUploaded byGoutham Athem
- R Script Module 3Uploaded byVaish Navi
- 15 AAQ Stat ExcelUploaded byaaqipaper
- Wooldridge IE AISE SSM Ch02Uploaded bybhaswarbd
- SSRN-id1846267Uploaded byMGTsabalira
- EXCEL Multiple RegressionUploaded byAvra Ray
- 1Assignment - RM B.Sc.(AS-5)Uploaded byAnoop Kore
- Bayesian Analysis of Binary and Polychotomous Response DataUploaded byHikmatullahshaheer
- Determinants of Total Consumption in SudanUploaded byDonald Patrick
- gender discriminationUploaded bymianusmanrasul
- Polynomial RegressionUploaded byBenjamin Honorio
- 054Hoermann.pdfUploaded byfilieve
- 255Syllabus Fall2012Uploaded bykinghopy
- ARTIGO INTERNAC.2011- Effect of PH and Temperature on Enzyme Activity of ChitosanaseUploaded byCavalcante Rosane
- regression after midterm 5.pptUploaded byNataliAmiranashvili
- Douglas Etal JoH2009Uploaded byrodolfocesarino
- Ch 10 3rd NYU (1)Uploaded byMelanie Miranda
- Cheming e Part IUploaded bype
- lab#5-take-homeUploaded byConnor Smith Rico Suave
- CSC423 CSC324 Syllabus Fall2017 (1)Uploaded byjazzbug0318
- geyannis-pc233Uploaded byWilson Arias-Rojas
- cbUploaded byDommy Rampisela
- House Pricing RegressionUploaded bynitin3078
- Relationship between Critical Success Factors and Performance Indicator Factors in Automobile Industry of GujaratUploaded byIJIRST
- Cs Ks Hbbblr.fix160Uploaded bypene asoy
- Olabarrieta-Landa Et Al.,2017 (Verbal Fluency)Uploaded byNicol Mariana V. Fernández
- OUTPUT.docUploaded byricha
- Compute.Regression.docUploaded bySyed Uzair
- gggggggggUploaded byMukhtaar Case
- Case1 5In Sink EratorUploaded byNeto

- Old SyllabusUploaded bykookmasteraj
- Intramural Forfeit Fee ContractUploaded bykookmasteraj
- Finance AccountingUploaded bykookmasteraj
- Determinants Fall 2014Uploaded bykookmasteraj
- HW 9 UpdateUploaded bykookmasteraj
- Costco Run 1Uploaded bykookmasteraj
- Amy Hawkins Associate DegreeUploaded bykookmasteraj
- Leary_2Uploaded bykookmasteraj

- HLM-statistics presentationUploaded byJournal of International Students (http://jistudents.org/)
- 124354hbfbfjsd.pdfUploaded bydaidaina
- correlation and regression.pptUploaded bykaushalsingh20
- Ken Black QA ch14Uploaded byRushabh Vora
- ECN 702 Final Examination Question PaperUploaded byZoheel AL Ziyad
- ANOVA - Testing for the Assumption of Equal Variances_ Levene TestUploaded byPedro Barbosa
- Oneway ANOVAUploaded byNur_Febriana94
- Introduction to Econometrics- Stock & Watson -Ch 4 Slides.docUploaded byAntonio Alvino
- Binary Logistic Regression Mintab TutorialUploaded byMuhammad Imdadullah
- 3.2 Multiple Regression Step by StepUploaded byKezia Sarah Abednego
- Chapter_14 Advanced Regression ModelsUploaded bymgahabib
- FE3003 Lagrange MultiplierTest for Autocorrelation (1)Uploaded byQarsam Ilyas
- Bms Project Shubham 4Uploaded byRishab Lohan
- Ammi English ManualUploaded byFrancisco Jose Sanchez Marin
- Lampiran Uji StatistikUploaded byFarida Agustiningrum
- Class 10 Factor Analysis IUploaded byVia Anggraeni
- r Studio Cheat SheetUploaded bySuren Markosov
- [TS] Time SeriesUploaded byRokon Paul
- Chapter 6 - Correlational analysis_ Pearson's r.pdfUploaded byMurali Dharan
- Regression Tutorial 101 With NumXLUploaded bySpider Financial
- c13Uploaded byrgerwwaa
- Practica Analisis FactorialUploaded byjorge
- Types of Factor AnalysisUploaded byMohammed Bilal
- STAT 200 Week 7 HomeworkUploaded byldlewis
- Course - Data Science Foundations - Data MiningUploaded byImtiaz N
- What Statistical Analysis Should I Use_ Statistical Analyses Using SPSS - IDRE StatsUploaded byIntegrated Fayomi Joseph Ajayi
- Data PreparationUploaded byHimaanshu Gauba
- MultipleRegression_AssumptionsAndOUtliersUploaded bySyed Asim Sajjad
- SPSS Def + Example_new_1!1!2011Uploaded byvickysan
- l2ss3los9Uploaded byEdgar Lay

## Much more than documents.

Discover everything Scribd has to offer, including books and audiobooks from major publishers.

Cancel anytime.