Professional Documents
Culture Documents
Concepts
Concepts
• If β1 is [0.042, 0.053], for each $1,000 increase in television advertising, there will be an average
increase in sales of between 42 and 53 units.
• Is at least one of the predictors X1,X2, . . . , Xp useful in predicting the response?
• i.e. whether β1 = β2 = · · · = βp = 0. As in the simple linear regression setting, we use a hypothesis test to answer this question.
We test the null hypothesis,
• H0 : β1 = β2 = · · · = βp = 0, versus the alternative
• Ha : at least one βj is non-zero.
• This hypothesis test is performed by computing the F-statistic, =(TSS − RSS)/p/RSS/(n − p − 1)
• Adjusted R square, models with more variables can have higher RSE if the decrease in RSS is small relative to the increase in p.
• Interpretation with dummy variables: The average credit card debt for males is estimated to be $509.80, whereas
females are estimated to carry $19.73 in additional debt for a total of $509.80 + $19.73 = $529.53. hands-on with
Dummy function in Desctools package
• There is a simple remedy to fit our model for symptoms of heterosckedasticity, by weighted least squares, with weights
proportional to the inverse weighted variances—i.e. wi = ni in this case.
• collinearity increases the interval of coefficients, this makes interpretation more uncertain. increased interval is due to
high std error for coefficient estimate. Consequently, collinearity results in a decline in the t-statistic. As a result, in the
presence of collinearity, we may fail to reject H0 : βj
• it is possible for collinearity to exist between three or more variables even if no pair of variables has a particularly high
correlation. We call this situation multicollinearity. Instead of inspecting the correlation matrix, a better way to assess
multi- collinearity is to compute the variance inflation factor (VIF).
• When faced with the problem of collinearity, there are two simple solutions. The first is to drop one of the problematic
variables from the regression. The second solution is to combine the collinear variables together into a single predictor
• For logistic regression, if b1=0.0055, a one-unit increase in balance is associated with an increase in the log odds of
default by 0.0055 units.
Selection methods vs Shrinkage Methods
• The tuning parameters for ridge and lasso vary over a continuous range, while
selection methods give a one stop solution, however, maybe sub-optimal
• Lasso nullifies coefficients; ridge reduces their magnitude, making them less
important while, selection methods drop them.
• Within selection methods, backward selection does not work for singular
covariance matrices. In other words, when all variables (very large) are taken
together then m (# observations may be less) and a solution is not possible.
For a detailed understanding, a sound knowledge of Linear Algebra is required.
• Elastic-net models strike an optimal path between lasso and ridge. Industry
standards is to show a sensitivity for changing alpha [0 (r), 0.2, 0.4, 0.6, 0.8,
1(l)]
Clustering
• Shape of Clusters
K-means works well when the shape of clusters are hyper-spherical (or circular in 2
dimensions). If the natural clusters occurring in the dataset are non-spherical then
probably K-means is not a good choice.
• Repeatability
K-means starts with a random choice of cluster centers, therefore it may yield different
clustering results on different runs of the algorithm. Thus, the results may not be
repeatable and lack consistency. However, with hierarchical clustering, you will most
definitely get the same clustering results.
• Divisive vs Agglomerative- Former is preferred when you need broader
groups/segments & hence proves to be less complex & efficient. Latter is less preferred
in biz domain as it is redundant to study individual segregation (grouping) of customers
(observation)
Clustering Contd….
• Maximum or complete linkage clustering: It computes all pairwise dissimilarities between the
elements in cluster 1 and the elements in the second cluster and considers the largest value (i.e.,
maximum value) of these dissimilarities as the distance between the two clusters. It tends to
produce more compact clusters.
• Minimum or single linkage clustering: It computes all pairwise dissimilarities between the
elements in cluster 1 and the elements in the second cluster and considers the smallest of these
dissimilarities as a linkage criterion. It tends to produce long, "loose" clusters.
• Mean or average linkage clustering: It computes all pairwise dissimilarities between the elements
in cluster 1 and the elements in the second cluster and considers the average of these
dissimilarities as the distance between the two clusters.
• Centroid linkage clustering: It computes the dissimilarity between the centroid for cluster 1 (a
mean vector of length p variables) and the centroid for the second cluster.
• Ward’s minimum variance method: It minimizes the total within-cluster variance. At each step the
pair of clusters with minimum between-cluster distance are merged.
Computing Distance in Clustering
Silhouette Coefficient
• Assume the data have been clustered via any technique, such as k-means,
into clusters. For each datum , let be the average dissimilarity of with
all other data within the same cluster. Any measure of dissimilarity can be
used but distance measures are the most common. We can interpret as
how well is assigned to its cluster (the smaller the value, the better the
assignment). We then define the average dissimilarity of point to a
cluster as the average of the distance from to points in .
• Let be the lowest average dissimilarity of to any other cluster which is
not a member. The cluster with this lowest average dissimilarity is said to
be the "neighbouring cluster" of because it is the next best fit cluster for
point.
• For to be close to 1 we require . As is a measure of how dissimilar is to its own cluster, a small
value means it is well matched. Furthermore, a large implies that is badly matched to its
neighbouring cluster. Thus an close to one means that the datum is appropriately clustered. If is
close to negative one, then by the same logic we see that would be more appropriate if it was
clustered in its neighbouring cluster. An near zero means that the datum is on the border of two
natural clusters
• The average over all data of a cluster is a measure of how tightly grouped all the data in the cluster
are. Thus the average over all data of the entire dataset is a measure of how appropriately the data
has been clustered. If there are too many or too few clusters, as may occur when a poor choice of is
used in the k-means algorithm, some of the clusters will typically display much narrower silhouettes
than the rest. Thus silhouette plots and averages may be used to determine the natural number of
clusters within a dataset.
Fuzzy Clustering (fanny)
• Dunn’s coefficient ranges between 0 and 1. A low value of Dunn’s
coefficient indicates a very fuzzy clustering, whereas a value close to 1
indicates a near-crisp clustering.
• memb.exp (r), varies between 1 to 2, 2being default. Larger r leads to
fuzzy clusters while lower values form crisp clusters
E-M Algorithm
• the "expectation step“: we treat the values of the parameters as fixed
and recalculate the log-likelihood function. [probabilities (likelihood)
is computed for each observation. The prob indicates the chance that
the obs belongs to the corresponding distribution (latent)]
• LDA assumes that the observations are drawn from a Gaussian distribution with a common
covariance matrix in each class, and so can provide some improvements over logistic
regression when this assumption approximately holds. Conversely, logistic regression can
outperform LDA if these Gaussian assumptions are not met
• KNN is a completely non-parametric approach: no assumptions are made about the shape of
the decision boundary. Therefore, we can expect this approach to dominate LDA and logistic
regression when the decision boundary is highly non-linear. On the other hand, KNN does not
tell us which predictors are important; we don’t get a table of coefficients
Continued….
• Since QDA assumes a quadratic decision boundary, it can accurately model a wider range of
problems than can the linear methods. Though not as flexible as KNN, QDA can perform
better in the presence of a limited number of training observations because it does make
some assumptions about the form of the decision boundary.
• When the true decision boundaries are linear, then the LDA and logistic regression
approaches will tend to perform well. When the boundaries are moderately non-linear, QDA
may give better results. Finally, for much more complicated decision boundaries, a non-
parametric approach such as KNN can be superior. But the level of smoothness for a non-
parametric approach must be chosen carefully
• we could create a more flexible version of logistic regression, non-linear logistic regression, by
including X^2, X^3, and even X^4 as predictors. We could do the same for LDA. If we added all
possible quadratic terms and cross-products to LDA, the form of the model would be the
same as the QDA model, although the parameter estimates would be different. This model
lies between LDA & QDA, since we have polynomials of x but covariance matrix is assumed to
be common.
Transfer
Learning