You are on page 1of 18

• Naïve estimate for a basic regression model, Y predicted = ӯ

• β1 is the slope—the average change in Y associated with a one-unit change in X.


• the 95% confidence interval for β1 approximately takes the form:
β1 ± 1.96 · SE(β1), this indicates that for various samples β1 can have different values but it should hover/lie between the
interval estimates

• If β1 is [0.042, 0.053], for each $1,000 increase in television advertising, there will be an average
increase in sales of between 42 and 53 units.
• Is at least one of the predictors X1,X2, . . . , Xp useful in predicting the response?
• i.e. whether β1 = β2 = · · · = βp = 0. As in the simple linear regression setting, we use a hypothesis test to answer this question.
We test the null hypothesis,
• H0 : β1 = β2 = · · · = βp = 0, versus the alternative
• Ha : at least one βj is non-zero.
• This hypothesis test is performed by computing the F-statistic, =(TSS − RSS)/p/RSS/(n − p − 1)
• Adjusted R square, models with more variables can have higher RSE if the decrease in RSS is small relative to the increase in p.
• Interpretation with dummy variables: The average credit card debt for males is estimated to be $509.80, whereas
females are estimated to carry $19.73 in additional debt for a total of $509.80 + $19.73 = $529.53. hands-on with
Dummy function in Desctools package
• There is a simple remedy to fit our model for symptoms of heterosckedasticity, by weighted least squares, with weights
proportional to the inverse weighted variances—i.e. wi = ni in this case.
• collinearity increases the interval of coefficients, this makes interpretation more uncertain. increased interval is due to
high std error for coefficient estimate. Consequently, collinearity results in a decline in the t-statistic. As a result, in the
presence of collinearity, we may fail to reject H0 : βj
• it is possible for collinearity to exist between three or more variables even if no pair of variables has a particularly high
correlation. We call this situation multicollinearity. Instead of inspecting the correlation matrix, a better way to assess
multi- collinearity is to compute the variance inflation factor (VIF).
• When faced with the problem of collinearity, there are two simple solutions. The first is to drop one of the problematic
variables from the regression. The second solution is to combine the collinear variables together into a single predictor
• For logistic regression, if b1=0.0055, a one-unit increase in balance is associated with an increase in the log odds of
default by 0.0055 units.
Selection methods vs Shrinkage Methods
• The tuning parameters for ridge and lasso vary over a continuous range, while
selection methods give a one stop solution, however, maybe sub-optimal
• Lasso nullifies coefficients; ridge reduces their magnitude, making them less
important while, selection methods drop them.
• Within selection methods, backward selection does not work for singular
covariance matrices. In other words, when all variables (very large) are taken
together then m (# observations may be less) and a solution is not possible.
For a detailed understanding, a sound knowledge of Linear Algebra is required.
• Elastic-net models strike an optimal path between lasso and ridge. Industry
standards is to show a sensitivity for changing alpha [0 (r), 0.2, 0.4, 0.6, 0.8,
1(l)]
Clustering
• Shape of Clusters
K-means works well when the shape of clusters are hyper-spherical  (or circular in 2
dimensions). If the natural clusters occurring in the dataset are non-spherical then
probably K-means is not a good choice.
• Repeatability
K-means starts with a random choice of cluster centers, therefore it may yield different
clustering results on different runs of the algorithm. Thus, the results may not be
repeatable and lack consistency. However, with hierarchical clustering, you will most
definitely get the same clustering results.
• Divisive vs Agglomerative- Former is preferred when you need broader
groups/segments & hence proves to be less complex & efficient. Latter is less preferred
in biz domain as it is redundant to study individual segregation (grouping) of customers
(observation)
Clustering Contd….
• Maximum or complete linkage clustering: It computes all pairwise dissimilarities between the
elements in cluster 1 and the elements in the second cluster and considers the largest value (i.e.,
maximum value) of these dissimilarities as the distance between the two clusters. It tends to
produce more compact clusters.
• Minimum or single linkage clustering: It computes all pairwise dissimilarities between the
elements in cluster 1 and the elements in the second cluster and considers the smallest of these
dissimilarities as a linkage criterion. It tends to produce long, "loose" clusters.
• Mean or average linkage clustering: It computes all pairwise dissimilarities between the elements
in cluster 1 and the elements in the second cluster and considers the average of these
dissimilarities as the distance between the two clusters.
• Centroid linkage clustering: It computes the dissimilarity between the centroid for cluster 1 (a
mean vector of length p variables) and the centroid for the second cluster.
• Ward’s minimum variance method: It minimizes the total within-cluster variance. At each step the
pair of clusters with minimum between-cluster distance are merged.
Computing Distance in Clustering
Silhouette Coefficient
• Assume the data have been clustered via any technique, such as k-means,
into clusters. For each datum , let be the average dissimilarity of with
all other data within the same cluster. Any measure of dissimilarity can be
used but distance measures are the most common. We can interpret as
how well is assigned to its cluster (the smaller the value, the better the
assignment). We then define the average dissimilarity of point to a
cluster as the average of the distance from to points in .
• Let be the lowest average dissimilarity of to any other cluster which is
not a member. The cluster with this lowest average dissimilarity is said to
be the "neighbouring cluster" of because it is the next best fit cluster for
point.
• For to be close to 1 we require . As is a measure of how dissimilar is to its own cluster, a small
value means it is well matched. Furthermore, a large implies that is badly matched to its
neighbouring cluster. Thus an close to one means that the datum is appropriately clustered. If is
close to negative one, then by the same logic we see that would be more appropriate if it was
clustered in its neighbouring cluster. An near zero means that the datum is on the border of two
natural clusters
• The average over all data of a cluster is a measure of how tightly grouped all the data in the cluster
are. Thus the average over all data of the entire dataset is a measure of how appropriately the data
has been clustered. If there are too many or too few clusters, as may occur when a poor choice of is
used in the k-means algorithm, some of the clusters will typically display much narrower silhouettes
than the rest. Thus silhouette plots and averages may be used to determine the natural number of
clusters within a dataset.
Fuzzy Clustering (fanny)
• Dunn’s coefficient ranges between 0 and 1. A low value of Dunn’s
coefficient indicates a very fuzzy clustering, whereas a value close to 1
indicates a near-crisp clustering.
• memb.exp (r), varies between 1 to 2, 2being default. Larger r leads to
fuzzy clusters while lower values form crisp clusters
E-M Algorithm
• the "expectation step“: we treat the values of the parameters as fixed
and recalculate the log-likelihood function. [probabilities (likelihood)
is computed for each observation. The prob indicates the chance that
the obs belongs to the corresponding distribution (latent)]

• the "maximization step“: which (re)-computes new parameter values


maximizing the expected log-likelihood found on the E step. We
repeat until the values converge [now, given the probabilities from E
step, we use them in maximizing the log-likelihood to re-estimate the
parameter]
DTs
• Obj: Purity vs impurity of nodes
• Top Down approach: Identify attributes which reduce impurity;
feature selection based on entropy (measure of homogeneity)info
gain split Ratio (Gain Ratio)Gini Index (direct measure of
impurity)
• indicates the flow of concepts from one measure to another
• Tree Pruning: Pre (chi square stat is used to see whether the factor
has any significant distinguishing power) vs Post (bottom up
approach-remove branches with higher purity of nodes)
LDA vs QDA debate
• By assuming that the K classes share a common covariance matrix, the LDA
model becomes linear in x, which means there are Kp linear coefficients to
estimate. Consequently, LDA is a much less flexible classifier than QDA, and so
has substantially lower variance. This can potentially lead to improved prediction
performance.
• But there is a trade-off: if LDA’s assumption that the K classes share a common
covariance matrix is badly off, then LDA can suffer from high bias. Roughly
speaking, LDA tends to be a better bet than QDA if there are relatively few
training observations and so reducing variance is crucial.
• In contrast, QDA is recommended if the training set is very large, so that the
variance of the classifier is not a major concern, or if the assumption of a
common covariance matrix for the K classes is clearly untenable.
Comparing techniques for classification
• both logistic regression and LDA produce linear decision boundaries. The only difference
between the two approaches lies in the fact that β0 and β1 are estimated using maximum
likelihood, whereas c0 and c1 are computed using the estimated mean and variance from a
normal distribution:

• LDA assumes that the observations are drawn from a Gaussian distribution with a common
covariance matrix in each class, and so can provide some improvements over logistic
regression when this assumption approximately holds. Conversely, logistic regression can
outperform LDA if these Gaussian assumptions are not met
• KNN is a completely non-parametric approach: no assumptions are made about the shape of
the decision boundary. Therefore, we can expect this approach to dominate LDA and logistic
regression when the decision boundary is highly non-linear. On the other hand, KNN does not
tell us which predictors are important; we don’t get a table of coefficients
Continued….
• Since QDA assumes a quadratic decision boundary, it can accurately model a wider range of
problems than can the linear methods. Though not as flexible as KNN, QDA can perform
better in the presence of a limited number of training observations because it does make
some assumptions about the form of the decision boundary.
• When the true decision boundaries are linear, then the LDA and logistic regression
approaches will tend to perform well. When the boundaries are moderately non-linear, QDA
may give better results. Finally, for much more complicated decision boundaries, a non-
parametric approach such as KNN can be superior. But the level of smoothness for a non-
parametric approach must be chosen carefully
• we could create a more flexible version of logistic regression, non-linear logistic regression, by
including X^2, X^3, and even X^4 as predictors. We could do the same for LDA. If we added all
possible quadratic terms and cross-products to LDA, the form of the model would be the
same as the QDA model, although the parameter estimates would be different. This model
lies between LDA & QDA, since we have polynomials of x but covariance matrix is assumed to
be common.
Transfer
Learning

• One can also retrain the


network by extending the
architecture
Multi-Task
Learning

• Difference with soft max


regression

You might also like