You are on page 1of 17

Resumos de Estatistica

Statistics: the science that deals with the collection, organization, classification, presentation and
interpretation of a set of data, aiming at a better understanding of the phenomenon that the data
represents.

Measurement Scales

Outliers

Upper outlier threshold= P75+1.5*Riq

Lower outlier threshold=P25-1.5*Riq

Inter quartile (Riq)= P75-P25=Q3-Q1


Kurtosis

Skewness: Positive Asymmetry

Mediam<Mean
Skewness: Negative Asymmetry

Median>Mean
A note on Standardization

In Data Analysis we often deal with standardized variables Z. The standardized values are simply
obtained by subtracting the Mean from the observed value and then dividing this difference by the
standard deviation.

Standardization transforms different scales into a “standardized scale” with null average and unit
standard deviation

How to interpret Standardized measures


Measures of Association between two variables

The use of measures of association enables to quantify the strength (and, eventually, the direction)
of relationships that can be spotted in various descriptive analyses (tables, graphics univariate
statistics)

Cramer`s V and ETA

Spearman Rs and Pearson R

Sampling Procedures

Simple Random Sampling (S.R.S)

 Each element of the population is randomly selected from the population and is equally
likely to be included in a sample
 This type of sampling can avoid bias caused by a personal and subjective choice of sample
elements.
 Most known inferential methods rely on s. r. s. and on infinite populations. In practical
applications, ”infinite” can be seen as “large sized populations when compared to sample
size”

Descriptive Statistics

 Descriptive statistics is based solely on the sample

Inferential Statistics

 The analysis aims at generalizing some conclusions based on the sample to the population.

The CI for the mean using the normal distribution

Bootstrapping

 When we cannot resort to known theoretical sampling distributions (e.g. the Normal), even
by approximation, to build confidence intervals for parameters, we can (for example) resort
to multiple samples with replacement originated from the only sample we have i.e.
bootstrap samples.
 Using these multiple s.r.s. with replacement we can obtain the (empirical) bootstrap
distribution for diverse statistics which enables building bootstrap confidence intervals for
diverse parameters

Bootstrapping for building CI

Step 1: Create a random sample with replacement from the original sample with sample size as the
original sample

Step 2: calculate the sample statistics. For example; the median of the sample

Step 3: Repeat steps 1 and 2, Nb number of times to obtain the bootstrap distribution

Step 4: Use this bootstrap distribution to calculate confidence intervals, standard errors, etc.
Principal Components Analysis

General goal of principal components analysis (PCA)

 It is possible for the analyst to replace the original variables by the set of PC which have the
advantage of being non-correlated (which may be very relevant for use in some data analysis
techniques such as segmentation based in Euclidean distances or multiple linear regression)
 We can obtain a reduced set of PCs that can be used as surrogates of the original variables,
with a minimum loss of information, thus attaining the reduction of data dimensionality.
 Some of the original variables, those more correlated with the PCs, can also be selected
(another way of attaining the reduction of data dimensionality)
 From a set of original (usually standardized) metric variables, we aim to obtain a set of non-
correlated latent variables (Principal Components), which are linear combinations of the
original variables and account for (a part of) the “total variation” (sum of all variances) of the
same variables.

Is PCA Adequate?

1. The P original variables must be metric


2. The number of observations in the data set considered for PCA should be higher than 5* nº
of variables
3. The nº of variables must be correlated with each other
4. The Kaiser-Meyer-Olkin measure should indicate “suitability”

Principal components analysis

PCA enables to explain the total variance of a set of variables, using a set of new variables- Principal
components- which are linear combinations of the original (metric) variables and non-correlated.
How many components to retain?

1. Kaiser’s criterion: extract the PCs with variance (/eigenvalue) ≥ 1


2. Scree plot criterion: look at the “elbow” of the scree plot
3. Percentage of total variance explained: extract the first m components accounting for a
minimum of 70% to 80% of the total variance
4. Components’ interpretation: select a solution that is interpretable (eventually also take into
account 3.). In the component matrix we find the pearson correlation values, measuring the
association between the original variables and the extracted PC. The PC interpretation can
rely in these values (also called loadings).

How to evaluate the quality of the PCA solution?

 Proportion of total variance explained: shouldn’t be below 60%


 Communalities: proportion of the variation of each original variable that s explained by the
model shouldn’t be below 0.5
 Note: it is possible to obtain each original variable´s communality by summing the squares of
the corresponding loadings

Also contributing to the quality of the PCA solution:

 Interpretability of PC: the set of components should provide an interpretable solution based
on the loading´s values.
 The dimension reduction: number of original variables vs number of components should be
considered. Eventually
• Reduce the number of PC´s if there is a PC interpretation relying on a unique original
variable
• Increase the number of PC to comply with the other quality criteria

How to interpret principal components?

 Mainly based on correlations between original variables and PC (component matrix)


 When the component matrix does not provide an easy interpretation:
• Use varimax rotation
• Eventually, rethink the number of components

Rotation of principal components

 The extracted components can be rotated to turn then more interpretable


 The rotation angle according to the varimax method aims to obtain the absolute leadings
(absolute correlation coefficients) closer to 1 or 0, turning the components matrix easier to
interpret.

General goal of clustering analysis/segmentation

To structure data in such a way as to constitute a partition of entities such as we verify:

 Homogeneity within groups


 Heterogeneity between groups
Hierarchical clustering

 In the agglomerative process, the initial partition is formed by n clusters (singles cluster with
one entity), and at each stage of the process the current partition is altered by the fusion of
two of its clusters
 In the last iteration, a unique cluster is obtained including all entities in the sample.

Stages of hierarchical clustering analysis

1. Select the entities to cluster


2. Select a subset of relevant entities’ features, available in the data, for clustering
• Standardization/Normalization is a pertinent question
3. Select a (dis)similarity or distance measure and compute it for all pairs of entities 4
4. Select an agglomeration method (we will resort to Ward method only)
5. Decide on the number of clusters 6. Profile the clusters

A note on standardization (for metric variables)

When the variables are defined in different measurement scales and to avoid that distance values
unduly mainly reflect the variables with greater ranges, the previous standardization of the variables
is advised

A commonly used standardization resorts to the mean and standard deviation of each variable

Thus, Z have null average and unit standard deviation which ensures a similar role for all
(standardized) variables in the clustering process

Some distance measure (for metric variables)


Ward algorithm

 It is a hierarchical agglomerative method. The initial partition is constituted by n clusters


(singles clusters with one entity), and at each stage of the process, the current partition is
altered by the fusion of two of its clusters. In the last iteration, a unique cluster is obtained
including all entities in the sample.
 The initial groups intra-group variation is zero
 At each step of the analysis, the choice of the pair of groups to be joined falls on the pair
whose union results in the minimum increment of intra-clusters variation (the intra-clusters
variation is measured by squared Euclidean distance)

How many clusters?

The decision is based on:

 Dendrogram: a graphical representation of agglomeration distances/coefficients (which


express heterogeneity) associated with the agglomerative clustering process
 A line chart of numbers of clusters vs agglomeration coefficients.

Some disadvantages of hierarchical methods.

 At each stage of a hierarchical technique, the decisions made cannot be reversed and also
the next step “best” decision is always conditioned by the previous one, which limits
obtaining the overall best solution
 Also, the fact that one needs to rely on an (dis)similarity matrix and allocate the
correspondent memory in the computer is a computational disadvantage.

Non-Hierarchical clustering with K-means

 K means, including its multiple variants, is the most popular process in segmentation.
 Its goal is to minimize a sum of quadratic errors, defined as the squared Euclidean distance
between an observation and a reference point in each segment (the sample mean, in the
original version).
 Algorithms involving iterative reassignment of objects to groups essentially identified by
their centroids are generally referred to as K-means

Some advantages of K-means

 The K-means proceeds iteratively to the calculation of dissimilarities. It operates on the


original data being computationally more efficient and thus allowing the application of
cluster analysis to larger databases
 Allocation decisions can be reversed enabling the improvement of the objective function
 Non-hierarchical methods are generally less sensitive to the presence of irrelevant attributes
and the presence of outliers.
1. Starting (non-random) centres
2. Allocate all elements to a centre/cluster
3. Recalculate centres

K-means procedure

1. Initialization: Initialize the procedure by choosing K reference entities or centroids for K


cluster to be constituted
2. Assignment and centroids: Determine squared Euclidean distances between each
observation and the clusters´ centroids; assign each observation to the nearest cluster
(minimum distance to the centroid)
• In general, averages/centroids are recalculated after all observations are allocated
to the nearest cluster
• If average/centroids are recalculated each time an observation is allocated to a
cluster, we adopt running means
3. Repeat steps 2 and 3 until all clusters/ centroids are stabilized or until a maximum
computation time or a number of iterations threshold is attained.
K-means initialization

K-means is sensitive to the initial selection of refences entities (centres or centroids) around which
the groups will be constituted. Thus, it is important to use alternative initializations and to compare
the results

Profiling the clusters should be based on the most discriminant features between all clusters

We can use the ETA measure of association to rank the quantitative features for profiling according
to their discriminant power.

We can use the Cramer´s V measure of association to rank the qualitative features for profiling
according to their discriminant power

Some disadvantages of K-means

 Implicitly it is considered that the clustering base variables have a normal distribution with
identical variance
 Allocation decisions can be reversed, enabling the improvement of the objective function
but not all possible partitions are considered (number of all possible partitions is too large)
and although convergence to a local optimum is ensured, attaining a global optimum is not
guaranteed.

General goal of linear regression

 To verify if the dependent variable can be explained through the knowledge of other
(independent) variables.
 To predict the value of the dependent variable for a new sample observation
 To identify a subset of features, among many available, which is more effective in estimating
the dependent variable.

The multiple linear regression model (MLRM)


The choice of independent variables/ explanatory variables or predictors

 Must be supported by the theory: omitting other important predictor variables in multiple
linear regression models results in model specification errors.
 The automatic selection of a subset of explanatory variables according to their predictive
capacity can be conducted through several procedures: forward, backward, stepwise…
quantifying marginal increase or decrease of the model quality measures resulting from
including or deleting a variable.

Coefficient of determination (R squared) indicates the quality of model fit to the data

0< R2<1

R2 summarizes how well Y is explained by X1, X2, X3, based on the regression residuals.

Ex:

R2= 0.391: only 39.1% of the total variability of number of days absent from work is explained by X1,
X2, X3 based on the regression.
Adjusted R2

Adjusted R2 penalizes the R2 in the interests of parsimony: given the inclusion of one more
independent variable in the model, the adjusted R2 will increase only if the improvement in model´s
fit caused by the inclusion of the additional variable overcomes the loss of degrees of freedom.
𝑛−1
Adjusted R2 = 1-(1- R2)𝑛−𝑃−1

How to interpret the regression coefficient estimate?


Multicolinearity

 Multicolinearity refers to the existence of linear relationships between the independent


variables X1…. Xp
 These linear relationships mask the relationship between each one and Y and turns the
model interpretation difficult.

Measures of Multicolinearity

A good regression model can be used for

 Better understanding the relationship between X and Y


 Predict Y values based on X observations (provided they are similar to the ones that
originated the model in the first place)
Most discriminant clustering base variables Additional discriminant variables (use
(Rpc1- rpc6) ETA or Cramer’s V)

Clusters Clusters´ Rpc2- rPC4- rPC5- C1s- C1u C1v Aspects C2_Overall
(No dimension satisfaction Satisfaction Satisfaction Welcoming Promotions of the satisfaction
running with with with of and sale uniform of level with
means) professionalism technical elegancy customers Eta= the employees
(eta= 0.352) clarification (eta=0.65) Eta= collaborators (V=0.368)
(eta=0.652) Eta=
Cluster 817 0.204 0.3 0.207 5.77 5.44 5.62 43.5%
1 satisfied
28.1% very
satisfied
Cluster 157 -0.771 -1.611 0.532 4.63 4.49 4.59 36.8%
2 somewhat
satisfied
28.4%
Neutral
Cluster 160 -0.285 0.05 -1.579 4.94 4.55 4.47 32.1%
3 somewhat
satisfied
28.9%
satisfied
Most discriminant clustering base variables Additional discriminant variables (use
(Rpc1- rpc6) ETA or Cramer’s V)

Clusters Clusters´ Rpc3- rPC5- rPC6- C1s- C1u C1v Aspects C2_Overall
(with dimension satisfaction Satisfaction Satisfaction Welcoming Promotions of the satisfaction
running with personal with with of and sale uniform of level with
means) service elegancy courtesy customers Eta= the employees
(eta= 0.634) (eta=0.550) (eta=0.341) Eta= collaborators (V=0.368)
Eta=
Cluster 82 -2.168 0.4038 -1.48
1
Cluster 998 0.124 0.098 -0.07
2
Cluster 54 1.00 -2.4327 1.524
3

Satisfaction with professionalism and technical clarification is below average for cluster 2

A unit increase in standardized experience yields a 0.565 increased on predicted standardized


absence, all other predictors remaining fixed

An increase of a standard deviation referred to experience yields a 0.565 standard deviation increase
in predicted absence.

You might also like