Professional Documents
Culture Documents
Resumos de Estatistica
Resumos de Estatistica
Statistics: the science that deals with the collection, organization, classification, presentation and
interpretation of a set of data, aiming at a better understanding of the phenomenon that the data
represents.
Measurement Scales
Outliers
Mediam<Mean
Skewness: Negative Asymmetry
Median>Mean
A note on Standardization
In Data Analysis we often deal with standardized variables Z. The standardized values are simply
obtained by subtracting the Mean from the observed value and then dividing this difference by the
standard deviation.
Standardization transforms different scales into a “standardized scale” with null average and unit
standard deviation
The use of measures of association enables to quantify the strength (and, eventually, the direction)
of relationships that can be spotted in various descriptive analyses (tables, graphics univariate
statistics)
Sampling Procedures
Each element of the population is randomly selected from the population and is equally
likely to be included in a sample
This type of sampling can avoid bias caused by a personal and subjective choice of sample
elements.
Most known inferential methods rely on s. r. s. and on infinite populations. In practical
applications, ”infinite” can be seen as “large sized populations when compared to sample
size”
Descriptive Statistics
Inferential Statistics
The analysis aims at generalizing some conclusions based on the sample to the population.
Bootstrapping
When we cannot resort to known theoretical sampling distributions (e.g. the Normal), even
by approximation, to build confidence intervals for parameters, we can (for example) resort
to multiple samples with replacement originated from the only sample we have i.e.
bootstrap samples.
Using these multiple s.r.s. with replacement we can obtain the (empirical) bootstrap
distribution for diverse statistics which enables building bootstrap confidence intervals for
diverse parameters
Step 1: Create a random sample with replacement from the original sample with sample size as the
original sample
Step 2: calculate the sample statistics. For example; the median of the sample
Step 3: Repeat steps 1 and 2, Nb number of times to obtain the bootstrap distribution
Step 4: Use this bootstrap distribution to calculate confidence intervals, standard errors, etc.
Principal Components Analysis
It is possible for the analyst to replace the original variables by the set of PC which have the
advantage of being non-correlated (which may be very relevant for use in some data analysis
techniques such as segmentation based in Euclidean distances or multiple linear regression)
We can obtain a reduced set of PCs that can be used as surrogates of the original variables,
with a minimum loss of information, thus attaining the reduction of data dimensionality.
Some of the original variables, those more correlated with the PCs, can also be selected
(another way of attaining the reduction of data dimensionality)
From a set of original (usually standardized) metric variables, we aim to obtain a set of non-
correlated latent variables (Principal Components), which are linear combinations of the
original variables and account for (a part of) the “total variation” (sum of all variances) of the
same variables.
Is PCA Adequate?
PCA enables to explain the total variance of a set of variables, using a set of new variables- Principal
components- which are linear combinations of the original (metric) variables and non-correlated.
How many components to retain?
Interpretability of PC: the set of components should provide an interpretable solution based
on the loading´s values.
The dimension reduction: number of original variables vs number of components should be
considered. Eventually
• Reduce the number of PC´s if there is a PC interpretation relying on a unique original
variable
• Increase the number of PC to comply with the other quality criteria
In the agglomerative process, the initial partition is formed by n clusters (singles cluster with
one entity), and at each stage of the process the current partition is altered by the fusion of
two of its clusters
In the last iteration, a unique cluster is obtained including all entities in the sample.
When the variables are defined in different measurement scales and to avoid that distance values
unduly mainly reflect the variables with greater ranges, the previous standardization of the variables
is advised
A commonly used standardization resorts to the mean and standard deviation of each variable
Thus, Z have null average and unit standard deviation which ensures a similar role for all
(standardized) variables in the clustering process
At each stage of a hierarchical technique, the decisions made cannot be reversed and also
the next step “best” decision is always conditioned by the previous one, which limits
obtaining the overall best solution
Also, the fact that one needs to rely on an (dis)similarity matrix and allocate the
correspondent memory in the computer is a computational disadvantage.
K means, including its multiple variants, is the most popular process in segmentation.
Its goal is to minimize a sum of quadratic errors, defined as the squared Euclidean distance
between an observation and a reference point in each segment (the sample mean, in the
original version).
Algorithms involving iterative reassignment of objects to groups essentially identified by
their centroids are generally referred to as K-means
K-means procedure
K-means is sensitive to the initial selection of refences entities (centres or centroids) around which
the groups will be constituted. Thus, it is important to use alternative initializations and to compare
the results
Profiling the clusters should be based on the most discriminant features between all clusters
We can use the ETA measure of association to rank the quantitative features for profiling according
to their discriminant power.
We can use the Cramer´s V measure of association to rank the qualitative features for profiling
according to their discriminant power
Implicitly it is considered that the clustering base variables have a normal distribution with
identical variance
Allocation decisions can be reversed, enabling the improvement of the objective function
but not all possible partitions are considered (number of all possible partitions is too large)
and although convergence to a local optimum is ensured, attaining a global optimum is not
guaranteed.
To verify if the dependent variable can be explained through the knowledge of other
(independent) variables.
To predict the value of the dependent variable for a new sample observation
To identify a subset of features, among many available, which is more effective in estimating
the dependent variable.
Must be supported by the theory: omitting other important predictor variables in multiple
linear regression models results in model specification errors.
The automatic selection of a subset of explanatory variables according to their predictive
capacity can be conducted through several procedures: forward, backward, stepwise…
quantifying marginal increase or decrease of the model quality measures resulting from
including or deleting a variable.
Coefficient of determination (R squared) indicates the quality of model fit to the data
0< R2<1
R2 summarizes how well Y is explained by X1, X2, X3, based on the regression residuals.
Ex:
R2= 0.391: only 39.1% of the total variability of number of days absent from work is explained by X1,
X2, X3 based on the regression.
Adjusted R2
Adjusted R2 penalizes the R2 in the interests of parsimony: given the inclusion of one more
independent variable in the model, the adjusted R2 will increase only if the improvement in model´s
fit caused by the inclusion of the additional variable overcomes the loss of degrees of freedom.
𝑛−1
Adjusted R2 = 1-(1- R2)𝑛−𝑃−1
Measures of Multicolinearity
Clusters Clusters´ Rpc2- rPC4- rPC5- C1s- C1u C1v Aspects C2_Overall
(No dimension satisfaction Satisfaction Satisfaction Welcoming Promotions of the satisfaction
running with with with of and sale uniform of level with
means) professionalism technical elegancy customers Eta= the employees
(eta= 0.352) clarification (eta=0.65) Eta= collaborators (V=0.368)
(eta=0.652) Eta=
Cluster 817 0.204 0.3 0.207 5.77 5.44 5.62 43.5%
1 satisfied
28.1% very
satisfied
Cluster 157 -0.771 -1.611 0.532 4.63 4.49 4.59 36.8%
2 somewhat
satisfied
28.4%
Neutral
Cluster 160 -0.285 0.05 -1.579 4.94 4.55 4.47 32.1%
3 somewhat
satisfied
28.9%
satisfied
Most discriminant clustering base variables Additional discriminant variables (use
(Rpc1- rpc6) ETA or Cramer’s V)
Clusters Clusters´ Rpc3- rPC5- rPC6- C1s- C1u C1v Aspects C2_Overall
(with dimension satisfaction Satisfaction Satisfaction Welcoming Promotions of the satisfaction
running with personal with with of and sale uniform of level with
means) service elegancy courtesy customers Eta= the employees
(eta= 0.634) (eta=0.550) (eta=0.341) Eta= collaborators (V=0.368)
Eta=
Cluster 82 -2.168 0.4038 -1.48
1
Cluster 998 0.124 0.098 -0.07
2
Cluster 54 1.00 -2.4327 1.524
3
Satisfaction with professionalism and technical clarification is below average for cluster 2
An increase of a standard deviation referred to experience yields a 0.565 standard deviation increase
in predicted absence.