Professional Documents
Culture Documents
Advanced Statistics
A Portfolio
Presented to
In Partial Fulfillment
of the Requirements for the subject
Advanced Statistics
By
Christian B. Manginsay
November 2021
1|Page
Introduction
In this course, I have learned about the different concepts in Advanced Statistics and applied
what I know in every lesson. This portfolio is a collection of the summaries, reflection and
formative exams produced over several weeks. I think that this course will help me to established
knowledge which I can use in the future when I am teaching. This course will greatly help me in
Preliminary Statement
I created this portfolio to help me assess myself, gain new knowledge and to discover my
strength and weaknesses in all the discussed topics and might as well improve what is needed. I
have learned in a progressive portfolio the details of my thought and what I felt during the whole
Goals
Advanced Statistics.
And to develop and self-reflect how discussed topics could help me in my future career
as an educator.
2|Page
Acknowledgment
Apart from the efforts of me, the success of any project depends largely on the encouragement
and guidelines of others. I take this opportunity to express my gratitude to the people who have
been instrumental in the successful completion of this portfolio. The completion of this portfolio
I would like to show my greatest gratitude and appreciation to Dr. Bernie Rivas. I can't thank
him enough for his tremendous support and help. I feel motivated and encouraged every time I
attend his lecture. Without his encouragement and guidance, this portfolio would have not been
possible.
Not forgotten the guidance and support I received from all my classmate was vital for success of
this portfolio. I am grateful for their constant support and help. Last but not the least, I would
like to extend my deepest gratitude to all those who have directly and indirectly guided me in
3|Page
Table of Contents
4|Page
Summaries
5|Page
1. Measures of Central Tendency
Data can be classified in various forms. One way to distinguish between data is in terms of
grouped and ungrouped data. What is ungrouped data? When the data has not been placed in any
categories and no aggregation/summarization has taken placed on the data then it is known as
ungrouped data. Ungrouped data is also known as raw data. What is a grouped data? When raw
data have been grouped in different classes then it is said to be grouped data. Before we study
more about grouped and ungrouped data it is important to understand what do we mean by
“Central Tendencies”? As the names suggest, central tendencies have something to do with the
center. Central tendency is the central location in a probability distribution. There are many
measures for central tendencies like mean, mode, median. We should also understand the
measures of central tendencies of ungrouped data. MODE: The most frequently occurring
item/value in a data set is called mode. Bimodal is used in the case when there is a tie b/w two
values. Multimodal is when a given dataset has more than two values with the same occurring
frequency. MEDIAN: The median of a dataset is described as the middlemost value in the
ordered arrangement of the values in the dataset. MEAN: Also known as the arithmetic average.
It is calculated by the summation of all values divided by the number of values.
2. Fractiles
Fractiles are measures of location or position which include not only central location but also any
position based on the number of equal divisions in a given distribution. If we divide the
distribution into four equal divisions, then we have quartiles denoted by Q1, Q2, Q3, and Q4.
The most commonly used fractiles are the quartiles, deciles, and percentiles. QUARTILES
divide a distribution into four equal parts. DECILES are values that divide a distribution into 10
equal parts. PERCENTILES are values that divide the distribution into 100 equal parts.
3. Measures of Dispersion
A measure of dispersion indicates the scattering of data. It explains the disparity of data from one
another, delivering a precise view of their distribution. The measure of dispersion displays and
gives us an idea about the variation and the central value of an individual item.
In other words, dispersion is the extent to which values in a distribution differ from the average
of the distribution. It gives us an idea about the extent to which individual items vary from one
another, and from the central value.
4. Introduction to Correlation
To summarize, correlation means describing a relationship between two variables. And one way
to describe the possible correlation between two variables is by using a scatter diagram which
has an x and y axis through plotting. In measuring a correlation, we can use a scatter diagram. A
scatter diagram is a graph of ordered pair (x,y) of numbers consisting of the independent variable
x, and dependent variable, y. All correlations have two properties: strength and direction. The
strength refers to the numerical value while the direction refers to whether the correlation is
positive, negative or no correlation. There are also three types of correlation: positive, negative
and no correlation at all.
6|Page
5. Spearman Rank Correlation
When should you use the Spearman's rank-order correlation? The Spearman's rank-order
correlation is the nonparametric version of the Pearson product-moment correlation. Spearman's
correlation coefficient, (ρ, also signified by rs) measures the strength and direction of association
between two ranked variables. Spearman's correlation measures the strength and direction of
monotonic association between two variables. Monotonicity is "less restrictive" than that of a
linear relationship. A monotonic relationship is not strictly an assumption of Spearman's
correlation. That is, you can run a Spearman's correlation on a non-monotonic relationship to
determine if there is a monotonic component to the association. However, you would normally
pick a measure of association, such as Spearman's correlation, that fits the pattern of the
observed data.
What does this test do? The Pearson product-moment correlation coefficient (or Pearson
correlation coefficient, for short) is a measure of the strength of a linear association between two
variables and is denoted by r. Basically, a Pearson product-moment correlation attempts to draw
a line of best fit through the data of two variables, and the Pearson correlation coefficient, r,
indicates how far away all these data points are to this line of best fit (i.e., how well the data
points fit this new model/line of best fit). What values can the Pearson correlation coefficient
take? The Pearson correlation coefficient, r, can take a range of values from +1 to -1. A value of
0 indicates that there is no association between the two variables. A value greater than 0
indicates a positive association; that is, as the value of one variable increases, so does the value
of the other variable. A value less than 0 indicates a negative association; that is, as the value of
one variable increases, the value of the other variable decreases.
7. Multiple Correlation
The multiple correlation coefficient generalizes the standard coefficient of correlation. It is used
in multiple regression analysis to assess the quality of the prediction of the dependent variable. It
corresponds to the squared correlation between the predicted and the actual values of the
dependent variable. It can also be interpreted as the proportion of the variance of the dependent
variable explained by the independent variables. When the independent variables (used for
predicting the dependent variable) are pairwise orthogonal, the multiple correlation coefficient is
equal to the sum of the squared coefficients of correlation between each independent variable
and the dependent variable. This relation does not hold when the independent variables are not
orthogonal. The significance of a multiple coefficient of correlation can be assessed with an F
ratio. The magnitude of the multiple coefficient of correlation tends to overestimate the
magnitude of the population correlation, but it is possible to correct for this overestimation.
7|Page
8. Partial Correlation
Partial correlation is a method used to describe the relationship between two variables whilst
taking away the effects of another variable, or several other variables, on this relationship. Partial
correlation analysis is aimed at finding correlation between two variables after removing the
effects of other variables. This type of analysis helps spot spurious correlations (i.e. correlations
explained by the effect of other variables) as well as to reveal hidden correlations - i.e
correlations masked by the effect of other variables.
8|Page
Reflections
9|Page
1. Measures of Central Tendency
Measures of central tendency are often used in research to get an idea of where most data
values lie. Other data measures that are closely related to measures of central tendency are
variance and standard deviation. The most commonly used measures of central tendency
are mean, mode and median. These measures are mostly used by primary researchers
during data analysis.
2. Fractiles
Fractiles are important in engineering and scientific applications, and a different form of
them are one of the first real life exposure many of us get to statistics, as our parents look
up the growth percentiles of our baby siblings and we look up the percentile our SAT
scores fall in.
3. Measures of Dispersion
4. Introduction to Correlation
The sign of the Spearman correlation indicates the direction of association between X (the
independent variable) and Y (the dependent variable). If Y tends to increase when X
increases, the Spearman correlation coefficient is positive. If Y tends to decrease when X
increases, the Spearman correlation coefficient is negative. A Spearman correlation of zero
indicates that there is no tendency for Y to either increase or decrease when X increases.
The Spearman correlation increases in magnitude as X and Y become closer to being
perfectly monotone functions of each other. When X and Y are perfectly monotonically
related, the Spearman correlation coefficient becomes 1. A perfectly monotone increasing
relationship implies that for any two pairs of data values Xi, Yi and Xj, Yj, that Xi − Xj
and Yi − Yj always have the same sign. A perfectly monotone decreasing relationship
implies that these differences always have opposite signs.
10 | P a g e
significance, we assume normality of both the variables. Pearson's correlation coefficient
calculates the effect of change in one variable when the other variable changes. For
example: Up till a certain age, (in most cases) a child's height will keep increasing as his/her
age increases.
7. Multiple Correlation
In statistics, the coefficient of multiple correlation is a measure of how well a given variable
can be predicted using a linear function of a set of other variables. It is the correlation
between the variable's values and the best predictions that can be computed linearly from
the predictive variables. The coefficient of multiple correlation takes values between 0 and
1. Higher values indicate higher predictability of the dependent variable from the
independent variables, with a value of 1 indicating that the predictions are exactly correct
and a value of 0 indicating that no linear combination of the independent variables is a
better predictor than is the fixed mean of the dependent variable. The coefficient of
multiple correlation is known as the square root of the coefficient of determination, but
under the particular assumptions that an intercept is included and that the best possible
linear predictors are used, whereas the coefficient of determination is defined for more
general cases, including those of nonlinear prediction and those in which the predicted
values have not been derived from a model-fitting procedure.
8. Partial Correlation
Partial correlation measures the strength of a relationship between two variables, while
controlling for the effect of one or more other variables. For example, you might want to
see if there is a correlation between amount of food eaten and blood pressure, while
controlling for weight or amount of exercise. It’s possible to control for multiple variables
(called control variables or covariates). However, more than one or two is usually not
recommended because the more control variables, the less reliable your test. Partial
correlation has one continuous independent variable (the x-value) and one continuous
dependent variable (the y-value); This is the same as in regular correlation analysis. In the
blood pressure example above, the independent variable is “amount of food eaten” and the
dependent variable is “blood pressure”. The control variables — weight and amount of
exercise — should also be continuous.
Multiple regression is a statistical technique that can be used to analyze the relationship
between a single dependent variable and several independent variables. The objective of
multiple regression analysis is to use the independent variables whose values are known to
predict the value of the single dependent value. Each predictor value is weighed, the
weights denoting their relative contribution to the overall prediction. Often, you'll want to
use some nominal variables in your multiple regression. For example, if you're doing a
multiple regression to try to predict blood pressure (the dependent variable) from
independent variables such as height, weight, age, and hours of exercise per week, you'd
11 | P a g e
also want to include sex as one of your independent variables. This is easy as you create a
variable where every female has a 0 and every male has a 1, and treat that variable as if it
were a measurement variable.
Simple linear regression is a statistical method that allows us to summarize and study
relationships between two continuous (quantitative) variables: One variable, denoted x, is
regarded as the predictor, explanatory, or independent variable. The other variable,
denoted y, is regarded as the response, outcome, or dependent variable. Because the other
terms are used less frequently today, we'll use the "predictor" and "response" terms to
refer to the variables encountered in this course. The other terms are mentioned only to
make you aware of them should you encounter them. Simple linear regression gets its
adjective "simple," because it concerns the study of only one predictor variable.
An aggregate price index tracks the prices for a group of commodities (called a market
basket) at a given period of time to the price paid for that group of commodities at a
particular point in time in the past. The base period is the point in time in the past against
which all comparisons are made. In selecting the base period for a particular index, if
possible, you select a period of economic stability rather than one at or near the peak of an
expanding economy or the bottom of a recession or declining economy. In addition, the
base period should be relatively recent so that comparisons are not greatly affected by
changing technology and consumer attitudes and habits.
12 | P a g e