You are on page 1of 37

Analyzing Data 2

M-16 Transformation Scores and


Corrélation Coefficients

Chapter 7 Kothari
Type of Distributions
• In this chapter we shall learn about the shape
of the distribution.
1. Normal Distributions
• When a distribution of scores is very large, it
often tends to approximate a pattern called a
normal distribution
• When plotted as a frequency polygon, a
normal distribution forms a symmetrical, bell-
shaped pattern often called a normal curve

• The normal distribution is a theoretical frequency distribution that has certain special
characteristics
Characteristics of Normal Distribution
1. First, it is bell-shaped and symmetrical: the right half is a mirror image of the left half.
2. Mean, median, and mode are equal and are located at the center of the distribution.
3. Third, the normal distribution is unimodal, that is, it has only one mode.
4. Fourth, most of the observations are clustered around the center of the distribution with
far fewer observations at the ends, or tails, of the distribution.
5. Finally, when standard deviations are plotted on the x-axis, the percentage of scores
falling between the mean and any point on the x-axis is the same for all normal curves.
• Another feature to consider when talking about a distribution is the shape of the tails of
the distribution on the far left and the far right.
• Kurtosis is the measure of the thickness or heaviness of the tails of a distribution.
• It also refers to how flat or peaked a normal distribution is.
• In other words, kurtosis refers to the degree of dispersion among the scores, that is,
whether the distribution is tall and skinny or short and fat.
• There are three types of distribution
• Mesokurtic
• Leptokurtic
• Platykurtic
• Mesokurtic curves have peaks of medium height, and the distributions. Distribution is
moderate in width.
• Leptokurtic curves are tall and thin with only a few scores in the middle of the
distribution having a high frequency.
• Platykurtic curves are short and more dispersed (broader). In a platykurtic curve there
are many scores around the middle score that all have relatively high frequency.

Mesokurtic
Positively Skewed Distributions
• Most distributions do not approximate a normal or bell-shaped
curve, they are skewed or lopsided.
• In a skewed distribution scores tend to cluster at one end or the
other of the x-axis with the tail of the distribution extending in
the opposite direction.
• In a positively skewed distribution the peak is to the left of the
center point and the tail extends toward the right or in the
positive direction (see Figure)
• A few individuals have extremely high scores that pull the
distribution in that direction.
• Notice also what this skewing does to the mean, median, and
mode?
• These three measures do not have the same value, nor are they all located at the center of
the distribution as they are in a normal distribution.
• The mode the score with the highest frequency is the high point on the distribution.
• The median divides the distribution in half.
• The mean is pulled in the direction of the tail of the distribution because the few
extreme scores pull the mean toward them and inflate it.
Negatively Skewed Distributions
• Negatively skewed distribution, a distribution in which the
peak is to the right of the center point and the tail extends
toward the left or in the negative direction.
• The opposite of a positively skewed distribution i.e. The term
negative refers to the direction of the skew.
• In Figure in a negatively skewed distribution, the mean is
pulled toward the left by the few extremely low scores in the
distribution. As in all distributions the median divides the
distribution in half, and the mode is the most frequently
occurring score in the distribution
• Shape of a distribution provides valuable information about the distribution.
• For example, marks distribution in the exam, should be positively skewed or negatively skewed?
• One can see that in negatively skewed distribution, mode is on the right side of the mean, which
means, most of the students have marks above the “mean”
• Similarly, median which is middle of the score is higher than the mean, this is also an indicator
that middle of scattering is still above the mean. So class trend is high, if -vely skewed
• Another example of the value of knowing the shape of a distribution is provided by
Harvard paleontologist Stephen Jay Gould (1985).
• Gould was diagnosed in 1982 with a rare form of cancer. He immediately began
researching the disease and learned that it was incurable and had a median
mortality rate of only 8 months after discovery.
• Rather than immediately assuming that he would be dead in 8 months, Gould realized
this meant that half of the patients lived longer than 8 months.
• Because he was diagnosed with the disease in its early stages and was receiving high-
quality medical treatment, he reasoned that he could expect to be in the half of the
distribution who lived beyond 8 months.
• The other piece of information that Gould found encouraging was the shape of the
distribution.
• Look again at the two distributions and decide which you would prefer in this situation.
• With a positively skewed distribution, the cases to the right of the median could stretch
out for years; this is not true for a negatively skewed distribution.
• The distribution of life expectancy for Gould s disease was positively skewed, and Gould
was obviously in the far right-hand tail of the distribution because he lived and remained
professionally active for another 20 years.
z-Scores
• A z-score, or standard score, is a measure of how many standard deviation units an
individual raw score falls from the mean of the distribution.
• We can convert each exam score to a z-score and then compare the z-scores because they
are then in the same unit of measurement.
• We can think of z-scores as a translation of raw scores into scores of the same language
for comparative purposes.
𝑋𝑋−𝑋𝑋�
𝑧𝑧 = (For sample)
𝑆𝑆
𝑋𝑋 −𝜇𝜇
And 𝑧𝑧 = (For population)
𝜎𝜎
• Conversion to a z-score is a statistical technique that is appropriate for use with data on an
interval or ratio scale of measurement (scales for which means are calculated).
Example:
We want to know how an individual’s exam score in psychology, compares with score in
another class, English.
The individual who was in the psychology exam distribution example scored 86 on the
exam with mean of 74.00 with a standard deviation (S) of 13.64.
In the English exam he made a score of 91, and class mean was 85 with a standard deviation
of 9.58.
On which exam did the student do better?
Conclusion: Student stands at a better position in the class of Psychology as
compared to English, although he has scored higher marks in English.
Note
• z-score was above the mean, so we decided that student did better in English, since his z-
score was higher in English than in Psychology (from the mean)
• When the score is below the mean, the z score is negative, indicating that individual score
is lower than the mean of the distribution
Example
Suppose you administered a test to a large sample of people and computed the mean and
standard deviation of the raw scores with the following results:
𝑋𝑋� = 45
S= 4
Suppose scores of individuals is as following:

Solution on board
Everyone is close to mean – No exceptionally well
Henry is worst, lowest marks −1.5𝜎𝜎, Palm slightly better −1.0 𝜎𝜎 both are below average
Debbie is exactly on average and Rich is 1.0 𝜎𝜎, slightly better the Debbie – However best among the group in
consideration but not the whole class
• Therefore,
• z-score tells whether an individual raw score is above the mean (a positive z-score) or
below the mean (a negative z-score)
• how many standard deviations the raw score is above or below the mean.
• Thus z-scores are a means of transforming raw scores to standard scores for purposes of
comparison in both normal and skewed distributions
z-Scores, the Standard Normal Distribution, Probability, and Percentile Ranks
• If the distribution of scores for which we are calculating transformations (z-scores) is
normal i.e. symmetrical and unimodal, then it is referred to as the Standard Normal
Distribution,
• A standard normal distribution means that it has a mean value of 0 and a standard
deviation of 1 which is a theoretical distribution with a specific mathematical formula.
• All other normal curves approximate the Standard Normal Curve to a greater or lesser
extent.
• A researcher can also determine the probability of occurrence of a score that is higher or
lower than any other score in the distribution
• The proportions under the Standard Normal Curve only hold for normal distributions, not
for skewed distributions.
• Even though z-scores may be calculated on skewed distributions, the proportions under
the Standard Normal Curve do not hold for skewed distributions.
• Figure represents the area under the Standard Normal Curve in terms of standard
deviations
• The figure shows that:
• approximately 68% of the observations in the distribution fall between -1.0 and 1.0 standard
deviations from the mean.
• 13.5% of the observations fall between 1.0 and 2.0 and another 13.5% between 1.0 and
2.0 and
• 2% between 2.0 and 3.0

• Only 0.13% of the scores are beyond a z-


score of 3.0.
• If we sum the percentages, we have 100%
all of the area under the curve, representing
everybody in the distribution.
• If we sum half of the curve, we have 50%
half of the distribution.
• We have seen that with a curve that is normal (or symmetrical), the mean, median, and
mode are all at the center point;
• So, in normal distribution, 50% of the scores are above this number and 50% below it.
• This property helps us determine probabilities.
• A probability is defined as the expected relative frequency of a particular outcome,
which could be the result of an experiment or any situation in which the result is not
known in advance.
• For example, the probability that in a normal distribution, a result will fall upon the mean
is 50%
• The probability is equal to the proportion of scores in that area,
• Figure gives an estimate of the proportions under the normal curve.
• This information is provided in Table B.1 in Appendix B. A small portion of this table is
shown in Table 16.2.
• In the last example, we estimated the value of
z for English was 0.626, whereas z in
Psychology was 0.880
• Respective areas from this z values will be
0.2357 for z = 0.626, and for z= 0.88, it is
0.31106
• In previous example, Ritch has z equal to 1.
• This means he is on one standard deviation,
which covers area = 0.34134
• Let us take second example. It was:
𝑋𝑋� = 45
S =4
• z scores were as following
Percentile Rank
• The standard normal curve can also be used to determine an individual’s percentile rank;
-- A score that indicates the percentage of people who scored at or below a given raw
score.
• Again, to determine a percentile rank, we must first know the individual’s z-score
• Let us say we want to calculate an individual’s percentile rank based on the person’s score
on an intelligence test
• The scores on the intelligence are normally distributed with
𝜇𝜇 = 100
𝜎𝜎 = 15
Let us assume the person scored 119 marks, i.e. X = 115
• Area between Mean and z column for a score of 1.27, proportion = 0.3980
• To determine all of the area below the score we must add 0.50, to 0.3980
i.e. 0.50 + 0.3980 = 0.8980
• This implies, that be the intelligence test score of 119 as being in the 89.80th percentile,
which means that 89.9% students have scored less marks than him.
• What to do when z-score is negative?
• What if we know an individual s percentile rank and want to determine the
person’s raw score?
• Going back to earlier examples example –

𝑋𝑋� = 45
S=4

𝑋𝑋 − 𝜇𝜇
𝑧𝑧 =
σ
Based on information available about mean, and standard deviation, and z score one
can determine the data point
Correlation Coefficients
• We have done Correlation Coefficients in previous lectures
• The most commonly used correlation coefficient is the Pearson product moment
correlation coefficient, usually referred to as Pearson’s r (r is the statistical notation used
to report this correlation coefficient)
• Pearson s r is used for data measured on an interval or ratio scale of measurement.
• Refer to Figure 9.1, which presents a scatterplot of height and weight data for 20
individuals. The data is represented
• The value of r is determined as following;

𝑋𝑋 − 𝑋𝑋� 𝑌𝑌 − 𝑌𝑌�

𝑠𝑠𝑥𝑥 𝑠𝑠𝑦𝑦
𝑟𝑟 =
𝑛𝑛 − 1
sx, and sy are standard deviations
n is number of terms
Table 16.3 (Module 16)
Sr. No Weight (lb) Height (in) Zxi Zyi Zxi Zyi
Scatter Plot 9.1
90
1 100 60 -1.5779 -1.5800 2.4930 80

2 120 61 -0.9371 -1.3664 1.2805 70

60
3 105 63 -1.4177 -0.9394 1.3319
50
4 115 63 -1.0973 -0.9394 1.0309 40

5 119 65 -0.9692 -0.5124 0.4966 30

20
6 134 65 -0.4886 -0.5124 0.2504
10
7 129 66 -0.6488 -0.2989 0.1939
0
8 143 67 -0.2002 -0.0854 0.0171
90 110 130 150 170 190 210 230
9 151 65 0.0561 -0.5124 -0.0287
10 163 67 0.4405 -0.0854 -0.0376
11 160 68 0.3444 0.1281 0.0441
12 176 69 0.8570 0.3416 0.2928 Mean X 149.25
13 165 70 0.5046 0.5551 0.2801
14 181 72 1.0172 0.9821 0.9991 MeanY 67.4
15 192 76 1.3697 1.8362 2.5149
Std-Dev.S X 31.21
16 208 75 1.8823 1.6227 3.0543
17 200 77 1.6260 2.0497 3.3327 Std-Dev.S Y 4.68
18 152 68 0.0881 0.1281 0.0113
Sum ZxiZyi 17.89
19 134 66 -0.4886 -0.2989 0.1460
20 138 65 -0.3604 -0.5124 0.1847 r 0.941
Interpretation of results
• The positive sign tells us that the variables increase and decrease together.
• The large magnitude (close to 1.00) tells us that there is a strong positive relationship
between height and weight:
• Those who are taller tend to weigh more, whereas those who are shorter tend to weigh
less.
Coefficient of Determination (r2)
• Coefficient of determination (r2) is a measure of the proportion of the variance in one
variable that is accounted for by another variable
• In our group of 20 individuals there is variation in both the height and weight variables,
and some of the variation in one variable can be accounted for by the other
• We could say that some of the variation in the weights of the 20 individuals can be
explained by the variation in their heights
• Some of the variation in their weights, however, cannot be explained by the variation in
height.
• It might be explained by other factors such as genetic predisposition, age, fitness level, or
eating habits.
• The coefficient of determination tells us how much of the variation in weight is accounted
for by the variation in height.
• Hence, 88.36% of the variance in weight can be accounted for by the variance in height a
very high coefficient of determination.
• Depending on the research area, the coefficient of determination may be much lower and
still be important.
Advanced Correlational Techniques: Regression Analysis
• regression analysis: A procedure that allows us to predict an individual’s score on one
variable based on knowing one or more other variables.
• Regression analysis involves determining the equation for the best fitting line for a data
set.
• Equation of line is: 𝑌𝑌 = 𝑏𝑏𝑏𝑏 + 𝑎𝑎
Where b is slope and a is the intercept of regression line

Determination of a and b
Σ 𝑥𝑥𝑥𝑥
𝑏𝑏 = , 𝑎𝑎 = � − 𝑏𝑏 𝑋𝑋�
𝑌𝑌
Σ 𝑥𝑥 2
𝑥𝑥 = 𝑋𝑋 − 𝑋𝑋�
𝑦𝑦 = 𝑌𝑌 − 𝑌𝑌�
CorrelationCoeffCalc. xlsx
Table 16.3 (Module 16) - Regression Analysis

Sr. No Weight (lb) Height (in) x= X-MeanX y=Y-MeanY xy xsqr (x*x) Mean X 149.25
1 100 60 -49.25 -7.40 364.45 2425.56
MeanY 67.4
2 120 61 -29.25 -6.40 187.20 855.56
Sigmaxy 2615.00
Sigma(xsqr) 18509.75
3 105 63 -44.25 -4.40 194.70 1958.06

4 115 63 -34.25 -4.40 150.70 1173.06

5 119 65 -30.25 -2.40 72.60 915.06


b (slope) 0.14
6 134 65 -15.25 -2.40 36.60 232.56
a (intercept) 46.314
7 129 66 -20.25 -1.40 28.35 410.06
8 143 67 -6.25 -0.40 2.50 39.06
Chart Title
9 151 65 1.75 -2.40 -4.20 3.06
90
10 163 67 13.75 -0.40 -5.50 189.06
80
11 160 68 10.75 0.60 6.45 115.56
70
12 176 69 26.75 1.60 42.80 715.56
60
13 165 70 15.75 2.60 40.95 248.06 50
14 181 72 31.75 4.60 146.05 1008.06 40
15 192 76 42.75 8.60 367.65 1827.56 30 y = 0.1413x + 46.314
16 208 75 58.75 7.60 446.50 3451.56 20
R² = 0.8864
17 200 77 50.75 9.60 487.20 2575.56 10

18 152 68 2.75 0.60 1.65 7.56 0


90 110 130 150 170 190 210 230
19 134 66 -15.25 -1.40 21.35 232.56
20 138 65 -11.25 -2.40 27.00 126.56
Coefficient of Determination
We know, 𝑌𝑌𝑐𝑐 = 𝑎𝑎 + 𝑏𝑏 (𝑋𝑋)

2
𝑏𝑏 Σ𝑥𝑥𝑥𝑥
𝑟𝑟 =
Σ𝑦𝑦 2
Where 𝑥𝑥 = 𝑋𝑋 − 𝑋𝑋�
And 𝑦𝑦 = 𝑌𝑌 − 𝑌𝑌�
End

You might also like