Statistics Lecture Notes

Broad view of topics we would do is covered in
http://onlinestatbook.com/Online_Statistics_Education.pdf
There is no exposition of the theory of probability here; it needs to be done entirely from the Online
Stat Book.
Introduction:
Statistics include numerical facts and figures and involve calculation of numbers. But it also relies
heavily on how the numbers are chosen and how the statistics are interpreted. For example, its easy
to find flaws in the following statements,
• A new advertisement for Ben and Jerry's ice cream introduced in late May of last year resulted
in a 30% increase in ice cream sales for the following three months. Thus, the advertisement
was effective.
• The more churches in a city, the more crime there is. Thus, churches lead to crime.
• 75% more interracial marriages are occurring this year than 25 years ago. Thus, our society
accepts interracial marriages.
Thus, statistics are not only facts and figures; they are something more than that. In the broadest
sense, “statistics” refers to a range of techniques and procedures for analyzing, interpreting, displaying,
and making decisions based on data.
Statistics are often presented to add credibility to an argument or advice. You can see this by paying
attention to television advertisements. Many of the numbers thrown about in this way do not represent
careful statistical analysis. They can be misleading and push you into decisions that you might find
cause to regret.
The British Prime Minister Benjamin Disraeli is quoted by Mark Twain as having said, “There are
three kinds of lies -- lies, damned lies, and statistics.” This quote reminds us why it is so important to
understand statistics. So let us invite you to reform your statistical habits from now on. No longer will
you blindly accept numbers or findings. Instead, you will begin to think about the numbers, their
sources, and most importantly, the procedures used to generate them.
Types of Statistics:
Descriptive statistics are numbers that are used to summarize and describe data. The word “data”
refers to the information that has been collected from an experiment, a survey, an historical record,
etc. Descriptive statistics are just descriptive. They do not involve generalizing beyond the data at
hand. Generalizing from our data to another set of cases is the business of inferential statistics. Eg.,
averages, range.
Populations and samples: In statistics, we often rely on a sample --- that is, a small subset of a larger
set of data --- to draw inferences about the larger set. The larger set is known as the population from
which the sample is drawn.
Example #1: We are interested in examining how many math classes have been taken on average by
current graduating seniors at Indian colleges and universities during their three years in school. Our
population involves the graduating seniors throughout the country. This is a large set since there are
thousands of colleges and universities, each enrolling many students. It would be prohibitively costly
to examine the transcript of every college senior. We therefore take a sample of college seniors and
then make inferences to the entire population based on what we find. To make the sample, we might
first choose some public and private colleges and universities across India. Then we might sample 50
students from each of these institutions. Suppose that the average number of math classes taken by
the people in our sample were 3.2. Then we might speculate that 3.2 approximates the number we
would find if we had the resources to examine every senior in the entire population. But we must be
careful about the possibility that our sample is non-representative of the population. Perhaps we chose
an overabundance of math majors, or chose too many technical institutions that have heavy math
requirements. Such bad sampling makes our sample unrepresentative of the population of all seniors.
Researchers adopt a variety of sampling strategies. The most straightforward is simple random
sampling. Such sampling requires every member of the population to have an equal chance of being
selected into the sample. In addition, the selection of one member must be independent of the
selection of every other member. That is, picking one member from the population must not increase
or probability of picking any other member.
Example # 2: A research scientist is interested in studying the experiences of twins raised together
versus those raised apart. She obtains a list of twins from the National Twin Registry, and selects two
subsets of individuals for her study. First, she chooses all those in the registry whose last name begins
with Z. Then she turns to all those whose last name begins with B. Because there are so many names
that start with B, however, our researcher decides to incorporate only every other name into her
sample. Finally, she mails out a survey and compares characteristics of twins raised apart versus
together.
What is the population? What is the sample? Is the sample picked by simple random sampling? Is this
sample biased? Explain.
Question: A random sample of 20 subjects are taken from a population having equal number of males
and females. Find the probability that 70% or more of the sample will be female. [Hint: use concepts
you have learnt in the theory of probability.] Repeat this exercise with 50 subjects.
Moral of the story: Sample size matters. A small sample would not be representative, although it
would be drawn randomly. Only a large sample size makes it likely that our sample is close to
representative of the population. For this reason, inferential statistics take into account the sample
size when generalizing results from samples to populations.
Some other sampling techniques:
1. In experimental research, populations are often hypothetical. For example, in an experiment

comparing the effectiveness of a new anti-depressant drug with a placebo, there is no actual
population of individuals taking the drug. In this case, a specified population of people with
some degree of depression is defined and a random sample is taken from this population. The
sample is then randomly divided into two groups; one group is assigned to the treatment
condition (drug) and the other group is assigned to the control condition (placebo). This
random division of the sample into two groups is called random assignment. Random
assignment is critical for the validity of an experiment. For example, consider the bias that
could be introduced if the first 20 subjects to show up at the experiment were assigned to the
experimental group and the second 20 subjects were assigned to the control group. It is
possible that subjects who show up late tend to be more depressed than those who show up
early, thus making the experimental group less depressed than the control group even before
the treatment was administered.
2. Since simple random sampling often does not ensure a representative sample, a sampling
method called stratified random sampling is sometimes used to make the sample more
representative of the population. This method can be used if the population has a number of
distinct “strata” or groups. In stratified sampling, you first identify members of your sample
who belong to each group. Then you randomly sample from each of those subgroups in such
a way that the sizes of the subgroups in the sample are proportional to their sizes in the
population.
Measurement levels:
Before we can conduct a statistical analysis, we need to collect data, i.e, we need to measure the
variables of interest. Although procedures for measurement differ in many ways, they can be classified
using a few fundamental categories:
• Measurable (or continuous) variables are the ones which can be measured
• Categorical variables are the ones that describe attributes of object of interest. They can be
further classified into:
o Nominal scales, where only names or categorizes responses. Gender, handedness,
favorite color, and religion are examples of variables measured on a nominal scale. The
essential point about nominal scales is that they do not imply any ordering among the
responses.
o Ordinal scale, which provides ordering or comparison between the different values
of the variable. A researcher wishing to measure consumers' satisfaction with their
microwave ovens might ask them to specify their feelings as either “very dissatisfied,”
“somewhat dissatisfied,” “somewhat satisfied,” or “very satisfied.” The items in this
scale are ordered, ranging from least to most satisfied.
Distribution:
In a bag of Plain M&M's, the M&M's were in six different colors. A quick count showed that there
were 55 M&M's: 17 brown, 18 red, 7 yellow, 7 green, 2 blue, and 4 orange. A tabular form of
representing this is called frequency distribution. Frequency distribution may be represented
graphically using bar graphs. If there are smaller number of categories like the example above, the
frequency distribution may be represented using a pie chart. It converts percentages into radians to
represent them as sections within a circle.
Often, we need to compare the results of different surveys, or of different conditions within the same
overall survey. In this case, we are comparing the “distributions” of responses between the surveys or
conditions. Bar charts are often excellent for illustrating differences between two distributions. Eg.
Number of people playing different computer games on Saturdays and Sundays.
But consider the variable “time took by GRE test takers to respond to a question”. Since time is
continuous, if measured accurately enough, every observation in the sample would have different
values with frequency of 1. The solution to this problem is to create a grouped frequency distribution.
In a grouped frequency distribution, scores falling within various ranges are tabulated. Grouped
frequency distributions can be portrayed graphically using a histogram.
Stem and Leaf diagrams are useful ways of representing data that are not very numerous. They give
a good idea about the distribution of data. Two sector steam and leaf diagrams may be used to compare
two distributions. Eg., final exam scores of two different sections in Statistics.
Central Tendency:
Imagine this situation: You are in a class with just four other students, and the five of you took a 5-
point pop quiz. Today your instructor is walking around the room, handing back the quizzes. She
stops at your desk and hands you your paper. Written in bold black ink on the front is “3/5.” How
do you react? Are you happy with your score of 3 or disappointed? How do you decide? You might
calculate your percentage correct, realize it is 60%, and be appalled. But it is more likely that when
deciding how to react to your performance, you will want additional information. What additional
information would you like? If you are like most students, you will immediately ask your neighbors,
“Whad'ja get?” and then ask the instructor, “How did the class do?” In other words, the additional
information you want is how your quiz score compares to other students' scores. You therefore
understand the importance of comparing your score to the class distribution of scores. Should your
score of 3 turn out to be among the higher scores, then you'll be pleased after all. On the other hand,
if 3 is among the lower scores in the class, you won't be quite so happy.
This comparison requires that you know how the distribution of the scores of the class: in particular,
you would be willing to know the center and the spread of the distribution. The center of the spread
can be defined in two ways:
(a) the one which minimizes the sum of absolute deviations (differences) (Eg.), and
(b) one which minimizes the sum of squared deviations (Eg.).
The arithmetic mean is the sum of the numbers divided by the number of numbers. The median is
the midpoint of a distribution: the same number of scores is above the median as below it. In terms
of center of the spread, median is the value that minimizes the sum of absolute deviations, and the
arithmetic mean (or simply, mean) is the value that minimizes the sum of the squared deviations. The
mode is the observation with the highest frequency. Distributions may be bi-modal, indicating that
there are two observations which have the highest frequency. Multi-modal distribution may be defined
like-wise.
Yet another measure of central tendency is the geometric mean, which is computed by multiplying
all the numbers together and then taking the nth root of the product.
For symmetric distributions, the mean and the median are equal (Eg.). Their difference occurs in case
of skewed distributions. Consider the following positively skewed distribution:
When distributions have a positive skew, the mean is typically higher than the median, although it may
not be in bimodal distributions.
A large skew result in very different values for these measures. Suppose in a class of 5 pupils, 3 gets
0 in a test and the other two gets 3 and 5 respectively. If you were asked the very general question:
“So, how did the class do?” and answered with the mean of 1.6, you would not have told the whole
story since there are 60% of the class who got below that. If you answered with median or mode of
0, then it would give the impression that no student in the class could answer any question.
Fortunately, there is no need to summarize a distribution with a single number. We may report a mix
of the statistics that represent the data in the most concise way.
Distributions with positive skew normally have larger means than medians, while distributions with
negative skew has lower mean than medians. In other words, low frequency high observations pull
the mean up in case of positive skew; while low frequency small observations pull down the mean in
negative skew. The median is always at the center of distribution.(Explain using diagram).
Variability:
Variability refers to how “spread out” a group of scores is. (Eg. Different distribution of scores with the same
mean). There are four frequently used measures of variability: range, variance, and standard deviation.
The range is simply the highest score minus the lowest score.
Variability can also be defined in terms of how close the scores in the distribution are to the middle
of the distribution. Using the mean as the measure of the middle of the distribution, the variance
(denoted by σ2) is defined as the average squared difference of the scores from the mean. Standard
deviation is simply the square root of variance.
∑(𝑥−µ)2
Variance=σ2 = 𝑁
where σ is the standard deviation, and µ is the mean.
A note on central tendency and dispersion of random variables: A random variable is a variable
with unknown numerical value that can take on, or represent, any possible element from a sample
space. The elements of a sample space have probabilities associated (probability function). For
example, the outcome of a toss of a coin, the outcome of roll of a fair die are random variables. What
are the probabilities associated with them?
The mean of a random variable is called its expectation. The expectation of a random variable is
given by the sum of product of the random variable and the probability associated with the random
variable. The variance of a random variable is the sum of the product of squared differential
between the variable and its mean and the probability associated with the random variable.
𝜎 2 = ∑𝑖 (𝑥𝑖 − µ𝑥 )2 𝑝𝑖 .
Example: There is a 30% chance that a company will yield 50% return and 70% chance that the
company makes a loss of 10%. Find the expected returns. Find the standard deviation.
Effects of Linear Transformation on Central tendency and Variability:

To motivate this with an example, consider the relationship between the centigrade scale and
Fahrenheit scale: C=0.56 F-17.78. Note that you just need to multiply the mean temperature in
Fahrenheit by 0.556 and then subtract 17.778 to get the mean in Centigrade. The same is true for the
median.
The formula for the standard deviation is just as simple: the standard deviation in degrees Centigrade
is equal to the standard deviation in degrees Fahrenheit times 0.556. Since the variance is the standard
deviation squared, the variance in degrees Centigrade is equal to 0.5562 times the variance in degrees
Fahrenheit. To sum up, if a variable X has a mean of μ, a standard deviation of σ, and a variance of
σ2, then a new variable Y created using the linear transformation Y = bX + A will have a mean of
bμ+A, a standard deviation of bσ, and a variance of b2σ2.
In many situations, it may be important to find the variance of the sum of two variables. For example,
in a class of 30 students, you find the scores of the students in Statistics and Economics and then add
them up. You now want to know the variation in the sum. The variance of this sum may be computed
according to the following formula:
2 2 2
σ sum = σ S+ σ E. Therefore, if the variances on the Statistics and Economics score are respectively 0.9
and 0.8, then the variance of the sum would be 1.7. The formula for the variance of the difference
between the two variables is the same as the formula for the sum. More generally,
σ2S±E = σ2 S+ σ2E.

Statistics Lecture Notes

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistics Lecture Notes

Uploaded by

Copyright:

Available Formats

Broad view of topics we would do is covered in

1. In experimental research, populations are often hypothetical. For example, in an experiment

Effects of Linear Transformation on Central tendency and Variability:

You might also like