You are on page 1of 7

CHAPTER 1

A population consists of all the individuals that are of interest to the study or case.

A sample is a part or subset of the population. For example, the number of kids in a school are the
population and the girls or boys for a specific study are the sample from the population.

Parameters are the numbers that summarize or define the entire population.

Statistics are the numbers that define or summarize or define the sample or subset of the population.
For example: the parameter of a study is the goal of the study like the average height of students in the
school and the statistics would be the data derived from the sample of students.

Descriptive statistics involve calculating, summarizing, organizing a given data using numbers, bar graphs,
histograms, mean, median, standard variance etc. This is done with a known population or sample.

Inferential statistics involve making an educated guess using numbers and probability from a sample for
an unknown population. The level of confidence increases and the margin of error decreases as the
sample size increases. For example, making a guess for 100000 people based on a sample of 100 people.

(For test 1)

A confidence level is the amount or percentage of times you are right about an estimation. For example,
you are right about a person being 40-50 years old 95% of the times.

A significance level is the amount or percentage of times you are wrong about a conclusion. This is
typically concluded for a yes or no question. For example, you are 95% of the times wrong about a
person being 43 years old.

CHAPTER 2

Types of data:

 Nominal data is the type of data which can be put into categories. Also known as categorical
data or qualitative data. No calculations are possible in this type of data however, the
frequencies or the data can be counted.
 Ordinal data is a type of data which is nominal data but can be put into an order, for example,
the sizes of drinks or shirts. No calculations are possible or allowed in this type of data as well
except counting the frequencies and/or ordering the data.
 Interval data or numerical or quantitative data is a type of data that can be measured as counts
or a numeric value. All calculations are possible with this type of data due to it being in numeric
values.

The hierarchy of data goes from interval data at the top to ordinal data in between and nominal data
at the bottom. The data at the top can be converted into the data below it but the opposite is not
possible. For example, the ages of people who buy movie tickets is interval data and could be
classified as ordinal data by dividing the ages into categories of senior, adult, and child. This data
could be further classified as nominal data by summarizing the categories as adult and other
(including child and senior).

A Frequency distribution table shows the number or count or frequency of each category in the
data whereas, a Relative frequency distribution table or proportion shows the proportion of the
number or frequency of each category with respect to the total number of frequencies (n) {the total
of rF is always equal to 1.0}. A modal class has the highest level of frequency among all other
classes. Can also be represented as percentages when asked.

A cross classification table shows the representation of 2 types of nominal data

RELATIVE FREQUENCY = f/n

A Bar chart is the graphical representation of nominal data as bars with the categories on the
bottom and frequencies on the side.

A Pie chart is the visual representation of relative frequency of data in the form of a pie or a circle in
degrees.

DEGREES = rF * 360

CHAPTER 3

A Histogram is the graphical representation of the number of frequencies of interval data. The
classes do not overlap each other in a histogram.

The size of a class in histogram or the class width is given by:

Whereas the number of classes to be considered in a histogram is given by:

Sturges’s formula: Number of classes = 1 + 3.3 log (n)

However, this is not required, merely a suggestion.


An
ogive is a
graphical

representation
of cumulative
frequency. A
cumulative
frequency
distribution
shows
the

proportion of observations
that fall below each class interval
(i.e., the sum of all the frequencies
in the previous class along with the
current class).

Stem and leaf display is an easier version and rather quicker to draw a histogram from given data. The
stem and leaf display’s advantage over the histogram is that we can see the actual observations rather
than observations classified into different classes.
CHAPTER 4

Measures of central tendency summarize large groups of numbers into 1 number. For example, the
mean, median and mode are measures of central tendency. We might use different measures of central
tendency because they might be more appropriate depending upon the given data.

Mean basically means the average of all the given data. It is the most common and important measure
of central tendency. It is used when the given date is unimodal, uniform and symmetrical in nature.

Median is the central point, or the middle point of a given set of data. It is used when the given data is
skewed or ordinal. To find the median we must order the data and then choose the middle value from it,
in case there are two middle values, take the mean of the middle values, that value would be the median
(it does not have to be a part of the original data).

Mode is the highest occurring frequency in the data and is used instead of a mean when the data is
nominal or used in describing the shape of the data.

Geometric Mean is the mean for a proportional growth or a rate of interest situations. This is preferred
in these situations because Geometric mean takes into account the changing affect on the principle
amount in each time period
The geometric mean should be used for past RoR while the arithmetic mean should be used for best
future estimates of RoR.

The mean, median, and mode cannot fully describe a distribution. Variability measures such as range,
variance, and standard deviation provide additional insights into how spread-out data is around the
mean. The higher the variance the more spread out the data is from the mean.

The Range is the simplest measure of variability to calculate and understand. It’s simply the difference
between the maximum and minimum values in the data. However, this also highlights the limitation of
the range, as it only considers the extreme values and ignores the rest of the data.

Deviations from the mean are the differences from the mean or the average value. They are important
to figure out how much the actual values differ from the average. The sum of deviations always adds up
to zero. The higher the deviation, the higher the variance in each set of data.

Variance is a more accurate measure of variation than the range or average deviation, as it takes into
account every single data point in a set, and it’s not always zero, making it useful for comparing different
data sets.
The larger the standard deviation, the more the numbers in the data are spread apart.

Empirical rule talks about one kind of frequency distribution, a smooth curve distribution (a bell-shaped

curve). It tells how many people are within the central value or the population mean. Nearly all data will
fall within three standard deviations of the mean. It states preset values for the times of standard
deviations within the mean. 68% of data falls within one standard deviation of the mean, 95% within two
standard deviations, and 99.7% within three standard deviations.

Chebyshev’s rule covers all types of distribution to determine percentages of people are within a close
distance from the average or mean.

K = number of standard deviations.

(For percentages multiply the formula with 100)

This rule states that ATLEAST n% of observations fall within k standard deviations of the mean.

You might also like