You are on page 1of 12

University of the East

Mathematics in the Modern World

Data Management
Student’s Handouts

Introduction to Statistics

Definition 1: Statistics is a branch of science pertaining to the methods of collecting/obtaining,


organizing, presenting, analyzing, interpreting data and then drawing conclusions based on the
data.

Definition 2: Descriptive Statistics tries to summarize or describe a collection of data. It is a set


of methods to describe data that we have collected.

Some of the most commonly used statistical treatments used are percentages, measures of central
tendency (such as the mean, median and mode), measures of variation (such as range, average
deviation, standard deviation, variance and coefficient of variation) and measures of skewness and
kurtosis.

Definition 3: Inferential Statistics is use to draw conclusions and make predictions based on the
analysis of numeric data. It is a set of methods used to make a generalization, estimate, prediction
or decision.

A pair of one measure of central tendency and one measure of variation can be use to draw a
conclusion, commonly used pair are mean and standard deviation.

Exercise 4: Classify whether the statement belongs to the area of Descriptive Statistics and
Inferential Statistics.

1. Ninety two percent of the class has age between 16-18 years.
2. Ninety five percent of the class may pass Basic Statistics.

Page 1 of 12
3. According to the local survey, the top three popular courses are: Psychology (23%),
Hospitality (19%) and Computer-related course (10%).
4. The normal blood sugar level of human is 70 mg/dL to 120 mg/dL.
5. Drinking pineapple juice may boost our immune system.

In the study of study of statistics, two terms are commonly used: population and sample.

Definition 5: Population is defined as the complete or entire collection of elements (person or


things) to be studied while sample refers to the representative part or finite number of elements
chosen from the population.

In relation to population and sample, next is to differentiate parameter from statistic.

Definition 6: Parameter is numerical value calculated from a population. Statistic is a number


that describes a set of observations in a sample.

Definition 7: Variables are characteristics or values that vary across individuals. It can be
qualitative or quantitative.

Definition 8: Qualitative Variables, also known as categorical variables, are used to represent
character, class or kind but not in amount. Some examples of qualitative variables are gender,
religion, nationality, favorite color and birthplace.

Definition 9: Quantitative Variables are variables that can be measured on a numeric or


quantitative scale. It can be classified as discrete or continuous.

Definition 10: Discrete uses natural numbers or counting numbers. Some examples of discrete
variables are number of students enrolled in STA111, number of iPad units in a store and number
of buildings in Metro Manila.

Page 2 of 12
Definition 11: A quantitative variable is continuous if it uses decimals or fractions. Some
examples of continuous variables are height, weight, length, width and speed of a bullet.

Definition 12: Levels of measurement are used to determine the statistical tool that can be used
to describe a data. There are four levels of measurement; these are Nominal, Ordinal, Interval and
Ratio.

Definition 13: The first level is called the Nominal level. In this level, names are assigned to
objects for the purpose of identifying or belonging to a group or category. The data can not be
arranged in an ordering system. Examples of data under this level are religion, nationality or race,
gender, birthplace and course.

Definition 14: The second level is the Ordinal level. In this stage, the words or numbers are
assigned to objects to represent the rank or order between them. It implies ranking, order or
inequalities. Examples are class rank, contest winners, degree of burn and cancer stages.

Definition 15: Interval level is the third level of measurement. It refers to quantitative
measurements used to identify and rank but in this scale, differences between two items can be
determined and operations such as multiplication and division are worthless. Interval scales do not
have a true zero point. Example of an interval data is temperature.

Definition 16: Lastly, fourth level of measurement is the Ratio level. It is similar to interval scale
but ratio has a true zero point and operations such as multiplication and division are therefore
significant. Examples of data under ratio are income, age, height, weight, area and volume.

Exercise 17: State the level of measurement of each of the following.


1. Blood type
2. Doctors salary
3. Latin honors
4. Temperature in Fahrenheit
5. Student number

Page 3 of 12
6. Gender
7. Land area
8. Contest winners
9. Kids height in cm
10. Athletes age in years

Sampling and Sampling Techniques

Definition 18: Sampling is the process of choosing elements, such as person, objects or groups
from a known population of interest to be included in a study in order to generate a fair result.
Sampling is done to reduce cost since it is less expensive conduct survey in a sample than in whole
population. Another advantage of using a sample instead of a population is that in sampling, data
can be obtained faster. Also, greater scope and accuracy are expected since the volume of work in
encoding and computing will be reduced.
There are two types of sampling techniques: probability sampling and non-probability sampling.

Definition 19: Probability sampling or random sampling gives all members of the population a
known and equal chance of being part in the sample. In other words, the selection of individuals
does not affect the chance of anyone else in the population being selected.

Definition 20. Simple random sampling is also called the lottery or the fishbowl method. Simple
random sampling use scientific calculator or computer program to generate a random number or a
table of random numbers to select the numbers for the elements to include in the sample.

Definition 21. In Systematic Skip Sampling, elements are listed numerically and then every “kth”
element from the list is selected from a randomly selected starting point.

Page 4 of 12
Definition 22: Stratified Random Sampling is a method where the population is divided into
sub-groups (called strata) base on some well-known characteristics of the population, such as age,
gender or socio-economic status; then take a random sample from each strata. The selection of
elements is then made separately from within each strata, usually by random or systematic
sampling methods.

Remark 23: In stratified random sampling, the number of samples per strata may be equal or
proportional.

Example 24: A study is conducted to 1,000 college students of the University of the East. Two
hundred students will be selected to be part of the study. How many samples are needed per year
level using equal distribution?

N = 1,000 (number of population)

n = 4 (four groups: First Year, Second Year, Third Year and Fourth Year)

N 200
ni    50
n 4

Each year level must be represented by 50 students.

Example 25: A study is conducted to 1,000 college students of the University of the East. Two
hundred students will be selected to be part of the study. The number of student per year level is
presented on a table. How many samples are needed per year level using proportional allocation?

Year Level Population ( N i )

First Year 300


Second year 250
Third Year 250
Fourth Year 200

Page 5 of 12
n  Ni
Use the formula: ni  where n i is the number of sample per year level, N i is the population
N
of student per year level, N is total number of population of the high school students and n is the
total sample needed.

Year Level Population ( N i ) ni


First Year 300 200  300
nI   60
1,000
Second year 250 200  250
n II   50
1,000
Third Year 250 200  250
n III   50
1,000
Fourth Year 200 200  200
n IV   40
1,000

Definition 26: Cluster Sampling is a method where the researcher divides the population into
groups, or clusters. Elements within a cluster are heterogeneous or are dissimilar. Select clusters
at random then use all units in the selected clusters as the sample.

Definition 27: Unlike probability sampling, non-probability sampling does not give everyone
an equal chance of being selected to be part of the sample. Non-probability sampling procedures
are much less desirable, as they will almost certainly contain sampling biases.

Some of the methods under non-probability sampling are quota, convenience and purposive
sampling.

Page 6 of 12
Descriptive Measures

Definition 28: Measures of Central Tendency are descriptive measures that are used to describe
the center of a set of data, arranged numerically. The three different types of “average” will be
discussed, the mean, the median and the mode.

Definition 29: The most commonly used to measure the central tendency is the mean. It is also
called the computed average. It is defined as the sum of the values divided by the total number of
items.

Definition 30: The median is the middle value in a set of data. The value which divides the
distribution into two equal parts, with one half of the values is lower than the median and other
half are higher than the median.

Definition 31: The third measure on central tendency is the mode. It is easily found by inspection.
It is a point on the distribution in which the frequency is higher than any other value.

Definition 32: A distribution with only one mode is called unimodal while f it has two modes,
then it is called bimodal. If it has more than two modes, the distribution is called multimodal. The
mode does not exist in a distribution if no value is repeated.

Page 7 of 12
Exercise 33: Determine the mean, median and mode of the given set of data.

1. 8, 10, 13, 13, 16


2. 2, 5, 3, 8, 5, 7, 2
3. 12, 10, 15, 14, 11, 18
4. 1, 9, 10, 2, 9, 4, 2, 1
5. 3, 6, 4, 4, 6, 3, 6, 3, 4

Remark 34: Best use of the mean, median and mode.

The mean is computed if the values are in interval or ratio scale. The mean is influenced
by outliers that may be at the extremes of the data set. The median is used for ordinal scale. Unlike
the mean, the median is not influenced by outliers at the extremes of the data set. The mode is
practical for nominal data. In such cases, the mode may not exist or may not be very meaningful.

Now, consider the given set of data:

Set A: 9, 12, 13, 15, 15, 17, 24

Set B: 7, 11, 15, 15, 17, 19, 21

Set C: 11, 11, 15, 15, 15, 18, 20

Using the measures of central tendency it seems that the sets are equal (that is, 15). But obviously,
the sets of data are different. Like, the values of Set A are more disperse or scattered than of Set B
and C. Using only these measures are not enough to describe a given set of data, we need to use
other descriptive measures to further describe a distribution.

Definition 35: Measures of Dispersion or Variability describes the spread or the scatterings of
the values around the mean.

Definition 36: The range is the difference between the highest and lowest value/observation.

Page 8 of 12
Example 37: Using the data above, the range of Set A is 24  9  15 .

Definition 38: The average deviation is the measure of the distance of each value to the mean.

 xx
The formula is given by: AD  where 𝑥̅ is the mean, 𝑥 are the values and 𝑛 is number
n
of values.

Exercise 39: Compute the average deviation of set A in the data above.

Definition 40: Variance measures how much variability there is in the entire distribution. The
standard deviation is the most commonly used measure of dispersion. It is the positive square
root of the variance. The formulas are as follows:

 ( x  x)
2
 ( x  x)
2

s 2
 s
n 1 n 1

Variance Standard Deviation

Exercise 41: Using the above data, compute the variance and the standard deviation of set A.

Definition 42: In a symmetrical or normal distribution the mean, median, and mode all fall at
the same point or equal.

Definition 43: In a positively skewed distribution, the extreme scores are larger, thus the mean
is larger than the median.

Page 9 of 12
Definition 44: In negatively skewed distribution, the order of the measures of central tendency
would be the opposite of the positively skewed distribution, with the mean being smaller than the
median, which is smaller than the mode.

Definition 45: Skewness measures the degree of symmetry of a distribution. One of the formulas
3(mean  median) 3( x  md )
of skewness is the given by Sk   .
s tan dard deviation s

Remark 46: When Sk = 0, the distribution is Normal or Symmetrical, when Sk > 0, the distribution
is Positively Skewed and when Sk < 0, the distribution is Negatively Skewed

Hypothesis Testing

Definition 47: A statistical hypothesis is a conjecture concerning one or more population whose
veracity can be stablished using sample data.

Definition 48: Parametric tests are applied to data that are normally distributed. Moreover, it is
assumed that the measurement of variables are either interval or ratio level.

Definition 49: Nonparametric tests do not require a normal distribution and the variables of
interest are on nominal or ordinal level.

Page 10 of 12
Table 50: Below is the summary of some of the different statistical tests.

https://www.google.com/url?sa=i&source=imgres&cd=&cad=rja&uact=8&ved=2ahUKEwi9vdqMq7biAhUXZt4KHQSSD2IQjRx6BAgBEAU&url=http%3A%2F%
2Fmethods.sagepub.com%2Fbook%2Funderstanding-social-science-research%2Fn10.xml&psig=AOvVaw3r5_gFGEBctk29Qmey76r_&ust=1558861859448779

Correlation

Definition 51: Correlation measures the strength of the linear association between two
quantitative variables: the independent variable and the dependent variable. The independent
variables are variables that can be manipulated or controlled while dependent variables are those
that cannot be controlled.

Definition 52: The most commonly used technique to calculate the coefficient of correlation is by
using the Pearson Product Moment Correlation Coefficient. The formula is given by

NXY  XY
r
[ NX  X  ][ NY 2  Y  ]
2 2 2

Page 11 of 12
where 𝑋 = the observed data from the independent variable, 𝑌 = the observed data from the
dependent variable, 𝑁 = sample size and 𝑟 = degree of relationship of x and y

Remark 53: The range of the correlation coefficient is -1 and +1. If the value of the coefficient is
close to -1.00, it represents a perfect negative correlation while a value of +1.00 represents a
perfect positive correlation. If the value is equal to 0.00, it means that there is no relation between
the variables.

Exercise 54: Determine the degree of relationship of the number of absences incurred by the
students and his final grade in Statistics 111 class. The data obtained in a study is from seven
randomly selected students of a Statistics 111 class.

Number of Final Grade


Absences (x) (y)
6 82
2 86
15 43
9 74
12 58
5 90
8 78
References:
 Almeda, Josefina V. et.al. (2010). Elementary Statistics. Diliman, Quezon City: University
of the Philippines Press.
 Aufmann R., et al (2018). Mathematical Excursions, Fourth Edition. USA: Cengage
Learning.
 Bluman, A. G. (2009). Elementary statistics: A step by step approach. New York:
McGraw-Hill Higher Education.
 Walpole, R. E., Myers, R. H., Myers, S. L., & Ye, K. (2012). Probability & statistics for
engineers & scientists (9th edition.). Boston: Prentice Hall.
 First Generation Training the Trainors (2016). Philippines: Ateneo De Manila University.
 Photo credits: Google Images

Page 12 of 12

You might also like