You are on page 1of 10

Unit 1

STATISTICS: AN OVERVIEW

“Facts are stubborn things, but statistics are pliable,”

Mark Train

“Without big data, you are blind and deaf and in the middle of a freeway.”

– Geoffrey Moore.

Learning Outcome:

At the end of the lesson, you are expected to:


 Understand the basic facts about statistics
 Identify and classify the kinds of data

Pretest

Please answer the following items:

Directions: Each word belongs to a particular kind of data. You are to identify the following words
as to the particular kind of data where they belong. Write N for Nominal data, O for Ordinal Data,
I for Interval Data and R for Ratio Data. Write your answer on the space provided before each
number:

_____ 1. Gender
_____ 2. Current rank in the class
_____ 3. Color of the shirt
_____ 4. Birth Order
_____ 5. Percentage of profit over sales
_____ 6. Age bracket of teenagers
_____ 7. Grade range for “With honors”
_____ 8. Return on Investments
_____ 9. Interest Rate
_____10. World ranking for universities

Select the best answer from the options given in each of the statements below:
1. Which of the following is an example of a categorical variable?
a. Flavour of softdrink ordered by each customer in a fastfood
b. Height measured in inches for each student in a class
c. Points scored by each player in a team
2. Numerical and pictorial information about variables are called
a. Analytical statistics
b. Inferential statistics
c. Descriptive statistics
3. The entire group of interest for a statistical conclusion is called the
a. Data
b. Population
c. sample
4. A subgroup that is representative of a population is called
a. Category
b. Data
c. sample
5. Statistical inference is:
a. The process of estimates and conclusions carefully based on data from a sample
b. The process of estimates and conclusions carefully based on data from the entire
population
c. Pictorial displays that summarize a data
6. Two types of statistical variables are:
a. Categorical and descriptive
b. Categorical and numerical
c. Descriptive and numerical
7. It is a mathematical science that deals with data collection, organization, analysis and
interpretation.
a. Mathematics
b. Inforgraphics
c. Statistics
8. _____________ is a set of raw numbers and/or words that are collected through
observations and/or just descriptions of things.
a. Information
b. Data
c. collections
9. An estimate of the characteristics of a population is called:
a. Data
b. Sample
c. population
10. Statistic is to a sample, as __________ is to population.
a. Representation
b. Parameter
c. Mentimeter

Learning Content

As a prelude, please watch the following video clip at YouTube


https://www.youtube.com/watch?v=1hF0x7WsVOI
Let’s take a short review:

Statistics is a mathematical science including methods of collecting, organizing and


analysing data in such a way that meaningful conclusions can be drawn from them. Descriptive
statistics deal with the processing of data without attempting to draw any inferences from it. The
data are presented in forms of tables and charts. Descriptive data comes from but not limited to:
educational qualifications, religion, prices of goods, businesses incomes, epidemics, sports data,
population data and the like.

Data is a set of raw numbers and/or words that are collected through observations and/or
just descriptions of things. In a more technical sense, data is a set of values of qualitative or
quantitative variables about one or more persons or objects, while a datum is a single value of a
single variable. Data, basically, are unprocessed information.

Data Sets are elements with which elements are characterized by variables.

Data Types are an important concept of statistics, which needs to be understood, to


correctly apply statistical measurements to your data and therefore to correctly conclude certain
assumptions about it. This blog post will introduce you to the different data types you need to
know, to do proper exploratory data analysis (EDA), which is one of the most underestimated
parts of a machine learning project.

Types of Data
Basically, there are four kinds of data:
o Nominal
o Ordinal
o Interval
o Ratio

Figure 1 Types of Data

Types of Data

Categorical Numeric

Nominal Ordinal Interval Ratio

Nominal data is a kind of data that is taken from nominative variables like gender, degree
completed, color of the skin, etc. Take for example gender. When gender is used as a variable,
the data gathered will be male, female, lesbian, gay, bisexual, transgender, queer, questioning,
intersex and others for which these are pure descriptions of gender. There are no numbers
assigned nor order on it. Another example is degree completed wherein the data gathered would
possibly be, Bachelor of Secondary Education, Bachelor in Elementary Education, Bachelor of
Science in Engineering (Electrical, Electronics and Communications, Civil, Mechanical, etc.).
These are pure nomenclatures of degree programs completed by the respondent.
Nominal data can be analyzed using the grouping method. The variables can be grouped
together into categories, and for each category, the frequency or percentage can be calculated.
The data can also be presented visually, such as by using a pie chart.

Although nominal data cannot be treated using mathematical operators, they still can be
analyzed using advanced statistical methods. For example, one way to analyze the data is
through hypothesis testing.

For nominal data, hypothesis testing can be carried out using nonparametric tests such
as the chi-square test. The chi-square test aims to determine whether there is a significant
difference between the expected frequency and the observed frequency of the given values.

Ordinal Data is a kind of data that is expressed in order or rank. For example, ordinal data
is said to have been collected when a responder inputs his/her financial happiness level on a
scale of 1-10. In ordinal data, there is no standard scale on which the difference in each score is
measured.

Considering the example highlighted above, let us assume that 50 people earning
between $1000 to $10000 monthly were asked to rate their level of financial happiness. An
undergraduate earning $2000 monthly may be on an 8/10 scale, while a father of 3 earning $5000
rates 3/10. This is to show that the scale is usually influenced by personal factors and not due to
a set rule.

Another example is ranking in competitions. Ordinal data are built upon nominal scales
by assigning numbers to objects to reflect a rank or ordering on an attribute (formplusblog, 25th
June, 2020). Examples are that of the extent of satisfaction of customers on a restaurant service,
which may range from very satisfied, satisfied to dissatisfied. These extent of satisfaction
descriptions may be assigned with numbers that denote order like 3 is assigned to an extent
where customers are “very satisfied”, 2 can be assigned to “satisfied” customers and 1 for
“dissatisfied” customers.

Interval data is a kind of data which are expressed in scales. Each point of the scale is
placed at an equal distance from one another. Interval data is one of the 2 types of numerical
data and is reflects as an extension of the ordinal data.

Class Size. The term refers to the difference between the class boundaries of the upper
limit and the lower limit be it overlapping or non-overlapping lower and upper limits. Example,
The class size of the overlapping interval 10 – 20.

Lower Limit = 10

Upper Limit = 20

Class size = upper limit – lower limit


= 20 – 10
Class size = 10
Class Interval. This refers to the numerical width of any class in a particular distribution.
It is defined as the difference between the upper-class limit and the lower-class limit. In
statistics, the data are arranged into different classes and the width of such class is called the
class interval.

Take the following sample data on weights of people on a diet plan. 52, 75, 92, 101, 83,
68, 133, 78, 104, 61, 39, 46, 135, 87, 131, 99, 104, 86, 67, 116, 89, 57, 87, 98, 131, 116, 135,
93.

1. With a class interval of 14, determine how many classes you get
2. Present the weights by using a frequency distribution table.

Step 1. Arrange the data either in ascending or descending order.

39, 46, 52, 57, 61, 67, 68, 75, 78, 83, 86, 87, 87, 89, 92, 93, 98, 99, 101, 104, 104, 116, 116,
131, 131, 133, 135, 135

Class Interval = 14

No. No. of Scores /


Class Interval (14) Percent
Interval

1 39 – 53 3 6.86

2 54 – 67 3 6.86

3 68 – 81 3 6.86

4 82 – 95 7 25.00

5 96 – 110 5 17.86

6 111 – 124 2 7.14

7 125 - 139 5 17.86

Total 28 100.00

Range. The term is used to describe the difference between the lowest and the highest
values. Take the following data set as an example: (4, 6, 9, 3, 7). The lowest value is 3 and the
highest value is 9 so the range is computed as: 9 – 3 = 6. So simple!

Real Limit. The term refers to the boundaries that separate each interval. The real limit
separating two adjacent scores is located exactly halfway between the scores. Each score has
two real limits – one at the top of its interval called the upper real limit and the one at the bottom
of its interval called the lower real limit.
Cumulative Frequency. This is the total of a frequency and all frequencies so far in a
frequency distribution. It is the running total of frequencies. For example:

Class Interval (14) No. of Scores / Interval Cumulative Frequency

39 – 53 3 3

54 – 67 3 6

68 – 81 3 9

82 – 95 7 16

96 – 110 5 21

111 – 124 2 23

125 – 139 5 28

Total 28

Cumulative Percent. This refers to the total percentage and all percentages so far in a
percentage distribution. It is the running total of all percentages. For example:

Class Interval (14) No. of Scores / Interval Percent Cumulative Percent

39 – 53 3 10.71 10.71

54 – 67 3 10.71 21.42

68 – 81 3 10.71 32.13

82 – 95 7 25.00 57.13

96 – 110 5 17.86 74.99

111 – 124 2 7.14 82.13

125 - 139 5 17.86 100.00

Total 28 100.00

Ratio data is the 2 types of numerical data. It is an extension of the interval data and is
also the peak of the measurement variable types. The only difference between the ratio data and
interval data is that the ratio data already has a zero value. For example, temperature, when
measured in Kelvin is an example of ratio variables. The presence of a zero-point accommodates
the measurement in Kelvin. Also, unlike the interval data multiplication and division operations
can be performed on the values of a ratio data.

Commonly Used Statistical Terms

Parameter. In statistics, parameter is a value that tells something about a population and
is the opposite from a statistic, which tells you something about a small part of the population. A
parameter never changes, because everyone (or everything) was surveyed to find the parameter.
For example, you might be interested in the average age of everyone in your class. Maybe you
asked everyone and found the average age was 25. That’s a parameter, because you asked
everyone in the class. Now let’s say you wanted to know the average age of everyone in your
grade or year. If you use that information from your class to take a guess at the average age, then
that information becomes a statistic. That’s because you can’t be sure your guess is correct
(although it will probably be close!).

Statistic. A statistic is a characteristic of a sample. Generally, a statistic is used to


estimate the value of a population parameter. It is a value that tells something about a small
part of the population which most oftentimes is referred to as the sample. Parameter is exactly
the other way around for statistic.

Discrete Variables. These are countable variable in a finite amount of time. For example,
you can count the change in your pocket. You can count the money in your bank account. You
could also count the amount of money in everyone’s bank accounts. It might take you a long time
to count that last item, but the point is—it’s still countable.

Continuous Variables. They are variables that would (literally) take forever to count. In
fact, you would get to “forever” and never finish counting them. For example, take age. You can’t
count “age”. Why not? Because it would literally take forever. For example, you could be: 25
years, 10 months, 2 days, 5 hours, 4 seconds, 4 milliseconds, 8 nanoseconds, 99 picosends…and
so on.

Scales of Measurement. Scales of measurement refer to ways in which


variables/numbers are defined and categorized. Each scale of measurement has certain
properties which in turn determines the appropriateness for use of certain statistical analyses.
The four scales of measurement are nominal, ordinal, interval, and ratio.

Cross-Sectional Data. In statistics and econometrics cross-sectional data is a type


of data collected by observing many subjects (such as individuals, firms, countries, or regions) at
the one point or period of time. The analysis might also have no regard to differences in
time. Analysis of cross-sectional data usually consists of comparing the differences among
selected subjects.

Time series Data are data that are collect over several time periods.

Census. A census is a survey conducted on the full set of observation objects belonging
to a given population or universe. Context: A census is the complete enumeration of a population
or groups at a point in time with respect to well defined characteristics: for example, population,
production, traffic on particular roads.

Survey. A survey is an investigation about the characteristics of a given population by


means of collecting data from a sample of that population and estimating their characteristics
through the systematic use of statistical methodology.

Now that you have read the content of the unit, you perform the learning
activities. If your problem is internet connection, feel free to contact me on
the mobile phone number included in this learning package

Learning
Let’s Activity
have fun:

I. Encircle the letter of your best choice.


1. How does ordinal data differ from nominal data?
a. Nominal data is a name, while ordinal data is a number.
b. Nominal data only distinguishes, ordinal data also offers magnitude information.
c. Nominal data can be a name or number, while ordinal data can only be number.
d. Nominal data can only be a name, while ordinal data can be name or number.
2. “The sequential list according which the batsmen in a cricket team would come out to bat”
– Which of the following data types does this data set belong to?
a. Nominal
b. Ordinal
c. Ratio
d. interval
3. A group of 10 people were shown 15 photographs. Each person was asked to choose
their favourite photo, and the choices were recorded. What is the data type of the recorded
data?
a. Nominal
b. Ordinal
c. Ratio
d. interval
4. What is the type of data scale marked on a measuring tape?
a. Integer
b. Ratio
c. Nominal
d. discrete
5. A researcher doing a blind experiment got the respondent data coded with numbers in a
column, “respondent_ID”. What data type is it?
a. Ordinal
b. Continuous
c. Interval
d. Nominal
6. What is the data type for the Singapore-average-rainfall-data in mm?
7. Discrete data is from qualities that can be
a. Measured
b. Counted
c. both
8. A collection of facts such as test scores, drawing, photographs, and inventory figures is
called:
a. Quantity
b. Product
c. Data
d. Collector’s item
e. Merchandize
9. Safety data sheets are changing to conform with the Globally Harmonized System:
a. True
b. false
10. Why is it useful to look at frequency data?
a. It can be quicker/easier to do certain post-processing functions in the frequency
domain
b. Time series data is complicated because it is unclear when certain events occur
c. Frequency data shows us the power of events so we can write music about it
11. A researcher should explore the characteristics of the data and the examined variables to
summarize the data once data is clean and ready for investigation:
a. True
b. false
12. One of the important considerations in preliminary analysis is to look for patterns in the
data and to check if any specific variable looks extremely erratic.
a. True
b. false
13. Blunders are errors made in transferring the manual data onto software for analysis during
data entry or coding
a. True
b. false
14. Different data types command the use of similar analysis techniques, whereby statistical
methods for analyzing categorical data can also be used for continuous data.
a. True
b. false
15. Parametric statistical techniques underscore stringent assumptions regarding the
distribution of data for the population under study:
a. True
b. false

II. Give example data out of the following nominal variables:


Mobile Phone Brand: _____________, ______________, ______________
Skin Colour: _____________, ______________, ______________
Food Taste: _____________, ______________, ______________

Descriptions of Quality: _____________, ______________, ______________


Level of Preference: _____________, ______________, ______________
Attitude Towards Accounting: _____________, ______________, ______________

Evaluation
Convert the following ordinal data into interval and ratio data:
Item No Responses
5 4 3 2 1
1 16 29 32 6 17
2 18 27 30 8 17
3 14 31 30 20 5
4 16 20 30 17 17
5 10 25 35 15 15
6 15 25 27 27 6
7 12 32 20 18 18
8 15 27 23 28 7
9 8 35 32 20 5
10 10 13 23 26 28
Basis for Description:
5 Always
4 Often
3 Sometimes
2 Rarely
1 Never

Compute and display descriptive statistics of the following data sets:


Order Category Weight Frequency
Excellent 5 16
Very Satisfactory 4 29
Satisfactory 3 32
Fair 2 6
Poor 1 17
Total 100

You might also like