Welcome to Statistics!!
Dr. Nguyen Thi Viet Ly
Email: ly.ntv@vgu.edu.vn
1 12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
Chapter 1
Statistics, Data,
and Statistical
Thinking
(Part 2)
Copyright © 2022 Pearson Education, Ltd. All Rights Reserved
12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
Contents
1. The Science of Statistics
2. Types of Statistical Applications in Business & Engineering
3. Fundamental Elements of Statistics
4. Processes
5. Statistical Thinking in Business & Engineering
6. Types of Data
7. Collecting Data: Sampling and Related Issues
12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
Where We’re Going
1. Identify the different types of data
2. Data-collection methods
3. Data understanding
4. Data cleaning
12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
1.6
Types of Data
1 12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
Types of Data: By Nature of Data
Quantitative data are measurements
that are recorded on a naturally Types of Data
occurring numerical scale. by Nature
Qualitative data are measurements
that cannot be measured on a natural
numerical scale; they can only be Quantitative Qualitative
classified into one of a group of Data Data
categories.
1 12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
Quantitative Data (Numerical)
Data that represents measurable quantities and can be analyzed
mathematically.
1. Discrete data: Countable data with finite values, often whole numbers.
Examples: Number of students in a class (e.g., 25), number of cars in a parking lot.
2. Continuous data: Data that can take any value within a range and includes decimals.
Examples: Height (e.g., 5.6 feet), Temperature (e.g., 37.5°C), Distance (e.g., 12.34 km).
7 12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
Quantitative Data (Numerical)
Measured on a numerical scale.
1. The temperature (in degrees Celsius) at which each unit in a sample
of 20 pieces of heat-resistant plastic begins to melt
2. The current unemployment rate (measured as a percentage) for each
of the 50 states
3. The scores of a sample of 150 MBA applicants on the GMAT, a
standardized business graduate school entrance exam administered
nationwide
4. The number of female executives employed in each of a sample of 75
manufacturing companies
1 12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
Qualitative Data (Categorical)
Data that describes characteristics or qualities and is non-numerical.
1. Nominal data: Data categorized without a specific order.
Examples: Gender (Male/Female), Colors (Red, Blue, Green), Product Categories
(Electronics, Furniture).
2. Ordinal data: Data categorized in a specific order, but the intervals between
categories are not meaningful.
Examples: Ratings (Good, Better, Best), Education Levels (High School,
Bachelor’s, Master’s).
9 12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
Qualitative Data (Categorical)
Classified into categories.
1. The political party affiliation (Democrat, Republican, or Independent)
in a sample of 50 CEOs
2. The defective status (defective or not) of each of 100 computer chips
manufactured by Intel
3. The size of a car (subcompact, compact, midsize, or full-size) rented by
each of a sample of 30 business travelers
4. A taste tester’s ranking (best, worst, etc.) of four brands of barbecue
sauce for a panel of 10 testers
1 12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
Example
Chemical and manufacturing plants sometimes discharge toxic-waste
materials such as DDT into nearby rivers and streams. These toxins can
adversely affect the plants and animals inhabiting the river and the
riverbank. The U.S. Army Corps of Engineers conducted a study of fish in
the Tennessee River (in Alabama) and its three tributary creeks: Flint
Creek, Limestone Creek, and Spring Creek. A total of 144 fish were
captured, and the following variables were measured for each: (continued
on next slide)
1 12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
Example (cont)
1. River/creek where each fish was captured
2. Species (channel catfish, largemouth bass, or smallmouth buffalo fish)
3. Length (centimeters)
4. Weight (grams)
5. DDT concentration (parts per million)
These data are saved in the DDT file. Classify each of the five variables
measured as quantitative or qualitative.
1 12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
Example (cont)
Solution
The variables length, weight, and DDT are quantitative because each is
measured on a numerical scale: length in centimeters, weight in grams,
and DDT in parts per million. In contrast, river/creek and species cannot
be measured quantitatively: They can only be classified into categories
(e.g., channel catfish, largemouth bass, and smallmouth buffalo fish for
species). Consequently, data on river/creek and species are qualitative.
1 12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
Types of Data: By Source of Data
Types of Data
by Source
Primary Secondary
Data Data
1 12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
Types of Data: By Source of Data
1. Primary data: Data collected directly from the source for a specific purpose.
Examples: Surveys, interviews, experiments.
• Advantages: Relevant, up-to-date, and specific to the problem.
• Disadvantages: Time-consuming and expensive.
2. Secondary data: Data collected from existing sources, not directly gathered by the user.
Examples: Published reports, government statistics, research papers.
• Advantages: Easily accessible and cost-effective.
• Disadvantages: May not be entirely relevant or up-to-date.
15 12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
Types of Data: By Structure of Data
Types of Data
by Structure
Structured Unstructured
Data Data
1 12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
Types of Data: By Structure of Data
1. Structured data: Data that is organized in rows and columns, easily stored in databases.
Examples:
• Excel spreadsheets with sales data (e.g., Date, Product, Quantity, Revenue).
• Attendance records.
2. Unstructured data: Data that doesn’t follow a predefined format.
Examples:
• Images, videos, emails, social media posts.
• Text reviews on Amazon.
17 12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
Types of Data: By Time or Usage Context
Types of Data
by Time or Usage Context
Real-time Historical
Data Data
1 12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
Types of Data: By Time or Usage Context
1. Real-time data: Data generated and updated instantly.
Examples:
• Stock market prices (collected from stock exchanges).
• GPS tracking data for logistics companies.
2. Historical data: Data from past events or time periods.
Examples:
• Weather records (from meteorological departments).
• Sales data from previous years.
19 12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
Types of Data
Category (By...) Types of Data Examples
E.g., revenue, temperature vs.
1. By nature (what the Quantitative (numerical) vs.
customer gender, satisfaction
data represent) Qualitative (categorical)
level
2. By source (where data E.g., data from your own
Primary vs. Secondary
come from) survey vs. government database
3. By structure (how data E.g., Excel tables vs. social
Structured vs. Unstructured
are stored) media text, images
4. By time / usage context E.g., traffic sensors streaming
(when and how data are Real-Time vs. Historical live speed data vs. archived
used) traffic reports from last year
12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
The Importance of Correctly Categorizing Data
➢ Properly categorizing data is a critical step in the
data analysis process.
➢ It directly affects how data is interpreted,
analyzed, and utilized for decision-making.
21 12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
The Importance of Correctly Categorizing Data
Practical example:
• Imagine you are working for a retailer collecting data on
customer purchases.
• This data includes: Customer age group, Sales revenue, Product
categories, Purchase date
Task 1: think about the impact of incorrect categorization
What could happen if the following data is miscategorized:
• Customer age group (categorical) is treated as numerical.
• Sales revenue (numerical) is treated as categorical.
22 12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
The Importance of Correctly Categorizing Data
Answer:
➢ Misleading insights: if age groups like “21-30," “31-40," etc., are treated as numbers,
calculations like averages or sums will have no real meaning and lead to flawed
demographic analysis.
Example: An average "age group" of 35.5 doesn’t provide any meaningful demographic
information.
➢ Loss of trend analysis: revenue trends over time cannot be analyzed if revenue values are
grouped into arbitrary categories.
Example: Grouping revenues into categories like "low," "medium," and "high" may
oversimplify data and prevent precise measurement of growth or decline.
23 12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
The Importance of Correctly Categorizing Data
Practical example:
• Imagine you are working for a retailer collecting data on customer
purchases.
• This data includes: Customer age group, Sales revenue, Product
categories, Purchase date
Task 2: Reflect on the benefits of correct categorization
How can categorizing age group as categorical help analyze purchasing
behavior?
How can treating sales revenue as numerical help identify trends?
24 12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
The Importance of Correctly Categorizing Data
Answer:
➢ Enables meaningful demographic analysis and helps identify target customers for
marketing strategies.
Example: “21-30" age group spends more on electronics, while "31-40" focuses on home
appliances.
➢ Supports detailed revenue analysis and allows advanced calculations like total revenue,
revenue per product category, and forecasting.
Example: Average customer spending can be calculated for different time periods or product
categories.
Growth trends over time (e.g., monthly sales growth) can be analyzed effectively.
25 12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
1.7
Collecting Data
1 12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
Obtaining Data
1. Data from a published source
2. Data from a designed experiment
3. Data from an observationally study (or survey)
12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
Obtaining Data
Published source
Book, journal, newspaper, Web site
Designed experiment
Researcher exerts strict control over the units
Survey
A group of people are surveyed and their responses are recorded
Observation study
Units are observed in natural setting and variables of interest are
recorded
12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
Designed Experiment
A designed experiment is a data-collection method where the
researcher exerts full control over the characteristics of the
experimental units sampled. These experiments typically involve a
group of experimental units that are assigned the treatment and an
untreated (or control) group.
Designed experiments test what causes a change.
12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
Observational Study
An observational study is a data-collection method where the
experimental units sampled are observed in their natural setting. No
attempt is made to control the characteristics of the experimental units
sampled. (Examples include opinion polls and surveys.)
We can find relationships but not causes.
We see what happens, not why it happens.
12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
Designed Experiment vs. Observational Study
▪ In designed experiments, we control conditions.
▪ In observational studies, we only observe.
12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
Samples
A representative sample exhibits characteristics typical of those
possessed by the population of interest.
A simple random sample of n experimental units is a sample
selected from the population in such a way that every different
sample of size n has an equal chance of selection.
12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
Random Number Generators
Most researchers rely on random number
generators to automatically generate the random
sample.
Random number generators are available in table
form, and they are built into most statistical
software packages.
12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
Example
Suppose you wish to assess the feasibility of building a new high
school. As part of your study, you would like to gauge the opinions of
people living close to the proposed building site. The neighborhood
adjacent to the site has 711 homes. Use a random number generator to
select a simple random sample of 20 households from the
neighborhood to participate in the study
12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
Example (cont)
Solution In this study, your population of interest consists of the 711
households in the adjacent neighborhood. To ensure that every possible
sample of 20 households selected from the 711 has an equal chance of
selection (i.e., to ensure a simple random sample), first assign a number
from 1 to 711 to each of the households in the population. These numbers
were entered into an Excel worksheet. Now, apply the random number
generator of Excel/ XLSTAT, requesting that 20 households be selected
without replacement. The figure in your text on page 17 shows one
possible set of random numbers generated from XLSTAT. You can see
that households numbered 40, 63, 108, . . . , 636 are the households to be
included in your sample.
12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
Example (cont)
Solution Take a sample of 20 households
Option 1: Use Excel or Google Sheets
▪ List all households in one column (numbers 1 → 711).
▪ In the next column, type the formula: =RAND()
This assigns a random number to each row.
▪ Sort your data by the RAND column (ascending).
▪ Pick the first 20 rows, that’s your random sample.
▪ This gives each household an equal chance of being selected.
12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
Example (cont)
Solution Take a sample of 20 households
Option 2: Use Python
12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
Importance of Selection
How a sample is selected from a population is of
vital importance in statistical inference because
the probability of an observed sample will be
used to infer the characteristics of the sampled
population.
12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
Random Sampling
Stratified random sampling used when the
experimental units associated with the population
can be separated into two or more groups of units.
Cluster sampling sample natural grouping of
experimental units and collect data from all
experimental units within each cluster
12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
Random Sampling
Systematic sampling systematically selecting
every kth experimental unit from a list of all
experimental units.
Randomized response sampling useful when the
questions of a pollster are likely to elicit false
answers.
12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
Nonrandom Sample Errors
Selection bias results when a subset of the
experimental units in the population is excluded so
that these units have no chance of being selected
for the sample.
Nonresponse bias results when the researchers
conducting a survey or study are unable to obtain
data on all experimental units selected for the
sample.
Measurement error refers to inaccuracies in the
values of the data recorded. In surveys, the error
may be due to ambiguous or leading questions and
the interviewer’s effect on the respondent.
12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
Example
What is the most popular device used by online shoppers? To find
out, the mobile video ad network AdColony conducted a 2019
nationwide survey of 1,000 US online shoppers for Mobile
Marketer. The most popular device was a smartphone, used by
56% of the online shoppers. Other results: 28% used a desktop or
laptop computer, and 16% used a tablet.
a. Identify the data-collection method.
b. Identify the target population.
c. Are the sample data representative of the population?
12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
Example (cont)
Solution
a. Identify the data-collection method.
The data-collection method is a survey: 1,000 online shoppers
participated in the study.
b. Identify the target population.
Presumably, Mobile Marketer (who commissioned the survey) is
interested in the devices used by all US online shoppers.
Consequently, the target population is all consumers who use the
Internet for online shopping.
12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
Example (cont)
c. Are the sample data representative of the population?
Because the 1,000 respondents clearly make up a subset of the target
population, they do form a sample. Whether or not the sample is
representative is unclear because Mobile Marketer provided no detailed
information on how the 1,000 shoppers were selected. If the respondents
were obtained using, say, random-digit telephone dialing, then the sample
is likely to be representative because it is a random sample.
12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
Example (cont)
However, if the questionnaire was made available to anyone surfing
the Internet, then the respondents are self-selected (i.e., each
Internet user who saw the survey chose whether or not to respond to
it). Such a survey often suffers from nonresponse bias. It is possible
that many Internet users who chose not to respond (or who never
saw the questionnaire) would have answered the questions
differently, leading to a lower (or higher) sample percentage.
12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
1.8
Understanding Data
1 12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
Data Preparation
➢ Preparing a dataset involves several steps to ensure it is clean, consistent, and
ready for analysis.
▪ Defining key questions
▪ Collecting data
▪ Addressing the mistakes or inconsistencies in data collection.
▪ Changing the data to make it more amenable for data analysis.
➢ By following these steps, you ensure the dataset is structured, reliable, and ready
for analysis or modeling.
➢ In many projects, understanding the data and getting it ready for analysis is the
most time-consuming step in the process, since the data is usually integrated
from many sources, with different representations and formats.
47 12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
Understanding Your Data Before Cleaning
Why understanding data is important?
• Avoid wasting effort: Cleaning irrelevant or poorly understood data can lead to unnecessary work or
incorrect results.
• Identify the right focus: Understanding helps clearly define which data matters most for your analysis
goals.
• Prevent mistakes: Cleaning data blindly can result in removing valuable information or introducing
errors.
• Tailor the cleaning process: The approach to cleaning depends on the characteristics of the data.
Example: If you’re analyzing customer demographics but don’t realize that some columns (e.g.,
“Region”) are critical, you might accidentally discard them during cleaning.
48 12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
Understanding Your Data Before Cleaning
1. Check the dataset structure and missing values:
• What to do: Review columns, data types, size of the dataset, and missing values.
• Why:
- Helps you understand the data’s scope and limitations.
- Missing values can indicate problems (e.g., incomplete data entry) or natural
gaps (e.g., not all customers apply for loans).
• Tool in Python:
49 12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
Understanding Your Data Before Cleaning
2. Understand the variables:
• What to do: Interpret what each column represents and its importance.
• Why: Avoid misunderstandings that could lead to incorrect cleaning or analysis.
• Questions to ask:
- What does each column mean?
- Are there categorical (e.g., "Region") or numerical (e.g., "Sales") variables?
- Are there calculated fields (e.g., "Profit" = "Revenue" - "Cost")?
3. Insights from statistical summaries:
• Does the mean differ significantly from the median, indicating skewness?
• Are there unexpected minimum/maximum values?
50 12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
Understanding Your Data Before Cleaning
Practice!!!
Dataset: Bank_Customers_Loan_Application_file.csv
▪ Download resources: Download the synthetic data file
Bank_Customers_Loan_Application_file.csv from the eLearning platform.
▪ Data file placement: Place the data files in the same folder as your installed
Anaconda and Python setup (the location where the .yml environment file is
exported).
▪ Work with synthetic data file using Python, and write down the insights you
gain while exploring to understand the raw data.
51 12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
Understanding Your Data Before Cleaning
Practice!!!
Dataset: Bank_Customers_Loan_Application_file.csv
Python method:
52 12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
Understanding Your Data Before Cleaning
Practice!!!
df.head(10)
12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
Understanding Your Data Before Cleaning
Practice!!!
df.info()
12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
Understanding Your Data Before Cleaning
Practice!!!
df.isnull().sum()
12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
Understanding Your Data Before Cleaning
Practice!!!
df.describe()
12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
1.9
Cleaning Data
1 12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly
Cleaning Data
Dealing with:
▪ Invalid data: refers to something doesn’t make sense.
▪ Outlier: refers to numbers that are very different from most others.
▪ Missing value: Some cells in your dataset are empty.
▪ Duplicated value: The rows where all column values are identical
or the rows where only values in the specified columns are
identical.
58 12.11.2025
OSTA | Winter2025 | Dr. Nguyen Thi Viet Ly