Types of
Data
Classification of Data on Various Grounds
1. Based on Nature (Qualitative vs. Quantitative)
2. Based on Measurement Scale (Stevens' Classification)
3.Based on Source of Collection
4. Based on Data Processing
5. Based on Storage and Structure
6. Based on Sensitivity & Privacy
7. Based on Time Dimension
8. Based on Variable
9. Based on Values
Based on Nature of data- Qualitative vs. Quantitative
Data
1. Quantitative data
Quantitative data seems to be the easiest to explain. It answers key questions such as “how many, “how much” and
“how often”.
Quantitative data can be expressed as a number or can be quantified. Simply put, it can be measured by numerical
variables.
Quantitative data are easily amenable to statistical manipulation and can be represented by a wide variety of
statistical types of graphs and charts such as line, bar graph, scatter plot, and etc.
Examples of quantitative data:
Scores on tests and exams e.g. 85, 67, 90 and etc.
The weight of a person or a subject.
Your shoe size.
The temperature in a room
Qualitative data
Qualitative data can’t be expressed as a number and can’t be measured.
Qualitative data consist of words, pictures, and symbols, not numbers.
Qualitative data is also called categorical data because the information can be
sorted by category, not by number.
Qualitative data can answer questions such as “how this has happened” or and
“why this has happened”.
Examples of qualitative data:
Colors e.g. the color of the sea
Your favorite holiday destination such as Hawaii, New Zealand and etc.
Names as John, Patricia..
Ethnicity such as American Indian, Asian, etc.
Nominal vs. Ordinal Data
Nominal data
Nominal data is used just for labelling variables, without any type of quantitative value.
The name ‘nominal’ comes from the Latin word “nomen” which means ‘name’.
The nominal data just name a thing without applying it to order. Actually, the nominal
data could just be called “labels.”
Examples of Nominal Data:
Gender (Women, Men)
Hair color (Blonde, Brown, Brunette, Red, etc.)
Marital status (Married, Single, Widowed)
Ethnicity (Hispanic, Asian)
Eye color is a nominal variable having a few categories (Blue, Green, Brown) and there
is no way to order these categories from highest to lowest.
Ordinal data
Ordinal data shows where a number is in order. This is the crucial difference from nominal
types of data.
Ordinal data is data which is placed into some kind of order by their position on a scale.
Ordinal data may indicate superiority.
However, you cannot do arithmetic with ordinal numbers because they only show
sequence.
Ordinal variables are considered as “in between” qualitative and quantitative variables.
In other words, the ordinal data is qualitative data for which the values are ordered.
In comparison with nominal data, the second one is qualitative data for which the values
cannot be placed in an ordered.
We can also assign numbers to ordinal data to show their relative position. But we cannot do
math with those numbers. For example: “first, second, third…etc.”
Examples of Ordinal Data:
The first, second and third person in a competition.
Letter grades: A, B, C, and etc.
When a company asks a customer to rate the sales experience on a scale of 1-10.
Economic status: low, medium and high
Discrete vs. Continuous Data
In statistics, marketing research, and data science, many decisions depend on whether the basic data is discrete or
continuous
Discrete data
Discrete data is a count that involves only integers. The discrete values cannot be subdivided into parts.
For example, the number of children in a class is discrete data. You can count whole individuals. You can’t count 1.5
kids.
To put in other words, discrete data can take only certain values. The data variables cannot be
divided into smaller parts.
It has a limited number of possible values e.g. days of the month.
Examples of discrete data:
The number of students in a class.
The number of workers in a company.
The number of home runs in a baseball game.
The number of test questions you answered correctly
Continuous data
Continuous data is information that could be meaningfully divided into finer levels. It can be
measured on a scale or continuum and can have almost any numeric value.
For example, you can measure your height at very precise scales — meters, centimeters, millimeters and
etc.
You can record continuous data at so many different measurements – width, temperature, time, and etc.
This is where the key difference from discrete types of data lies.
The continuous variables can take any value between two numbers. For example, between 50
and 72 inches, there are literally millions of possible heights: 52.04762 inches, 69.948376
inches and etc.
A good great rule for defining if a data is continuous or discrete is that if the point of
measurement can be reduced in half and still make sense, the data is continuous.
Examples of continuous
data:
The amount of time
required to complete a
project.
The height of children.
The square footage of a
two-bedroom house.
The speed of cars.
Based on Measurement Scale
(Stevens' Classification)
Based on Measurement Scale
(Stevens' Classification)
Nominal Scale: Labels or names without numerical significance (e.g., eye color, marital
status).
Ordinal Scale: Ordered categories but without a consistent scale (e.g., movie ratings,
educational qualifications).
Interval Scale: Ordered values with meaningful differences, but no true zero (e.g.,
temperature in Celsius or Fahrenheit).
Ratio Scale: Ordered values with meaningful differences and a true zero (e.g., weight,
height, age, income).
For Further Reading
https://www.questionpro.com/blog/nominal-ordinal-interval-ratio/
Offers: Nominal Ordinal Interval Ratio
The sequence of variables is established – Yes Yes Yes
Mode Yes Yes Yes Yes
Median – Yes Yes Yes
Mean – – Yes Yes
Difference between variables can be
– – Yes Yes
evaluated
Addition and Subtraction of variables – – Yes Yes
Multiplication and Division of variables – – – Yes
Based on Source of Collection
Primary Data: Collected directly from the source for a specific purpose (e.g., surveys, interviews,
experiments).
Primary data are fresh (new) information collected for the first time by a researcher himself for a
particular purpose. It is a unique, first-hand and qualitative information not published before. It is
collected systematically from its place or source of origin by the researcher himself or his appointed
agents. It is obtained initially as a result of research efforts taken by a researcher (and his team) with
some objective in mind. It helps to solve certain problems concerned with any domain of choice or
sphere of interest. Once it is used up for any required purpose, its original character is lost, and it turns
into secondary data.
One must note that, even if the data is originally collected by somebody else from its source for his
study, but never used then the collected data is called primary data. However, once used it turns into
secondary data.
Imagine, you are visiting an unexplored cave to investigate and later recording its minute details to
publish, is an example of primary data collection.
Wessel's definition of primary data,
“Data originally collected in the process of investigation are known as primary data.”
Secondary Data: Previously collected by someone else and reused (e.g., government reports, academic
papers, business records).
Secondary data, on the other hand, are information already collected by others or somebody else and
later used by a researcher (or investigator) to answer their questions in hand. Hence, it is also called
second-hand data. It is a ready-made, quantitative information obtained mostly from different published
sources like companies' reports, statistics published by government, etc. Here the required information is
extracted from already known works of others (e.g. Published by a subject scholar or an organization,
government agency, etc.). It is readily available to a researcher at his desk or place of work.
Assume, you are preparing a brief report on your country's population for which you take reference of the
census published by government, is an example of secondary data collection.
Sir Wessel, defined secondary data in simple words as,
“Data collected by other persons are called secondary data.”
Another definition of secondary data in words of M. M. Blair,
“Secondary data are those which are already in existence and collected for some other purpose
than the answering of the question in hand.”
Based on Data Processing
Raw Data: Unprocessed and unstructured data (e.g., survey
responses before analysis).
Processed Data: Cleaned, transformed, and structured data ready for
analysis (e.g., averages, percentages in a report).
Basis For Difference Raw Data Processed Data
Introduction Set of unorganized quantities, values and facts Organized form of quantities, values and facts
Refers To Raw facts, symbols, numbers etc. Refined and processed facts, symbols and numbers
Level Of Knowledge First Second
Significant ? No Yes
Requirement Research and observation Analysis
Dependency Not dependent on information Dependent on data
Usefulness May or may not be Yes
Input/Output Input for information Output of data
Meaning Does not provide Provides meaning
Reproduce Not possible Possible
Nature Vague Specific
Based on Storage and Structure
Structured Data: Organized in a predefined format, usually in
tables/databases (e.g., sales records in SQL databases).
Structured data refers to business data that’s organized into
specific formats based on the needs of the business. Usually,
structured data is organized into tables and
stored in data warehouses. Because the data is specifically
formatted, it's easier for the average business user to understand
and utilize it for specific purposes.
Data is written into a database using schema-on-write, which
means the data is preformatted before going in. This method
makes it easier to manage smaller amounts of prestructured data,
such as credit card information, medical insurance data and
financial transactions.
Pros of using structured data
Easy for the average user to utilize and understand
Easy for machine learning algorithms to utilize
A greater number of analytics tools can use the data
Requires less storage space
Cons of using structured data
Limited to specific uses
More limited storage options
Difficult and expensive to make changes
Based on Storage and Structure
Unstructured Data: Does not follow a specific format, such as
text, images, videos (e.g., emails, social media posts).
Unstructured data is an amalgamation of data formats typically
stored in data lakes. It covers everything from social media posts
to videos and text files.
One of the key advantages of unstructured data is that it helps
provide qualitative information useful to businesses for
understanding trends and changes.
The primary drawback of working with unstructured data is the
added complexity requiring specialized skills, tools and
understanding to analyze and use the information. This complexity
typically means working with a data specialist who can query and
analyze the information.
In contrast to structured data, its unstructured counterpart utilizes a
schema-on-read data analysis strategy. This method means that the
data is organized as it gets pulled out of the storage location rather
than before going in.
Pros of using unstructured data
Easier to store due to being in native format
Collecting and storing are faster
Cheaper to store unstructured data using data lakes
Provides more granular information
Cons of using unstructured data
More complicated to work with
Requires highly specialized tools for organizing
Expertise needed
Semi-structured Data: A mix of structured and unstructured data,
often in formats like XML, JSON (e.g., log files, metadata).
Based on Sensitivity & Privacy
• Public Data: Accessible to everyone (e.g., census reports, weather
data).
•Private Data: Restricted access, usually organizational data (e.g.,
company sales data, HR records).
•Sensitive Data: Requires protection due to legal or ethical concerns
(e.g., personal health records, financial transactions).
•Anonymized Data: Processed to remove identifiable information (e.g.,
research datasets with masked identities).
Based on Time Dimension
Cross-Sectional Data: Collected at a single point in time (e.g.,
customer survey results in 2024).
Time-Series Data: Collected over time to observe trends (e.g., stock
prices, annual GDP).
Panel Data (Longitudinal Data): Combination of cross-sectional and
time-series data (e.g., tracking customer spending habits over years).
Based on Variables
Univariate Data
Bivariate Data
Multi-Variate Data
Based on Values
Discrete Data
Continuous Data
Collection of Data
Data collection is a crucial first step in Business
Analytics. The quality, accuracy, and relevance of
the data collected directly impact the validity and
success of the subsequent analysis, modeling, and
interpretation. This guide provides an overview of
various data collection methods in data science,
including primary and secondary data collection
techniques, digital data sources, and best practices
for effective data collection.
Definition: Data collection is the process of gathering and
measuring information on variables of interest, in a
systematic way, to answer research questions, test
hypotheses, and evaluate outcomes.
Importance:
Provides the foundation for analysis and decision-making.
Ensures that data is relevant, accurate, and sufficient to support the
goals of the project.
Affects the choice of analytical methods and tools.
Types of Data in Business
Analytics
Structured Data: Data that is organized in a predefined manner, usually in tabular
form with rows and columns. Examples include databases, spreadsheets, and data
from sensors.
Unstructured Data: Data that does not have a predefined structure. Examples
include text data (emails, social media posts), images, videos, and audio files.
Semi-Structured Data: Data that does not reside in a relational database but has
some organizational properties, such as JSON, XML, and NoSQL databases.
Primary Data Collection Methods
Primary data is data collected directly from the source for a specific research
purpose. Primary data collection methods are typically used when specific, tailor-
made data is needed.
Surveys and Questionnaires:
Used to collect data from a large number of respondents.
Can be conducted online, over the phone, or in person.
Example: Collecting customer feedback on a product or service.
Tools: Google Forms, SurveyMonkey, Qualtrics.
Interviews:
Involve direct, one-on-one or group discussions with respondents to gather detailed
information.
Can be structured (with a set of predefined questions) or unstructured (open-ended
conversations).
Example: Conducting expert interviews to gather insights on a specific industry trend.
Tools: Zoom, Microsoft Teams, Google Meet
Observations:
Involves collecting data by observing subjects in a natural or controlled environment.
Useful for gathering real-time data and understanding natural behavior.
Example: Observing customer behavior in a retail store to study buying patterns.
Experiments:
Data is collected in controlled environments where specific variables are manipulated
to observe their effect.
Useful for testing hypotheses and establishing causal relationships.
Example: A/B testing in digital marketing to compare the performance of two different
ads.
Sensor Data:
Data collected from various sensors (e.g., IoT devices, wearables) in real time.
Widely used in smart devices, healthcare, and environmental monitoring.
Example: Collecting data from smartwatches to monitor health metrics.
Secondary Data Collection Methods
Secondary data is data that has already been collected by someone else and is readily
available for use. Secondary data collection is cost-effective and time-saving
compared to primary data collection.
Public Databases and Repositories:
Open data sources like government databases, scientific repositories, and international
organizations.
Examples: UCI Machine Learning Repository, Kaggle Datasets, World Bank Data.
Web Scraping:
Automated process of extracting data from websites using scripts or software tools.
Useful for collecting large volumes of data from online sources such as social media, e-
commerce sites, and news websites.
Tools: Python libraries (BeautifulSoup, Scrapy), R packages (rvest), and online tools
(Octoparse).
APIs (Application Programming Interfaces):
Allow access to data from web services or online platforms programmatically.
Commonly used to collect data from social media platforms (Twitter, Facebook), financial markets,
or weather services.
Tools: httr (R), requests (Python), Postman.
Pre-existing Research:
Data from previous studies, research papers, and academic publications.
Useful for conducting meta-analyses or comparative studies.
Commercial Data Providers:
Companies that provide data for a fee, such as market research firms, data aggregators, and
specialized data providers.
Examples: Nielsen, Statista, LexisNexis.
5. Digital Data Sources
Digital data sources are increasingly popular in data science due to the vast
amount of data generated through digital interactions. Some common digital
data sources include:
Social Media Data:
Data from platforms like Twitter, Facebook, Instagram, and LinkedIn.
Useful for sentiment analysis, trend analysis, and marketing research.
Tools: Twitter API, Facebook Graph API, Instagram Graph API.
Website Analytics:
Data from tools like Google Analytics, which provide insights into website traffic, user
behavior, and conversions.
Useful for digital marketing, user experience optimization, and e-commerce analysis.
E-commerce Data:
Data from online retailers and marketplaces (Amazon, eBay, Alibaba).
Useful for sales analysis, inventory management, and customer behavior
studies.
Sensor and IoT Data:
Data from Internet of Things (IoT) devices, including smart home devices,
industrial sensors, and health trackers.
Useful for predictive maintenance, smart city planning, and health
monitoring.
Process of Data Collection
Define Objectives: Clearly define the objectives and scope of data
collection to ensure relevance and focus.
Choose the Right Method: Select the appropriate data collection
method based on the type of data needed, budget, time, and resources
available.
Ensure Data Quality: Implement measures to ensure the accuracy,
consistency, completeness, and reliability of the data collected.
Follow Ethical Guidelines: Ensure that data collection adheres to
ethical standards, including obtaining consent from participants,
ensuring confidentiality, and adhering to data privacy regulations.
Data Cleaning and Preprocessing: Prepare the collected data for
analysis by handling missing values, outliers, and duplicates, and
converting data to the required formats.
Automate Collection Where Possible: Use automation tools and
scripts to collect data efficiently, especially for large datasets or
continuous data streams.
Document Data Collection Process: Maintain detailed
documentation of the data collection process, including methods, tools,
and any assumptions or limitations.
Tools and Technologies for Data Collection
Programming Languages:
Python: Libraries like pandas, requests, BeautifulSoup, and Selenium for web scraping, API
interaction, and data handling.
R: Packages like httr, rvest, readxl, and DBI for data import, web scraping, and database access.
Data Collection Platforms:
Qualtrics, SurveyMonkey: For surveys and questionnaires.
Google Forms, Typeform: For online form-based data collection.
Database Management Systems (DBMS):
MySQL, PostgreSQL, MongoDB: For collecting and storing structured and unstructured data.
Data Integration Tools:
Talend, Apache NiFi, Microsoft Power Automate: For integrating and transforming data from
various sources.