0% found this document useful (0 votes)
27 views48 pages

Data Classification Types Explained

The document categorizes data into various types based on different criteria such as nature, measurement scale, source of collection, processing, storage, sensitivity, time dimension, variables, and values. It explains qualitative vs. quantitative data, nominal vs. ordinal data, discrete vs. continuous data, and primary vs. secondary data, providing examples for each category. Additionally, it emphasizes the importance of data collection methods in business analytics, highlighting structured, unstructured, and semi-structured data.

Uploaded by

Comedyy Memer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views48 pages

Data Classification Types Explained

The document categorizes data into various types based on different criteria such as nature, measurement scale, source of collection, processing, storage, sensitivity, time dimension, variables, and values. It explains qualitative vs. quantitative data, nominal vs. ordinal data, discrete vs. continuous data, and primary vs. secondary data, providing examples for each category. Additionally, it emphasizes the importance of data collection methods in business analytics, highlighting structured, unstructured, and semi-structured data.

Uploaded by

Comedyy Memer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Types of

Data
Classification of Data on Various Grounds

 1. Based on Nature (Qualitative vs. Quantitative)


 2. Based on Measurement Scale (Stevens' Classification)
 3.Based on Source of Collection
 4. Based on Data Processing
 5. Based on Storage and Structure
 6. Based on Sensitivity & Privacy
 7. Based on Time Dimension
 8. Based on Variable
 9. Based on Values
Based on Nature of data- Qualitative vs. Quantitative
Data

 1. Quantitative data
 Quantitative data seems to be the easiest to explain. It answers key questions such as “how many, “how much” and
“how often”.

  Quantitative data can be expressed as a number or can be quantified. Simply put, it can be measured by numerical
variables.

  Quantitative data are easily amenable to statistical manipulation and can be represented by a wide variety of
statistical types of graphs and charts such as line, bar graph, scatter plot, and etc.

 Examples of quantitative data:


 Scores on tests and exams e.g. 85, 67, 90 and etc.
 The weight of a person or a subject.
 Your shoe size.
 The temperature in a room
 Qualitative data
 Qualitative data can’t be expressed as a number and can’t be measured.
Qualitative data consist of words, pictures, and symbols, not numbers.
 Qualitative data is also called categorical data because the information can be
sorted by category, not by number.
 Qualitative data can answer questions such as “how this has happened” or and
“why this has happened”.
Examples of qualitative data:
 Colors e.g. the color of the sea
 Your favorite holiday destination such as Hawaii, New Zealand and etc.
 Names as John, Patricia..
 Ethnicity such as American Indian, Asian, etc.
Nominal vs. Ordinal Data

 Nominal data
Nominal data is used just for labelling variables, without any type of quantitative value.
The name ‘nominal’ comes from the Latin word “nomen” which means ‘name’.
The nominal data just name a thing without applying it to order. Actually, the nominal
data could just be called “labels.”

 Examples of Nominal Data:


 Gender (Women, Men)
 Hair color (Blonde, Brown, Brunette, Red, etc.)
 Marital status (Married, Single, Widowed)
 Ethnicity (Hispanic, Asian)

 Eye color is a nominal variable having a few categories (Blue, Green, Brown) and there
is no way to order these categories from highest to lowest.
Ordinal data

 Ordinal data shows where a number is in order. This is the crucial difference from nominal
types of data.
 Ordinal data is data which is placed into some kind of order by their position on a scale.
Ordinal data may indicate superiority.
 However, you cannot do arithmetic with ordinal numbers because they only show
sequence.
 Ordinal variables are considered as “in between” qualitative and quantitative variables.
In other words, the ordinal data is qualitative data for which the values are ordered.
 In comparison with nominal data, the second one is qualitative data for which the values
cannot be placed in an ordered.
 We can also assign numbers to ordinal data to show their relative position. But we cannot do
math with those numbers. For example: “first, second, third…etc.”

 Examples of Ordinal Data:


 The first, second and third person in a competition.
 Letter grades: A, B, C, and etc.
 When a company asks a customer to rate the sales experience on a scale of 1-10.
 Economic status: low, medium and high
Discrete vs. Continuous Data

 In statistics, marketing research, and data science, many decisions depend on whether the basic data is discrete or
continuous

 Discrete data
Discrete data is a count that involves only integers. The discrete values cannot be subdivided into parts.
For example, the number of children in a class is discrete data. You can count whole individuals. You can’t count 1.5
kids.

 To put in other words, discrete data can take only certain values. The data variables cannot be
divided into smaller parts.

 It has a limited number of possible values e.g. days of the month.

 Examples of discrete data:


 The number of students in a class.
 The number of workers in a company.
 The number of home runs in a baseball game.
 The number of test questions you answered correctly
 Continuous data
Continuous data is information that could be meaningfully divided into finer levels. It can be
measured on a scale or continuum and can have almost any numeric value.

 For example, you can measure your height at very precise scales — meters, centimeters, millimeters and
etc.
You can record continuous data at so many different measurements – width, temperature, time, and etc.
This is where the key difference from discrete types of data lies.
 The continuous variables can take any value between two numbers. For example, between 50
and 72 inches, there are literally millions of possible heights: 52.04762 inches, 69.948376
inches and etc.

 A good great rule for defining if a data is continuous or discrete is that if the point of
measurement can be reduced in half and still make sense, the data is continuous.
 Examples of continuous
data:
 The amount of time
required to complete a
project.
 The height of children.
 The square footage of a
two-bedroom house.
 The speed of cars.
Based on Measurement Scale
(Stevens' Classification)
Based on Measurement Scale
(Stevens' Classification)

 Nominal Scale: Labels or names without numerical significance (e.g., eye color, marital
status).
 Ordinal Scale: Ordered categories but without a consistent scale (e.g., movie ratings,
educational qualifications).
 Interval Scale: Ordered values with meaningful differences, but no true zero (e.g.,
temperature in Celsius or Fahrenheit).
 Ratio Scale: Ordered values with meaningful differences and a true zero (e.g., weight,
height, age, income).

For Further Reading


https://www.questionpro.com/blog/nominal-ordinal-interval-ratio/
Offers: Nominal Ordinal Interval Ratio

The sequence of variables is established – Yes Yes Yes

Mode Yes Yes Yes Yes

Median – Yes Yes Yes

Mean – – Yes Yes

Difference between variables can be


– – Yes Yes
evaluated

Addition and Subtraction of variables – – Yes Yes

Multiplication and Division of variables – – – Yes


Based on Source of Collection

 Primary Data: Collected directly from the source for a specific purpose (e.g., surveys, interviews,
experiments).
 Primary data are fresh (new) information collected for the first time by a researcher himself for a
particular purpose. It is a unique, first-hand and qualitative information not published before. It is
collected systematically from its place or source of origin by the researcher himself or his appointed
agents. It is obtained initially as a result of research efforts taken by a researcher (and his team) with
some objective in mind. It helps to solve certain problems concerned with any domain of choice or
sphere of interest. Once it is used up for any required purpose, its original character is lost, and it turns
into secondary data.
 One must note that, even if the data is originally collected by somebody else from its source for his
study, but never used then the collected data is called primary data. However, once used it turns into
secondary data.
 Imagine, you are visiting an unexplored cave to investigate and later recording its minute details to
publish, is an example of primary data collection.
 Wessel's definition of primary data,
“Data originally collected in the process of investigation are known as primary data.”
 Secondary Data: Previously collected by someone else and reused (e.g., government reports, academic
papers, business records).
 Secondary data, on the other hand, are information already collected by others or somebody else and
later used by a researcher (or investigator) to answer their questions in hand. Hence, it is also called
second-hand data. It is a ready-made, quantitative information obtained mostly from different published
sources like companies' reports, statistics published by government, etc. Here the required information is
extracted from already known works of others (e.g. Published by a subject scholar or an organization,
government agency, etc.). It is readily available to a researcher at his desk or place of work.
 Assume, you are preparing a brief report on your country's population for which you take reference of the
census published by government, is an example of secondary data collection.
 Sir Wessel, defined secondary data in simple words as,
“Data collected by other persons are called secondary data.”
 Another definition of secondary data in words of M. M. Blair,
“Secondary data are those which are already in existence and collected for some other purpose
than the answering of the question in hand.”
Based on Data Processing

 Raw Data: Unprocessed and unstructured data (e.g., survey


responses before analysis).
 Processed Data: Cleaned, transformed, and structured data ready for
analysis (e.g., averages, percentages in a report).
Basis For Difference Raw Data Processed Data

Introduction Set of unorganized quantities, values and facts Organized form of quantities, values and facts

Refers To Raw facts, symbols, numbers etc. Refined and processed facts, symbols and numbers

Level Of Knowledge First Second

Significant ? No Yes

Requirement Research and observation Analysis

Dependency Not dependent on information Dependent on data

Usefulness May or may not be Yes

Input/Output Input for information Output of data

Meaning Does not provide Provides meaning

Reproduce Not possible Possible

Nature Vague Specific


Based on Storage and Structure

 Structured Data: Organized in a predefined format, usually in


tables/databases (e.g., sales records in SQL databases).
 Structured data refers to business data that’s organized into
specific formats based on the needs of the business. Usually,
structured data is organized into tables and
stored in data warehouses. Because the data is specifically
formatted, it's easier for the average business user to understand
and utilize it for specific purposes.
 Data is written into a database using schema-on-write, which
means the data is preformatted before going in. This method
makes it easier to manage smaller amounts of prestructured data,
such as credit card information, medical insurance data and
financial transactions.
 Pros of using structured data
 Easy for the average user to utilize and understand
 Easy for machine learning algorithms to utilize
 A greater number of analytics tools can use the data
 Requires less storage space
 Cons of using structured data
 Limited to specific uses
 More limited storage options
 Difficult and expensive to make changes
Based on Storage and Structure

 Unstructured Data: Does not follow a specific format, such as


text, images, videos (e.g., emails, social media posts).
 Unstructured data is an amalgamation of data formats typically
stored in data lakes. It covers everything from social media posts
to videos and text files.
 One of the key advantages of unstructured data is that it helps
provide qualitative information useful to businesses for
understanding trends and changes.
 The primary drawback of working with unstructured data is the
added complexity requiring specialized skills, tools and
understanding to analyze and use the information. This complexity
typically means working with a data specialist who can query and
analyze the information.
 In contrast to structured data, its unstructured counterpart utilizes a
schema-on-read data analysis strategy. This method means that the
data is organized as it gets pulled out of the storage location rather
than before going in.
 Pros of using unstructured data
 Easier to store due to being in native format
 Collecting and storing are faster
 Cheaper to store unstructured data using data lakes
 Provides more granular information
 Cons of using unstructured data
 More complicated to work with
 Requires highly specialized tools for organizing
 Expertise needed
 Semi-structured Data: A mix of structured and unstructured data,
often in formats like XML, JSON (e.g., log files, metadata).
Based on Sensitivity & Privacy

• Public Data: Accessible to everyone (e.g., census reports, weather


data).
•Private Data: Restricted access, usually organizational data (e.g.,
company sales data, HR records).
•Sensitive Data: Requires protection due to legal or ethical concerns
(e.g., personal health records, financial transactions).
•Anonymized Data: Processed to remove identifiable information (e.g.,
research datasets with masked identities).
Based on Time Dimension

 Cross-Sectional Data: Collected at a single point in time (e.g.,


customer survey results in 2024).
 Time-Series Data: Collected over time to observe trends (e.g., stock
prices, annual GDP).
 Panel Data (Longitudinal Data): Combination of cross-sectional and
time-series data (e.g., tracking customer spending habits over years).
Based on Variables

 Univariate Data
 Bivariate Data
 Multi-Variate Data
Based on Values

 Discrete Data
 Continuous Data
Collection of Data
 Data collection is a crucial first step in Business
Analytics. The quality, accuracy, and relevance of
the data collected directly impact the validity and
success of the subsequent analysis, modeling, and
interpretation. This guide provides an overview of
various data collection methods in data science,
including primary and secondary data collection
techniques, digital data sources, and best practices
for effective data collection.
 Definition: Data collection is the process of gathering and
measuring information on variables of interest, in a
systematic way, to answer research questions, test
hypotheses, and evaluate outcomes.
 Importance:
 Provides the foundation for analysis and decision-making.
 Ensures that data is relevant, accurate, and sufficient to support the
goals of the project.
 Affects the choice of analytical methods and tools.
Types of Data in Business
Analytics
 Structured Data: Data that is organized in a predefined manner, usually in tabular
form with rows and columns. Examples include databases, spreadsheets, and data
from sensors.
 Unstructured Data: Data that does not have a predefined structure. Examples
include text data (emails, social media posts), images, videos, and audio files.
 Semi-Structured Data: Data that does not reside in a relational database but has
some organizational properties, such as JSON, XML, and NoSQL databases.
Primary Data Collection Methods

 Primary data is data collected directly from the source for a specific research
purpose. Primary data collection methods are typically used when specific, tailor-
made data is needed.
 Surveys and Questionnaires:
 Used to collect data from a large number of respondents.
 Can be conducted online, over the phone, or in person.
 Example: Collecting customer feedback on a product or service.
 Tools: Google Forms, SurveyMonkey, Qualtrics.
 Interviews:
 Involve direct, one-on-one or group discussions with respondents to gather detailed
information.
 Can be structured (with a set of predefined questions) or unstructured (open-ended
conversations).
 Example: Conducting expert interviews to gather insights on a specific industry trend.
 Tools: Zoom, Microsoft Teams, Google Meet
 Observations:
 Involves collecting data by observing subjects in a natural or controlled environment.
 Useful for gathering real-time data and understanding natural behavior.
 Example: Observing customer behavior in a retail store to study buying patterns.
 Experiments:
 Data is collected in controlled environments where specific variables are manipulated
to observe their effect.
 Useful for testing hypotheses and establishing causal relationships.
 Example: A/B testing in digital marketing to compare the performance of two different
ads.
 Sensor Data:
 Data collected from various sensors (e.g., IoT devices, wearables) in real time.
 Widely used in smart devices, healthcare, and environmental monitoring.
 Example: Collecting data from smartwatches to monitor health metrics.
Secondary Data Collection Methods

Secondary data is data that has already been collected by someone else and is readily
available for use. Secondary data collection is cost-effective and time-saving
compared to primary data collection.
 Public Databases and Repositories:
 Open data sources like government databases, scientific repositories, and international
organizations.
 Examples: UCI Machine Learning Repository, Kaggle Datasets, World Bank Data.
 Web Scraping:
 Automated process of extracting data from websites using scripts or software tools.
 Useful for collecting large volumes of data from online sources such as social media, e-
commerce sites, and news websites.
 Tools: Python libraries (BeautifulSoup, Scrapy), R packages (rvest), and online tools
(Octoparse).
 APIs (Application Programming Interfaces):
 Allow access to data from web services or online platforms programmatically.
 Commonly used to collect data from social media platforms (Twitter, Facebook), financial markets,
or weather services.
 Tools: httr (R), requests (Python), Postman.
 Pre-existing Research:
 Data from previous studies, research papers, and academic publications.
 Useful for conducting meta-analyses or comparative studies.
 Commercial Data Providers:
 Companies that provide data for a fee, such as market research firms, data aggregators, and
specialized data providers.
 Examples: Nielsen, Statista, LexisNexis.
5. Digital Data Sources

 Digital data sources are increasingly popular in data science due to the vast
amount of data generated through digital interactions. Some common digital
data sources include:
 Social Media Data:
 Data from platforms like Twitter, Facebook, Instagram, and LinkedIn.
 Useful for sentiment analysis, trend analysis, and marketing research.
 Tools: Twitter API, Facebook Graph API, Instagram Graph API.
 Website Analytics:
 Data from tools like Google Analytics, which provide insights into website traffic, user
behavior, and conversions.
 Useful for digital marketing, user experience optimization, and e-commerce analysis.
 E-commerce Data:
 Data from online retailers and marketplaces (Amazon, eBay, Alibaba).
 Useful for sales analysis, inventory management, and customer behavior
studies.
 Sensor and IoT Data:
 Data from Internet of Things (IoT) devices, including smart home devices,
industrial sensors, and health trackers.
 Useful for predictive maintenance, smart city planning, and health
monitoring.
Process of Data Collection

 Define Objectives: Clearly define the objectives and scope of data


collection to ensure relevance and focus.
 Choose the Right Method: Select the appropriate data collection
method based on the type of data needed, budget, time, and resources
available.
 Ensure Data Quality: Implement measures to ensure the accuracy,
consistency, completeness, and reliability of the data collected.
 Follow Ethical Guidelines: Ensure that data collection adheres to
ethical standards, including obtaining consent from participants,
ensuring confidentiality, and adhering to data privacy regulations.
 Data Cleaning and Preprocessing: Prepare the collected data for
analysis by handling missing values, outliers, and duplicates, and
converting data to the required formats.
 Automate Collection Where Possible: Use automation tools and
scripts to collect data efficiently, especially for large datasets or
continuous data streams.
 Document Data Collection Process: Maintain detailed
documentation of the data collection process, including methods, tools,
and any assumptions or limitations.
Tools and Technologies for Data Collection

 Programming Languages:
 Python: Libraries like pandas, requests, BeautifulSoup, and Selenium for web scraping, API
interaction, and data handling.
 R: Packages like httr, rvest, readxl, and DBI for data import, web scraping, and database access.
 Data Collection Platforms:
 Qualtrics, SurveyMonkey: For surveys and questionnaires.
 Google Forms, Typeform: For online form-based data collection.
 Database Management Systems (DBMS):
 MySQL, PostgreSQL, MongoDB: For collecting and storing structured and unstructured data.
 Data Integration Tools:
 Talend, Apache NiFi, Microsoft Power Automate: For integrating and transforming data from
various sources.

You might also like