Chapter2 - Data Collection Methods and Data Types

CDS501:
PRINCIPLES & PRACTICES OF

DATA SCIENCE & ANALYTICS
Data Collection and Data Types
Copyright © 2022 Halim Noor, USM

Outline
▪ Primary and Secondary
▪ Structured, Semi-structured and Unstructured Data
▪ Data Types

Primary Data
▪ Primary
▪ Data that is collected directly from the data source without going
through any existing sources
▪ Expensive and time consuming
▪ Data is reliable, authentic, up to date and objective (collected with
purpose)
▪ Ownership – belong to the organization
▪ Interview, Focus Group, Survey and Questionnaire, Observation,
Experiment

Interview
▪ Involve two groups of people: Interviewer and Interviewee
▪ Question and responses may be oral or verbal (spoken/written)
▪ In-person or telephonic interview
▪ Pros
▪ In-depth information can be collected
▪ Non-response and response bias can be detected
▪ The samples can be controlled
▪ Allows open ended question e.g. what did you like best about the product?
▪ Cons
▪ Time consuming and expensive
▪ Interviewer may be biased
▪ Data is complex to interpret

Focus Group
▪ Involves 2 or more people with similar characteristics (target audience)
▪ Similar to interviews, but this involves discussions and interactions led
by facilitators rather than questions and answers
▪ Less formal – participants do most of the talking, facilitators oversees
the process
▪ Pros
▪ Low cost compared to interviews (the interviewer does not have to discuss
with each participant individually)
▪ Takes lesser time
▪ Data is relatively less complex than interview (light analysis)
▪ Cons
▪ Response bias due to facilitators or participants might be subjective to what
people will think about sharing a sincere opinion
▪ Speaking time of some attendees may be higher than that of others, making
their contribution disproportionate
▪ Group thinking does not clearly mirror individual opinions

Survey and Questionnaire
▪ A group of questions typed or written down to obtain responses
▪ Can be carried out online or offline
▪ Pros
▪ Respondents have adequate time to give responses
▪ Responses can be in the form of categorical and numerical values
▪ It is free from the bias of the interviewer
▪ They are cheaper compared to interviews
▪ Cons
▪ A high rate of non-response bias
▪ It is inflexible and can't be changed once sent
▪ It is a slow process

Observation
▪ Involves collecting information by observation or using scientific tools
▪ Systematically planned and subjected to checks and controls
▪ Approaches: structured or unstructured, controlled or uncontrolled
▪ E.g. watches a person teach a class to determine how well the delivery,
counts number of visitors, measure temperature using sensors
▪ Pros
▪ The data is objective
▪ The data is not affected by past or future events
▪ Cons
▪ The information is limited
▪ Expensive

Experiment
▪ A structured study where the researchers attempt to understand the
causes, effects
▪ The approach is controlled e.g. which subject, how they are grouped
and the treatment they received
▪ E.g action or treatment is carried out on subject, then reactions are
recorded
▪ Pros
▪ It is usually objective since the data recorded are the results of a process
▪ Non-response bias is eliminated
▪ Cons
▪ Incorrect data may be recorded due to human error
▪ It is expensive

Secondary Data
▪ Data that has been collected in the past by someone else but made
available for others to use
▪ Affordable and requires very little to no cost to acquire them
▪ Easily accessible (shared publicly)
▪ Data may not be suited to the project needs
▪ Data may not authentic – need further verification
▪ Data may be outdated
▪ Kaggle, UCI Repository

Types of Data Structure
▪ Structured data
▪ Semi-structured data
▪ Unstructured data

Structured Data
▪ Conform to formal data models such as database and excel
▪ Table-structured data with rows and columns (headers)
▪ Can store numeric (numerical and categorical) or text
▪ E.g. patient information, student information, product, real estate
▪ Data storage: relational database, excel, csv

Structured Data
• Rows are instances about which the entity being observed

• Columns are facts or measurements (attributes or features)
• Cells are the values (data)
Structured Data
• Not well-structured: table-structured data without headers or with

ambiguous headers
• Not as easy to analyze
• Data is encoded value, needs to decode using the documentation
Structured Data
▪ There are many tools for data processing e.g. SQL, Pandas, Tidyverse
▪ Easy to manipulate, search and analyze
▪ Queries are efficient e.g. allows complex joining

Numerical Data
▪ Quantitative data
▪ Any data where data points are exact numbers
▪ Has no spatial and temporal structure

Numerical Data
▪ Quantitative data
▪ Any data where data points are exact numbers
▪ Has no spatial and temporal structure
▪ Continuous: assume any value (real numbers) e.g. 35.6, 10.0, 89.26
▪ Discrete: distinct values e.g. 3, 55, 10

Numerical Data

Continuous or Discrete?
salary height # of cars sold
# of students body temperature

Categorical Data
▪ Data that represents categories or groups
▪ Can take numerical values, but the values have no meaning

Categorical Data
▪ Data that represents groups
▪ Nominal: categorical data without ordering e.g. gender, town, weather
▪ Ordinal: categorical data with ordering e.g. size, difficulty

Categorical Data
▪ Numerical data can be split into groups
▪ House price
▪ 0 – RM 200,000: cheap
▪ RM 200,001 – RM 500,000: affordable
▪ RM 500,001 – RM 1,000,000: expensive
▪ RM 1,000,000 – ∞: super expensive

Categorical Data

Nominal or Ordinal?
happiness
player’s position
body temperature types of vehicles

Semi-structured Data
▪ Does not conform to formal structure of data models e.g. relational
database
▪ Has some organizational properties that makes it easier to analyze
▪ It contains some tags or other markers to group data or enforce
hierarchies or records and fields within the data (describe how the data
is stored)
▪ Data storage: JSON, XML or advanced database management e.g.
Cassandra, MongoDB



▪ The data is not constrained by a fixed schema
▪ Data is portable
▪ It can deal easily with the heterogeneity of sources (numerical,
categorical, character, text etc.)
▪ Its supports users who can not express their need in SQL
▪ Interpreting the relationship between data is difficult compared to
structure data as there is no separation of the schema and the data
▪ Queries are less efficient as compared to structured data.

Unstructured Data
▪ Data which is not organized in a predefined manner or does not have a
predefined data model
▪ Difficult to manipulate and analyze
▪ Data storage: advanced database technology e.g. Cassandra, MongoDB

Unstructured Data
▪ Text, Time series, Image
▪ Data may come from social media, survey, image sharing platform

Text
▪ Words – needs to convert to a form that computers can process
▪ Tokenization – sentence to words
▪ Removing unnecessary punctuation, tags
▪ Removing stop words (most common words) – words that have not
much semantic meaning
▪ Stemming & Lemmatization– reduce words to root words e.g. 'studies'
to 'study'

Text
▪ He is playing at football.
▪ He, is, playing, at, football, . (tokenization)
▪ He, is, playing, at, football (remove punctuation)
▪ playing, football (remove stop words)
▪ play, football (stemming & lemmatization)
▪ Represent the words using numerical representation technique such as

Bag of Words (BOW), Word2Vec etc.

Text
▪ BOW turns each word into numbers by counting the occurrence of
words

Time Series Data
▪ A sequence of values ordered by time
▪ The values (data points) take place in a given period of time in regular
interval
▪ millisecond, sec, min, hour, day, week, … month, year, …
▪ Has temporal structure: trends, seasonal, cyclic variation, irregularities

Image Data
▪ A set of values (pixels) which describes the intensities or the color of
the pixels
▪ The values are arranged in an array of rows and columns that
correspond to the vertical and horizontal positions of the pixels
▪ Has spatial structure – visual information
▪ Each pixel has a value representing the intensity of the pixel (grayscale)
▪ Each pixel has three values representing Red, Green and Blue (color)

End


Chapter2 - Data Collection Methods and Data Types

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter2 - Data Collection Methods and Data Types

Uploaded by

Copyright:

Available Formats

CDS501:

PRINCIPLES & PRACTICES OF

Copyright © 2022 Halim Noor, USM

Copyright © 2022 Halim Noor, USM

Copyright © 2022 Halim Noor, USM

Copyright © 2022 Halim Noor, USM

Copyright © 2022 Halim Noor, USM

Copyright © 2022 Halim Noor, USM

Copyright © 2022 Halim Noor, USM

Copyright © 2022 Halim Noor, USM

Copyright © 2022 Halim Noor, USM

Copyright © 2022 Halim Noor, USM

Copyright © 2022 Halim Noor, USM

• Rows are instances about which the entity being observed

• Not well-structured: table-structured data without headers or with

Copyright © 2022 Halim Noor, USM

Copyright © 2022 Halim Noor, USM

Copyright © 2022 Halim Noor, USM

Copyright © 2022 Halim Noor, USM

salary height # of cars sold

# of students body temperature

Copyright © 2022 Halim Noor, USM

Copyright © 2022 Halim Noor, USM

Copyright © 2022 Halim Noor, USM

Copyright © 2022 Halim Noor, USM

body temperature types of vehicles

Copyright © 2022 Halim Noor, USM

Copyright © 2022 Halim Noor, USM

Copyright © 2022 Halim Noor, USM

Copyright © 2022 Halim Noor, USM

Copyright © 2022 Halim Noor, USM

Copyright © 2022 Halim Noor, USM

Copyright © 2022 Halim Noor, USM

▪ Represent the words using numerical representation technique such as

Copyright © 2022 Halim Noor, USM

Copyright © 2022 Halim Noor, USM

Copyright © 2022 Halim Noor, USM

Copyright © 2022 Halim Noor, USM

Copyright © 2022 Halim Noor, USM

You might also like