You are on page 1of 35

CDS501:

PRINCIPLES & PRACTICES OF


DATA SCIENCE & ANALYTICS
Data Collection and Data Types

Copyright © 2022 Halim Noor, USM


Outline
▪ Primary and Secondary
▪ Structured, Semi-structured and Unstructured Data
▪ Data Types

Copyright © 2022 Halim Noor, USM


Primary Data
▪ Primary
▪ Data that is collected directly from the data source without going
through any existing sources
▪ Expensive and time consuming
▪ Data is reliable, authentic, up to date and objective (collected with
purpose)
▪ Ownership – belong to the organization
▪ Interview, Focus Group, Survey and Questionnaire, Observation,
Experiment

Copyright © 2022 Halim Noor, USM


Interview
▪ Involve two groups of people: Interviewer and Interviewee
▪ Question and responses may be oral or verbal (spoken/written)
▪ In-person or telephonic interview
▪ Pros
▪ In-depth information can be collected
▪ Non-response and response bias can be detected
▪ The samples can be controlled
▪ Allows open ended question e.g. what did you like best about the product?

▪ Cons
▪ Time consuming and expensive
▪ Interviewer may be biased
▪ Data is complex to interpret

Copyright © 2022 Halim Noor, USM


Focus Group
▪ Involves 2 or more people with similar characteristics (target audience)
▪ Similar to interviews, but this involves discussions and interactions led
by facilitators rather than questions and answers
▪ Less formal – participants do most of the talking, facilitators oversees
the process
▪ Pros
▪ Low cost compared to interviews (the interviewer does not have to discuss
with each participant individually)
▪ Takes lesser time
▪ Data is relatively less complex than interview (light analysis)

▪ Cons
▪ Response bias due to facilitators or participants might be subjective to what
people will think about sharing a sincere opinion
▪ Speaking time of some attendees may be higher than that of others, making
their contribution disproportionate
▪ Group thinking does not clearly mirror individual opinions

Copyright © 2022 Halim Noor, USM


Survey and Questionnaire
▪ A group of questions typed or written down to obtain responses
▪ Can be carried out online or offline
▪ Pros
▪ Respondents have adequate time to give responses
▪ Responses can be in the form of categorical and numerical values
▪ It is free from the bias of the interviewer
▪ They are cheaper compared to interviews

▪ Cons
▪ A high rate of non-response bias
▪ It is inflexible and can't be changed once sent
▪ It is a slow process

Copyright © 2022 Halim Noor, USM


Observation
▪ Involves collecting information by observation or using scientific tools
▪ Systematically planned and subjected to checks and controls
▪ Approaches: structured or unstructured, controlled or uncontrolled
▪ E.g. watches a person teach a class to determine how well the delivery,
counts number of visitors, measure temperature using sensors
▪ Pros
▪ The data is objective
▪ The data is not affected by past or future events

▪ Cons
▪ The information is limited
▪ Expensive

Copyright © 2022 Halim Noor, USM


Experiment
▪ A structured study where the researchers attempt to understand the
causes, effects
▪ The approach is controlled e.g. which subject, how they are grouped
and the treatment they received
▪ E.g action or treatment is carried out on subject, then reactions are
recorded
▪ Pros
▪ It is usually objective since the data recorded are the results of a process
▪ Non-response bias is eliminated

▪ Cons
▪ Incorrect data may be recorded due to human error
▪ It is expensive

Copyright © 2022 Halim Noor, USM


Secondary Data
▪ Data that has been collected in the past by someone else but made
available for others to use
▪ Affordable and requires very little to no cost to acquire them
▪ Easily accessible (shared publicly)
▪ Data may not be suited to the project needs
▪ Data may not authentic – need further verification
▪ Data may be outdated
▪ Kaggle, UCI Repository

Copyright © 2022 Halim Noor, USM


Types of Data Structure
▪ Structured data
▪ Semi-structured data
▪ Unstructured data

Copyright © 2022 Halim Noor, USM


Structured Data
▪ Conform to formal data models such as database and excel
▪ Table-structured data with rows and columns (headers)
▪ Can store numeric (numerical and categorical) or text
▪ E.g. patient information, student information, product, real estate
▪ Data storage: relational database, excel, csv

Copyright © 2022 Halim Noor, USM


Structured Data

• Rows are instances about which the entity being observed


• Columns are facts or measurements (attributes or features)
• Cells are the values (data)
Copyright © 2022 Halim Noor, USM
Structured Data

• Not well-structured: table-structured data without headers or with


ambiguous headers
• Not as easy to analyze
• Data is encoded value, needs to decode using the documentation
Copyright © 2022 Halim Noor, USM
Structured Data
▪ There are many tools for data processing e.g. SQL, Pandas, Tidyverse
▪ Easy to manipulate, search and analyze
▪ Queries are efficient e.g. allows complex joining

Copyright © 2022 Halim Noor, USM


Numerical Data
▪ Quantitative data
▪ Any data where data points are exact numbers
▪ Has no spatial and temporal structure

Copyright © 2022 Halim Noor, USM


Numerical Data
▪ Quantitative data
▪ Any data where data points are exact numbers
▪ Has no spatial and temporal structure
▪ Continuous: assume any value (real numbers) e.g. 35.6, 10.0, 89.26
▪ Discrete: distinct values e.g. 3, 55, 10

Copyright © 2022 Halim Noor, USM


Numerical Data

Copyright © 2022 Halim Noor, USM


Continuous or Discrete?

salary height # of cars sold

# of students body temperature


Copyright © 2022 Halim Noor, USM
Categorical Data
▪ Data that represents categories or groups
▪ Can take numerical values, but the values have no meaning

Copyright © 2022 Halim Noor, USM


Categorical Data
▪ Data that represents groups
▪ Can take numerical values, but the values have no meaning
▪ Nominal: categorical data without ordering e.g. gender, town, weather
▪ Ordinal: categorical data with ordering e.g. size, difficulty

Copyright © 2022 Halim Noor, USM


Categorical Data
▪ Can take numerical values, but the values have no meaning
▪ Numerical data can be split into groups
▪ House price
▪ 0 – RM 200,000: cheap
▪ RM 200,001 – RM 500,000: affordable
▪ RM 500,001 – RM 1,000,000: expensive
▪ RM 1,000,000 – ∞: super expensive

Copyright © 2022 Halim Noor, USM


Categorical Data

Copyright © 2022 Halim Noor, USM


Nominal or Ordinal?

happiness
player’s position

body temperature types of vehicles


Copyright © 2022 Halim Noor, USM
Semi-structured Data
▪ Does not conform to formal structure of data models e.g. relational
database
▪ Has some organizational properties that makes it easier to analyze
▪ It contains some tags or other markers to group data or enforce
hierarchies or records and fields within the data (describe how the data
is stored)
▪ Data storage: JSON, XML or advanced database management e.g.
Cassandra, MongoDB

Copyright © 2022 Halim Noor, USM


Semi-structured Data

Copyright © 2022 Halim Noor, USM


Semi-structured Data

Copyright © 2022 Halim Noor, USM


Semi-structured Data
▪ The data is not constrained by a fixed schema
▪ Data is portable
▪ It can deal easily with the heterogeneity of sources (numerical,
categorical, character, text etc.)
▪ Its supports users who can not express their need in SQL
▪ Interpreting the relationship between data is difficult compared to
structure data as there is no separation of the schema and the data
▪ Queries are less efficient as compared to structured data.

Copyright © 2022 Halim Noor, USM


Unstructured Data
▪ Data which is not organized in a predefined manner or does not have a
predefined data model
▪ Difficult to manipulate and analyze
▪ Data storage: advanced database technology e.g. Cassandra, MongoDB

Copyright © 2022 Halim Noor, USM


Unstructured Data
▪ Text, Time series, Image
▪ Data may come from social media, survey, image sharing platform

Copyright © 2022 Halim Noor, USM


Text
▪ Words – needs to convert to a form that computers can process
▪ Tokenization – sentence to words
▪ Removing unnecessary punctuation, tags
▪ Removing stop words (most common words) – words that have not
much semantic meaning
▪ Stemming & Lemmatization– reduce words to root words e.g. 'studies'
to 'study'

Copyright © 2022 Halim Noor, USM


Text
▪ He is playing at football.
▪ He, is, playing, at, football, . (tokenization)
▪ He, is, playing, at, football (remove punctuation)
▪ playing, football (remove stop words)
▪ play, football (stemming & lemmatization)

▪ Represent the words using numerical representation technique such as


Bag of Words (BOW), Word2Vec etc.

Copyright © 2022 Halim Noor, USM


Text
▪ BOW turns each word into numbers by counting the occurrence of
words

Copyright © 2022 Halim Noor, USM


Time Series Data
▪ A sequence of values ordered by time
▪ The values (data points) take place in a given period of time in regular
interval
▪ millisecond, sec, min, hour, day, week, … month, year, …
▪ Has temporal structure: trends, seasonal, cyclic variation, irregularities

Copyright © 2022 Halim Noor, USM


Image Data
▪ A set of values (pixels) which describes the intensities or the color of
the pixels
▪ The values are arranged in an array of rows and columns that
correspond to the vertical and horizontal positions of the pixels
▪ Has spatial structure – visual information
▪ Each pixel has a value representing the intensity of the pixel (grayscale)
▪ Each pixel has three values representing Red, Green and Blue (color)

Copyright © 2022 Halim Noor, USM


End

Copyright © 2022 Halim Noor, USM


Copyright © 2022 Halim Noor, USM

You might also like