Professional Documents
Culture Documents
drives
Collection of data
What is Statistics?
⁺ Statistics is the study of how
to collect, organise, analyse,
Statistics and interpret information.
⁺ Statistics is a tool for
converting data into
information.
Data Information
What is Data Analysis?
⁺ Data analysis is defined as a
process of collecting, modeling,
and analyzing data to extract
insights that support decision-
making.
❑Semi-structured
➢Asks same general set of questions but answers to the
questions are predominantly open-ended.
Structured vs. Semi-structured Surveys
Structured ❑ Harder to develop.
❑ Easier to complete.
❑ Easier to analyze.
❑ More efficient when working with large numbers.
Semi- ❑ Easier to develop: open ended questions.
structured ❑ More difficult to complete: burdensome for people
to complete as a self-administrated questionnaire.
❑ Harder to analyze but provide a richer source of
data, interpretation of open-ended responses
subject to bias.
Modes of Survey Administration
❑Telephone surveys.
❑Self-administered questionnaires distributed by
mail, e-mail, or websites.
❑Administered questionnaires, common in the
development context.
❑In development context, often issues of language
and translation.
Mail / Phone / Internet Surveys
❑Literacy issues.
❑Consider accessibility
➢Reliability of postal service
➢Turn-around time
❑Consider bias
➢What population segment has
telephone access? Internet
access?
Advantage/Challenge: Survey
Advantages Best when you want to know what people
think, believe, or perceive, only they can tell
you that.
Challenges People may not accurately recall their
behavior or may be reluctant to reveal their
behavior if it is illegal or stigmatized. What
people think they do or say they do is not
always the same as what they actually do.
Interviews
❑Often semi-structured.
❑Used to explore complex
issues in depth.
❑Forgiving of mistakes:
unclear questions can be
clarified during the interview
and changed for subsequent
interviews.
❑Can provide evaluators with
an intuitive sense of the
situation.
Challenges of Interviews
❑Can be expensive, labor
intensive, and time
consuming.
❑Selective hearing on the
part of the interviewer may
miss information that does
not conform to pre-
existing beliefs.
❑Cultural sensitivity: e.g.,
gender issues.
Tool 5: Focus Groups
❑Type of qualitative research
where small homogenous
groups of people are brought
together to informally discuss
specific topics under the
guidance of a moderator.
❑Purpose: To identify issues
and themes, not just
interesting information, and
not “counts”.
Focus Groups Are Inappropriate when:
❑Language barriers are insurmountable.
❑Evaluator has little control over the situation.
❑Trust cannot be established.
❑Free expression cannot be ensured.
❑Confidentiality cannot be assured.
Focus Group Process
Phase Action
1 Opening Ice-breaker; explain purpose; ground rules;
introductions.
2 Warm-up Relate experience; stimulate group interaction; start
with least threatening and simplest questions.
3 Main body Move to more threatening or sensitive and complex
questions; elicit deep responses; connect emergent
data to complex, broad participation.
4 Closure End with closure-type questions; summarize and
refine; present theories, etc; invite final comments
or insights; thank participants.
Advantage/Challenge: Focus Groups
Advantages • Can be conducted relatively quickly and easily.
• May take less staff time than in-depth.
• Allow flexibility to make changes in process and
questions.
• Can explore different perspectives.
Challenges • Analysis is time consuming.
• Participants not be representative of population.
• Possibly biasing the data.
• Group may be influenced by moderator or
dominant group members.
Tool 6: Diaries and Self-Reported Checklists
❑Use when you want to capture
information about events in
people’s daily lives.
❑Participants capture experiences in
real-time not later in a
questionnaire.
❑Used to supplement other data
collection.
Self-reported Checklists
❑Cross between a questionnaire and a diary.
❑The evaluator specifies a list of behaviors
or events and asks the respondents to
complete the checklist.
❑Done over a period of time to capture the
event or behavior.
❑More quantitative approach than diary.
Advantage/Challenge: Diaries and Self-
reported Checklists
Advantages • Can capture in-depth, detailed data that might
be otherwise forgotten.
• Can collect data on how people use their time
• Can collect sensitive information.
• Supplements interviews provide richer data
Challenges • Require commitment and self-discipline.
• Data may be incomplete or inaccurate.
• Poor handwriting, difficult to understand
phrases.
Tool 7: Expert Judgment
❑Use of experts, one-on-one or as a panel.
❑Can be structured or unstructured.
❑Issues in selecting experts.
In-Person Interviews
Online Surveys
• High • Self- • In-depth (+)
Confidence manage (+) • Time
(+) • Data consuming
• Need to accuracy (+) (-)
hire an • Need
agency (-) internet (-)
Data Collection
Example 2: A data scientist working in an e-commerce company
wants to collect data about customers. The data is already stored
by company.
❑Data Cleaning
❑Inferential
statistics (study sample data) .
❑Estimate uncertainty (using probability).
Data Describing
Example 1: Visualizing Data
Data Analysis
A subset of data
analytics that
includes specific
processes.
Descriptive
Analysis
Text Exploratory
Analysis Analysis
Data
Analysis
Types
Predictive
Inferential
Analysis
Analysis
Descriptive Analysis
❑The very first analysis
performed.
❑Generates simple
summaries about samples
and measurements.
❑Common descriptive
statistics (measures of
central tendency, variability,
frequency, position, etc).
Exploratory Analysis
❑EDA helps in discovering
relationships between measures
in the data, which are not
evidence for the existence of the
correlation, as denoted by the
phrase (Correlation doesn’t imply
causation).
❑Useful for discovering new
connections — forming
hypothesis and drives design
planning and data collection.
Inferential Analysis
❑Using estimated data that
value in population and give a
measure of uncertainty
(standard deviation) in your
estimation.
❑Accuracy of inference
depends heavily on sampling
scheme; if the sample isn’t
representative of the
population, the generalization
will be inaccurate.
Predictive Analysis
❑Using historical or current data to find
patterns to make predictions about the
future.
❑Accuracy of the predictions depends on the
input variables.
❑Accuracy also depends on the types of
models, a linear model might work well in
some cases, and vice-versa.
❑Using a variable to predict another doesn’t
denote a causal relationships.
What is Predictive Modeling?
❑Predictive modeling is a process used in predictive
analytics to create a model of future behavior.
❑A predictive model is made up of a number
of predictors, which are variable factors that are
likely to influence future behavior or results.
❑Historical data is considered for extracting
predictive rules, which are then applied on future
data.
What is Predictive Modeling?
Historical Data Model Development
using Machine
for training Learning Technique
Prediction
Steps in Predictive Modeling
❑Historical data collection.
❑Selection of appropriate machine learning technique for
developing model.
❑Data cleaning including outlier analysis and missing data
handling.
❑Attribute subsection to be used in the prediction model.
❑Model development using the training dataset.
❑Model validation for determination of its accuracy.
❑Using the model for future predictions.
Key Elements in Predictive Modeling
❑Historical Data
❑Learning Technique
❑Statistical Techniques
❑Machine Learning Techniques
Historical Data
❑Historical Data is collected data about past events and
circumstances pertaining to a particular subject.
❑Historical data includes most data generated either manually
or automatically within an enterprise.
❑Sources of data collection includes open source software,
repositories, press releases, log files, financial reports,
software source code, project and product documentation and
email and other communications.
Learning Technique
❑There are two types of learning algorithms/technique such as
statistical and machine learning techniques.
❑The learning techniques are used for mode construction after
data portioning.
❑The independent variables such as software metrics are used to
predict the dependent variable which are software quality
attributes.
❑The algorithms may be statistical or machine learning depending
on the type of dataset.
Text Analysis
❑Text Analysis is also referred to
as Data Mining.
❑It is one of the methods of data
analysis to discover a pattern in
large data sets using databases or
data mining tools.
❑It used to transform raw data
into business information.
❑Overall it offers a way to extract
and examine data and deriving
patterns and finally interpretation
of the data.
Data Analysis Methods Descriptive
Statistics
Statistical
Inferential
Statistics
Supervised
Learning
Data Analysis Machine Unsupervised
Methods Learning Learning
Reinforcement
Learning
Quantitative
Research Data Analysis
Methods Qualitative
Data Analysis
Criteria forSelection of Data
AnalysisTechnique
Important
Type of
Natureof Aspects of
Dependent
Different
Variable DataSet
Algorithms
Logistic
Statistical
Regression
Discriminant
Binary Analysis
Decision Tree
Natureof DataSet
Redundancy indata
Other considerations
whileselecting data
analysis methods
Accuracy ofclass
probability Efficacy inhigh
estimates dimensions
Ability tohandle
classimbalance
Why study Statistics?
❑Numerical information is everywhere.
❑Statistical techniques are used to inform
decisions that affect our everyday lives.
❑A knowledge of statistical methods will
help in understanding how decisions are
made and how they might affect.
❑An understanding of data analysis is
helpful in most occupations.
1. Machine does
Myths about Data Science everything
❑Data Collection What to collect?
Where to collect?
Human decide
How to collect?
What type of data should
be collected? (open
source/proprietary)
Human decide
Human must
How to clean?
have domain
knowledge
Study and integrate
multiple formats
Hypothesis
Human decide
Propose Models
Oversee
Noisy data
No enough data
Data Analysis Uses
❑Real-time Analytics
❑Changing Hiring Patterns
❑Improved Marketing Efficiency
❑Healthcare Ingression
❑Cyber Security
❑Science
Fields in Data Science
❑Data Analyst
❑Data Engineer
❑Machine Learning Engineer
❑Data Science Generalist