You are on page 1of 112

Introduction To Data Analysis

Ms. Shweta Meena


Department of Software Engineering
Delhi Technological University
What is Data?
⁺ Data is raw information.
⁺ Extraction of meaningful data from raw
data is termed as information.
⁺ Raw data is collected as a part of research,
observations and surveys.
Why data is required?
What we want to know?

drives

Collection of data
What is Statistics?
⁺ Statistics is the study of how
to collect, organise, analyse,
Statistics and interpret information.
⁺ Statistics is a tool for
converting data into
information.
Data Information
What is Data Analysis?
⁺ Data analysis is defined as a
process of collecting, modeling,
and analyzing data to extract
insights that support decision-
making.

⁺ The purpose of data analysis is to extract useful information


from data and taking the decision based upon the data
analysis.
Data Analysis Process

Collecting Storage Processing Describing Modeling


Data Data Data Data Data
Data Collection
⁺ Data Collection is the process of gathering information on
targeted variables identified as data requirements.
⁺ Data Collection ensures that data gathered is accurate such
that the related decisions are valid.
⁺ Data is collected from various sources ranging from
organizational databases to the information in web pages.
⁺ The data thus obtained, may not be structured and may
contain irrelevant information.
Data Collection Strategies
❑No one best way: decision depends on:
➢What you need to know: numbers or stories.
➢Where the data reside: environment, files, people.
➢Resources and time available.
➢Complexity of the data to be collected.
➢Frequency of data collection.
➢Intended forms of data analysis.
Rules for Collecting Data
❑Use multiple data collection methods.
❑Use available data, but need to know
➢How the measures were defined?
➢How the data were collected and cleaned ?
➢ The extent of missing data.
➢How accuracy of the data was ensured?
Rules for Collecting Data
❑If must collect original data:
➢Establish procedures and follow them (protocol).
➢Maintain accurate records of definitions and coding.
➢Verify accuracy of coding, data input.
Structured Approach
❑All data collected in the same way.
❑Especially important for multi-site and
cluster evaluations so you can compare.
❑Important when you need to make
comparisons with alternate interventions.
When to use Structured Approach ?
❑Need to address extent questions.
❑Have a large sample or population.
❑Know what needs to be measured.
❑Need to show results numerically.
❑Need to make comparisons across different sites or
interventions.
Semi-structured Approach
❑Systematic and follow general procedures but data are
not collected in exactly the same way every time.
❑More open and fluid.
❑Does not follow a rigid script.
➢May ask for more detail.
➢People can tell what they want in their own way.
When to use Semi-structured Approach ?
❑Conducting exploratory work.
❑Seeking understanding, themes, and/or issues.
❑Need narratives or stories.
❑Want in-depth, rich, “backstage” information.
❑Seek to understand results of data that are
unexpected.
Characteristics of Good Measures
❑Is the measure relevant?
❑Is the measure credible?
❑Is the measure valid?
❑Is the measure reliable?
Relevance
✓Does the measure ✓Do not measure
capture what matters? what is easy instead
of what is needed?
Credibility
✓ Is the measure believable? Will it
be viewed as a reasonable and
appropriate way to capture the
information sought?
Internal Validity

✓ How well does the ✓Are waiting lists a


measure capture what valid measure of
it is supposed to? demand?
Reliability
✓ A measure’s precision ✓How reliable are:
and stability- extent to • Birth weights of
which the same result newborn infants?
would be obtained with • Speeds measured by a
repeated trials. stopwatch?
Quantitative Approach
❑Data in numerical form.
❑Data that can be precisely measured
✓age, cost, length, height, area, volume, weight,
speed, time, and temperature
❑Harder to develop.
❑Easier to analyze.
Qualitative Approach
❑Data that deal with description.
❑Data that can be observed or self-reported, but not always
precisely measured.
❑Less structured, easier to develop.
❑Can provide “rich data” — detailed and widely applicable.
❑Is challenging to analyze.
❑Is labor intensive to collect.
❑Usually generates longer reports.
Obtrusive vs. Unobtrusive Methods
Obtrusive Unobtrusive
Data collection methods that Data collection methods that
directly obtain information do not collect information
from those being evaluated. directly from evaluees.

Example interviews, surveys, Example document analysis,


focus groups Google Earth, observation at
a distance, trash of the stars
How to Decide on Data Collection Approach?
❑Choice depends on the
situation.
❑Each technique is more
appropriate in some situations
than others.
❑Caution: All techniques are
subject to bias.
Triangulation to Increase Accuracy of Data
❑Triangulation of methods
➢collection of same information
using different methods.
❑Triangulation of sources
➢collection of same information
from a variety of sources.
❑Triangulation of evaluators
➢collection of same information
from more than one evaluator.
Data Collection Tools
❑Participatory Methods
❑Records and Secondary Data
❑Observation
❑Surveys and Interviews
❑Focus Groups
❑Diaries, Journals, Self-reported Checklists
❑Expert Judgment
❑Delphi Technique
❑Other Tools
Tool 1: Participatory Methods
❑Involve groups or communities
heavily in data collection.
❑Examples:
❑Community meetings
❑Mapping
❑Transect walks
Community Meetings
❑One of the most common
participatory methods.
❑Must be well organized.
❑Agree on purpose.
❑Establish ground rules.
➢Who will speak?
➢Time allotted for speakers.
➢Format for questions and
answers.
Mapping
❑Drawing or using existing maps.
❑Useful tool to involve stakeholders.
➢Increases understanding of the community.
➢Generates discussions, verifies secondary sources of
information, perceived changes.
❑Types of mapping:
➢Natural resources, social, health, individual or civic assets,
wealth, land use, demographics.
Transect Walks
❑Evaluator walks around community observing
people, surroundings, and resources.
❑Need good observation skills.
❑Walk a transect line through a map of a
community — line should go through all zones of
the community.
Tool 2: Records and Secondary Data
❑Examples of sources:
➢Files/records
➢Computer data bases
➢Industry or government reports
➢Other reports or prior evaluations
➢Census data and household survey data
➢Electronic mailing lists and discussion groups
➢Documents (budgets, organizational charts, policies and
procedures, maps, monitoring reports)
➢Newspapers and television reports
Using Existing Data Sets
❑ Key issues: Validity,
reliability, accuracy,
response rates, data
dictionaries, and
missing data rates.
Advantage/Challenge: Available Data
Advantages Often less expensive and faster than
collecting the original data again

Challenges There may be coding errors or other


problems. Data may not be exactly what is
needed. You may have difficulty getting
access. You have to verify validity and
reliability of data
Tool 3: Observation
❑See what is happening
➢Traffic patterns
➢Land use patterns
➢Layout of city and rural areas
➢Quality of housing
➢Condition of roads
➢Conditions of buildings
➢Who goes to a health clinic
Observation is Helpful when:
❑Need of direct information.
❑Trying to understand ongoing behavior.
❑There is physical evidence, products, or
outputs than can be observed.
❑Need to provide alternative when other
data collection is infeasible or
inappropriate.
Degree of Structure of Observations
❑Structured: Determine, before the observation, precisely what
will be observed before the observation.
❑Unstructured: Select the method depending upon the
situation with no pre-conceived ideas or a plan on what to
observe.
❑Semi-structured: A general idea of what to observe but no
specific plan.
Ways to Record Information from Observations
❑Observation guide
➢Printed form with space to record
❑Recording sheet or checklist
➢Yes/no options; tallies, rating scales
❑Field notes
➢Least structured, recorded in narrative, descriptive style
Guidelines for Planning Observations
❑Have more than one observer, if feasible.
❑Train observers so they observe the same
things.
❑Pilot test the observation data collection
instrument.
❑For less structured approach, have a few key
questions in mind.
Advantage/Challenge: Observation
Advantages Collects data on actual vs. self- reported
behavior or perceptions. It is real-time vs.
retrospective.

Challenges Observer bias, potentially unreliable;


interpretation and coding challenges;
sampling can be a problem; can be labor
intensive; low response rates.
Tool 4: Surveys and Interviews
❑Excellent for asking people about:
➢Perceptions, opinions, ideas
❑Less accurate for measuring behavior
❑Sample should be representative of the whole
❑Big problem with response rates
Structures for Surveys
❑Structured:
➢Precisely worded with a range of pre-determined responses
that the respondent can select.
➢Everyone asked exactly the same questions in exactly the
same way, given exactly the same choices.

❑Semi-structured
➢Asks same general set of questions but answers to the
questions are predominantly open-ended.
Structured vs. Semi-structured Surveys
Structured ❑ Harder to develop.
❑ Easier to complete.
❑ Easier to analyze.
❑ More efficient when working with large numbers.
Semi- ❑ Easier to develop: open ended questions.
structured ❑ More difficult to complete: burdensome for people
to complete as a self-administrated questionnaire.
❑ Harder to analyze but provide a richer source of
data, interpretation of open-ended responses
subject to bias.
Modes of Survey Administration
❑Telephone surveys.
❑Self-administered questionnaires distributed by
mail, e-mail, or websites.
❑Administered questionnaires, common in the
development context.
❑In development context, often issues of language
and translation.
Mail / Phone / Internet Surveys
❑Literacy issues.
❑Consider accessibility
➢Reliability of postal service
➢Turn-around time
❑Consider bias
➢What population segment has
telephone access? Internet
access?
Advantage/Challenge: Survey
Advantages Best when you want to know what people
think, believe, or perceive, only they can tell
you that.
Challenges People may not accurately recall their
behavior or may be reluctant to reveal their
behavior if it is illegal or stigmatized. What
people think they do or say they do is not
always the same as what they actually do.
Interviews
❑Often semi-structured.
❑Used to explore complex
issues in depth.
❑Forgiving of mistakes:
unclear questions can be
clarified during the interview
and changed for subsequent
interviews.
❑Can provide evaluators with
an intuitive sense of the
situation.
Challenges of Interviews
❑Can be expensive, labor
intensive, and time
consuming.
❑Selective hearing on the
part of the interviewer may
miss information that does
not conform to pre-
existing beliefs.
❑Cultural sensitivity: e.g.,
gender issues.
Tool 5: Focus Groups
❑Type of qualitative research
where small homogenous
groups of people are brought
together to informally discuss
specific topics under the
guidance of a moderator.
❑Purpose: To identify issues
and themes, not just
interesting information, and
not “counts”.
Focus Groups Are Inappropriate when:
❑Language barriers are insurmountable.
❑Evaluator has little control over the situation.
❑Trust cannot be established.
❑Free expression cannot be ensured.
❑Confidentiality cannot be assured.
Focus Group Process
Phase Action
1 Opening Ice-breaker; explain purpose; ground rules;
introductions.
2 Warm-up Relate experience; stimulate group interaction; start
with least threatening and simplest questions.
3 Main body Move to more threatening or sensitive and complex
questions; elicit deep responses; connect emergent
data to complex, broad participation.
4 Closure End with closure-type questions; summarize and
refine; present theories, etc; invite final comments
or insights; thank participants.
Advantage/Challenge: Focus Groups
Advantages • Can be conducted relatively quickly and easily.
• May take less staff time than in-depth.
• Allow flexibility to make changes in process and
questions.
• Can explore different perspectives.
Challenges • Analysis is time consuming.
• Participants not be representative of population.
• Possibly biasing the data.
• Group may be influenced by moderator or
dominant group members.
Tool 6: Diaries and Self-Reported Checklists
❑Use when you want to capture
information about events in
people’s daily lives.
❑Participants capture experiences in
real-time not later in a
questionnaire.
❑Used to supplement other data
collection.
Self-reported Checklists
❑Cross between a questionnaire and a diary.
❑The evaluator specifies a list of behaviors
or events and asks the respondents to
complete the checklist.
❑Done over a period of time to capture the
event or behavior.
❑More quantitative approach than diary.
Advantage/Challenge: Diaries and Self-
reported Checklists
Advantages • Can capture in-depth, detailed data that might
be otherwise forgotten.
• Can collect data on how people use their time
• Can collect sensitive information.
• Supplements interviews provide richer data
Challenges • Require commitment and self-discipline.
• Data may be incomplete or inaccurate.
• Poor handwriting, difficult to understand
phrases.
Tool 7: Expert Judgment
❑Use of experts, one-on-one or as a panel.
❑Can be structured or unstructured.
❑Issues in selecting experts.

Example: Government task forces,


Advisory Groups
Selecting Experts
❑Establish criteria for selecting experts not only
on recognition as expert but also based on:
➢Areas of expertise
➢Diverse perspectives
➢Diverse political views
➢Diverse technical expertise
Advantage/Challenge: Expert Judgment
Advantages • Fast
• Relatively inexpensive
Challenges • Weak for impact evaluation.
• May be based mostly on perceptions.
• Value of data depends on how credible
the experts are perceived to be .
Tool 8: Delphi Technique
❑Enables experts to engage remotely in a
dialogue and reach consensus, often about
priorities.
❑Experts asked specific questions; often
rank choices.
❑Responses go to a central source, are
summarized and feedback to the experts
without attribution.
❑Experts can agree or argue with others’
comments.
❑Process may be iterative.
Advantage/Challenge: Delphi Technique
Advantages • Allows participants to remain anonymous.
• Is inexpensive.
• Is free of social pressure, personality influence, and
individual dominance.
• Is conducive to independent thinking.
• Allows sharing of information.
Challenges • May not be representative.
• Has tendency to eliminate extreme positions.
• Requires skill in written communication.
• Requires time and participant commitment.
Data Collection Summary
❑Choose more than one
data collection technique.
❑No “best” tool.
❑Do not let the tool drive
your work but rather
choose the right tool to
address the evaluation
question.
Data Collection
Example 1: A data scientist wants to collect feedback through
survey. There are three ways of conducting survey:
Phone Survey

In-Person Interviews
Online Surveys
• High • Self- • In-depth (+)
Confidence manage (+) • Time
(+) • Data consuming
• Need to accuracy (+) (-)
hire an • Need
agency (-) internet (-)
Data Collection
Example 2: A data scientist working in an e-commerce company
wants to collect data about customers. The data is already stored
by company.

❑How can Data Scientist access this data?


✓Data Scientist requires SQL queries,
JAVA, Python code or JSON
depending on the structure of data.
Data Collection
Example 3: A data scientist is working for a political party.

❑What are people saying about new


policy?
✓Data Scientist requires crawler,
API, JAVA, python scripts in order
to crawl and scrap the already
available data.
Data Collection
Example 4: A data scientist is working with farmers.

❑How to analyze the effect of seed,


fertilizer, irrigation on yield?
✓Since, data is not available. A
data scientist require to perform
experiments to collect data in
this scenario.
Skills required for Data Collection
❑Intermediate level of programming
(Python, JavaScript, Scala, R, SQL, and
Julia).
❑Knowledge of Databases (MySQL,
Microsoft Access, Microsoft SQL Server,
FileMaker Pro, Oracle Database, and
dBASE).
❑Knowledge of Statistics (Statistical
techniques and methods).
Data Storage
⁺ Data storage describes what type of, where, and how
hardware or software holds, deletes, backs up, organizes, and
secures information.
⁺ Data can be stored either for
temporary or permanent storage.
⁺ Data storage can be done through
three ways such as relational
database, data warehouse and
data lake.
Data Storage
Example 1: Transactional and Operational Data
❑Data such as patient records, insurance
claims, inventory customer records,
telephone bills, invoices, employee
records, reimbursements, purchase
orders is stored in a structured form.

❑Big organizations generates lots of structured data to design


relational databases (select, insert, deleted, update operation).
Data Storage
Example 2: Data Warehouse
Bank
❑The data from multiple
Bank Credit cards
databases is stored into a
Investments
accounts data data common repository.
❑Data warehouse is a repository
which stores structured data.
❑Data warehouse supports
analytics, optimization and
analytical operation.
Data Storage
Example 3: Data Lake
❑Large amount of unstructured
data is generated
❑Unstructured data has high
volume
❑High variety
❑High velocity
❑High volume
Data Storage (Summary)
Relational Data
Data Lake
Databases Warehouse

⁺ Structured ⁺ Structured ⁺ Unstructured


⁺ Optimized for ⁺ Curated ⁺ Uncurated
SQL queries ⁺ Optimized for ⁺ Big Data
analytics
Skills required for Data Storage
❑Programming & Engineering.
❑Knowledge of Relational Database.
❑Knowledge of NoSQL (semi-structured
data) Database.
❑Knowledge of Data Warehouse.
❑Knowledge of Data Lake (Hadoop).
Data Processing
⁺ The data that is collected must be
processed or organized for analysis.
⁺ This includes structuring the data as
required for the relevant analysis tools.

⁺ Methods of data processing:


❑Data Wrangling & Data Munging

❑Data Cleaning

❑Data Scaling, Normalization, and Standardization


Data Processing
Example 1: Data Wrangling
& Data Munging

❑Large amount of unstructured data is


generated.
❑Data wrangling is the process of cleaning
and unifying messy and complex data sets
for easy access and analysis.
❑Manual conversion and mapping of data
from one raw form into another format.
Data Processing
Example 2: Data Cleaning
❑Data Cleaning means the process of
identifying the incorrect, incomplete,
inaccurate, irrelevant or missing part of the
data and then modifying, replacing or
deleting them according to the necessity.
⁺ Fill missing values
⁺ Standardize keywords tags
⁺ correct spelling errors
⁺ Identify and remove outliers
Data Processing
Example 3: Data Scaling
❑Scaling shrinks or stretches
the data to fit within a specific
range.
❑Normalization transforms the
data on a similar scale.
❑ Standardisation involves
centering the variable at zero,
and standardising the variance
to 1.
Skills required for Data Processing
❑Programming Skills
❑Basic Statistics
❑Map Reduce (Hadoop)
❑SQL and NoSQL Databases
Data Describing
⁺ Descriptive statistics is essentially
describing the data through methods
such as graphical representations,
measures of central tendency and
measures of variability.
⁺ Three principles of describing data
❑Center, Spread and Shape.

❑Inferential
statistics (study sample data) .
❑Estimate uncertainty (using probability).
Data Describing
Example 1: Visualizing Data

❑Data visualization is the


graphical representation of
information and data.

❑Visual elements such as charts, graphs, and maps, data


visualization tools provide an accessible way to see and
understand trends, outliers, and patterns in data.
Data Describing
Example 1: Summarizing Data
❑Data Summarization refers to presenting the
summary of generated data in an easily
comprehensible and informative manner.
❑Data can be summarized numerically as a table
(tabular summarization), or visually as a graph (data
visualization).
❑Selection of appropriate statistical test depends on
the general trends of the data revealed in the
summarization step.
Skills required for Describing Data
❑Statistics
❑Programming Skills (Python, R)
❑Tableau
❑Excel
❑Database management
❑Selection of right data
summarising measure
Data Modeling
⁺ Data modeling is the process of
producing a descriptive diagram of
relationships between various types of
information that are to be stored in a
database.
⁺ One of the goals of data modeling is to
create the most efficient method of
storing information while still
providing for complete access and
reporting.
Data Modeling Types X1
105
105
⁺ Statistical Modeling 107
❑Modeling underlying data distribution 107
108
❑Modeling underlying relations in data 115
116
❑Formulated and test hypotheses 117
❑Give statistical guarantees (p-value, 117
118
goodness-of-fit test) 129
133
135
These values are close to mean.
Data Modeling Types
⁺ Algorithmic Modeling Depends on weight,
height, age and BP
Blood sugar level y = ƒ(x)
after 30 days
⁺ Subject of machine learning is to estimate function ƒ
using data & optimization technique.
⁺ For a new patient plug-in the value of x is required to get
y.
⁺ Focus on prediction (not relevant to underlying
phenomena).
Data Modeling
Statistical Modeling Algorithmic Modeling
Simple, Intuitive models Complex, Flexible models
More suitable for low-dimensional data Can work with high-dimensional data
Robust statistical analysis is possible Not suitable for robust analysis
Focus on interpretability Focus on prediction
Data lean models Data hungry models
More of Statistics More of machine learning and deep
learning
Linear Regression, Logistic Regression, Linear Regression, Logistic Regression,
Linear Discriminant Analysis Linear Discriminant Analysis, Decision
Tree, K-NNs, Support Vector Machine,
Naïve Bayes
Data Analysis and Data Analytics
Data Analytics
The broad filed of using
data and tools to make
business decisions.

Data Analysis
A subset of data
analytics that
includes specific
processes.
Descriptive
Analysis

Text Exploratory
Analysis Analysis
Data
Analysis
Types

Predictive
Inferential
Analysis
Analysis
Descriptive Analysis
❑The very first analysis
performed.
❑Generates simple
summaries about samples
and measurements.
❑Common descriptive
statistics (measures of
central tendency, variability,
frequency, position, etc).
Exploratory Analysis
❑EDA helps in discovering
relationships between measures
in the data, which are not
evidence for the existence of the
correlation, as denoted by the
phrase (Correlation doesn’t imply
causation).
❑Useful for discovering new
connections — forming
hypothesis and drives design
planning and data collection.
Inferential Analysis
❑Using estimated data that
value in population and give a
measure of uncertainty
(standard deviation) in your
estimation.
❑Accuracy of inference
depends heavily on sampling
scheme; if the sample isn’t
representative of the
population, the generalization
will be inaccurate.
Predictive Analysis
❑Using historical or current data to find
patterns to make predictions about the
future.
❑Accuracy of the predictions depends on the
input variables.
❑Accuracy also depends on the types of
models, a linear model might work well in
some cases, and vice-versa.
❑Using a variable to predict another doesn’t
denote a causal relationships.
What is Predictive Modeling?
❑Predictive modeling is a process used in predictive
analytics to create a model of future behavior.
❑A predictive model is made up of a number
of predictors, which are variable factors that are
likely to influence future behavior or results.
❑Historical data is considered for extracting
predictive rules, which are then applied on future
data.
What is Predictive Modeling?
Historical Data Model Development
using Machine
for training Learning Technique

Predictive Rules New Data Set

Prediction
Steps in Predictive Modeling
❑Historical data collection.
❑Selection of appropriate machine learning technique for
developing model.
❑Data cleaning including outlier analysis and missing data
handling.
❑Attribute subsection to be used in the prediction model.
❑Model development using the training dataset.
❑Model validation for determination of its accuracy.
❑Using the model for future predictions.
Key Elements in Predictive Modeling
❑Historical Data
❑Learning Technique
❑Statistical Techniques
❑Machine Learning Techniques
Historical Data
❑Historical Data is collected data about past events and
circumstances pertaining to a particular subject.
❑Historical data includes most data generated either manually
or automatically within an enterprise.
❑Sources of data collection includes open source software,
repositories, press releases, log files, financial reports,
software source code, project and product documentation and
email and other communications.
Learning Technique
❑There are two types of learning algorithms/technique such as
statistical and machine learning techniques.
❑The learning techniques are used for mode construction after
data portioning.
❑The independent variables such as software metrics are used to
predict the dependent variable which are software quality
attributes.
❑The algorithms may be statistical or machine learning depending
on the type of dataset.
Text Analysis
❑Text Analysis is also referred to
as Data Mining.
❑It is one of the methods of data
analysis to discover a pattern in
large data sets using databases or
data mining tools.
❑It used to transform raw data
into business information.
❑Overall it offers a way to extract
and examine data and deriving
patterns and finally interpretation
of the data.
Data Analysis Methods Descriptive
Statistics
Statistical
Inferential
Statistics
Supervised
Learning
Data Analysis Machine Unsupervised
Methods Learning Learning
Reinforcement
Learning
Quantitative
Research Data Analysis
Methods Qualitative
Data Analysis
Criteria forSelection of Data
AnalysisTechnique
Important
Type of
Natureof Aspects of
Dependent
Different
Variable DataSet
Algorithms
Logistic
Statistical
Regression
Discriminant
Binary Analysis

Decision Tree

On the basis of Machine


Learning Support Vector
type of Machine
dependent
variable Statistical Artificial Neural
Network

Continuous Linear Regression


Machine
Learning Ordinary Least
Square
Diversity inData

Natureof DataSet
Redundancy indata

Type and existenceof


interactionsamong variables

Sizeof the training set


Aspectsof DataAnalysisMethods
❑ To select an appropriate data analysis method,
following important aspectsshouldbeconsidered:
❑ Accuracy: It refers to the predictive power of the
algorithm.
❑ Speed: It refers to time required to train the model and
the time required to test the model.
❑ Interpretability: The results produced by the algorithm are easily
interpretable.
❑ Simplicity: the algorithm must be simple in its operation
and easy tolearn.
Ability tohandle
missingvalues

Ability tohandle Sensitivityto


non-vectordata outliers

Other considerations
whileselecting data
analysis methods
Accuracy ofclass
probability Efficacy inhigh
estimates dimensions

Ability tohandle
classimbalance
Why study Statistics?
❑Numerical information is everywhere.
❑Statistical techniques are used to inform
decisions that affect our everyday lives.
❑A knowledge of statistical methods will
help in understanding how decisions are
made and how they might affect.
❑An understanding of data analysis is
helpful in most occupations.
1. Machine does
Myths about Data Science everything
❑Data Collection What to collect?

Where to collect?
Human decide
How to collect?
What type of data should
be collected? (open
source/proprietary)

Machine does Executing the scripts


Myths about Data Science
❑Data Storage What schema to use?

Human decide

Which file system?

Machine provides Physical Storage


Myths about Data Science
❑Data Processing What to clean?

Human must
How to clean?
have domain
knowledge
Study and integrate
multiple formats

Machine does Execute scripts


Myths about Data Science
❑Describing Data Which columns?

Human decide Which plots?

What trends are


required for study?

Machine does Execute scripts


Myths about Data Science
❑Data Modeling Training

Hypothesis
Human decide
Propose Models
Oversee

Machine does Estimate parameters


Myths about Data Science
2. Data science requires Big Data and Deep Learning
Big Data
+
Deep Learning
= Data Science
+
Hardware
(GPU & TPU)
Hence, statistics is required in data science.
Myths about Data Science
3. Data science is always successful.
Why data science could fail?

No meaningful insights in data

No actionable insights in data

Noisy data

No enough data
Data Analysis Uses
❑Real-time Analytics
❑Changing Hiring Patterns
❑Improved Marketing Efficiency
❑Healthcare Ingression
❑Cyber Security
❑Science
Fields in Data Science
❑Data Analyst
❑Data Engineer
❑Machine Learning Engineer
❑Data Science Generalist

You might also like