You are on page 1of 28

Foundations of

Data Science
Unit 2
Acknowledgement
▪ Most of the slides in this presentation are taken from material provided by
▪ Han and Kimber (Data Mining Concepts and Techniques) and
▪ Tan, Steinbach and Kumar (Introduction to Data Mining)

Zarmeen
Spring 2021 2
Nasim
A simplified Data Science Taxonomy

Data Science
Data
Acquisition
Statistics
Data Analytics
Data Mining
Data
Visualization

Zarmeen
Spring 2021 3
Nasim
Data Analytics

▪ The discovery and communication of meaningful patterns


in data.
▪ Two major types are
▪ Descriptive Analytics
▪ Predictive Analytics

Zarmeen
Spring 2021 4
Nasim
Descriptive vs. Predictive Analytics
▪ Descriptive Analytics
▪ what happened and why did it happen
▪ Referred to as “unsupervised learning” in machine learning
▪ Predictive Analytics
▪ what will happen
▪ Referred to as “supervised learning” in machine learning

Zarmeen
Spring 2021 5
Nasim
Predictive analytics
Classification Techniques Prediction
▪ Classification Trees
▪ Naïve Bayes ▪ Regression Analysis
▪ Random Forest ▪ Time Series Analysis
▪ Neural Networks
▪ Support Vector Machine

Zarmeen
Spring 2021 6
Nasim
Descriptive Analytics
▪ Clustering
▪ Market Basket Analysis

Fig. Clustering

Fig. Market Basket Analysis

Zarmeen
Spring 2021 7
Nasim
What Is Data Mining?
▪ Data mining (knowledge discovery from data)
▪ Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful)
patterns or knowledge from huge amount of data

▪ Alternative names
▪ Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern
analysis, data archeology, data dredging, information harvesting etc.

▪ Watch out: Is everything “data mining”?


▪ Simple search and query processing
Zarmeen
Spring 2021 8
Nasim
Origins of Data Mining

▪ Draws ideas from machine learning/AI, pattern


recognition, statistics, and database systems
Statistics/ Machine Learning/
▪ Traditional Techniques AI Pattern
may be unsuitable due to Recognition
▪ Enormity of data
▪ High dimensionality Data Mining
of data
▪ Heterogeneous,
distributed nature Database
of data systems

Zarmeen
Spring 2021 9
Nasim
Data Visualization
▪ Representation of Data using visual forms such as charts, graphs and maps
▪ Goal - Communicate information clearly and effectively to users
▪ Data visualization helps data scientists to get better insights
▪ Tools: Plotly, DataHero, Tableau, Dygraphs, QlikView, ZingCHhart, etc.

Zarmeen
Spring 2021 10
Nasim
Data Science vs. Allied
Fields
Data Science vs. Statistics
▪ Statistics is a mathematically-based field which seeks to collect and interpret
quantitative data. In contrast, data science is a multidisciplinary field which uses
scientific methods, processes, and systems to extract knowledge from data in a range
of forms.
▪ Data scientists use methods from many disciplines, including statistics. However, the
fields differ in their processes, the types of problems studied, and several other
factors.
▪ Statistics has its roots in mathematics, and therefore, there has been an emphasis on
mathematical rigor, a desire to establish that something is sensible on theoretical
grounds before testing it in practice.
▪ In contrast, the data science community has its origin very much in computer practice.
This has led to a practical orientation, a willingness to test something out to see how
well it performs, without waiting for a formal proof of effectiveness.

Zarmeen
Spring 2021 12
Nasim
Data Science vs. Machine Learning
▪ Data science is a broad
term for multiple Machine Learning(ML) Data Science
disciplines, machine
learning fits within data Develop new (individual) Explore many models, build and
science. models tune hybrids

▪ Both fields are concerned Prove mathematical Understand empirical properties of


with the analysis of data to properties of models models
find useful or informative Improve/validate on a few,
Develop/use tools that can handle
patterns relatively clean, small
massive datasets
datasets
More focused on research More practical in nature

Zarmeen
Spring 2021 13
Nasim
Data Science vs. Business Intelligence
Business Intelligence
Features Data Science
(BI)
Structured Both Structured and Unstructured
Data Sources (Usually SQL, often Data ( logs, cloud data, SQL, NoSQL,
Warehouse) text)

Statistics, Machine Learning, Graph


Approach Statistics and Visualization
Analysis

Focus Past and Present Present and Future

Tableau, Microsoft RapidMiner, BigML, Weka, R,


Tools
BI, QlikView, R Python

Zarmeen
Spring 2021 14
Nasim
BI Answers for Fraud Detection
▪ How many cases were investigated last month?
▪ What was the success rate in collecting debts?
▪ How much revenue was recovered through collections?
▪ What was the close rate of cases in the past month? Past quarter? Past year?
▪ For debts that were closed out, how many days it take on average to close out
debts?

Zarmeen
Spring 2021 15
Nasim
Predictive Analytics for Fraud Detection
▪ What is the likelihood that the transaction is fraudulent?
▪ What is the likelihood the invoice is fraudulent or warrants further investigation?
▪ Which characteristics of the transaction are most related to or most predictive of
fraud?
▪ What is the expected amount of fraud?
▪ Historically, which demographic and historic purchase patterns were most related
to fraud?

Zarmeen
Spring 2021 16
Nasim
Predictive Analytics for Customer Analytics
▪ What is the likelihood an e-mail will be opened?
▪ What is the likelihood a customer will click-through a link in an e-mail?
▪ Which product is a customer more likely to purchase if given the choice?
▪ How many e-mails should the customer receive to maximize the likelihood of a
purchase?
▪ What is the likelihood of a product will sell out if it is put on sale?

Zarmeen
Spring 2021 17
Nasim
BI Answers for Customer Analytics
▪ Which regions/states/ZIPs had the highest response rates?
▪ Which products had the highest/lowest click-through rates?
▪ How many repeat purchasers were there last month?
▪ How many new subscriptions to the loyalty program were there?
▪ How many visits to the store/website did a person have?

Zarmeen
Spring 2021 18
Nasim
Structured vs. Non-Structured Data
▪ Most business databases contain structured data consisting of well-defined fields
with numeric or alpha-numeric values.
▪ An example of unstructured data is a video recorded by a surveillance camera in
a departmental store. This form of data generally requires extensive processing to
extract and structure the information contained in it.

Zarmeen
Spring 2021 19
Nasim
Structured vs. Non-Structured Data (Cont’d)
▪ Structured data is often referred to as traditional data, while the semi-structured
and unstructured data are lumped together as non-traditional data.
▪ Most of the current data mining methods and commercial tools are applied to
traditional data.

Zarmeen
Spring 2021 20
Nasim
Data Science Process
Data Science - Process (CRISP-DM)

Zarmeen
Spring 2021 22
Nasim
CRISP-DM (Business Understanding)
▪ Understand the project objectives and requirements
▪ Can it be converted into a data mining problem definition
▪ Were any effort made in the past? If yes, what were the findings? Why are we
doing it again? What has changed?
▪ Assess availability of time, technology and human resources. Do we have enough
time and resources to execute the analytics project?
▪ Identify the success criteria, key risks and major stake holders

Zarmeen
Spring 2021 23
Nasim
CRISP-DM (Data Understanding)
▪ Get familiar with the data. Is it enough to solve the stated business problem? If
not, do we need to redesign the data collection process?
▪ What’s needed vs. what’s available
▪ Identify data quality problems
▪ Determine the structures and tools needed
▪ Discover first insights into the data

Zarmeen
Spring 2021 24
Nasim
CRISP-DM (Data Preparation)
▪ Construct the final dataset
▪ Process likely to be repeated multiple times, and not in any prescribed order
▪ Tasks include attribute selection as well as transformation and cleaning of data
▪ Understand what to keep and what to discard
▪ Extensive use of exploratory data analysis and visualization

Zarmeen
Spring 2021 25
Nasim
CRISP-DM (Modeling)
▪ Application of various modeling techniques and calibration of their parameters to
optimal values
▪ Documenting assumptions behind each modeling technique to get feedback from
stake holders and domain experts
▪ Typically require stepping back to the data preparation phase

Zarmeen
Spring 2021 26
Nasim
CRISP-DM (Evaluation)
▪ Test robustness of the models under consideration by gauging their performances
against hold-out data
▪ Analyze if the models achieve the business objectives.
▪ Finalize a data mining model
▪ Quantify business value and identify key findings

Zarmeen
Spring 2021 27
Nasim
CRISP-DM (Deployment)
▪ Typically a customer-driven stage instead of data analyst driven.
▪ Important for the customer to understand up front the actions needed to actually
make use of the created models.
▪ Define process to update and retrain the model, as needed.

Zarmeen
Spring 2021 28
Nasim

You might also like