You are on page 1of 216

DEPARTMENT OF ARTIFICIAL INTELLIGENCE &

MACHINE LEARNING

INTRODUCTION TO DATA SCIENCE

LECTURE NOTES – UNIT 5

B. TECH
II YEAR – II SEM (Sec-A & B)
Academic Year 2022-23

Prepared & compiled by

DR.G. ARUN SAMPAUL THOMAS,


ASSOCIATE PROFESSOR & HOD, DEPARTMENT OF AI&ML
J.B.I.E.T
Bhaskar Nagar, Yenkapally(V), Moinabad(M),

Ranga Reddy(D), Hyderabad – 500 075, Telangana, India.


J. B. Institute of Engineering and
AY 2020-21 B. Tech: AI & ML
Technology
onwards II Year – II Sem
(UGC Autonomous)
Course Code:
INTRODUCTION TO DATA SCIENCE L T P D
J22D3
Credits: 2 2 0 0 0

Pre-requisite:
Database Management Systems, Data Structures

Course Objectives:
This course will enable students to:
• Know about the fundamental concepts and technologies of Data Science.
• Explore the various Data collection and storage methods.
• Understand the Data Analysis, statistics, and various machine learning algorithms.
• Investigate about the visualization of data and apply coding techniques to data for
securing the data.
• Study the Applications of Data Science, Technologies for visualization Handling of
variables using Python.

UNIT-I - Introduction to Data Science


Introduction to core concepts and technologies: Introduction, Terminology, Data science
Process, data science toolkit, Types of data, Example applications

UNIT-II - Data collection and management:


Introduction, Sources of data, Data collection and APIs, Exploring and fixing data. Data storage
and management, using multiple data sources.

UNIT-III - Data analysis:


Introduction, Terminology and concepts, Introduction to statistics, Central tendencies and
distributions, Variance, Distribution properties and arithmetic, Samples/CLT. Basic machine
learning algorithms, Linear regression, SVM, Naive Bayes.

UNIT-IV - Data visualization:


Introduction, Types of data visualization, Data for visualization:
Data types, Data encodings, Retinal variables, mapping variables to encodings, Visual
encodings.

UNIT-V - Practices and Case Studies in Data Science:


Applications of Data Science, Technologies for visualization, Recent trends in various data
collection and analysis techniques, various visualization techniques, application development
methods used in data science. Demonstrate some case studies like Marketing, Finance, HR,
Manufacturing, Healthcare etc

Textbooks:
1. Cathy O’Neil, Rachel Schutt, Doing Data Science, Straight Talk from the Frontline. O’Reilly,
2013.
2. Jure Leskovek, Anand Rajaraman, Jeffrey Ullman, Mining of Massive Datasets. v 2.1,
Cambridge University Press, 2014.
Reference Books:
1. Joel Grus, “Data Science from scratch”, O'Reilly, 2015.
2. Gupta, S.C. and Kapoor, V.K.: “Fundamentals of Mathematical Statistics”, Sultan &
Chand & Sons, New Delhi, 11th Ed, 2002.
3. Hastie, Trevor, et al. “The elements of Statistical Learning”, Springer, 2009.
4. Wes Mc Kinney, “Python for Data Analysis”, O'Reilly Media, 2012

Course Outcomes:
The student will be able to
• Identify the basic concepts of data science and identify the types of data.
• Analyse about how to collect the data, manage the data, explore the data, store the data.
• Implement the basic measures of central tendency and classify the data using SVM and
navie Bayesian.
• Interpret the visualization of data and apply coding techniques to data for securing the
data.
• Analyse the various concepts of data science and can be able to handle simple
applications of data science using python.

WEBSITE REFERENCES FOR SELF LEARNING


1. https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-learn-data-science-python-
scratch-2/
2. https://www.rstudio.com/online-learning/
INTRODUCTION TO DATA SCIENCE

UNIT– V
Ø Applications of Data
Science

DR. G. ARUN SAMPAUL THOMAS


Associate Professor & HOD – Department of AI&ML
J.B. Institute of Engineering and Technology
Hyderabad, Telangana
1
arunsam.infotech@gmail.com arunthomas.ai_ml@jbiet.edu.in
INTRODUCTION TO DATA SCIENCE

UNIT– V
Ø Technologies & Tools
for Visualisation

Ø Visualisation
Techniques

DR. G. ARUN SAMPAUL THOMAS


Associate Professor & HOD – Department of AI&ML
J.B. Institute of Engineering and Technology
Hyderabad, Telangana
1
arunsam.infotech@gmail.com arunthomas.ai_ml@jbiet.edu.in
INTRODUCTION TO DATA SCIENCE

UNIT– V
Ø Recent Data
Collection &
Analysis

Ø DS Case study
DR. G. ARUN SAMPAUL THOMAS
Associate Professor & HOD – Department of AI&ML
J.B. Institute of Engineering and Technology
Hyderabad, Telangana
1
arunsam.infotech@gmail.com arunthomas.ai_ml@jbiet.edu.in
Data All Around

• Lots of data is being collected


and warehoused
– Scientific Experiments
– Internet of Things
– Web data, e-commerce
– Financial transactions, bank/credit transactions
– Online trading and purchasing
– Social Network
– ……many more!

2
What To Do With These Data?

• Aggregation and Statistics


– Data warehousing and OLAP
• Indexing, Searching, and Querying
– Keyword based search
– Pattern matching (XML/RDF)
• Knowledge discovery
– Data Mining
– Statistical Modeling
• Data Driven
– Predictive Analytics
– Deep Learning

3
Statistical and Critical Thinking
Analyzing Data: Potential Pitfalls
• Misleading Conclusions
When forming a conclusion based on a statistical analysis, we should make statements that are clear
even to those who have no understanding of statistics and its terminology.
• Sample Data Reported Instead of Measured
When collecting data from people, it is better to take measurements yourself instead of asking
subjects to report results.
• Loaded Questions
If survey results are not worded carefully, the results of a study can be misleading.
• Order of Questions
Sometimes survey questions are unintentionally loaded by the order of the items being considered.
• Nonresponse
A nonresponse occurs when someone either refuses to respond or is unavailable.
• Percentages
Some studies cite misleading percentages. Note that 100% of some quantity is all of it, but if there
are references made to percentages that exceed 100%, such references are often not justified.
5
Types of Data, Key Concept

A major use of statistics is to collect and use sample data to make conclusions
about populations.

Parameter & Statistic

• Parameter
a numerical measurement describing some
characteristic of a population
• Statistic
a numerical measurement describing some
characteristic of a sample

1
0
Types of Data

Quantitative Data & Categorical Data


• Quantitative (or numerical) data
consists of numbers representing counts or measurements.

Example: The weights of supermodels


Example: The ages of respondents

• Categorical (or qualitative or attribute) data


consists of names or labels (not numbers that represent counts or measurements).

Example: The gender (male/female) of professional athletes


Example: Shirt numbers on professional athletes uniforms - substitutes for names
1
1
Types of Data, Quantitative Data

Discrete & Continuous types:


• Discrete data
result when the data values are quantitative and the number of values is
finite, or “countable.”

Example: The number of tosses of a coin before getting tails


• Continuous (numerical) data
result from infinitely many possible quantitative values, where the
collection of values is not countable.

Example: The lengths of distances from 0 cm to 12 cm

1
2
Types of Data, Quantitative Data

Data

Qualitative Quantitative
Categorical Numerical,
Can be ranked

Discrete Continuous
Countable Can be decimals
5, 29, 8000, etc. 2.59, 312.1, etc.

1
3
Types of Data, Levels of Measurement:
Another way of classifying data: 4 levels of measurement: nominal, ordinal, interval, and ratio.

• Nominal level of measurement


characterized by data that consist of names, labels, or categories only, and
the data cannot be arranged in some order (such as low to high).
• Nominal - categories only
Example: Survey responses of yes, no, and undecided
(Names)
• Ordinal level of measurement
involves data that can be arranged in some order, but differences (obtained
by subtraction) between data values either cannot be determined or are • Ordinal - categories with
meaningless.
some order ( nominal, plus can
Example: Course grades A, B, C, D, or F be ranked (order))
• Interval level of measurement
involves data that can be arranged in order, and the differences between • Interval - differences but no
data values can be found and are meaningful. However, there is no
natural zero starting point at which none of the quantity is present. natural zero point (Ordinal,
plus intervals are consistent)
Example: Years 1000, 2000, 1776, and 1492
• Ratio level of measurement • Ratio - differences and a
data can be arranged in order, differences can be found and are
meaningful, and there is a natural zero starting point (where zero indicates natural zero point(Iinterval,
that none of the quantity is present). Differences and ratios are both
meaningful.
plus ratios are consistent, true
zero)
Example: Class times of 50 minutes and 100 minutes 10
Types of Data, Levels of Measurement:
Example 1:

Determine the measurement level.

Variable Nominal Ordinal Interval Ratio Level


Hair Color Yes No Nominal
Zip Code Yes No Nominal
Letter Grade Yes Yes No Ordinal
ACT Score Yes Yes Yes No Interval
Height Yes Yes Yes Yes Ratio
Age Yes Yes Yes Yes Ratio
Temperature Yes Yes Yes No Interval

(F)

3
Example 2:

4
Example 3:

Parameter or Statistic?

Statistic

Parameter

5
Example 4:

Discrete or Continuous?

Continuous

Discrete

6
Example 5:
Determine the measurement level.

Nominal

Ratio

Ordinal

Interval

7
Example 6:
Determine the measurement level & what’s wrong with the conclusion?

8
Structured vs Unstructured

https://www.youtube.com/watch?v=WBU7sW1jy2o
Big Data & Data Science

• “… the stylish job in the next 10 years will


be statisticians,” Hal Varian, Google Chief Economist
• The U.S. will need 140,000-190,000 predictive
analysts and 1.5 million managers/analysts by 2018.
McKinsey Global Institute’s June 2011

• New Data Science institutes being created or


repurposed – NYU, Columbia, Washington, UCB,...
• New degree programs, courses, boot-camps:
– e.g., at Berkeley: Stats, I-School, CS, Astronomy…
– One proposal (elsewhere) for an MS in “Big Data Science”
– Plans for Data Science Stream at AUST
– RDA-CODATA School of Research Data Science
74
Data Science Vs Analysis Vs Software
Delivery
Component Traditional Analysis Traditional Software Data Science
Delivery
Tools SAS, R, Excel, SQL, in- Java, source control, Linux, R, Java, scientific Python libraries,
house tools continuous integration, unit Excel, SQL, Hadoop, Hive, Pig,
testing, bug reports and Mahout and other machine learning
project management libraries, github for source control
and issue management
Analytical Regressions, N/A Classification, clustering, similarity
Methods classifications, detection, recommenders,
measuring prediction unsupervised and supervised
accuracy and learning, small- and large-scale
coverage/error, computations, measuring prediction
sampling accuracy and coverage/error
Team Statisticians, Developers, Project Mathematicians, Statisticians,
Structure Mathematicians, Managers, Systems Scientists, Developers, Systems
Scientists Engineers Engineers
Time Frame Either: Regular software release Either:
• Usually on-going cycle, continuous delivery, etc. • Discovery/learning phase leading
research and to product development
discovery within a Or:
team in the • On-going research and product
organization invention/improvement
Or:
• Specific project to
determine answers 75
Contrast: Scientific Computing

Image General purpose classifier


Supernova

Not

Nugent group / C3 LBL

Scientific Modeling Data-Driven Approach


Physics-based models General inference engine replaces model
Problem-Structured Structure not related to problem
Mostly deterministic, precise Statistical models handle true randomness,
and un-modeled complexity.
Run on Supercomputer or High-end Run on cheaper computer Clusters (EC2)
Computing Cluster
76
Contrast: Machine Learning

Machine Learning Data Science


Develop new (individual) models Explore many models, build and tune
hybrids
Prove mathematical properties of Understand empirical properties of
models models
Improve/validate on a few, relatively Develop/use tools that can handle
clean, small datasets massive datasets
Publish a paper J Take action!
14
Contrast: Data Engineering

Data Science Data Engineering


Approach Scientific (Exploration) Engineering (Development)
Problems Unbounded Bounded
Path to Solution Iterative, exploratory, Mostly linear
nonlinear
Education More is better (PhD’s BS and/or self-trained
common)
Presentation Skills Important Not as important
Research Important Not as important
Experience
Programming Not as important Important
Skills
Data Skills Important Important

78
Data Science Applications

Business Health Care Urban Leaving


Summary From car design to Tomorrow’s healthcare may For the first time in human
insurance to pizza delivery, look more efficient thanks to history, more people live in
businesses are using data things like electronic health cities than in suburban or
science to optimize their records. It also may look a lot rural areas. An emerging field
operations and better meet more effective. Reduced called “urban informatics”
their customers’ readmissions, better care, and combines data science with
expectations. earlier detection are on the the unique challenges facing
horizon. the world’s growing cities
Two-Way Street for the Reducing Hospital Taking on Megacity Traffic
Ford Focus Electric Car Readmissions
Better Fraud Detection Better Point-of-Care Decisions Fighting Crime with Data
What is Boosts Customer "predictive policing"
happening? Satisfaction
E-Commerce Insights:
Domino’s Secret Sauce
What is possible Using Social Data to Medical Exams by Bathroom Instrumenting cities
Select Successful Retail Mirrors
Locations
.

79
Data Science: Case Study
Cancer Research
• Cancer is an incredibly complex disease; a single tumor can have
more than 100 billion cells, and each cell can acquire mutations
individually. The disease is always changing, evolving, and adapting.
• Employ the power of big data analytics and high-performance
computing.
• Leverage sophisticated pattern and machine learning algorithms to
identify patterns that are potentially linked to cancer
• Huge amount of data processing and recognition

80
Data Science: Case Study
Health Care

• Stanford Medicine, Google


team up to harness power of
data science for health care
• Stanford Medicine will use the
power, security and scale of
Google Cloud Platform to
support precision health and
more efficient patient care.
• Analyzing genetic data
• Focusing on precision health
• Data as the engine that
drives research

http://med.stanford.edu/news/all-news/2016/08/stanford-medicine-google-team-up-to-harness-power-of-data-science.html 81
Data Science: Case Study
Elections
• The Obama campaigns in 2008 and 2012 are credited for their
successful use of social media and data mining.
• Micro-targeting in 2012
– http://www.theatlantic.com/politics/archive/2012/04/the-
creepiness-factor-how-obama-and-romney-are-getting-to-know-
you/255499/
– http://www.mediabizbloggers.com/group-m/How-Data-and-Micro-
Targeting-Won-the-2012-Election-for-Obama---Antony-Young-
Mindshare-North-America.html
• Micro-profiles built from multiple sources accessed by aps, real-
time updating data based on door-to-door visits, focused media
buys, e-mails and Facebook messages highly targeted.
• 1 million people installed the Obama Facebook app that gave
access to info on “friends”.
22
Data Science: Case Study
Internet of Things (IoT)
• The Internet of Things is rapidly growing. It is predicted that more than 25 billion devices
will be connected by 2020.

• The Internet of Things (IOT) will soon produce a massive volume and variety of data at
unprecedented velocity. If "Big Data" is the product of the IOT, "Data Science" is it's
soul. 23
Data Science: Case Study
Customer Analytics

84
Case Study - How Recommender Systems Work
(Netflix/Amazon)

https://www.youtube.com/watch?v=n3RKsY2H-NE
INTRODUCTION TO DATA SCIENCE

UNIT– V
Ø Apps, Business
Development in
DS
Ø Case Study
DR. G. ARUN SAMPAUL THOMAS
Associate Professor & HOD – Department of AI&ML
J.B. Institute of Engineering and Technology
Hyderabad, Telangana
1
arunsam.infotech@gmail.com arunthomas.ai_ml@jbiet.edu.in
INTRODUCTION TO DATA SCIENCE
UNIT– V
Supplementary
Notes
Ø DS Case Study - How
People Leveraging
Chat GPT
DR. G. ARUN SAMPAUL THOMAS
Associate Professor & HOD – Department of AI&ML
J.B. Institute of Engineering and Technology
Hyderabad, Telangana
1
arunsam.infotech@gmail.com arunthomas.ai_ml@jbiet.edu.in

You might also like