Professional Documents
Culture Documents
Outline Unit - 1
Data Integration
Relationship between Data
Science and Information Data Collection
Science
Data Reduction
Business intelligence Need of Data
versus Data Science wrangling Data Transformation
2
Course Outcome Unit - 1
3
Basic Term Unit - 1
Data Analysis
Data analysis is a procedure of investigating, cleaning, transforming,
and training of the data with the aim of finding some useful
information, recommend conclusions and helps in decision-making
4
Basic Term Unit - 1
Data Analytics
Analytics is utilizing data, machine
learning, statistical analysis and
computer-based models to get better
insight and make better decisions from the
data.
6
Basics of Data Science Unit - 1
7
Basics of Big Data Unit - 1
8
Basics of Big Data Unit - 1
related to
Traditional System
by
Handled | Processed
huge amount of complex,
variously formatted data
generated at high speed
that cannot be
9
Basics of Big Data Unit - 1
1
0
Basics of Big Data Unit - 1
1
1
Basics of Big Data Unit - 1
Source: https://medium.com/analytics-vidhya/big-data-3-vs-and-5-v-s-c1cae2a6d311
12
Basics of Big Data Unit - 1
Volume
● Refers to the size of data (generally in Terabytes)
Velocity
● Velocity refers to the speed at which the data is being
generated.
1
3
Basics of Big Data Unit - 1
Variety
● Data in different formats
Veracity
● Refers to the quality and accuracy of data.
Value
● what organizations can do with the collected data.
1
4
Basics of Big Data Unit - 1
Source: https://medium.com/analytics-vidhya/big-data-3-vs-and-5-v-s-c1cae2a6d311
16
Basics of Big Data Unit - 1
Source: https://medium.com/analytics-vidhya/big-data-3-vs-and-5-v-s-c1cae2a6d311
17
Basics of Big Data Unit - 1
18
Basics of Big Data Unit - 1
19
Exercise time…. Unit - 1
2
0
Sources of Big Data Unit - 1
21
Sources of Big Data Unit - 1
Enterprise Data
Transactional Data
23
Sources of Big Data Unit - 1
Social Media
24
Sources of Big Data Unit - 1
25
Sources of Big Data Unit - 1
Public Data
26
Basics of Big Data Unit - 1
27
Basics of Big Data Unit - 1
Archives
28
Basics of Big Data Unit - 1
29
Basic of Big Data Unit - 1
30
Data Science Unit - 1
Study of Data
It involves
any type of data —
both structured and
unstructured.
developing methods
from
for
insights and
knowledge
Recording, storing ,
analysing
To gain
31
Basics of Data Science Unit - 1
32
Basics of Data Science Unit - 1
3
3
Basics of Data Science Unit - 1
Business Predictive
Marketing Assessing Automating
Intelligence Analysis to
Better Business Recruitment
for Smarter Predict
Product Decision Process
Decision Outcome
3
4
Basics of Data Science Unit - 1
35
Relationship between Data Science and Information Science Unit - 1
Data Vs
Information
Data Science Vs
Information Science
Users in iSchool
Information
Science reference 36
Relationship between Data Science and Information Science Unit - 1
Data Vs Information
37
Relationship between Data Science and Information Science Unit - 1
38
Relationship between Data Science and Information Science Unit - 1
people’s behaviors
accomplishing the task or
goal of the user.
studying
39
Relationship between Data Science and Information Science Unit - 1
● Data Science is the discovery of knowledge or ● Information Science is the design of practices for
actionable information in data sorting & retrieving information
● Data Science is heavy on computer science and ● Information Science is more concerned with the areas
mathematics such as library science, cognitive science &
communication
● Data Science is used in business functions such as ● Information Science is used in areas such as knowledge
strategy formation,decision making & operational management,data management & interaction design
process.
reference 41
Business intelligence versus Data Science Unit - 1
Science
Analytical Business Intelligence
Approach Standard & ad hoc reporting ,
Typical Techniques
Business dashboards,alerts,queries,detail on demand
● It is basically a set of technologies, applications & ● It is a field that uses mathematics, statistics and various
processes that are used by the enterprises for other tools to discover the hidden patterns in the data.
business data analysis.
● It mainly deals only with structured data. ● It deals with both structured as well as unstructured
data.
● It makes the use of analytic method. ● It makes the use of scientific method.
● It’s tools are InsightSquared Sales Analytics, ● It’s tools are SAS, BigML, MATLAB, Excel etc.
Klipfolio, ThoughtSpot, Cyfe, TIBCO Spotfire etc.
Reference 43
Data Unit - 1
● Data is used to make decisions, solve problems, drive automation, execute transactions.
Reference 44
Data : Data Types Unit - 1
Data Types
45
Data : Data Types Unit - 1
Structured Data
● Data having Predefined data model/schema/structure and is often either relational in nature or is
closely resembling a relational model.
● It includes data in the relational databases, data from ERP/CRM/SAP Systems, XML files etc.
46
Data : Data Types Unit - 1
Unstructured Data
● It includes flat files, spreadsheets, Word documents, emails, images, audio files, video files,
feeds, PDF files, scanned documents, etc.
47
Data : Data Types Unit - 1
Derive insights
49
Data Science Life Cycle Unit - 1
Capture
Communicate Maintain
Data
Science Life
Cycle
Analyze Process
50
Data Science Life Cycle Unit - 1
Data Capture
● Data Acquisition,
● Data Entry,
● Signal Reception,
● Data Extraction
● Collection of raw structured and
unstructured data.
51
Data Science Life Cycle Unit - 1
Data Maintenance
● Data Warehousing
● Data Cleansing
● Data Staging,
● Data Processing
● Data Architecture.
● taking the raw data and putting it in a
form that can be used.
52
Data Science Life Cycle Unit - 1
Data Process
● Data Mining,
● Clustering/Classification,
● Data Modeling,
● Data Summarization.
● examine its patterns, ranges, and biases
● determine how useful it will be in
predictive analysis.
53
Data Science Life Cycle Unit - 1
Data Analyze
● Exploratory/Confirmatory
● Predictive Analysis
● Regression
● Text Mining
● Qualitative Analysis
● Chief real meat of the life cycle.
● performing the various analyses on the
data
54
Data Science Life Cycle Unit - 1
Data Capture
● Data Acquisition,
● Data Entry,
● Signal Reception,
● Data Extraction
● Collection of raw structured and
unstructured data.
55
Data Science Life Cycle Unit - 1
Data Maintenance
● Data Warehousing
● Data Cleansing
● Data Staging,
● Data Processing
● Data Architecture.
● taking the raw data and putting it in a
form that can be used.
56
Data Science Life Cycle Unit - 1
Data Process
● Data Mining,
● Clustering/Classification,
● Data Modeling,
● Data Summarization.
● examine its patterns, ranges, and biases
● determine how useful it will be in
predictive analysis.
57
Data Science Life Cycle Unit - 1
Data Analyze
● Exploratory/Confirmatory
● Predictive Analysis
● Regression
● Text Mining
● Qualitative Analysis
● Chief real meat of the life cycle.
● performing the various analyses on the
data
58
Data Science Life Cycle Unit - 1
Data Communication
● Data Reporting
● Data Visualization
● Business Intelligence
● Decision Making.
● prepare the analyses in easily readable
forms such as charts, graphs, and
reports.
59
Data : Data Collection Unit - 1
❏ There are many places online to look for sets or collections of data
Sources of Data
60
Data : Sources of Data Unit - 1
freely
available in a
public domain
61
Data : Data Collection Unit - 1
Sources of Data
In 2013
Freely
WhiteHouse Restriction
available
Developed in From
Public Copyrights ,
Open Data Domain Patents
Dr. M. R. Sanghavi 62
Data : Data Collection Unit - 1
Open Data
Public: Reusable:
● freely available in a public domain, ● made available under an open license
● without restrictions from copyright, patents that places no restrictions on their use.
Accessible: Complete:
● made available in convenient, modifiable ● published in primary forms
● open formats that can be retrieved,
downloaded, indexed, and searched.
Described: Timely:
● sufficient information to understand their ● made available as quickly as necessary to
strengths, weaknesses, analytical
preserve the value of the data
limitations, and security requirements
63
Data : Sources of Data Unit - 1
64
Data : Sources of Data Unit - 1
Multimodal Data
Every device is
getting connected emerging trend of
to the Internet the Internet of
Things(IoT)
65
Data : Data Collection Unit - 1
CSV XML
Json
TSV RSS
66
Data : Data Collection Unit - 1
CSV
Open Data
Public: Reusable:
● freely available in a public domain, ● made available under an open license
● without restrictions from copyright, patents that places no restrictions on their use.
Accessible: Complete:
● made available in convenient, modifiable ● published in primary forms
● open formats that can be retrieved,
downloaded, indexed, and searched.
Described: Timely:
● sufficient information to understand their ● made available as quickly as necessary to
strengths, weaknesses, analytical
preserve the value of the data
limitations, and security requirements
68
Data : Sources of Data Unit - 1
69
Data : Sources of Data Unit - 1
Multimodal Data
Every device is
getting connected emerging trend of
to the Internet the Internet of
Things(IoT)
70
Data : Data Collection Unit - 1
CSV XML
Json
TSV RSS
71
Data : Data Collection Unit - 1
CSV
Sources of Data
TSV
Sources of Data
XML
Sources of Data
RSS
76
Data : Data Collection Unit - 1
Sources of Data
It is not only easy for humans to read and write, but also easy for machines to
parse and generate.
Data Storage and Presentation
It is based on a subset of the JavaScript Programming Language, Standard
Json ECMA-262, 3rd Edition – December 1999.18
JSON is built on two structures:
○ A collection of name–value pairs. In various languages, this is realized as
an object, record, structure, dictionary, hash table, keyed list, or
associative array.
○ An ordered list of values. In most languages, this is realized as an array,
vector, list, or sequence.
77
Data : Data Collection Unit - 1
Sources of Data
Json
78
Need of Data Wrangling Unit - 1
80
Need of Data Wrangling Unit - 1
Dimensionality Reduction
81
Need of Data Wrangling Unit - 1
Duplicates
Missing Values
Bad Characters
Outliers
Dimensionality Reduction
82
Data Wrangling Unit - 1
Definition
the process of cleaning,
organizing, and data cleaning or data
transforming raw data munging, data wrangling
In to
Desired format
Also known as
for
decision-making
Analyst
For
83
Methods of Data Wrangling Unit - 1
Data Cleaning
Data Integration
Data Reduction
Data Transformation
Data Discretization
84
Benefits of Data Wrangling Unit - 1
● Data wrangling helps to improve data usability as it converts data into a compatible format for the
end system.
● It helps to quickly build data flows within an intuitive user interface and easily schedule and
automate the data-flow process.
● Integrates various types of information and their sources (like databases, web services, files,
etc.)
● Help users to process very large volumes of data easily and easily share data-flow techniques.
85
Methods of Data Wrangling Unit - 1
Data Cleaning
Data
Data Pre-processing
Cleaning
● Data in the real world is often dirty; that is, it is in need of being
cleaned up before it can be used for a desired purpose. This is often
called data pre-processing.
86
Methods of Data Wrangling Unit - 1
Data Cleaning
Data Incomplete
Data Pre-processing Data
Cleaning
o1
87
Methods of Data Wrangling Unit - 1
Data Cleaning
Data
Data Pre-processing
Cleaning
88
Methods of Data Wrangling Unit - 1
Data Cleaning
Data Cleaning
Data
...
89
Methods of Data Wrangling Unit - 1
Data Cleaning
Key
Data Cleaning
Methods
01 02
Data Munging
Data Cleaning
Key
Data Cleaning
Methods
Data Munging
91
Methods of Data Wrangling Unit - 1
Data Cleaning
Key
Data Cleaning
Methods Consider the following text recipe.
“Add two diced tomatoes, three cloves of garlic, and a pinch of salt in the mix.”
Data Munging
92
Methods of Data Wrangling Unit - 1
Data Cleaning
Key
Data Cleaning Reasons for Missing Data
Methods
Data Cleaning
Key
Data Cleaning Reasons for Missing Data
Methods
Data Cleaning
Sensor Failed
Power Failure
Data Cleaning
Key
Data Cleaning Types of Missing Data
Methods
Data Cleaning
Key
Data Cleaning Types of Missing Data
Methods
10
1
Methods of Data Wrangling Unit - 1
1) list-wise deletion
102
Methods of Data Wrangling Unit - 1
2) Pairwise Deletion
103
Methods of Data Wrangling Unit - 1
3) Dropping Variables
104
Methods of Data Wrangling Unit - 1
105
Methods of Data Wrangling Unit - 1
106
Methods of Data Wrangling Unit - 1
107
Methods of Data Wrangling Unit - 1
108
Methods of Data Wrangling Unit - 1
109
Methods of Data Wrangling Unit - 1
4) Linear Interpolation
110
Methods of Data Wrangling Unit - 1
111
Methods of Data Wrangling Unit - 1
112
Methods of Data Wrangling Unit - 1
113
Methods of Data Wrangling Unit - 1
Data Integration
114
Methods of Data Wrangling Unit - 1
115
Methods of Data Wrangling Unit - 1
Loose Coupling:
● Here, an interface is provided that takes the query from the user, transforms it in a way the source database can
understand, and then sends the query directly to the source databases to obtain the result.
● And the data only remains in the actual source databases.
116
Methods of Data Wrangling Unit - 1
01 02
Schema Integration
Handling Redundancy
117
Methods of Data Wrangling Unit - 1
MetaData:
● Name
● Meaning
● Data type
● Range of values permitted for the attribute
118
Methods of Data Wrangling Unit - 1
Issues:
● Feature Values from different sources are
different
● Different Units , representation , scaling
119
Methods of Data Wrangling Unit - 1
● Cust_Name Matching
● DOB Converted to Age (After Conversion 2022-1992= 30 Years) Matching
● Cust_Type silver in Data Source 1=Temp in Data Source 2 Correlation Analysis
● Discount 10% in Data Source 1= Temp in DataSource 2
● It may be identical
120
Methods of Data Wrangling Unit - 1
2. Redundancy:
● An attribute may be redundant if it can be derived or obtaining from another attribute or set of attributes.
● Inconsistencies in attributes can also cause redundancies in the resulting data set.
Data Transformation
122
Methods of Data Wrangling Unit - 1
Data Smoothing
Data Aggregation
Generalization
Normalization
● Difference is 32%
128
Methods of Data Wrangling Unit - 1
129
Methods of Data Wrangling Unit - 1
Min-Max Normalization:
● It is an operation which rescales a set of data.
● This can be useful when:
○ Comparing data from two different scales
○ Converting data to a new scale
● In most situations, data is normalized to a fit a target range of [0, 1]
● The smallest value in the original set would be mapped to 0
● The largest value in the original set would be mapped to 1
● Every other value would be mapped to a value somewhere between these two bounds
130
Methods of Data Wrangling Unit - 1
Min-Max Normalization
https://www.codecademy.com/article/normalization
https://www.upgrad.com/blog/normalization-in-data-mining/ 131
Methods of Data Wrangling Unit - 1
● X = [ 7, 21, 13,15]
● xmin=7
● xmax=21
● xnew= (7-7)/(21-7)
● Repeat for all the remaining values of x
[0/14,1, 6/14,8/14]
132
Methods of Data Wrangling : Data Transformation Unit - 1
Normalization
Formula New value = (x – μ) / σ
Z-Score Normalization
Refers to
where: Standard Deviation
Process
❏ x: Original value
of ❏ µ: Mean of data
❏ σ: Standard deviation of data
Normalizing every value
In a
Normalization Solution
Example 1 the mean of the dataset is 21.2
2 the standard deviation is 29.8.
Consider the
3 To perform a z-score normalization on the first value in the dataset
dataset
● New value = (x – µ) / σ
● New value = ( 3 – 21.2 ) / 29.8
Normalization
● The mean of the normalized values is 0
z-score normalization
on every value in the ● The standard deviation of the normalized values is 1.
dataset
● The normalized values represent the number of
standard deviations that the original value is from the
mean.
Example
● The first value in the dataset is 0.61 standard
deviations below the mean.
135
Methods of Data Wrangling : Data Transformation Unit - 1
Normalization
The benefit of performing this type of normalization
z-score normalization Is that
on every value in the
dataset clear outlier
Has
been
Transformed
In such
No longer Massive Outlier way that
Normalization
Decimal Scaling
Normalization
Decimal Scaling
Reference138
Methods of Data Wrangling Unit - 1
Data Transformation
❏ New attributes are constructed and added from the given set of attributes to help the
mining process.
❏ The goal of an attribute construction method is to construct new attributes out of the
original ones
❏ Transforming the original data representation into a new one where regularities in the
data are more easily detected by the classification algorithm
140
Methods of Data Wrangling Unit - 1
Data Reduction
141
Methods of Data Wrangling Unit - 1
Data Reduction
Common Techniques for data reduction
Dimensionality Reduction
142
Methods of Data Wrangling : Data Reduction Unit - 1
It is
Data Cube Aggregation
Technique
Process
used to
Aggregate Gather information | summarized
into smallest volume
The
Data
Simpler Form
143
Methods of Data Wrangling : Data Reduction Unit - 1
Example
quarterly sales for the year 2010 to 2012 Open Spreadsheet
● summarizes the total sales per year instead of per quarter. It summarizes the data.
144
Methods of Data Wrangling Unit - 1
❏ In dimensionality reduction redundant attributes are detected and removed which reduce the
data set size.
❏ we come across any data which is weakly important, then we use the attribute required for our
analysis.
145
Methods of Data Wrangling Unit - 1
Dimensionality
Reduction Methods
Wrapper Embedded
Filter Method PCA
Method Method
146
Methods of Data Wrangling Unit - 1
147
Methods of Data Wrangling Unit - 1
Feature Selection
Feature Extraction
❏ With this technique, we generate a new feature set by extracting and combining
information from the original feature set.
148
Methods of Data Wrangling Unit - 1
Wrapper
Method
o1
Feature
Selection
Embedded o3 Methods o2 Filter
Method Method
Reference 149
Methods of Data Wrangling Unit - 1
o1 Wrapper Method
Reference 150
Methods of Data Wrangling Unit - 1
o1 Wrapper Method
Reference 151
Methods of Data Wrangling Unit - 1
o1 Wrapper Method
Forward Selection
152
Methods of Data Wrangling Unit - 1
Example
Data Reduction Dimensionality Reduction
fitness level prediction
o1 Wrapper Method
Forward Selection
153
Methods of Data Wrangling Unit - 1
Example
Data Reduction Dimensionality Reduction
fitness level prediction
o1 Wrapper Method
Forward Selection
154
Methods of Data Wrangling Unit - 1
Example
Data Reduction Dimensionality Reduction
fitness level prediction
o1 Wrapper Method
Forward Selection
155
Methods of Data Wrangling Unit - 1
Example
Data Reduction Dimensionality Reduction
fitness level prediction
o1 Wrapper Method
Forward Selection
156
Methods of Data Wrangling Unit - 1
Example
Data Reduction Dimensionality Reduction
fitness level prediction
o1 Wrapper Method
Forward Selection
157
Methods of Data Wrangling Unit - 1
Example
Data Reduction Dimensionality Reduction
fitness level prediction
o1 Wrapper Method
Forward Selection
Example
Data Reduction Dimensionality Reduction
fitness level prediction
o1 Wrapper Method
Forward Selection
159
Methods of Data Wrangling Unit - 1
o1 Wrapper Method
Backward Elimination
Reference 160
Methods of Data Wrangling Unit - 1
Example
Data Reduction Dimensionality Reduction
fitness level prediction
o1 Wrapper Method
Backward Elimination
Accuracy of 92%
Reference 161
Methods of Data Wrangling Unit - 1
Example
Data Reduction Dimensionality Reduction
fitness level prediction
o1 Wrapper Method
Backward Elimination
Example
Data Reduction Dimensionality Reduction
fitness level prediction
o1 Wrapper Method
Backward Elimination
Example
Data Reduction Dimensionality Reduction
fitness level prediction
o1 Wrapper Method
Backward Elimination
Gender does not have a high impact on the Fitness_Level variable. And
hence it can be dropped
Reference 164
Methods of Data Wrangling Unit - 1
o1 Wrapper Method
❏ Begin with zero features in your model and iteratively add the next
Forward Selection most predictive feature until no additional performance is reached
with the addition of another feature.
Backward Selection ❏ Begin with all features in your model and iteratively remove the
least significant feature until performance starts to drop.
Note : These methods are typically less preferred as they take a long time to compute and tend to overfit.
Reference 165
Methods of Data Wrangling Unit - 1
o2 Filter Method
● Correlation
● Chi-Square Test
● ANOVA
● Information Gain
Reference 166
Methods of Data Wrangling Unit - 1
Reference 167
Methods of Data Wrangling Unit - 1
Variance thresholds
Reference 168
Methods of Data Wrangling Unit - 1
Reference 169
Methods of Data Wrangling Unit - 1
Reference 170
Methods of Data Wrangling Unit - 1
Reference 171
Methods of Data Wrangling Unit - 1
Reference 172
Methods of Data Wrangling Unit - 1
Data Discretization
Such as
Ambient Light
Methods of Data Wrangling Unit - 1
Data Discretization
● Data discretization converts a large number of data values into smaller once
● so that data evaluation and data management becomes very easy.
Data Discretization
Data Discretization
● This type of discretization considers only the attribute being discretized and does
not use class information.
● This type of discretization considers the class value and will divide the continuous
attributes in such a way so that it provides maximum information about the class.
Methods of Data Wrangling Unit - 1
Data Discretization
Example
Methods of Data Wrangling Unit - 1
Data Discretization
Example
Methods of Data Wrangling Unit - 1
Data Discretization
Data Discretization
Data Discretization
ii.) Supervised Discretization
● This methods targets to find out the best split for given attributes.
● The spilt will be recursively calculated for each bin and the split having maximum
information gain will be treated as best split.
● Here the bins are more promising , means that the majority of the data attributes in a
particular bins will be associated with the same class label.
Methods of Data Wrangling Unit - 1
Data Discretization
ii.) Supervised Discretization
Methods of Data Wrangling Unit - 1
Data Discretization
ii.) Supervised Discretization
● In table ??, the attributes are associated with the class label:P and N. the entropy and
information gain is calculated for two different possibilities.
Methods of Data Wrangling Unit - 1
Data Discretization
Data Discretization
Methods of Data Wrangling Unit - 1
Data Discretization