Unit 1 Introduction To Data Science

Introduction To Data Science and
Big Data Analytics

Presented By: Parag Bute
1
Why Data Science?
2
Unit I Introduction to Data Science and Big Data
● Basics and need of Data Science and Big Data, Applications of Data
Science
● Data explosion, 5 V’s of Big Data, Relationship between Data Science
and Information Science,
● Business intelligence versus Data Science, Data Science Life Cycle,
Data: Data Types, Data Collection.
● Need of Data wrangling, Methods: Data Cleaning, Data Integration,
Data Reduction, Data Transformation, Data Discretization.
3
What is Big Data
Data has evolved in the last 5-7 years like never before. Lots of data is being generated each day in every
business sector.
In recent years, organizations have started understanding the value of Bigdata and they have decided not to ign
any data as being uneconomical, now we can talk about different platform from where the data is getting genera
i.e
1. Social Media -: eg. facebook, whatsapp, instagram, twitter etc.
2. Ecommerce -: eg amazon, ebay, flipkart etc
3. Tech giants -: e.g Google, Apple, oracle etc
So lots of Data is getting generated everyday, an average smart phone user generated somewhere around 40
exabytes of data in one month.
4
So these large amount of data that are coming from multiple sources, and are of various types like structured,
unstructured, semi structured etc are known as Big Data.
Therefore, we can say, Big data is the term for collection of large and complex datasets that it becomes difficult to
process, using traditional data processing applications
5
5 V’s of Big data (Characteristics)
● Now the question is, how do we classify data as a bigdata and how do we know which kind of data
is hard to process?
So to answer this question, we have 5 V’s, which helps us to classify data as a bigdata i,e
1. Volume
2. Variety
3. Velocity
4. Validity
5. Value
6
● Volume defines the huge amount of data that is produced each day by companies, for example;
The generation of data is so large and complex that it can no longer be saved or analyzed
using conventional data processing methods.
● Variety refers to the diversity of data types and data sources. 80% of the data in the world today
is unstructured and at first glance does not show any indication of relationships. Thanks to Big
Data such algorithms, data is able to be sorted in a structured manner and examined for
relationships. Data does not always comprise only conventional datasets but also media files
line images, videos and audios etc.
● Velocity refers to the speed at which the data is generated, analyzed and reprocessed. Today
this is mostly possible within a fraction of a second, known as real time.
● Validity is the guarantee of the data quality or alternatively, Veracity is the authenticity and
credibility of data. Big Data involves working with all degrees of quality, since the volume factor
usually results in a shortage of quality.
● Value denotes the added value for companies. Many companies have recently established their
own data platforms, filled their data pools and invested a lot of money in infrastructure. It is now
a question of generating business value from their investments.
7
For example -:
let's take a example from heath care industry
Hospitals and clinics across the worlds generates 2314 exabytes of data are
collected annually , this show the Volume
In the form of patient records and test results all the data is generated at
very high speed this called velocity
Variety refers to the various data types such as structured eg. excel file, semi structured eg. log files and
unstructured eg. xray images.
Accuracy and trustworthiness of the generated data is termed as Veracity
Analysing all these data will benefit the medical sector by enabling faster disease detection, better treatment &
reduced cost. Hence this show the Value
8
9
What is Big Data Analytics
● Big Data analytics examines large and complex data to uncover hidden patterns, co-relations and other
insights.
● So, basically, it helps the large organizations to facilitates their growth and development. This majorly
involves the applying various techniques of data mining, on a given set of data which will then benefits
these organizations to take better business decisions.
● On a broader scale, data analytics technologies and techniques give organization a way to analyze data
sets and gather new insights.Business Intelligence queries answer basic questions about business
operations performance.
● Big data analytics is a form of advanced analytics, which involve complex applications with elements
such as predictive models, statistical algorithms, and what-if analysis powered by analytics systems.
10
What is Big Data Analytics (Contd.)
● Big data analytics is important for Organizations as they can use big data analytics systems and software
to make data-driven decisions that an improve business-related outcomes.
● The benefits may include more effective marketing, new revenue opportunities, customers
personalization and improved operational efficiency. With an effective strategy, these benefits can
provide competitive advantages over revivals.
11
How Does Big Data Analytics works ?
● Data analytics, data scientists, predictive modelers, statisticians and other analytics professionals collect, process, clean and
analyze growing volumes of structured transaction data as well as other forms of data not used by conventional BI and
analytics programs
● Here, is an overview of the four steps of the data preparation process:
1. Data Professionals collects data from a variety of different sources. Often, it is mix of semi-structured and
unstructured data. While each organization will use different data streams, some common sources includes.
2. Internet clickstram data,
3. Web server logs,
4. Cloud applications,
5. Mobile applications,
6. Social media content,
7. Text from customer emails and survey response,
8. Mobile phone records and
9. Machine data captured by sensors connected to the internet of things (IOT).
12
How Does Big Data Analytics works ? (contd.)
● Data is processed, after data is collected and stored in a data warehouse or data lake, data
professionals must organize configure and partition the data properly for analytical queries.Thorough
data processing makes for higher performance from analytical queries.
● Data is cleansed for quality. Data professionals scrub the data using scripting tools or enterprise
software. They look for any errors or inconsistencies, such as duplications or formatting mistakes,
and organize and tidy up the data.
● The collected, processed and cleaned data is analyzed with analytics software. This includes tools
for:
○ data mining, which sifts through data sets in search of patterns and relationships
○ predictive analytics, which builds models to forecast customer behavior and other future
developments
○ machine learning, which taps algorithms to analyze large data sets
○ deep learning, which is a more advanced offshoot of machine learning
○ text mining and statistical analysis software
○ artificial intelligence (AI)
○ mainstream business intelligence software
○ data visualization tools 13
What is need of Big Data Analytics
Big Data Analytics is contributing to the this factors and organizations are adopting this lead them,
1. Making Smarter and More Efficient Organizations
2. Optimize Business Operations by analysing customer behavious
3. Cost Reduction
4. Next Generation Product
There has been an enormous growth in the field of Big Data analytics with the benefits of the technology.
This has led to the use of big data in multiple industries ranging from
1. Banking
2. Healthcare
3. Energy
4. Technology
5. Consumer
6. Manufacturing
14
● There are many other industries which use big data analytics. Banking is seen as the field making
the maximum use of Big Data Analytics.
● The education sector is also making use of data analytics in a big way. There are new options for research
and analysis using data analytics. The institutional data can be used for innovations by technical tools
available today. Due to immense opportunities,
● Data analytics has become an attractive option to study for students as well.
● The insights provided by the big data analytics tools help in knowing the needs of customers better. This
helps in developing new and better products. Improved products and services with new insights can help
the firm enormously. This may help the customers too as they get better offerings satisfying their needs
effectively.
● All in all, Data analytics has become an essential part of companies today.
15
Data Formats
● Big Data Comes with different formats, i.e. it can be machine generated (such as
log files) or could be human generated (such as tabular data).
● Largely we can divide the data formats as:
1. Structured data 2. Semi- Structured data 3. Unstructured data
● Structured data has a particular order for storing and working with it. It is usually
generated by machines or compiled by humans.
● Ex:- spreadsheets for customer records, sales report. It can be spreadsheet or CSV
format
16
Data Formats (Contd)
● Semi Structured data has some definitive patterns for storage but the data
attributes may not be interrelated.
● The data can be hierarchical or graph based. The semicolon (;) is normally used to
store the text files as XMLs, JSON format.
● It is usually machine generated such as website feeds, sensors or any other
application programs.
● Unstructured data does not exhibits a fixed pattern or particular schema. This is
the most common format used in Big Data.
● Examples are video,audio, tweets, likes, shares, text documents, PDFs, scanner
images.
● Special tools are used to process unstructured data
17
Data Pyramid
● When Data is Processed gives information.
● Processing of Information gives knowledge.
● When Knowledge is Processed, you get wisdom.
Data Information Knowledge Wisdom
100121 10-01-21 Birthday Buy a gift
18
What is Data Science?
● Cross Disciplinary set of skills.
● Comprises of 3 Distinct & Overlapping areas.
● Statistician, Skills of Computer Scientist &
Domain Expertise.
● Statistician ⇒ knows how to model and
summarize the datasets.
● Skills of Computer Scientist ⇒ knows how
to design and use algorithms to effectively
store, process and visualize the data.
● Domain Expertise ⇒ has domain knowledge
to formulate right questions and to put their
answers in context.
19
Data Science (Contd.)
● Data Science is a blend of Statistics , Maths, domain (subject) knowledge &
Programming Skills.
● If you can do all three, you are already highly knowledgeable in the field of data
science.
● Data science deals with a huge amount of data which includes data cleansing,
preparation, analysis, and predictive modeling.
● Data scientists understand data from a business point of view and can provide
accurate predictions and insights that can be used to power critical business
decisions.
20
Data Science (Contd.)
● To summarize this, Data science is not a new domain knowledge to learn but new
set of skills that we can apply within our current area of expertise.
● Areas like forecasting certain parameters, predicting the behaviours or patterns
etc.
● Flood Forecasting, Demand Forecasting, Financial Forecasting.
21
Types of Big Data Analytics
● Descriptive Analytics
Descriptive analytics looks to the past for answers. However, while diagnostic analytics asks why something happened,
descriptive analytics asks what happened?
Summary statistics, clustering, and segmentation are techniques used in descriptive analytics. The goal is to dig into the
details of what happened, but this can sometimes be time sensitive as its easier to do a descriptive analysis with more
recent data.
● Diagnostic Analytics
This type of data analytics is used to help determine why something happened, diagnostic analytics reviews data to do
with a past event
or situation. Diagnostic analytics typically uses techniques like data mining, drilling down, and correlation to analyze a
situation.
It is often used to help identify customer trends.
22
Types of Big Data Analytics
● Predictive Analytics
Predictive analytics attempts to forecast the future using statistics, modeling, data mining, and machine learning to hone in on
suggested patterns. It is the most commonly used type of analytics, and typically focuses on predicting the outcome of specific
scenarios in relation to different potential responses from a company to a situation. There are different types of predictive analytics
models, but usually they all use a scoring system to indicate how likely an outcome is to occur
● Prescriptive Analytics
Prescriptive analytics, along with descriptive and predictive analytics, is one of the three main types of analytics companies use to
analyze
data. This type of analytics is sometimes described as being a form of predictive analytics, but is a little different in its focus.
The goal of prescriptive analytics is to conceive the best possible recommendations for a situation as it is unfolding, given what the
analyst
determine from the available data. Think of prescriptive analytics as working in the present, while predictive looks to the future, and
descriptive explores the past.
23
Growth of Data Science - KPIs
● Huge Amount of data available.
● Data Storage is efficient.
● Increased Computational Power.
24
Python - For Data Science
● Python has emerged as major tool for scientific computing tasks including
analysis and visualization for large datasets.
● Usefulness of Python for data science stems primarily from large and active
ecosystem of Packages such as Numpy, Pandas, SciPy, Matplotlib, Scikit-Learn.
● Numpy: used to manipulate the homogeneous array based data.
● Pandas: used to manipulate heterogeneous and labelled data.
● Scipy: used to common scientific tasks (data distribution)
● Matplotlib: used for publication of data, visualization.
● Scikit-Learn: used for machine learning.
25
Machine Learning
● It is primary means by which Data science manifests itself to broader world.
● Machine learning is where these computational and algorithmic skills of data
science meet the statistical thinking of data science.
● Machine Learning involves building mathematical models to help understand
data.
● Learning enters the fray when we give this models tunable parameters that can be
adapted to observed data.
● In this way we can say that the program can learn from the data.
● Once this models fitted to historical data they can be used to predict the aspects
of newly observed data.
26
What is Machine Learning?
● Use of Algorithms to extract data, learn
from data and forecast future data.
● Here Machine get learned from data.
● Examples : Social Media Services
(Facebook, NetFlix, Amazon Prime etc)
Img Source: https://tlp.iasbaba.com/2019/01/day-49-q-1-what-is-machine-learning-what-are-its-applications/ 27

Categories of Machine learning
● Supervised Learning: involves modeling the relationship between measured features
of data and some label associated with data.
● It is further subdivided into Classification(labels are discrete) and Regression(labels
are Continuous).
● Unsupervised Learning: involves modeling of features of dataset without reference to
any label.
● Often treated as “letting dataset speak itself”.
● It is includes tasks such as Clustering and Dimensionality Reduction.
● Clustering identifies distinct groups of data.
● Dimensionality Reduction search for more succinct representation of data.
28
Supervised Learning
29
Unsupervised Learning
● Let's take a example of baby with her dog.
30
Unsupervised Learning
● Baby has not seen this dog earlier but
recognizes it.
● This is Unsupervised Learning .
31
Data Analytics Life Cycle
Data Analytics life cycle broadly classified in 6
phases.
1. Discovery
2. Data Preparation
3. Model Planning
4. Model Building
5. Communicate Results
6. Operationalise
32
1. Discovery
● The data science team trained and researches the issue.
● Create context and gain understanding.
● Learn about the data sources that are needed and accessible to the project.
● The team comes up with an initial hypothesis, which can be later confirmed with
evidence.
33
2. Data Preparation
● Methods to investigate the possibilities of pre-processing, analysing, and
preparing data before analysis and modelling.
● It is required to have an analytic sandbox. The team performs, loads, and
transforms to bring information to the data sandbox.
● Data preparation tasks can be repeated and not in a predetermined sequence.
● Some of the tools used commonly for this process include - Hadoop, Alpine
Miner, Open Refine, etc.
34
2. Data Preparation (Contd.)
● Perform ETL (Extract, Transform & Load) in this the data is extracted from the
datastore, transformed as deemed right (removing noise, outliers, from data) and
then load into datastore again for analysis.
● Once the ETL process is over the team spends time in learning about the data and
its attribute. I.e. understanding the data which is key for building the good
machine learning model.
● Data is also conditioned (cleaned & normalised) for performing further
transformation.
● Data visualizations is done with clean data for identifying the patterns and
exploring the data characteristics.
35
Tools used for Data Preparations
1. Apache Hadoop:- it is a framework that allows the distributed processing of large
large data sets across the various clusters of computers using simple programming
models.
2. Apache Kafka:- it is a Distributed streaming platform in which you can publish
and subscribe to streams of records, store streams of records and process streams
of records as they occur.
3. Alpine Miner:- it provides a graphical interface for creating analytics workflows
and is optimised for fast experimentation, collaboration and an ability to work
within the database itself.
4. OpenRefine:- it is a powerful tool for working with messy data. It cleans the data
and transform it from one format into another.
36
3.Model Planning
● In this team explores and evaluates the possible data models that could be applied to
given data.
● Data Exploration:- in this data is explored to get pattern relationship in the data by
consulting with subject matter experts, stakeholders and analysts.
● Model Selection:- in this we choose the analytical technique based on the dataset and
desired outcome. Based on the type of data (structured, semi structured or unstructured)
different techniques could be chosen and applied.
● R, Python:- is a language and environment for statistical computing and graphics.
● SQL Server Analysis Services:- it supports tabular models at all compatibility levels,
multidimensional models and data mining. It provides analytical data engine used in
decision support and business intelligence (BI) solutions.
● SAS:- it provides the integration between SAS and the analytics sandbox via multiple
data connectors.
37
4. Model Building
In this phase the team starts to build the data analytics model. The available dataset is
divided into:
1. Training dataset
It is used to train the model
2. Testing dataset
It is used to test the model
Some of the common tools used to build the model
R, Python, WEKA, MATLAB
38
4. Model Building (contd.)
● The training dataset is used to train the model (design the model), once the team
is confident about the model it tests the model using testing dataset.
● Ready model is used in the production .i.e. Go live environment.
● The new datasets could be applied to it to get desired results.
● R is used for statistical computing and graphics
● GNU Octave is used for Scientific programming language with powerful
mathematics oriented syntax having built in plotting and visualization tools.
● WEKA is a free data mining software with an analytic workbench.
● Python is programming language that provides toolkit for machine learning and
analysis such as scikit-learn, numpy, pandas,matplotlib
39
5. Communicate Results
● Once the model is build, tested and executed the results generated by the model
are communicated to stakeholders for taking the business decisions.
● If the model building process is unsuccessful then we need to build the new
model
● It's a iterative process.
40
6. Operationalise
● In this stage the model is deployed in the staging environment before it goes live
on a wider scale.
● Here the idea is to ensure the model sustains the performance requirements and
other executions constraints and any issues are identified before the model is
deployed in the production environment
● So any changes which are required can be carried out and tested again. (defect
raising and testing).
41
Data Wrangling
● It is a process of cleaning the messy data to make it more appropriate and useful
for variety of downstream purposes such as analytics
● It is also called as pre processing of the data.
● Data always comes in raw form which always contains noise, missing values,
improper data formats.
42
Tasks
● Why we collect data?
● What starts out as a promising approach may not work in reality. What was
originally just a hunch may end up leading to the best solution.
● Workflows with data is often multistage and iterative process.
● Ex;- stock prices data at exchange stored in database bought by company
converted into Hive store on a hadoop cluster, pulled out of the store by script,
cleaned by another script, dumped to a file, and converted to a format that you
can try out in your modelling library.
● The predictions are then dumped back out to a CSV file and parsed by an
evaluator and the model is iterated multiple times using scripting languages.
● The database is then pumped into system.
43
Models
● Trying to understand the world through data is like trying to piece together reality
using a noisy, incomplete jigsaw puzzle with bunch of extra pieces.
● This is where mathematical modelling in particular statistical modelling comes in.
● The language of statistics contains concepts for many frequent characteristics of
data, such as wrong, redundant or missing. Wrong data is the result of mistakes in
measurements.
● Redundant data contains multiple aspects that convey exactly the same
information. For instances the day of week may be present as categorical variable
with values of “Monday”, “Tuesday”, …. If this day of week information is not
present for some points then you have got missing data on your hands.
44
Models (contd)
● A mathematical model of data describes the relationship between different aspects
of the data. For instance a model that predicts stock prices might be a formula that
maps company’s earning history, past stock prices, and industry to the predicted
stock prices.
● A model that recommends music might measure the similarity between users
(based on their listening habits) and recommend the same artist to users who
have listened to a lot of the same songs.
● Mathematical formulas relate numeric quantities to each other. But raw data is
often not numeric
● For example: Sachin bought Mercedez S9 series on Monday is not numeric.
● Similarly product reviews may not be numeric.
● This is where feature comes in.
45
Features
● Feature is anything that you can measure and build data for. Such as length
variations in animal.
● Features can be numeric, set of characters, boolean values or anything else that
describes the data.
● In machine learning mathematical models features are required to be numeric so
that they can be used in various computations.
● So we can say that feature is numeric representation of raw data.
● Features are also called as dimensions. Data having n features are called as
n-dimensional data set.
46
Features (Contd)
● The dataset has 2 dimensions or features. Gender Marks
● Now here we can see that Gender field is
not numeric so how can we say that it has Girl 65
2 features.
● Here we can have feature engineering Girl 25
applied to it for something more
Boy 85
meaningful and computationally more
appropriate. Boy 47
● We can assign values:
Girl 46
0 ⇒ for Girls
Girl 95
1 ⇒ for Boys
● Now the data is numeric and we can say Boy 37

that it has 2 features. Boy 85
47
Feature Engineering
● There are many ways with which we can convert raw data into numeric data
which is why features can end up looking like a lot of things.
● Naturally features must derive from the type of data that is available. Features are
also tied to the model. Some models are more appropriate for some types of
features and vice versa.
● The right features are, those which are relevant to the task at hand and should be
easy for the model to ingest.
● Feature engineering is the process of formulating the most appropriate features
from the given data, model and the task.
48
Feature Engineering (Contd)
● In machine learning features and models
are the most important aspects for
getting the insights from the raw data.
● Good features make good model.
● The number of features are also
important. If there are not enough
informative features then model will be
unable to perform the desired task.
● If too many features are there or if most
of them are irrelevant then the model
will be more expensive and difficult to
train.
49
● It includes :-
Feature Creation
Feature Transformation
Feature Extraction
Feature Selection
50
● Feature creation identifies the features in the dataset that are relevant to the
problem at hand.
● Feature transformation manages replacing missing values or features that are not
valid.
● Feature extraction is the process of creating new features from existing features
typically with the goal of reducing dimensionality of features.
● Feature selection is the filtering of irrelevant or redundant features in your dataset.
This is usually done by observing variance or correlation thresholds to determine
which features to remove.
51
● Often raw data engineering (data pre-processing) confused with feature
engineering.
● Data engineering is the process of converting raw data into prepared data. Feature
engineering then tunes the prepared data to create the features expected by the
machine learning model.
● Raw Data: data in raw form without any preparation for machine learning.
● Prepared data: data which is ready for machine learning tasks.
● Engineered features: dataset with tuned features .i.e. We can perform certain
machine learning operations on the columns.
● Such as scaling numerical columns to a value between 0 to 1 using encoding
techniques.
52
Data Wrangling Methods
● Data Cleaning
● Data integration
● Data reduction
● Data transformation
● Data discretization
53
Data Cleaning
● It is applied to remove the noise and correct inconsistencies in the data.
● It is the routine work to clean data by replacing missing values, smoothing noisy
data, identifying or removing the outliers.
● If the data is dirty then could lead to inaccurate results..
● Examples: Monday could be represented as Mon, 1 ,M
● Inconsistent data such as USA can be represented as US or United States or
America or United states of America.
54
Data Integration
● It merges data from multiple sources into coherent data store such as data
warehouse.
● It may lead to inconsistencies in data due to redundancy in the data.
● i.e customer identification in one dataset may be referred as cust_id in another.
● Naming inconsistencies may also occur such as first name bill in one database and
b in other database.
● Large amount of redundancy causes slow down of processing or confusing
knowledge discovery.
● Data cleaning and data integration are performed as pre-processing step when
preparing the data for data warehouse.
55
Data Reduction
● Data reduction can reduce the data size by aggregating, eliminating redundant
features or clustering.
● Data reduction obtains a reduced representation of data set which os smaller than
in volume yet produces the same analytical results.
● Data reduction strategies include dimensionality reduction and numerosity
reduction.
● Here we do data encoding so as to obtain a reduced or compressed representation
of original data.
● Dimensionality reduction techniques help you to reduce the number of
dimensions to only keep important dimensions of data and discard all other
dimensions.
56
Data Reduction (Contd)
● In machine learning the complexity depends upon the number of dimensions
(input) as well as the size of the sample. So it becomes increasingly harder to work
with data having lot of dimensions. This is termed as the curse of dimensionality.
● So as to make it computationally intensive there is need of reducing the
dimensions which will also reduce the complexity.
Income Credit score Age Location Give Loan
50,000 High 34 Mumbai Yes
75,000 Low 33 Banglore No
80,000 High 37 Kolkata Yes
90,000 High 29 Pune Yes

57
Data Reduction (Contd)
● Here the dataset contains 4 dimensions based on which loan approval seems to be
granted. Now the dimensions other than credit score is not that much important
for taking the decision of loan approval.
● So the same dataset can be reduced to 2 dimensions having credit score and give
loan as two dimensions.
● This makes algorithm computationally less expensive and also makes it easy to
understand and visualize the dataset.
● Here the aim of dimensionality reduction is not to compromise the quality of
data when discarding unnecessary dimensions but to keep only dimensions
which matters the most without any significant loss of quality.
58
4. Data Transformation
● It may be applied where data are scaled to fall under within a smaller range like
0.0 to 0.1. This can improve the accuracy and efficiency of modelling algorithms
involving distance measurements.
● If you have customer data which contains attribute like age and annual salary. The
annual salary attribute usually takes much larger values than age.
● So if the attributes are left un normalised the distance measurements taken on
annual salary will generally outweigh distance measurements taken on age.
59
Log Transform
● The log transform is powerful tool for dealing with large positive numbers with heavy
tailed distribution.
● A heavy tailed distribution places more entries towards the tail end of the plot rather
than center.. It compresses the long tail in the high end of the distribution into shorter
tail and expands low end into longer head.
● Log transform is commonly used when you are dealing with large numbers. For example
some business have lots of review (over 2000) and some have only few (20).
● In such case it becomes difficult to compare and correlate one business with other
because of large count in one element of the data set would outweigh the similarity in all
other elements.
● This could result in throwing off the entire similarity measurement for various machine
learning algorithms.
● Log transform helps is dealing such skewed data.
60
5. Data Discretisation
● It is concept in which we can replace the raw data values with the ranges or higher
conceptual levels.
● Quantization or binning is the technique in which we create the feature by
combining the existing feature. .i.e we can group the counts into bins.
Site Length Site Breadth Site Price
30 40 40 lakhs
40 32 40 lakhs
30 30 30 lakhs
35 45 45 lakhs
40 60 60 lakhs
60 80 90 lakhs
61
6. Binning or quantization
Site Length Site Breadth Site Area Site Price
30 40 1200 40 lakhs
40 32 1280 40 lakhs
30 30 900 30 lakhs
35 45 1575 45 lakhs
40 60 2400 60 lakhs
60 80 4800 90 lakhs
62
6. Binning or quantization (Contd)
● Binning can be carried out on both numerical and categorical data.
BMI Nutritional Status
Below 18.5 Underweight
18.5-24.9 Normal Weight
25.0-29.9 Pre-obesity
30.0-34.9 Obesity class 1
35.0-39.9 Obesity class 2
Above 40 Obesity class 3
63

Unit 1 Introduction To Data Science

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 1 Introduction To Data Science

Uploaded by

Copyright:

Available Formats

Introduction To Data Science and

Big Data Analytics

Accuracy and trustworthiness of the generated data is termed as Veracity

1. Structured data 2. Semi- Structured data 3. Unstructured data

Data Information Knowledge Wisdom

100121 10-01-21 Birthday Buy a gift

Img Source: https://tlp.iasbaba.com/2019/01/day-49-q-1-what-is-machine-learning-what-are-its-applications/ 27

● Now the data is numeric and we can say Boy 37

Income Credit score Age Location Give Loan

50,000 High 34 Mumbai Yes

75,000 Low 33 Banglore No

80,000 High 37 Kolkata Yes

90,000 High 29 Pune Yes

Site Length Site Breadth Site Price

Site Length Site Breadth Site Area Site Price

BMI Nutritional Status

Below 18.5 Underweight

18.5-24.9 Normal Weight

30.0-34.9 Obesity class 1

35.0-39.9 Obesity class 2

Above 40 Obesity class 3

You might also like