You are on page 1of 109

WELCOME TO ALL

Dear Student of
I M.SC Applied Data Science
by

Dr. M. Pandiyan M.C.A.,M.Phil,NET.,SET.,Ph.D


Assistant Professor,
Department of Computer Application,
SRMIST, FSH Campus,
Chennai.
Machine Learning For Data Science
Unit-1: Chapters-ML For DS
• Introduction to Machine Learning
• Definition and types of Machine Learning
• Machine Learning process, Stages
• Machine Learning Development Lifecycle
• Machine Learning Workflow
• Machine Learning Training Process
• Machine Learning Platforms
• Collecting and Manipulating data
• Machine Learning in data
• Data Modeling
• Data Processing
Unit-1: Chapters-ML For DS
• Architecture for ML in Enterprises, Software
• Architecture to Model ML Apps in Production
• Model Machine Learning apps
• ML Reference Architecture
• Implementing Data Preprocessing
• Building Blocks
• Evolvable Architectures
• Migration, Pitfalls of Evolutionary Architecture
• Anti patterns, Setting Up ML Solutions
• Fitness Function and Categories
• Architecture for Refinement and Production Readiness
• Describing Similarity Neighborhoods
Today Researchers Highly Talk about

Why? What it is
Relationship b/w AI,ML and DL
What is machine learning?
• Machine Learning is a subset of Artificial Intelligence.
Machine Learning is the study of making machines more
human-like in their behavior and decisions by giving them
the ability to learn and develop their own programs.
• This is done with minimum human intervention, i.e., no
explicit programming.
• The learning process is automated and improved based
on the experiences of the machines throughout the
process.
• Good quality data is fed to the machines, and different
algorithms are used to build ML models to train the
machines on this data.
Heroes of Machine Learning
• Geoffrey Hinton is an Emeritus Distinguished
Professor at the University of Toronto and a
Google Brain researcher’
• He is best known for his work on artificial
neural networks (ANNs). His contributions in
the field of deep learning are a key reason
behind the success of the field and he is often
called the “Godfather of Deep Learning”
(with good reason).
• Mr. Hinton’s other notable research works are
Boltzmann machines and
Capsule neural networks. Both major
breakthroughs in our field.
• Hinton recently won the 2018 Turing Award
for his groundbreaking work around deep
neural networks, along with Yann LeCun and
Yoshua Bengio.
• He has also won the BBVA Foundation
Frontiers of Knowledge Award (2016) and
IEEE/RSE Wolfson James Clerk Maxwell
Award.
Michael Jordan is a professor at the
University of California, Berkeley.
His areas of research are machine learning,
statistics, and deep learning.
He has been a major advocate
of Bayesian networks and has made a
significant contribution towards probabilistic
graphical models, spectral methods, natural
language processing (NLP) and much more.
He has won many well-known awards,
including the IEEE Neural NetworksPioneer
Award, the best paper award (with R. Jacobs)
at the American Control Conference (ACC
1991) and the ACM – AAAI Allen Newell
Award. He has also been named a Neyman
Lecturer and a Medallion Lecturer by the
Institute of Mathematical Statistics.

https://www.youtube.com/watch?v=S_YabTj0
Andrew Ng is considered as one of the most
significant researchers in Machine Learning
and Deep Learning in today’s time.

He is the co-founder of Coursera and


deeplearning.ai and an Adjunct Professor of
Computer Science at Stanford University.
Professor Andrew also co-founded the
Google Brain project and was previously the
Chief Scientist at Baidu.

Andrew has an exceptional track record as an


academic researcher – he has over 300
published papers in machine learning and
robotics! He is also a recipient of prestigious
awards like IJCAI Computers and Thought
award, ICML Best Paper Award, ACL Best Paper
Award and many, many more.

https://www.youtube.com/watch?v=e0WKJLo
vaZg
How it differ from traditional
Programming?
• In traditional programming, we would feed the
input data and a well written and tested
program into a machine to generate output.
• When it comes to machine learning, input data
along with the output is fed into the machine
during the learning phase, and it works out a
program for itself.
• To understand this better, refer to the
illustration below:
Machine Learning Terminology

To get started with Machine Learning, let’s take a look


at some of the important terminologies used in
Machine Learning:
• Model: Also known as “hypothesis”, a machine learning
model is the mathematical representation of a real-
world process. A machine learning algorithm along
with the training data builds a machine learning model.
• Feature: A feature is a measurable property or
parameter of the data-set.
• Feature Vector: It is a set of multiple numeric features.
We use it as an input to the machine learning model
for training and prediction purposes.
Machine Learning Terminology

• Training: An algorithm takes a set of data known


as “training data” as input. The learning
algorithm finds patterns in the input data and
trains the model for expected results (target).
The output of the training process is the
machine learning model.
• Prediction: Once the machine learning model is
ready, it can be fed with input data to provide a
predicted output.
• Target (Label): The value that the machine
learning model has to predict is called the target
or label.
Machine Learning Terminology

• Over fitting: When a massive amount of data


trains a machine learning model, it tends to learn
from the noise and inaccurate data entries. Here
the model fails to characterize the data correctly.
• Under fitting: It is the scenario when the model
fails to decipher the underlying trend in the input
data. It destroys the accuracy of the machine
learning model. In simple terms, the model or the
algorithm does not fit the data well enough.
When Do We Use Machine Learning?
ML is used when:
• Human expertise does not exist (navigating on Mars)
• Humans can’t explain their expertise (speech
recognition)
• Models must be customized (personalized medicine)
• Models are based on huge amounts of data (genomics)

Learning isn’t always useful:


• There is no need to “learn” to calculate
payroll 5
D a t a everywhere!
1. Google: processes 24 peta bytes of data per day.

2. F a c e b o o k : 10 million photos uploaded every


hour.

3. Yo u tu b e: 1 hour of video uploaded every second.

4. Tw i t t e r : 400 million tweets per day.

5. A stro n o my: Satellite data is in hundreds of P B .


6. . . .

7. “ B y 2 0 2 0 the digital universe will reach 4 4


z et ta b yt e s. . . ”

T h e Digital Universe of Opportunities: Rich Da t a and the


Increasing Value of the Internet of Things, April 2014.
T h a t ’ s 44 trillion gigabytes!
Why Machine Learning
• Now?
Just listen from a great person:
https://www.youtube.com/watch?v=5cFUZ03Sbhc
• Develop systems that can automatically adapt and
customize themselves to individual users.
• Discover new knowledge from large databases
• Ability to mimic human and replace certain peculiar
tasks -which require some intelligence.
• like recognizing handwritten characters
• Develop systems that are too difficult/expensive to
construct manually because they require specific
detailed skills or knowledge tuned to a specific task

2
1
A classic example of a task that requires machine learning: It
is very hard to say what makes a 2

6
Slide credit: Geoffrey
Need for Machine Learning
• The need for machine learning is increasing day by day.
The reason behind the need for machine learning is that
it is capable of doing tasks that are too complex for a
person to implement directly.
• As a human, we have some limitations as we cannot
access the huge amount of data manually, so for this,
we need some computer systems and here comes the
machine learning to make things easy for us.
• We can train machine learning algorithms by providing
them the huge amount of data and let them explore the
data, construct the models, and predict the required
output automatically.
• The performance of the machine learning algorithm
depends on the amount of data, and it can be
determined by the cost function. With the help of
machine learning, we can save both time and money.
• The importance of machine learning can be easily
understood by its uses cases, Currently, machine
learning is used in self-driving cars, cyber fraud
detection, face recognition, and friend suggestion by
Facebook, etc.
• Various top companies such as Netflix and Amazon
have build machine learning models that are using a
vast amount of data to analyze the user interest and
recommend product accordingly.
Following are some key points which show the
importance of Machine Learning:
• Rapid increment in the production of data
• Solving complex problems, which are difficult
for a human
• Decision making in various sector including
finance
• Finding hidden patterns and extracting useful
information from data.
Some Real time applications
Autonomous Cars

• Nevada made it legal for


autonomous cars to drive
on roads in June 2011
• As of 2013, four states
(Nevada, Florida, California,
and Michigan) have legalized
autonomous cars
Penn’s Autonomous
12
Car 
Netflix
Machine Learning in
Automatic Speech Recognition
A Typical Speech Recognition
System

ML used to predict of phone states from the sound spectrogram

Deep learning has state-of-the-art


results
# Hidden Layers 1 2 4 8 10 12

Word Error Rate % 16.0 12.8 11.4 10.9 11.0 11.1

Baseline GMM performance = 15.4%


[Zeiler et al. “On rectified linear units for
speech recognition” ICASSP 2013]
21
Unit-1: Chapters-ML For DS
• Introduction to Machine Learning
• Definition and types of Machine Learning
• Machine Learning process, Stages
• Machine Learning Development Lifecycle
• Machine Learning Workflow
• Machine Learning Training Process
• Machine Learning Platforms
• Collecting and Manipulating data
• Machine Learning in data
• Data Modeling
• Data Processing
Ch.#1: Introduction to Machine Learning
What is machine Learning?
• Machine learning is an application of artificial intelligence
(AI) that provides systems the ability to automatically learn
and improve from experience without being explicitly
programmed.
• Machine learning is to allow the computers learn
automatically without human intervention or assistance
and adjust actions accordingly.
• The term machine learning was first introduced by Arthur
Samuel in 1959. We can define it in a summarized way as:
“Machine learning enables a machine to automatically learn
from data, improve performance from experiences, and
predict things without being explicitly programmed”.
Ch.#2: Definition of Machine Learning
Arthur Samuel, a pioneer in the field of
artificial intelligence and computer gaming,
coined the term “Machine Learning”. He
defined machine learning as –
Ch.#2: Definition of Machine Learning

A more formal and mathematical definition for Machine Learning by Tom Mitchel is
• Herbert Alexander Simon:
“Learning is any process by
which a system improves
performance from experience.”
• “Machine Learning is concerned
with computer programs that
automatically improve their
performance through
experience. “ Herbert Simon
Turing Award 1975 Nobel
Prize in Economics 1978
Why machine learning is important?

• Machine learning is important because it gives


enterprises a view of trends in customer behavior
and business operational patterns, as well as
supports the development of new products.
Many of today's leading companies, such as
Facebook, Google and Uber, make machine
learning a central part of their operations.
Machine learning has become a significant
competitive differentiator for many companies.
How does Machine Learning work
• A Machine Learning system learns from historical data,
builds the prediction models, and whenever it receives
new data, predicts the output for it.
• The accuracy of predicted output depends upon the
amount of data, as the huge amount of data helps to build
a better model which predicts the output more accurately.
• Suppose we have a complex problem, where we need to
perform some predictions, so instead of writing a code for
it, we just need to feed the data to generic algorithms, and
with the help of these algorithms, machine builds the logic
as per the data and predict the output.

How does Machine Learning work

Machine learning has changed our way of


thinking about the problem. The below block
diagram explains the working of Machine
Learning algorithm
Types of Machine Learning
• Machine Learning is the study of how to build
applications that exhibit this iterative
improvement.
• There are many ways to frame this idea, but
largely there are three major recognized
categories:
1. supervised learning,
2. unsupervised learning
3. reinforcement learning.
Supervised Learning

• Supervised learning is the types of machine learning in which


machines are trained using well "labeled" training data, and on
basis of that data, machines predict the output.
• The labeled data means some input data is already tagged with
the correct output.
• In supervised learning, the training data provided to the
machines work as the supervisor that teaches the machines to
predict the output correctly. It applies the same concept as a
student learns in the supervision of the teacher.
• Supervised learning is a process of providing input data as well
as correct output data to the machine learning model. The aim
of a supervised learning algorithm is to find a mapping function
to map the input variable(x) with the output variable(y).
• In the real-world, supervised learning can be used for Risk
Assessment, Image classification, Fraud Detection, spam
filtering, etc.
How Supervised Learning Works?

• In supervised learning, models are trained


using labeled dataset, where the model learns
about each type of data.
• Once the training process is completed, the
model is tested on the basis of test data (a
subset of the training set), and then it predicts
the output.
• The working of Supervised learning can be
easily understood by the below example and
diagram:
• Suppose we have a dataset of different types of shapes
which includes square, rectangle, triangle, and Polygon.
Now the first step is that we need to train the model for
each shape.
• If the given shape has four sides, and all the sides are
equal, then it will be labeled as a Square.
• If the given shape has three sides, then it will be labeled
as a triangle.
• If the given shape has six equal sides then it will be
labeled as hexagon.
• Now, after training, we test our model using the test set,
and the task of the model is to identify the shape.
• The machine is already trained on all types of shapes,
and when it finds a new shape, it classifies the shape on
the bases of a number of sides, and predicts the output.
Steps Involved in Supervised Learning:
• First Determine the type of training dataset
• Collect/Gather the labeled training data.
• Split the training dataset into training dataset, test dataset, and
validation dataset.
• Determine the input features of the training dataset, which
should have enough knowledge so that the model can
accurately predict the output.
• Determine the suitable algorithm for the model, such as
support vector machine, decision tree, etc.
• Execute the algorithm on the training dataset. Sometimes we
need validation sets as the control parameters, which are the
subset of training datasets.
• Evaluate the accuracy of the model by providing the test set. If
the model predicts the correct output, which means our model
is accurate.
Types of problems solved by supervised
Machine learning Algorithms
• Supervised learning algorithm used to solve
the following two types of problems
1. Regression
• Regression algorithms are used if there is a
relationship between the input variable and the
output variable.
• It is used for the prediction of continuous variables,
such as Weather forecasting, Market Trends, etc.
• Below are some popular Regression algorithms
which come under supervised learning:
1. Linear Regression
2. Regression Trees
3. Non-Linear Regression
4. Bayesian Linear Regression
5. Polynomial Regression
2. Classification

• Classification algorithms are used when the


output variable is categorical, which means
there are two classes such as Yes-No, Male-
Female, True-false, etc.
1. Spam Filtering,
2. Random Forest
3. Decision Trees
4. Logistic Regression
5. Support vector Machines
Classification Vs Regression
• Figure A: It is a dataset of a shopping store which is
useful in predicting whether a customer will purchase a
particular product under consideration or not based on
his/ her gender, age and salary.
Input : Gender, Age, Salary
Output : Purchased i.e. 0 or 1 ; 1 means yes the
customer will purchase and 0 means that customer
won’t purchase it.
• Figure B: It is a Meteorological dataset which serves
the purpose of predicting wind speed based on
different parameters.
Input : Dew Point, Temperature, Pressure, Relative
Humidity, Wind Direction
Output : Wind Speed
Training the system:
• While training the model, data is usually split in the ratio
of 80:20 i.e. 80% as training data and rest as testing data.
• In training data, we feed input as well as output for 80%
data. The model learns from training data only. We use
different machine learning algorithms to build our model.
By learning, it means that the model will build some logic
of its own.
• Once the model is ready then it is good to be tested. At
the time of testing, the input is fed from the remaining
20% data which the model has never seen before, the
model will predict some value and we will compare it
with actual output and calculate the accuracy.
• Classification : It is a Supervised Learning task where output is having
defined labels(discrete value). For example in above Figure A, Output
– Purchased has defined labels i.e. 0 or 1 ; 1 means the customer will
purchase and 0 means that customer won’t purchase.
• The goal here is to predict discrete values belonging to a particular
class and evaluate on the basis of accuracy.
It can be either binary or multi class classification.
In binary classification, model predicts either 0 or 1 ; yes or no but in
case of multi class classification, model predicts more than one class.
Example: Gmail classifies mails in more than one classes like social,
promotions, updates, forum.
• Regression : It is a Supervised Learning task where output is having
continuous value.
Example in above Figure B, Output – Wind Speed is not having any
discrete value but is continuous in the particular range. The goal here
is to predict a value as much closer to actual output value as our
model can and then evaluation is done by calculating error value. The
smaller the error the greater the accuracy of our regression model.
Advantages and Disadvantages of Supervised
learning:
Advantages of Supervised learning:
• With the help of supervised learning, the model can predict the
output on the basis of prior experiences.
• In supervised learning, we can have an exact idea about the
classes of objects.
• Supervised learning model helps us to solve various real-world
problems such as fraud detection, spam filtering, etc.
Disadvantages of supervised learning:
• Supervised learning models are not suitable for handling the
complex tasks.
• Supervised learning cannot predict the correct output if the test
data is different from the training dataset.
• Training required lots of computation times.
• In supervised learning, we need enough knowledge about the
classes of object.
Unsupervised Machine Learning
Introduction
• In supervised machine learning in which
models are trained using labeled data under
the supervision of training data.
• But there may be many cases in which we do
not have labeled data and need to find the
hidden patterns from the given dataset.
• So, to solve such types of cases in machine
learning, we need unsupervised learning
techniques.
What is Unsupervised Learning?
• As the name suggests, unsupervised learning is a
machine learning technique in which models are not
supervised using training dataset. Instead, models
itself find the hidden patterns and insights from the
given data.
• It can be compared to learning which takes place in
the human brain while learning new things. It can be
defined as:
“Unsupervised learning is a type of machine learning in
which models are trained using unlabeled dataset and
are allowed to act on that data without any
supervision.”
• The goal of unsupervised learning is to find the underlying
structure of dataset, group that data according to
similarities, and represent that dataset in a compressed
format.
• Example: Suppose the unsupervised learning algorithm is
given an input dataset containing images of different types
of cats and dogs.
• The algorithm is never trained upon the given dataset,
which means it does not have any idea about the features
of the dataset.
• The task of the unsupervised learning algorithm is to
identify the image features on their own.
• Unsupervised learning algorithm will perform this task by
clustering the image dataset into the groups according to
similarities between images.
Unlabeled Data-Unsupervised Learning
Why use Unsupervised Learning?
• Below are some main reasons which describe the
importance of Unsupervised Learning:
• Unsupervised learning is helpful for finding useful
insights from the data.
• Unsupervised learning is much similar as a human
learns to think by their own experiences, which makes it
closer to the real AI.
• Unsupervised learning works on unlabeled and
uncategorized data which make unsupervised learning
more important.
• In real-world, we do not always have input data with the
corresponding output so to solve such cases, we need
unsupervised learning.
Working of Unsupervised Learning

Working of unsupervised learning can be understood by the below diagram


• Here, we have taken an unlabeled input data, which
means it is not categorized and corresponding
outputs are also not given.
• Now, this unlabeled input data is fed to the machine
learning model in order to train it. Firstly, it will
interpret the raw data to find the hidden patterns
from the data and then will apply suitable algorithms
such as k-means clustering, Decision tree, etc.
• Once it applies the suitable algorithm, the algorithm
divides the data objects into groups according to the
similarities and difference between the objects.
Types of Unsupervised Learning Algorithm:

• The unsupervised learning algorithm can be


further categorized into two types of
problems:
Types of Unsupervised Learning Algorithm:
• Clustering: Clustering is a method of grouping the objects into
clusters such that objects with most similarities remains into a
group and has less or no similarities with the objects of another
group.
• Cluster analysis finds the commonalities between the data objects
and categorizes them as per the presence and absence of those
commonalities.
• Association: An association rule is an unsupervised learning
method which is used for finding the relationships between
variables in the large database. It determines the set of items that
occurs together in the dataset. Association rule makes marketing
strategy more effective. Such as people who buy X item (suppose
a bread) are also tend to purchase Y (Butter/Jam) item. A typical
example of Association rule is Market Basket Analysis.
Unsupervised Learning algorithms:
• Below is the list of some popular unsupervised
learning algorithms:
• K-means clustering
• KNN (k-nearest neighbors)
• Hierarchal clustering
• Anomaly detection
• Neural Networks
• Principle Component Analysis
• Independent Component Analysis
• Apriori algorithm
• Singular value decomposition
Advantages of Unsupervised Learning
• Unsupervised learning is used for more complex tasks as
compared to supervised learning because, in
unsupervised learning, we don't have labeled input data.
• Unsupervised learning is preferable as it is easy to get
unlabeled data in comparison to labeled data.
Disadvantages of Unsupervised Learning
• Unsupervised learning is intrinsically more difficult than
supervised learning as it does not have corresponding
output.
• The result of the unsupervised learning algorithm might
be less accurate as input data is not labeled, and
algorithms do not know the exact output in advance.
Machine Learning Process, Stages and Life
Cycle
• The process of machine learning would be broken down
in the 7 steps listed below.
• In order to illustrate the significance and function of each
step, we would be using an example of a simple model.
• This model would be responsible for differentiating
between an apple and an orange. Machine learning is
capable of much for complex tasks.
• However, in order to explain the process in simplistic
terms, a basic example is taken to explain the relevant
concepts.
Stage #1: Gathering Data
• For the purpose of developing our machine learning model,
our first step would be to gather relevant data that can be
used to differentiate between the 2 fruits.
• Different parameters can be used to classify a fruit as either
an orange or apple.
• For the sake of simplicity, we would only take 2 features
that our model would utilize in order to perform its
operation.
• The first feature would be the color of the fruit itself and
the second one being the shape of the fruit.
• Using these features, we would hope that our model can
accurately differentiate between the 2 fruits.
Color Shape Apple or Orange?
Red Round Conical Apple

Orange Round Orange

•A mechanism would be needed to gather the data for


our 2 chosen features. For instance, for collecting data on
color, we may use a spectrometer and, for the shape
data, we may use pictures of the fruits so that they can be
treated as 2D figures.
• For the purpose of collecting data, we would try to get
as many different types of apples and orange as possible
in order to create diverse data sets for our features.
• For this purpose, we may try to search the markets for
oranges and apples that may be from different parts of
the world
NOTE
• The step of gathering data is the foundation of the
machine learning process.
• Mistakes such as choosing the incorrect features or
focusing on limited types of entries for the data set may
render the model completely ineffective.
• That is why it is imperative that the necessary
considerations are made when gathering data as the
errors made in this stage would only amplify as we
progress to latter stages.
Stage #2: Preparing that Data
• After collecting the data, we need to prepare it for further
steps. Data preparation is a step where we put our data into
a suitable place and prepare it to use in our machine learning
training.
• In this step, first, we put all data together, and then
randomize the ordering of data.
• Another major component of data preparation is breaking
down the data sets into 2 parts.
• The larger part (~80%) would be used for training the model
while the smaller part (~20%) is used for evaluation
purposes.
• This is important because using the same data sets for both
training and evaluation would not give a fair assessment of
the model’s performance in real world scenarios.
Stage #3: Data Wrangling
• Data wrangling is the process of cleaning and converting raw data
into a useable format.
• It is the process of cleaning the data, selecting the variable to use,
and transforming the data in a proper format to make it more
suitable for analysis in the next step.
• It is one of the most important steps of the complete process.
Cleaning of data is required to address the quality issues.
• It is not necessary that data we have collected is always of our use
as some of the data may not be useful. In real-world applications,
collected data may have various issues, including:
 Missing Values
 Duplicate data
 Invalid data
 Noise
• So, we use various filtering techniques to clean the data.
Stage #4: Data Analysis

• Now the cleaned and prepared data is passed on to the


analysis step. This step involves:
• Selection of analytical techniques
• Building models
• Review the result
• The aim of this step is to build a machine learning model to
analyze the data using various analytical techniques and
review the outcome. It starts with the determination of the
type of the problems, where we select the machine learning
techniques such as Classification, Regression, Cluster
analysis, Association, etc. then build the model using
prepared data, and evaluate the model.
Stage #5: Train Model
• Now the next step is to train the model, in this step we train our
model to improve its performance for better outcome of the
problem.
• We use datasets to train the model using various machine learning
algorithms. Training a model is required so that it can understand
the various patterns, rules, and, features.
• Here we use the part of data set allocated for training to teach our
model to differentiate between the 2 fruits. If we view our model in
mathematical terms, the inputs i.e. our 2 features would have
coefficients.
• The achieved output is compared with actual output and the
difference is minimized by trying different values of weights and
biases.
• The iterations are repeated using different entries from our training
data set until the model reaches the desired level of accuracy.
Stage #5: Train Model

• Training requires patience and experimentation.


It is also useful to have knowledge of the field
where the model would be implemented.
• Training can prove to be highly rewarding if the
model starts to succeed in its role.
• It is comparable to when a child learns to ride a
bicycle. Initially, they may have multiple falls
but, after a while, they develop a better grasp of
the process and are able to react better to
different situations while riding the bicycle.
Stage #6: Test Model
• Once our machine learning model has been trained on a
given dataset, then we test the model. In this step, we
check for the accuracy of our model by providing a test
dataset to it.
• Testing the model determines the percentage accuracy of
the model as per the requirement of project or problem.
• In our case, it could mean trying to identify a type of an
apple or an orange that is completely new to the model.
• However, through its training, the model should be
capable enough to extrapolate the information and deem
whether the fruit is an apple or an orange.
Stage #7: Deployment

• The last step of machine learning life cycle is


deployment, where we deploy the model in the
real-world system.
• If the above-prepared model is producing an
accurate result as per our requirement with
acceptable speed, then we deploy the model in the
real system. But before deploying the project, we
will check whether it is improving its performance
using available data or not. The deployment phase
is similar to making the final report for a project.
Machine Learning Workflow
Important web references
• https://towardsdatascience.com/workflow-of-
a-machine-learning-project-ec1dba419b94
WELCOME TO ALL

Dear Student of
I M.SC Applied Data Science
by

Dr. M. Pandiyan M.C.A.,M.Phil,NET.,SET.,Ph.D


Assistant Professor,
Department of Computer Application,
SRMIST, FSH Campus,
Chennai.
Machine Learning For Data Science
Data Modeling
What is data modeling?

• Data modeling is the process of creating a visual


representation of either a whole information system or
parts of it to communicate connections between data
points and structures.
• The goal is to illustrate the types of data used and
stored within the system, the relationships among these
data types, the ways the data can be grouped and
organized and its formats and attributes.
• Data models are built around business needs. Rules and
requirements are defined upfront through feedback
from business stakeholders
Data Model
• Data can be modeled at various levels of abstraction.
• The process begins by collecting information about
business requirements from stakeholders and end users.
These business rules are then translated into data
structures to formulate a concrete database design.
• A data model can be compared to a roadmap, an
architect’s blueprint or any formal diagram that facilitates
a deeper understanding of what is being designed.
• This provides a common, consistent, and predictable way
of defining and managing data resources across an
organization, or even beyond.
Types of data models

• Data models can generally be divided into


three categories, which vary according to their
degree of abstraction.
• 1. Conceptual model,
• 2. Logical model
• 3. Physical model.
Each type of data model is discussed in more
detail below:
Conceptual data models.
• They are also referred to as domain models and offer a
big-picture view of what the system will contain, how it
will be organized, and which business rules are involved.
• Conceptual models are usually created as part of the
process of gathering initial project requirements.
• Typically, they include entity classes (defining the types
of things that are important for the business to
represent in the data model), their characteristics and
constraints, the relationships between them and
relevant security and data integrity requirements.
Logical data models.
• They are less abstract and provide greater detail about the
concepts and relationships in the domain under consideration.
• One of several formal data modeling notation systems is
followed. These indicate data attributes, such as data types
and their corresponding lengths, and show the relationships
among entities
• Logical data models don’t specify any technical system
requirements.
• Logical data models can be useful in highly procedural
implementation environments, or for projects that are data-
oriented by nature, such as data warehouse design or
reporting system development
Physical data models.
• They provide a schema for how the data will be
physically stored within a database.
• As such, they’re the least abstract of all. They offer a
finalized design that can be implemented as a relational
database, including associative tables that illustrate the
relationships among entities as well as the primary keys
and foreign keys that will be used to maintain those
relationships.
• Physical data models can include database management
system (DBMS)-specific properties, including
performance tuning.
Data modeling process
• Data modeling process contain the following five
steps
1. Identify the entities
2. Identify key properties of each entity
3. Identify relationships among entities
4. Map attributes to entities completely
5. Assign keys as needed, and decide on a degree of
normalization that balances the need to reduce
redundancy with performance requirements.
Note: Data modeling is an iterative process that should
be repeated and refined as business needs change.
Identify the entities. The process of data modeling
begins with the identification of the things, events or
concepts that are represented in the data set that is
to be modeled.
Identify key properties of each entity.
• Each entity type can be differentiated from all others
because it has one or more unique properties, called
attributes.
• For instance, an entity called “customer” might
possess such attributes as a first name, last name,
telephone number and salutation, while an entity
called “address” might include a street name and
number, a city, state, country and zip code.
Identify relationships among entities.
• The earliest draft of a data model will specify the nature
of the relationships each entity has with the others. In
the above example, each customer “lives at” an address.
• If that model were expanded to include an entity called
“orders,” each order would be shipped to and billed to an
address as well. These relationships are usually
documented via unified modeling language (UML).
Map attributes to entities completely.
• This will ensure the model reflects how the business will
use the data.
• Several formal data modeling patterns are in widespread
use. Object-oriented developers often apply analysis
patterns or design patterns
Assign keys as needed, and decide on a degree of
normalization that balances the need to reduce redundancy
with performance requirements.
• Normalization is a technique for organizing data models (and
the databases they represent) in which numerical identifiers,
called keys, are assigned to groups of data to represent
relationships between them without repeating the data.
• For instance, if customers are each assigned a key, that key
can be linked to both their address and their order history
without having to repeat this information in the table of
customer names. Normalization tends to reduce the amount
of storage space a database will require, but it can at cost to
query performance
Benefits of data modeling

• Data modeling makes it easier for developers, data architects,


business analysts, and other stakeholders to view and understand
relationships among the data in a database or data warehouse. In
addition, it can:
• Reduce errors in software and database development.
• Increase consistency in documentation and system design across the
enterprise.
• Improve application and database performance.
• Ease data mapping throughout the organization.
• Improve communication between developers and business
intelligence teams.
• Ease and speed the process of database design at the conceptual,
logical and physical levels.
Types of data modeling

We have many types of data modeling techniques some


of the techniques are
• Hierarchical data models
• Relational data models
• Entity-relationship (ER) data models
• Object-oriented data models
• Dimensional data models
Hierarchical data models

• It represent one-to-many relationships in a treelike


format. In this type of model, each record has a single
root or parent which maps to one or more child tables.
• This model was implemented in the IBM Information
Management System (IMS), which was introduced in 1966
and rapidly found widespread use, especially in banking.
• Though this approach is less efficient than more recently
developed database models, it’s still used in Extensible
Markup Language (XML) systems and geographic
information systems (GISs).
Relational data models
• Relational data models were initially proposed by IBM
researcher E.F. Codd in 1970. They are still
implemented today in the many different relational
databases commonly used in enterprise computing.
Relational data modeling doesn’t require a detailed
understanding of the physical properties of the data
storage being used. In it, data segments are explicitly
joined through the use of tables, reducing database
complexity.
• Relational databases frequently employ structured
query language (SQL) for data managemen
• Entity-relationship (ER) data models use formal diagrams to
represent the relationships between entities in a database.
Several ER modeling tools are used by data architects to
create visual maps that convey database design objectives.
• Object-oriented data models gained traction as object-
oriented programming and it became popular in the mid-
1990s. The “objects” involved are abstractions of real-world
entities. Objects are grouped in class hierarchies, and have
associated features. Object-oriented databases can
incorporate tables, but can also support more complex data
relationships. This approach is employed in multimedia and
hypertext databases as well as other use cases.
• Dimensional data models were developed by Ralph Kimball,
and they were designed to optimize data retrieval speeds for
analytic purposes in a data warehouse. While relational and
ER models emphasize efficient storage, dimensional models
increase redundancy in order to make it easier to locate
information for reporting and retrieval. This modeling is
typically used across OLAP systems.
• Two popular dimensional data models are the star schema, in
which data is organized into facts (measurable items) and
dimensions (reference information), where each fact is
surrounded by its associated dimensions in a star-like pattern.
The other is the snowflake schema, which resembles the star
schema but includes additional layers of associated
dimensions, making the branching pattern more complex.
Data modeling tools

• erwin Data Modeler


• Enterprise Architect
• ER/Studio
• Free data modeling tools i
Data Preprocessing
• Data preprocessing is the process of transforming
raw data into an understandable format.
• It is also an important step in data mining as we
cannot work with raw data.
• The quality of the data should be checked before
applying machine learning or data mining algorithms.
Why is Data preprocessing important?

• Preprocessing of data is mainly to check the data quality.


The quality can be checked by the following
• Accuracy: To check whether the data entered is correct or
not.
• Completeness: To check whether the data is available or not
recorded.
• Consistency: To check whether the same data is kept in all
the places that do or do not match.
• Timeliness: The data should be updated correctly.
• Believability: The data should be trustable.
• Interpretability: The understandability of the data.
Major Tasks in Data Preprocessing:

• Data cleaning
• Data integration
• Data reduction
• Data transformation
Data cleaning:

• Data cleaning is the process to remove incorrect data, incomplete


data and inaccurate data from the datasets, and it also replaces the
missing values. There are some techniques in data cleaning
Handling missing values:
• Standard values like “Not Available” or “NA” can be used to replace
the missing values.
• Missing values can also be filled manually but it is not recommended
when that dataset is big.
• The attribute’s mean value can be used to replace the missing value
when the data is normally distributed, wherein in the case of non-
normal distribution median value of the attribute can be used.
• While using regression or decision tree algorithms the missing value
can be replaced by the most probable
value.
• Noisy:
Noisy generally means random error or containing unnecessary
Data integration:
• The process of combining multiple sources into a single dataset.
The Data integration process is one of the main components in
data management. There are some problems to be considered
during data integration.
• Schema integration: Integrates metadata(a set of data that
describes other data) from different sources.
• Entity identification problem: Identifying entities from multiple
databases. For example, the system or the use should know
student _id of one database and student_name of another
database belongs to the same entity.
• Detecting and resolving data value concepts: The data taken
from different databases while merging may differ. Like the
attribute values from one database may differ from another
database. For example, the date format may differ like
“MM/DD/YYYY” or “DD/MM/YYYY”.
• Data reduction:
• This process helps in the reduction of the volume of the data which
makes the analysis easier yet produces the same or almost the same result.
This reduction also helps to reduce storage space. There are some of the
techniques in data reduction are Dimensionality reduction, Numerosity
reduction, Data compression.
• Dimensionality reduction: This process is necessary for real-world applications
as the data size is big. In this process, the reduction of random variables or
attributes is done so that the dimensionality of the data set can be reduced.
Combining and merging the attributes of the data without losing its original
characteristics. This also helps in the reduction of storage space and
computation time is reduced. When the data is highly dimensional the
problem called “Curse of Dimensionality” occurs.
• Numerosity Reduction: In this method, the representation of the data is made
smaller by reducing the volume. There will not be any loss of data in this
reduction.
• Data compression: The compressed form of data is called data compression.
This compression can be lossless or lossy. When there is no loss of information
during compression it is called lossless compression. Whereas lossy
compression reduces information but it removes only the unnecessary
information.
• Data Transformation:
• The change made in the format or the structure of the data is called
data transformation. This step can be simple or complex based on the
requirements. There are some methods in data transformation.
• Smoothing: With the help of algorithms, we can remove noise from the
dataset and helps in knowing the important features of the dataset. By
smoothing we can find even a simple change that helps in prediction.
• Aggregation: In this method, the data is stored and presented in the
form of a summary. The data set which is from multiple sources is
integrated into with data analysis description. This is an important step
since the accuracy of the data depends on the quantity and quality of
the data. When the quality and the quantity of the data are good the
results are more relevant.
• Discretization: The continuous data here is split into intervals.
Discretization reduces the data size. For example, rather than specifying
the class time, we can set an interval like (3 pm-5 pm, 6 pm-8 pm).
• Normalization: It is the method of scaling the data so that it can be
represented in a smaller range. Example ranging from -1.0 to 1.0.
• https://www.analyticsvidhya.com/blog/2021/
08/data-preprocessing-in-data-mining-a-hand
s-on-guide
/
• https://
www.geeksforgeeks.org/data-preprocessing-
machine-learning-python
/
reference
• https://
www.ibm.com/cloud/learn/data-modeling
• https://
www.youtube.com/watch?v=4qFZ-5i4GS8
• https://www.analyticsvidhya.com/blog/2021/
08/data-preprocessing-in-data-mining-a-hand
s-on-guide
/

You might also like