Unit 1 - DS

What is Data Science
Data Science is an interdisciplinary field that allows you to

extract knowledge from structured or unstructured data.
 Data science enables you to translate a business problem

into a research project and then translate it back into a
practical solution.
SRM Institute of Science and Technology 2

Data science is the process of deriving knowledge and

insights from a huge and diverse set of data through
organizing, processing and analyzing the data.
Data Science is an interdisciplinary field that allows you to

extract knowledge from structured or unstructured data.

Data science enables you to translate a business problem

into a research project and then translate it back into a
practical solution.
.


Statistics:
Statistics is the most critical unit in Data science. It is the

method or science of collecting and analyzing numerical
data in large quantities to get useful insights.
Visualization:
Visualization technique helps you to access huge amounts

of data in easy to understand and digestible visuals.

Machine Learning:
Machine Learning explores the building and study of

algorithms which learn to make predictions about
unforeseen/future data
Deep Learning:
Deep Learning method is new machine learning research

where the algorithm is applied to handle huge amount of
data.

Data Science Process
• Defining data science project roles
• Understanding the stages of a data science project
• Setting expectations for a new data science project

The roles in a data science project
Project sponsor : represents the business interests;

champions the project
Client :represents end user
Data scientist : sets and executes analytic strategy;

communicates with sponsor and client

The roles in a data science project
Data architect : manages data and data storage; sometimes

manages data collection
Operations: manages infrastructure; deploys final project

results

PROJECT SPONSOR
 The sponsor is the person who wants the data science

result.
 The sponsor is responsible for deciding whether the

project is a success or failure.

CLIENT
 The client is the role that represents the model’s end
users’ interests.
 Generally the client belongs to a different group in the

organization and has other responsibilities beyond your
project

DATA SCIENTIST
Data scientist is responsible for taking all necessary steps

to make the project succeed.
Responsible for setting the project strategy ,design, pick the

data sources, and pick the tools to be used and the
techniques
They’re also responsible for project planning and tracking

,testing and procedures, applies machine learning models,
and evaluates results.

DATA ARCHITECT
The data architect is responsible for all of the data and its
storage.
Data architects often manage data warehouses for many

different projects

DATA ARCHITECT
The data architect is responsible for all of the data and its
storage.
Data architects often manage data warehouses for many

different projects

Stages of a data science project

Defining the goal
The first task in a data science project is to define a
measurable and quantifiable goal.
Why do the sponsors want the project in the first place?
What do they lack, and what do they need?
What are they doing to solve the problem now, and why isn’t
that good enough?

Defining the goal
What resources will you need: what kind of data and how
much staff?
Will you have domain experts to collaborate with, and what
are the computational resources?
 How do the project sponsors plan to deploy your results?
What are the constraints that have to be met for successful
deployment.
Example: The ultimate business goal is to
reduce the bank’s losses due to bad loans.

Data collection and management
This step encompasses identifying the data you need,

exploring it, and conditioning it to be suitable for analysis.
 This stage is often the most time-consuming step in the
process.
 What data is available to me?
Will it help me solve the problem?
Is data enough?
 Is the data quality good enough?

Modeling
Finalize the statistics and machine learning during the
modeling, or analysis stage.
Extracting useful insights from the data in order to achieve

your goals.
The loan application problem is a classification problem

Modeling
The most common data science modeling tasks are these:
Classification—Deciding if something belongs to one
category or another
Scoring—Predicting or estimating a numeric value, such as
a price or probability
Ranking—Learning to order items by preferences
Clustering—Grouping items into most-similar groups
Finding relations—Finding correlations or potential causes
of effects seen in the data
Characterization—Very general plotting and report
generation from data
Model evaluation and critique
Once you have a model, you need to determine if it meets
your goals.
Is it accurate enough for your needs? Does it generalize

well?
 Does it perform better than ―the obvious guess‖? Better

than whatever estimate you currently use?
Do the results of the model (coefficients, clusters, rules)

make sense in the context of the problem domain?

CONFUSION MATRIX
Predicted
Values
TP FP
Predicted
Values
FN TN
Predicted
Values
Actual 5 0
Values
1 9

Accuracy
Accuracy is defined as the ratio of total number of correct

predictions to the total number of samples.
Accuracy =(True Positive + True Negative) / (True positive

+True Negative+ False Positive +False Negative)

Precision
Precision defines the correct identification of actual
positives
Precision = True Positives / (True positives + False
positives)
Recall
Recall or True Positive Rate is defined as the ratio of true
positives to total number of true positives and false negatives.
Recall = True Positives / (True positives + False

Negatives)
Presentation and documentation
Once you have a model that meets your success criteria,

you’ll present your results to your project sponsor and other
stakeholders.
You must also document the model for those in the

organization who are responsible for using, running, and
maintaining the model once it has been deployed

Model deployment and maintenance
Finally, the model is put into operation.
In many organizations this means the data scientist no

longer has primary responsibility for the day-to-day operation
of the model

Stages of a data science project

Defining the goal
The first task in a data science project is to define a
measurable and quantifiable goal.
Why do the sponsors want the project in the first place?
What do they lack, and what do they need?
What are they doing to solve the problem now, and why isn’t
that good enough?

Defining the goal
What resources will you need: what kind of data and how
much staff?
Will you have domain experts to collaborate with, and what
are the computational resources?
 How do the project sponsors plan to deploy your results?
What are the constraints that have to be met for successful
deployment.
Example: The ultimate business goal is to
reduce the bank’s losses due to bad loans.

Data collection and management
This step encompasses identifying the data you need,

exploring it, and conditioning it to be suitable for analysis.
 This stage is often the most time-consuming step in the
process.
 What data is available to me?
Will it help me solve the problem?
Is data enough?
 Is the data quality good enough?

Modeling
Finalize the statistics and machine learning during the
modeling, or analysis stage.
Extracting useful insights from the data in order to achieve

your goals.
The loan application problem is a classification problem

Modeling
The most common data science modeling tasks are these:
Classification—Deciding if something belongs to one
category or another
Scoring—Predicting or estimating a numeric value, such as
a price or probability
Ranking—Learning to order items by preferences
Clustering—Grouping items into most-similar groups
Finding relations—Finding correlations or potential causes
of effects seen in the data
Characterization—Very general plotting and report
generation from data
Once you have a model, you need to determine if it meets
your goals.
Is it accurate enough for your needs? Does it generalize

well?
 Does it perform better than ―the obvious guess‖? Better

than whatever estimate you currently use?
Do the results of the model (coefficients, clusters, rules)

make sense in the context of the problem domain?

CONFUSION MATRIX
Predicted
Values
TP FP
Predicted
Values
FN TN
Predicted
Values
Actual 5 0
Values
1 9

Accuracy
Accuracy is defined as the ratio of total number of correct

predictions to the total number of samples.
Accuracy =(True Positive + True Negative) / (True positive

+True Negative+ False Positive +False Negative)

Precision
Precision defines the correct identification of actual
positives
Precision = True Positives / (True positives + False
positives)
Recall
Recall or True Positive Rate is defined as the ratio of true
positives to total number of true positives and false negatives.
Recall = True Positives / (True positives + False

Negatives)
Presentation and documentation
Once you have a model that meets your success criteria,

you’ll present your results to your project sponsor and other
stakeholders.
You must also document the model for those in the

organization who are responsible for using, running, and
maintaining the model once it has been deployed

Model deployment and maintenance
Finally, the model is put into operation.
In many organizations this means the data scientist no

longer has primary responsibility for the day-to-day operation
of the model

WORKING WITH DATA FROM FILES
The most common ready-to-go data format is a family of tabular
formats called structured values.
Working with well-structured data from files

Loading file:
uciCar <- read.table(

'http://www.win-vector.com/dfiles/car.data.csv',
sep=',‘ ,header=T)
This loads the data and stores it in a new R data frame object called
uciCar.

EXAMINING OUR DATA
class()— tells us the object uciCar is of class data.frame.
help()—Gives you the documentation for a class.
summary()—Gives you a summary of almost any R object.

summary(uciCar) shows us a lot about the distribution of the
UCI car data.

Exploring the car data
summary(uciCar)
buying maint
high :432 high :432
low :432 low :432
med :432 med :432
vhigh:432 vhigh:432
The summary() command shows us the distribution of each
variable in the dataset.

Structured data
 Data containing a defined data type, format, and structure

(that is, transaction data,online analytical processing
[OLAP] data cubes, traditional RDBMS, CSV files, and
even simple spreadsheets).

WORKING WITH OTHER DATA FORMATS
XLS/XLSX—http://cran.r-project.org/doc/manuals/R-
data.html#Reading-Excel-spreadsheets
JSON—http://cran.r-
project.org/web/packages/rjson/index.html
XML—http://cran.r-project.org/web/packages/XML/index.html
MongoDB—http://cran.r-
project.org/web/packages/rmongodb/index.html
SQL—http://cran.r-project.org/web/packages/DBI/index.html

TRANSFORMING DATA IN R
Building a map to interpret loan use codes
mapping <- list(

'A40'='car (new)',
'A41'='car (used)',
'A42'='furniture/equipment',
'A43'='radio/television',
'A44'='domestic appliances', ...)

TRANSFORMING
EXAMINING OUR NEW DATA
DATA IN R
> table(d$Purpose,d$Good.Loan)
BadLoan GoodLoan
business 34 63
car (new) 89 145
car (used) 17 86
domestic appliances 4 8
education 22 28
furniture/equipment 58 123
others 5 7
radio/television 62 218
Repairs 8 14
WORKING WITH RELATIONAL DATABASES
RMySQL Package:
R has a built-in package named "RMySQL" which provides

native connectivity between with MySql database. You can
install this package in the R environment using the following
command.
install.packages("RMySQL")

Updating Rows in the Tables

dbSendQuery(mysqlconnection, "update mtcars set disp =
168.5 where hp = 110")
Dropping Tables in MySql
dbSendQuery(mysqlconnection, 'drop table if exists student ')

WORKING WITH RELATIONAL DATABASES-
STAGING THE DATA
A staging area, is an intermediate storage area used for

data processing during the extract, transform and load
(ETL)process.
The data staging area sits between the data source(s) and
the data target(s), which are often data warehouses, data
marts, or other data repositories

CURATING THE DATA
Data curation is the organization and integration

of data collected from various sources.
 It involves annotation, publication and presentation of the

data such that the value of the data is maintained over time,
and the data remains available for reuse and preservation.
5
Connecting R to MySql
create a connection object in R to connect to the database. It

takes the username, password, database name and host
name as input.
mysqlconnection = dbConnect(MySQL(), user = 'root',

password = '', dbname = 'sakila', host = 'localhost')
# List the tables available in this database.

dbListTables(mysqlconnection)
We can query the database tables in MySql using the
function dbSendQuery().
The query gets executed in MySql and the result set is

returned using the R fetch() function.
It is stored as a data frame in R.

#Create table
actor<-"CREATE TABLE actor(actor_id INT, first_name

TEXT, last_name TEXT, last update TEXT

# Insert the data into table
dbSendQuery(mysqlconnection, "insert into

actor(actor_id,first_name,last_name,last_update)
values(188,’Jeba’ ,’raj’,13/08/2019))
# Query the "actor" tables to get all the rows.

result = dbSendQuery(mysqlconnection, "select * from actor")

# Store the result in a R data frame object. n = 5 is used

to fetch first 5 rows.
data.frame = fetch(result, n = 5)
print(data.fame)
Query with Filter Clause:

result = dbSendQuery(mysqlconnection, "select * from
actor where last_name = 'TORN'")

# Fetch all the records(with n = -1) and store it as a data frame.

data.frame = fetch(result, n = -1)
print(data)
Output:
actor_id first_name last_name last_update
1 18 DAN TORN 2006-02-15 04:34:33
2 94 KENNETH TORN 2006-02-15 04:34:33 3
102 WALTER TORN 2006-02-15 04:34:33


168.5 where hp = 110")

STAGING THE DATA
A staging area, is an intermediate storage area used for

data processing during the extract, transform and load
(ETL)process.
The data staging area sits between the data source(s) and
the data target(s), which are often data warehouses, data
marts, or other data repositories

CURATING THE DATA
Data curation is the organization and integration

of data collected from various sources.
 It involves annotation, publication and presentation of the

data such that the value of the data is maintained over time,
and the data remains available for reuse and preservation.
3
Connecting R to MySql
create a connection object in R to connect to the database. It

takes the username, password, database name and host
name as input.
mysqlconnection = dbConnect(MySQL(), user = 'root',

password = '', dbname = 'sakila', host = 'localhost')
# List the tables available in this database.

dbListTables(mysqlconnection)
We can query the database tables in MySql using the
function dbSendQuery().
The query gets executed in MySql and the result set is

returned using the R fetch() function.
It is stored as a data frame in R.

#Create table
actor<-"CREATE TABLE actor(actor_id INT, first_name

TEXT, last_name TEXT, last update TEXT

# Insert the data into table
dbSendQuery(mysqlconnection, "insert into

actor(actor_id,first_name,last_name,last_update)
values(188,’Jeba’ ,’raj’,13/08/2019))
# Query the "actor" tables to get all the rows.

result = dbSendQuery(mysqlconnection, "select * from actor")

# Store the result in a R data frame object. n = 5 is used

to fetch first 5 rows.
data.frame = fetch(result, n = 5)
print(data.fame)
Query with Filter Clause:

result = dbSendQuery(mysqlconnection, "select * from
actor where last_name = 'TORN'")

# Fetch all the records(with n = -1) and store it as a data frame.

data.frame = fetch(result, n = -1)
print(data)
Output:
actor_id first_name last_name last_update
1 18 DAN TORN 2006-02-15 04:34:33
2 94 KENNETH TORN 2006-02-15 04:34:33 3
102 WALTER TORN 2006-02-15 04:34:33


168.5 where hp = 110")

EXPLORING THE DATA
Using summary statistics to spot problems
Missing Valuees
Invalid Values And Outliers
Data Range
Units

EXPLORING THE DATA
> summary(custdata)
custid sex
Min. : 2068 F:440
1st Qu.: 345667 M:560
Median : 693403
Mean : 698500
3rd Qu.:1044606
Max. :1414286

EXPLORING THE DATA
Missing Values
is.employed
Mode :logical
FALSE:73
TRUE :599
NA‘s :328 (Missing Values)
housing. type
Homeowner free and clear :157
Homeowner with mortgage/loan:412
Occupied with no rent : 11
Rented :364
NA's : 56 (Missing Values)

EXPLORING THE DATA
Invalid values and Outliers
Examples of invalid values include negative values in what

should be a non-negative numeric data field (like age or
income), or text where you expect numbers.
Outliers are data points that fall well out of the range of where
you expect the data to be.

EXPLORING THE DATA
> summary(custdata$income)
Min. 1st Qu. Median Mean 3rd Qu.

-8700 14600 35000 53500 67000
Max.61500
(Negative values of income represent bad data)
> summary(custdata$age)
0.0 38.0 50.0 51.7 64.0
Max.146.7
(Customers of age zero, or customers of an age greater than about
110 are outliers)
EXPLORING THE DATA
DATA RANGE
-8700 14600 35000 53500 67000
Max.
615000
-8.7 14.6 35.0 53.5 67.0 615.0

EXPLORING THE DATA
UNITS
Does the income data represent hourly wages, or yearly wages

in units of $1000?
> summary(Income)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-8.7 14.6 35.0 53.5 67.0 615.0

THANK YOU

18CSE396T– DATA SCIENCE
Unit I– : Session –5 : SLO -1& SLO2

EXPLORING THE DATA
Missing Valuees
Invalid Values And Outliers
Data Range
Units

EXPLORING THE DATA
> summary(custdata)
custid sex
Min. : 2068 F:440
1st Qu.: 345667 M:560
Median : 693403
Mean : 698500
3rd Qu.:1044606
Max. :1414286

EXPLORING THE DATA
Missing Values
is.employed
Mode :logical
FALSE:73
TRUE :599
NA‘s :328 (Missing Values)
housing. type
Homeowner free and clear :157
Homeowner with mortgage/loan:412
Occupied with no rent : 11
Rented :364
NA's : 56 (Missing Values)

EXPLORING THE DATA
Examples of invalid values include negative values in what

should be a non-negative numeric data field (like age or
income), or text where you expect numbers.
Outliers are data points that fall well out of the range of where
you expect the data to be.

EXPLORING THE DATA

-8700 14600 35000 53500 67000
Max.61500
(Negative values of income represent bad data)
> summary(custdata$age)
0.0 38.0 50.0 51.7 64.0
Max.146.7
(Customers of age zero, or customers of an age greater than about
110 are outliers)
EXPLORING THE DATA
DATA RANGE
-8700 14600 35000 53500 67000
Max.
615000
-8.7 14.6 35.0 53.5 67.0 615.0

EXPLORING THE DATA
UNITS
Does the income data represent hourly wages, or yearly wages

in units of $1000?
> summary(Income)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-8.7 14.6 35.0 53.5 67.0 615.0

THANK YOU

18CSE396T– DATA SCIENCE
Unit I– : Session –6 : SLO -1& SLO2

MANAGING DATA- CLEANING DATA
Treating missing values

To drop or not to drop
Missing data in categorical variables
Missing Values In Numeric Data
summary(custdata$Income)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's

0 25000 45000 66200 82000 615000 328

TO DROP OR NOT TO DROP?
summary(custdata[is.na(custdata$housing.type),
c("recent.move","num.vehicles")])
The c function in R is used to create a vector with values you

provide explicitly.

custdata$is.employed.fix <- ifelse(is.na(custdata$is.employed),

"missing",
ifelse(custdata$is.employed==T,
"employed",
"not employed"))
fix invokes edit on x and then assigns the new (edited) version
of x in the user's workspace.

summary(as.factor(custdata$is.employed.fix))
employed missing not employed

599 328 73\
 Factors are variables in R which take on a limited

number of different values; such variables are often
referred to as categorical variables. ...

Treating missing values

To drop or not to drop
Missing Values In Numeric Data
summary(custdata$Income)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's

0 25000 45000 66200 82000 615000 328

TO DROP OR NOT TO DROP?
summary(custdata[is.na(custdata$housing.type),
c("recent.move","num.vehicles")])
The c function in R is used to create a vector with values you

provide explicitly.

custdata$is.employed.fix <- ifelse(is.na(custdata$is.employed),

"missing",
ifelse(custdata$is.employed==T,
"employed",
"not employed"))
fix invokes edit on x and then assigns the new (edited) version
of x in the user's workspace.

summary(as.factor(custdata$is.employed.fix))
employed missing not employed

599 328 73\
 Factors are variables in R which take on a limited

number of different values; such variables are often
referred to as categorical variables. ...

SAMPLING FOR MODELING AND VALIDATION
Sampling is the process of selecting a subset of a

population to represent the whole , during analysis
and modeling.
It’s important that the dataset that you do use is an

accurate representation of your population as a
whole.

SAMPLING FOR MODELING AND VALIDATION
For example, your customers might come from all

over the United States.
When you collect your custdata dataset, it might be

tempting to use all the customers from one state, to
train the model.
It’s a good idea to pick customers randomly from all

the states.

VALIDATION
Data validation refers to the process of ensuring the accuracy

and quality of data.
It is implemented by building several checks into a system or

report to ensure the log

TYPES OF DATA VALIDATION
Data Type Check
A data type check confirms that the data entered has the
correct data type. For example, a field might only accept
numeric data.
Code Check
A code check ensures that a field is selected from a valid list of

values or follows certain formatting rules

Range Check
A range check will verify whether input data falls within a

predefined range.
Format Check
Many data types follow a certain predefined format. A common

use case is date columns that are stored in a fixed format like
“YYYY-MM-DD” or “DD-MM-YYYY.”

Consistency Check
A consistency check is a type of logical check that confirms the

data’s been entered in a logically consistent way.
Uniqueness Check
uniqueness check ensures that an item is not entered multiple

times into a database.

TEST AND TRAINING SPLITS
The training set is the data that you feed to the

model-building algorithm—regression, decision
trees, and so on.
The test set is the data that you feed into the
resulting model, to verify that the model’s
predictions are accurate.

CREATING A SAMPLE GROUP COLUMN
Convenient way to manage random sampling is to

add a sample group column to the data frame.
The sample group column contains a number

generated uniformly from zero to one, using the
runif function.

CREATING A SAMPLE GROUP COLUMN
> custdata$gp <- runif(dim(custdata)[1])

> testSet <- subset(custdata, custdata$gp <= 0.1)
> trainingSet <- subset(custdata, custdata$gp > 0.1)
> dim(testSet)[1]
[1] 93
> dim(trainingSet)[1]
[1] 907

RECORD GROUPING
hh <- unique(hhdata$household_id)
households <- data.frame(household_id = hh, gp =
runif(length(hh)))
hhdata <- merge(hhdata, households, by="household_id")

DATA PROVENANCE
Data provenance records the information on how the data sets

were generated.

DATA STRUCTURES
Structured
Semi- Structured
Quasi-Structured
Unstructured

DATA STRUCTURES
Big data can come in multiple forms, including

structured and non-structured data such as financial data,
text files, multimedia files, and genetic mappings.
Most of the Big Data is unstructured or semi-structured

in nature, which requires different techniques and tools to
process and analyze.

DATA STRUCTURES
Structured data: Data containing a defined data type, format,

and structure.
Semi-structured data: Textual data files with a discernible

pattern that enables parsing

DATA STRUCTURES
Quasi-structured data: Textual data with erratic data formats

that can be formatted with effort, tools, and time
Unstructured data: Data that has no inherent structure, which

may include text documents, PDFs,images, and video.

DATA STRUCTURES
Structured
Semi- Structured
Quasi-Structured
Unstructured

DATA STRUCTURES
Big data can come in multiple forms, including

structured and non-structured data such as financial data,
text files, multimedia files, and genetic mappings.
Most of the Big Data is unstructured or semi-structured

in nature, which requires different techniques and tools to
process and analyze.

DATA STRUCTURES
Structured data: Data containing a defined data type, format,

and structure.
Semi-structured data: Textual data files with a discernible

pattern that enables parsing

DATA STRUCTURES
Quasi-structured data: Textual data with erratic data formats

that can be formatted with effort, tools, and time
Unstructured data: Data that has no inherent structure, which

may include text documents, PDFs,images, and video.

DRIVERS OF BIG DATA
Smart devices, which provide sensor-based collection of

information from smart electric grids, smart buildings, and many
other public and industry infrastructures
Nontraditional IT devices, including the use of radio-

frequency identification (RFID) readers, GPS navigation
systems, and seismic processing

DRIVERS OF BIG DATA



DRIVERS OF BIG DATA



DRIVERS OF BIG DATA



Unit 1 - DS

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 1 - DS

Uploaded by

Copyright:

Available Formats

What is Data Science

Data Science is an interdisciplinary field that allows you to

 Data science enables you to translate a business problem

SRM Institute of Science and Technology 2

Data science is the process of deriving knowledge and

Data Science is an interdisciplinary field that allows you to

SRM Institute of Science and Technology 3

Data science enables you to translate a business problem

SRM Institute of Science and Technology 4

SRM Institute of Science and Technology 5

Statistics is the most critical unit in Data science. It is the

Visualization technique helps you to access huge amounts

SRM Institute of Science and Technology 6

Machine Learning explores the building and study of

Deep Learning method is new machine learning research

SRM Institute of Science and Technology 8

• Defining data science project roles

• Understanding the stages of a data science project

• Setting expectations for a new data science project

SRM Institute of Science and Technology 9

Project sponsor : represents the business interests;

Client :represents end user

Data scientist : sets and executes analytic strategy;

SRM Institute of Science and Technology 2

Data architect : manages data and data storage; sometimes

Operations: manages infrastructure; deploys final project

SRM Institute of Science and Technology 3

 The sponsor is the person who wants the data science

 The sponsor is responsible for deciding whether the

SRM Institute of Science and Technology 4

 Generally the client belongs to a different group in the

SRM Institute of Science and Technology 5

Data scientist is responsible for taking all necessary steps

Responsible for setting the project strategy ,design, pick the

They’re also responsible for project planning and tracking

SRM Institute of Science and Technology 6

Data architects often manage data warehouses for many

SRM Institute of Science and Technology 7

Data architects often manage data warehouses for many

SRM Institute of Science and Technology 8

SRM Institute of Science and Technology 2

Why do the sponsors want the project in the first place?

What do they lack, and what do they need?

SRM Institute of Science and Technology 3

SRM Institute of Science and Technology 4

This step encompasses identifying the data you need,

SRM Institute of Science and Technology 5

Extracting useful insights from the data in order to achieve

The loan application problem is a classification problem

SRM Institute of Science and Technology 6

Is it accurate enough for your needs? Does it generalize

 Does it perform better than ―the obvious guess‖? Better

Do the results of the model (coefficients, clusters, rules)

SRM Institute of Science and Technology 8

SRM Institute of Science and Technology 9

Accuracy is defined as the ratio of total number of correct

Accuracy =(True Positive + True Negative) / (True positive

SRM Institute of Science and Technology 10

Recall = True Positives / (True positives + False

Once you have a model that meets your success criteria,

You must also document the model for those in the

SRM Institute of Science and Technology 12