You are on page 1of 113

What is Data Science

Data Science is an interdisciplinary field that allows you to


extract knowledge from structured or unstructured data.

 Data science enables you to translate a business problem


into a research project and then translate it back into a
practical solution.

SRM Institute of Science and Technology 2


What is Data Science

Data science is the process of deriving knowledge and


insights from a huge and diverse set of data through
organizing, processing and analyzing the data.

Data Science is an interdisciplinary field that allows you to


extract knowledge from structured or unstructured data.

SRM Institute of Science and Technology 3


What is Data Science

Data science enables you to translate a business problem


into a research project and then translate it back into a
practical solution.
.

SRM Institute of Science and Technology 4


What is Data Science

SRM Institute of Science and Technology 5


What is Data Science

Statistics:

Statistics is the most critical unit in Data science. It is the


method or science of collecting and analyzing numerical
data in large quantities to get useful insights.

Visualization:

Visualization technique helps you to access huge amounts


of data in easy to understand and digestible visuals.

SRM Institute of Science and Technology 6


What is Data Science

Machine Learning:

Machine Learning explores the building and study of


algorithms which learn to make predictions about
unforeseen/future data

Deep Learning:

Deep Learning method is new machine learning research


where the algorithm is applied to handle huge amount of
data.
SRM Institute of Science and Technology 7
What is Data Science

SRM Institute of Science and Technology 8


Data Science Process

• Defining data science project roles

• Understanding the stages of a data science project

• Setting expectations for a new data science project

SRM Institute of Science and Technology 9


The roles in a data science project

Project sponsor : represents the business interests;


champions the project

Client :represents end user

Data scientist : sets and executes analytic strategy;


communicates with sponsor and client

SRM Institute of Science and Technology 2


The roles in a data science project

Data architect : manages data and data storage; sometimes


manages data collection

Operations: manages infrastructure; deploys final project


results

SRM Institute of Science and Technology 3


PROJECT SPONSOR

 The sponsor is the person who wants the data science


result.

 The sponsor is responsible for deciding whether the


project is a success or failure.

SRM Institute of Science and Technology 4


CLIENT
 The client is the role that represents the model’s end
users’ interests.

 Generally the client belongs to a different group in the


organization and has other responsibilities beyond your
project

SRM Institute of Science and Technology 5


DATA SCIENTIST

Data scientist is responsible for taking all necessary steps


to make the project succeed.

Responsible for setting the project strategy ,design, pick the


data sources, and pick the tools to be used and the
techniques

They’re also responsible for project planning and tracking


,testing and procedures, applies machine learning models,
and evaluates results.

SRM Institute of Science and Technology 6


DATA ARCHITECT

The data architect is responsible for all of the data and its
storage.

Data architects often manage data warehouses for many


different projects

SRM Institute of Science and Technology 7


DATA ARCHITECT

The data architect is responsible for all of the data and its
storage.

Data architects often manage data warehouses for many


different projects

SRM Institute of Science and Technology 8


Stages of a data science project

SRM Institute of Science and Technology 2


Defining the goal
The first task in a data science project is to define a
measurable and quantifiable goal.

Why do the sponsors want the project in the first place?

What do they lack, and what do they need?

What are they doing to solve the problem now, and why isn’t
that good enough?

SRM Institute of Science and Technology 3


Defining the goal
What resources will you need: what kind of data and how
much staff?
Will you have domain experts to collaborate with, and what
are the computational resources?
 How do the project sponsors plan to deploy your results?
What are the constraints that have to be met for successful
deployment.
Example: The ultimate business goal is to
reduce the bank’s losses due to bad loans.

SRM Institute of Science and Technology 4


Data collection and management

This step encompasses identifying the data you need,


exploring it, and conditioning it to be suitable for analysis.
 This stage is often the most time-consuming step in the
process.
 What data is available to me?
Will it help me solve the problem?
Is data enough?
 Is the data quality good enough?

SRM Institute of Science and Technology 5


Modeling
Finalize the statistics and machine learning during the
modeling, or analysis stage.

Extracting useful insights from the data in order to achieve


your goals.

The loan application problem is a classification problem

SRM Institute of Science and Technology 6


Modeling
The most common data science modeling tasks are these:
Classification—Deciding if something belongs to one
category or another
Scoring—Predicting or estimating a numeric value, such as
a price or probability
Ranking—Learning to order items by preferences
Clustering—Grouping items into most-similar groups
Finding relations—Finding correlations or potential causes
of effects seen in the data
Characterization—Very general plotting and report
generation from data
SRM Institute of Science and Technology 7
Model evaluation and critique
Once you have a model, you need to determine if it meets
your goals.

Is it accurate enough for your needs? Does it generalize


well?

 Does it perform better than ―the obvious guess‖? Better


than whatever estimate you currently use?

Do the results of the model (coefficients, clusters, rules)


make sense in the context of the problem domain?

SRM Institute of Science and Technology 8


Model evaluation and critique

CONFUSION MATRIX
Predicted
Values

TP FP
Predicted
Values
FN TN

Predicted
Values
Actual 5 0
Values
1 9

SRM Institute of Science and Technology 9


Model evaluation and critique

Accuracy

Accuracy is defined as the ratio of total number of correct


predictions to the total number of samples.

Accuracy =(True Positive + True Negative) / (True positive


+True Negative+ False Positive +False Negative)

SRM Institute of Science and Technology 10


Model evaluation and critique
Precision
Precision defines the correct identification of actual
positives
Precision = True Positives / (True positives + False
positives)

Recall
Recall or True Positive Rate is defined as the ratio of true
positives to total number of true positives and false negatives.

Recall = True Positives / (True positives + False


Negatives)
SRM Institute of Science and Technology 11
Presentation and documentation

Once you have a model that meets your success criteria,


you’ll present your results to your project sponsor and other
stakeholders.

You must also document the model for those in the


organization who are responsible for using, running, and
maintaining the model once it has been deployed

SRM Institute of Science and Technology 12


Model deployment and maintenance

Finally, the model is put into operation.

In many organizations this means the data scientist no


longer has primary responsibility for the day-to-day operation
of the model

SRM Institute of Science and Technology 13


Stages of a data science project

SRM Institute of Science and Technology 2


Defining the goal
The first task in a data science project is to define a
measurable and quantifiable goal.

Why do the sponsors want the project in the first place?

What do they lack, and what do they need?

What are they doing to solve the problem now, and why isn’t
that good enough?

SRM Institute of Science and Technology 3


Defining the goal
What resources will you need: what kind of data and how
much staff?
Will you have domain experts to collaborate with, and what
are the computational resources?
 How do the project sponsors plan to deploy your results?
What are the constraints that have to be met for successful
deployment.
Example: The ultimate business goal is to
reduce the bank’s losses due to bad loans.

SRM Institute of Science and Technology 4


Data collection and management

This step encompasses identifying the data you need,


exploring it, and conditioning it to be suitable for analysis.
 This stage is often the most time-consuming step in the
process.
 What data is available to me?
Will it help me solve the problem?
Is data enough?
 Is the data quality good enough?

SRM Institute of Science and Technology 5


Modeling
Finalize the statistics and machine learning during the
modeling, or analysis stage.

Extracting useful insights from the data in order to achieve


your goals.

The loan application problem is a classification problem

SRM Institute of Science and Technology 6


Modeling
The most common data science modeling tasks are these:
Classification—Deciding if something belongs to one
category or another
Scoring—Predicting or estimating a numeric value, such as
a price or probability
Ranking—Learning to order items by preferences
Clustering—Grouping items into most-similar groups
Finding relations—Finding correlations or potential causes
of effects seen in the data
Characterization—Very general plotting and report
generation from data
SRM Institute of Science and Technology 7
Model evaluation and critique
Once you have a model, you need to determine if it meets
your goals.

Is it accurate enough for your needs? Does it generalize


well?

 Does it perform better than ―the obvious guess‖? Better


than whatever estimate you currently use?

Do the results of the model (coefficients, clusters, rules)


make sense in the context of the problem domain?

SRM Institute of Science and Technology 8


Model evaluation and critique

CONFUSION MATRIX
Predicted
Values

TP FP
Predicted
Values
FN TN

Predicted
Values
Actual 5 0
Values
1 9

SRM Institute of Science and Technology 9


Model evaluation and critique

Accuracy

Accuracy is defined as the ratio of total number of correct


predictions to the total number of samples.

Accuracy =(True Positive + True Negative) / (True positive


+True Negative+ False Positive +False Negative)

SRM Institute of Science and Technology 10


Model evaluation and critique
Precision
Precision defines the correct identification of actual
positives
Precision = True Positives / (True positives + False
positives)

Recall
Recall or True Positive Rate is defined as the ratio of true
positives to total number of true positives and false negatives.

Recall = True Positives / (True positives + False


Negatives)
SRM Institute of Science and Technology 11
Presentation and documentation

Once you have a model that meets your success criteria,


you’ll present your results to your project sponsor and other
stakeholders.

You must also document the model for those in the


organization who are responsible for using, running, and
maintaining the model once it has been deployed

SRM Institute of Science and Technology 12


Model deployment and maintenance

Finally, the model is put into operation.

In many organizations this means the data scientist no


longer has primary responsibility for the day-to-day operation
of the model

SRM Institute of Science and Technology 13


WORKING WITH DATA FROM FILES
The most common ready-to-go data format is a family of tabular
formats called structured values.

Working with well-structured data from files


Loading file:

uciCar <- read.table(


'http://www.win-vector.com/dfiles/car.data.csv',
sep=',‘ ,header=T)

This loads the data and stores it in a new R data frame object called
uciCar.

SRM Institute of Science and Technology 2


WORKING WITH DATA FROM FILES

EXAMINING OUR DATA

class()— tells us the object uciCar is of class data.frame.

help()—Gives you the documentation for a class.

summary()—Gives you a summary of almost any R object.


summary(uciCar) shows us a lot about the distribution of the
UCI car data.

SRM Institute of Science and Technology 3


WORKING WITH DATA FROM FILES

Exploring the car data

summary(uciCar)
buying maint
high :432 high :432
low :432 low :432
med :432 med :432
vhigh:432 vhigh:432
The summary() command shows us the distribution of each
variable in the dataset.

SRM Institute of Science and Technology 4


Structured data

 Data containing a defined data type, format, and structure


(that is, transaction data,online analytical processing
[OLAP] data cubes, traditional RDBMS, CSV files, and
even simple spreadsheets).

SRM Institute of Science and Technology 2


WORKING WITH OTHER DATA FORMATS

XLS/XLSX—http://cran.r-project.org/doc/manuals/R-
data.html#Reading-Excel-spreadsheets
JSON—http://cran.r-
project.org/web/packages/rjson/index.html
XML—http://cran.r-project.org/web/packages/XML/index.html
MongoDB—http://cran.r-
project.org/web/packages/rmongodb/index.html
SQL—http://cran.r-project.org/web/packages/DBI/index.html

SRM Institute of Science and Technology 3


TRANSFORMING DATA IN R

Building a map to interpret loan use codes

mapping <- list(


'A40'='car (new)',
'A41'='car (used)',
'A42'='furniture/equipment',
'A43'='radio/television',
'A44'='domestic appliances', ...)

SRM Institute of Science and Technology 4


TRANSFORMING
EXAMINING OUR NEW DATA
DATA IN R

> table(d$Purpose,d$Good.Loan)
BadLoan GoodLoan
business 34 63
car (new) 89 145
car (used) 17 86
domestic appliances 4 8
education 22 28
furniture/equipment 58 123
others 5 7
radio/television 62 218
Repairs 8 14
SRM Institute of Science and Technology 5
WORKING WITH RELATIONAL DATABASES

RMySQL Package:

R has a built-in package named "RMySQL" which provides


native connectivity between with MySql database. You can
install this package in the R environment using the following
command.

install.packages("RMySQL")

SRM Institute of Science and Technology 2


WORKING WITH RELATIONAL DATABASES

Updating Rows in the Tables


dbSendQuery(mysqlconnection, "update mtcars set disp =
168.5 where hp = 110")
Dropping Tables in MySql
dbSendQuery(mysqlconnection, 'drop table if exists student ')

SRM Institute of Science and Technology 3


WORKING WITH RELATIONAL DATABASES-
STAGING THE DATA

A staging area, is an intermediate storage area used for


data processing during the extract, transform and load
(ETL)process.

The data staging area sits between the data source(s) and
the data target(s), which are often data warehouses, data
marts, or other data repositories

SRM Institute of Science and Technology 4


WORKING WITH RELATIONAL DATABASES-
CURATING THE DATA

Data curation is the organization and integration


of data collected from various sources.

 It involves annotation, publication and presentation of the


data such that the value of the data is maintained over time,
and the data remains available for reuse and preservation.

5
WORKING WITH RELATIONAL DATABASES
Connecting R to MySql

create a connection object in R to connect to the database. It


takes the username, password, database name and host
name as input.

mysqlconnection = dbConnect(MySQL(), user = 'root',


password = '', dbname = 'sakila', host = 'localhost')

# List the tables available in this database.


dbListTables(mysqlconnection)
SRM Institute of Science and Technology 6
WORKING WITH RELATIONAL DATABASES
We can query the database tables in MySql using the
function dbSendQuery().

The query gets executed in MySql and the result set is


returned using the R fetch() function.

It is stored as a data frame in R.


#Create table

actor<-"CREATE TABLE actor(actor_id INT, first_name


TEXT, last_name TEXT, last update TEXT

SRM Institute of Science and Technology 7


WORKING WITH RELATIONAL DATABASES

# Insert the data into table

dbSendQuery(mysqlconnection, "insert into


actor(actor_id,first_name,last_name,last_update)
values(188,’Jeba’ ,’raj’,13/08/2019))

# Query the "actor" tables to get all the rows.


result = dbSendQuery(mysqlconnection, "select * from actor")

SRM Institute of Science and Technology 8


WORKING WITH RELATIONAL DATABASES

# Store the result in a R data frame object. n = 5 is used


to fetch first 5 rows.
data.frame = fetch(result, n = 5)
print(data.fame)

Query with Filter Clause:


result = dbSendQuery(mysqlconnection, "select * from
actor where last_name = 'TORN'")

SRM Institute of Science and Technology 9


WORKING WITH RELATIONAL DATABASES

# Fetch all the records(with n = -1) and store it as a data frame.


data.frame = fetch(result, n = -1)
print(data)

Output:
actor_id first_name last_name last_update
1 18 DAN TORN 2006-02-15 04:34:33
2 94 KENNETH TORN 2006-02-15 04:34:33 3
102 WALTER TORN 2006-02-15 04:34:33

SRM Institute of Science and Technology 10


WORKING WITH RELATIONAL DATABASES

Updating Rows in the Tables

dbSendQuery(mysqlconnection, "update mtcars set disp =


168.5 where hp = 110")

Dropping Tables in MySql

dbSendQuery(mysqlconnection, 'drop table if exists student ')

SRM Institute of Science and Technology 11


WORKING WITH RELATIONAL DATABASES-
STAGING THE DATA

A staging area, is an intermediate storage area used for


data processing during the extract, transform and load
(ETL)process.

The data staging area sits between the data source(s) and
the data target(s), which are often data warehouses, data
marts, or other data repositories

SRM Institute of Science and Technology 2


WORKING WITH RELATIONAL DATABASES-
CURATING THE DATA

Data curation is the organization and integration


of data collected from various sources.

 It involves annotation, publication and presentation of the


data such that the value of the data is maintained over time,
and the data remains available for reuse and preservation.

3
WORKING WITH RELATIONAL DATABASES
Connecting R to MySql

create a connection object in R to connect to the database. It


takes the username, password, database name and host
name as input.

mysqlconnection = dbConnect(MySQL(), user = 'root',


password = '', dbname = 'sakila', host = 'localhost')

# List the tables available in this database.


dbListTables(mysqlconnection)
SRM Institute of Science and Technology 4
WORKING WITH RELATIONAL DATABASES
We can query the database tables in MySql using the
function dbSendQuery().

The query gets executed in MySql and the result set is


returned using the R fetch() function.

It is stored as a data frame in R.


#Create table

actor<-"CREATE TABLE actor(actor_id INT, first_name


TEXT, last_name TEXT, last update TEXT

SRM Institute of Science and Technology 5


WORKING WITH RELATIONAL DATABASES

# Insert the data into table

dbSendQuery(mysqlconnection, "insert into


actor(actor_id,first_name,last_name,last_update)
values(188,’Jeba’ ,’raj’,13/08/2019))

# Query the "actor" tables to get all the rows.


result = dbSendQuery(mysqlconnection, "select * from actor")

SRM Institute of Science and Technology 6


WORKING WITH RELATIONAL DATABASES

# Store the result in a R data frame object. n = 5 is used


to fetch first 5 rows.
data.frame = fetch(result, n = 5)
print(data.fame)

Query with Filter Clause:


result = dbSendQuery(mysqlconnection, "select * from
actor where last_name = 'TORN'")

SRM Institute of Science and Technology 7


WORKING WITH RELATIONAL DATABASES

# Fetch all the records(with n = -1) and store it as a data frame.


data.frame = fetch(result, n = -1)
print(data)

Output:
actor_id first_name last_name last_update
1 18 DAN TORN 2006-02-15 04:34:33
2 94 KENNETH TORN 2006-02-15 04:34:33 3
102 WALTER TORN 2006-02-15 04:34:33

SRM Institute of Science and Technology 8


WORKING WITH RELATIONAL DATABASES

Updating Rows in the Tables

dbSendQuery(mysqlconnection, "update mtcars set disp =


168.5 where hp = 110")

Dropping Tables in MySql

dbSendQuery(mysqlconnection, 'drop table if exists student ')

SRM Institute of Science and Technology 9


EXPLORING THE DATA

Using summary statistics to spot problems

Missing Valuees
Invalid Values And Outliers
Data Range
Units

SRM Institute of Science and Technology 2


EXPLORING THE DATA

Using summary statistics to spot problems

> summary(custdata)
custid sex
Min. : 2068 F:440
1st Qu.: 345667 M:560
Median : 693403
Mean : 698500
3rd Qu.:1044606
Max. :1414286

SRM Institute of Science and Technology 3


EXPLORING THE DATA

Missing Values

is.employed
Mode :logical
FALSE:73
TRUE :599
NA‘s :328 (Missing Values)
housing. type
Homeowner free and clear :157
Homeowner with mortgage/loan:412
Occupied with no rent : 11
Rented :364
NA's : 56 (Missing Values)

SRM Institute of Science and Technology 4


EXPLORING THE DATA

Invalid values and Outliers

Examples of invalid values include negative values in what


should be a non-negative numeric data field (like age or
income), or text where you expect numbers.
Outliers are data points that fall well out of the range of where
you expect the data to be.

SRM Institute of Science and Technology 5


EXPLORING THE DATA
Invalid values and Outliers

> summary(custdata$income)

Min. 1st Qu. Median Mean 3rd Qu.


-8700 14600 35000 53500 67000
Max.61500
(Negative values of income represent bad data)
> summary(custdata$age)
Min. 1st Qu. Median Mean 3rd Qu.
0.0 38.0 50.0 51.7 64.0
Max.146.7
(Customers of age zero, or customers of an age greater than about
110 are outliers)
SRM Institute of Science and Technology 6
EXPLORING THE DATA
DATA RANGE

> summary(custdata$income)
Min. 1st Qu. Median Mean 3rd Qu.
-8700 14600 35000 53500 67000
Max.
615000

-8.7 14.6 35.0 53.5 67.0 615.0

SRM Institute of Science and Technology 7


EXPLORING THE DATA

UNITS

Does the income data represent hourly wages, or yearly wages


in units of $1000?
> summary(Income)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-8.7 14.6 35.0 53.5 67.0 615.0

SRM Institute of Science and Technology 8


THANK YOU

SRM Institute of Science and Technology 9


18CSE396T– DATA SCIENCE

Unit I– : Session –5 : SLO -1& SLO2

SRM Institute of Science and Technology 1


EXPLORING THE DATA

Using summary statistics to spot problems

Missing Valuees
Invalid Values And Outliers
Data Range
Units

SRM Institute of Science and Technology 2


EXPLORING THE DATA

Using summary statistics to spot problems

> summary(custdata)
custid sex
Min. : 2068 F:440
1st Qu.: 345667 M:560
Median : 693403
Mean : 698500
3rd Qu.:1044606
Max. :1414286

SRM Institute of Science and Technology 3


EXPLORING THE DATA

Missing Values

is.employed
Mode :logical
FALSE:73
TRUE :599
NA‘s :328 (Missing Values)
housing. type
Homeowner free and clear :157
Homeowner with mortgage/loan:412
Occupied with no rent : 11
Rented :364
NA's : 56 (Missing Values)

SRM Institute of Science and Technology 4


EXPLORING THE DATA

Invalid values and Outliers

Examples of invalid values include negative values in what


should be a non-negative numeric data field (like age or
income), or text where you expect numbers.
Outliers are data points that fall well out of the range of where
you expect the data to be.

SRM Institute of Science and Technology 5


EXPLORING THE DATA
Invalid values and Outliers

> summary(custdata$income)

Min. 1st Qu. Median Mean 3rd Qu.


-8700 14600 35000 53500 67000
Max.61500
(Negative values of income represent bad data)
> summary(custdata$age)
Min. 1st Qu. Median Mean 3rd Qu.
0.0 38.0 50.0 51.7 64.0
Max.146.7
(Customers of age zero, or customers of an age greater than about
110 are outliers)
SRM Institute of Science and Technology 6
EXPLORING THE DATA
DATA RANGE

> summary(custdata$income)
Min. 1st Qu. Median Mean 3rd Qu.
-8700 14600 35000 53500 67000
Max.
615000

-8.7 14.6 35.0 53.5 67.0 615.0

SRM Institute of Science and Technology 7


EXPLORING THE DATA

UNITS

Does the income data represent hourly wages, or yearly wages


in units of $1000?
> summary(Income)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-8.7 14.6 35.0 53.5 67.0 615.0

SRM Institute of Science and Technology 8


THANK YOU

SRM Institute of Science and Technology 9


18CSE396T– DATA SCIENCE

Unit I– : Session –6 : SLO -1& SLO2

SRM Institute of Science and Technology 1


MANAGING DATA- CLEANING DATA

Treating missing values


To drop or not to drop
Missing data in categorical variables
Missing Values In Numeric Data

summary(custdata$Income)

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's


0 25000 45000 66200 82000 615000 328

SRM Institute of Science and Technology 2


MANAGING DATA- CLEANING DATA

TO DROP OR NOT TO DROP?

summary(custdata[is.na(custdata$housing.type),
c("recent.move","num.vehicles")])

The c function in R is used to create a vector with values you


provide explicitly.

SRM Institute of Science and Technology 3


MANAGING DATA- CLEANING DATA

Missing data in categorical variables

custdata$is.employed.fix <- ifelse(is.na(custdata$is.employed),


"missing",
ifelse(custdata$is.employed==T,
"employed",
"not employed"))
fix invokes edit on x and then assigns the new (edited) version
of x in the user's workspace.

SRM Institute of Science and Technology 4


MANAGING DATA- CLEANING DATA

summary(as.factor(custdata$is.employed.fix))

employed missing not employed


599 328 73\

 Factors are variables in R which take on a limited


number of different values; such variables are often
referred to as categorical variables. ...

SRM Institute of Science and Technology 5


MANAGING DATA- CLEANING DATA

Treating missing values


To drop or not to drop
Missing data in categorical variables
Missing Values In Numeric Data

summary(custdata$Income)

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's


0 25000 45000 66200 82000 615000 328

SRM Institute of Science and Technology 2


MANAGING DATA- CLEANING DATA

TO DROP OR NOT TO DROP?

summary(custdata[is.na(custdata$housing.type),
c("recent.move","num.vehicles")])

The c function in R is used to create a vector with values you


provide explicitly.

SRM Institute of Science and Technology 3


MANAGING DATA- CLEANING DATA

Missing data in categorical variables

custdata$is.employed.fix <- ifelse(is.na(custdata$is.employed),


"missing",
ifelse(custdata$is.employed==T,
"employed",
"not employed"))
fix invokes edit on x and then assigns the new (edited) version
of x in the user's workspace.

SRM Institute of Science and Technology 4


MANAGING DATA- CLEANING DATA

summary(as.factor(custdata$is.employed.fix))

employed missing not employed


599 328 73\

 Factors are variables in R which take on a limited


number of different values; such variables are often
referred to as categorical variables. ...

SRM Institute of Science and Technology 5


SAMPLING FOR MODELING AND VALIDATION

Sampling is the process of selecting a subset of a


population to represent the whole , during analysis
and modeling.

It’s important that the dataset that you do use is an


accurate representation of your population as a
whole.

SRM Institute of Science and Technology 2


SAMPLING FOR MODELING AND VALIDATION

For example, your customers might come from all


over the United States.

When you collect your custdata dataset, it might be


tempting to use all the customers from one state, to
train the model.

It’s a good idea to pick customers randomly from all


the states.

SRM Institute of Science and Technology 3


VALIDATION

Data validation refers to the process of ensuring the accuracy


and quality of data.

It is implemented by building several checks into a system or


report to ensure the log

SRM Institute of Science and Technology 4


TYPES OF DATA VALIDATION

Data Type Check

A data type check confirms that the data entered has the
correct data type. For example, a field might only accept
numeric data.

Code Check

A code check ensures that a field is selected from a valid list of


values or follows certain formatting rules

SRM Institute of Science and Technology 5


TYPES OF DATA VALIDATION

Range Check

A range check will verify whether input data falls within a


predefined range.

Format Check

Many data types follow a certain predefined format. A common


use case is date columns that are stored in a fixed format like
“YYYY-MM-DD” or “DD-MM-YYYY.”

SRM Institute of Science and Technology 6


TYPES OF DATA VALIDATION

Consistency Check

A consistency check is a type of logical check that confirms the


data’s been entered in a logically consistent way.

Uniqueness Check

uniqueness check ensures that an item is not entered multiple


times into a database.

SRM Institute of Science and Technology 7


TEST AND TRAINING SPLITS

The training set is the data that you feed to the


model-building algorithm—regression, decision
trees, and so on.

The test set is the data that you feed into the
resulting model, to verify that the model’s
predictions are accurate.

SRM Institute of Science and Technology 2


CREATING A SAMPLE GROUP COLUMN

Convenient way to manage random sampling is to


add a sample group column to the data frame.

The sample group column contains a number


generated uniformly from zero to one, using the
runif function.

SRM Institute of Science and Technology 3


CREATING A SAMPLE GROUP COLUMN

> custdata$gp <- runif(dim(custdata)[1])


> testSet <- subset(custdata, custdata$gp <= 0.1)
> trainingSet <- subset(custdata, custdata$gp > 0.1)
> dim(testSet)[1]
[1] 93
> dim(trainingSet)[1]
[1] 907

SRM Institute of Science and Technology 4


RECORD GROUPING

hh <- unique(hhdata$household_id)
households <- data.frame(household_id = hh, gp =
runif(length(hh)))

hhdata <- merge(hhdata, households, by="household_id")

SRM Institute of Science and Technology 5


DATA PROVENANCE

Data provenance records the information on how the data sets


were generated.

SRM Institute of Science and Technology 6


DATA STRUCTURES

Structured
Semi- Structured
Quasi-Structured
Unstructured

SRM Institute of Science and Technology 2


DATA STRUCTURES

Big data can come in multiple forms, including


structured and non-structured data such as financial data,
text files, multimedia files, and genetic mappings.

Most of the Big Data is unstructured or semi-structured


in nature, which requires different techniques and tools to
process and analyze.

SRM Institute of Science and Technology 3


DATA STRUCTURES

Structured data: Data containing a defined data type, format,


and structure.

Semi-structured data: Textual data files with a discernible


pattern that enables parsing

SRM Institute of Science and Technology 4


DATA STRUCTURES

Quasi-structured data: Textual data with erratic data formats


that can be formatted with effort, tools, and time

Unstructured data: Data that has no inherent structure, which


may include text documents, PDFs,images, and video.

SRM Institute of Science and Technology 5


DATA STRUCTURES

Structured
Semi- Structured
Quasi-Structured
Unstructured

SRM Institute of Science and Technology 2


DATA STRUCTURES

Big data can come in multiple forms, including


structured and non-structured data such as financial data,
text files, multimedia files, and genetic mappings.

Most of the Big Data is unstructured or semi-structured


in nature, which requires different techniques and tools to
process and analyze.

SRM Institute of Science and Technology 3


DATA STRUCTURES

Structured data: Data containing a defined data type, format,


and structure.

Semi-structured data: Textual data files with a discernible


pattern that enables parsing

SRM Institute of Science and Technology 4


DATA STRUCTURES

Quasi-structured data: Textual data with erratic data formats


that can be formatted with effort, tools, and time

Unstructured data: Data that has no inherent structure, which


may include text documents, PDFs,images, and video.

SRM Institute of Science and Technology 5


DRIVERS OF BIG DATA

Smart devices, which provide sensor-based collection of


information from smart electric grids, smart buildings, and many
other public and industry infrastructures

Nontraditional IT devices, including the use of radio-


frequency identification (RFID) readers, GPS navigation
systems, and seismic processing

SRM Institute of Science and Technology 2


DRIVERS OF BIG DATA

Smart devices, which provide sensor-based collection of


information from smart electric grids, smart buildings, and many
other public and industry infrastructures

Nontraditional IT devices, including the use of radio-


frequency identification (RFID) readers, GPS navigation
systems, and seismic processing

SRM Institute of Science and Technology 3


DRIVERS OF BIG DATA

Smart devices, which provide sensor-based collection of


information from smart electric grids, smart buildings, and many
other public and industry infrastructures

Nontraditional IT devices, including the use of radio-


frequency identification (RFID) readers, GPS navigation
systems, and seismic processing

SRM Institute of Science and Technology 2


DRIVERS OF BIG DATA

Smart devices, which provide sensor-based collection of


information from smart electric grids, smart buildings, and many
other public and industry infrastructures

Nontraditional IT devices, including the use of radio-


frequency identification (RFID) readers, GPS navigation
systems, and seismic processing

SRM Institute of Science and Technology 3

You might also like