You are on page 1of 79

TYBSC CS – SEM VI – Data Science

Practical No – 01
Aim: Data collection, Data curation and management for Unstructured data
(NoSQL) using Apache CouchDB

Introduction:
Apache CouchDB is an open-source document-oriented NoSQL database, implemented in
Erlang.CouchDB uses multiple formats and protocols to store, transfer, and process its data. It
uses JSON to store data, JavaScript as its query language using MapReduce, and HTTP for an API
Unlike a relational database, a CouchDB database does not store data and relationships in
tables. Instead, each database is a collection of independent documents. Each document
maintains its own data and self-contained schema. An application may access multiple
databases, such as one stored on a user's mobile phone and another on a server. Document
metadata contains revision information, making it possible to merge any differences that may
have occurred while the databases were disconnected.
Architecture:

The architecture of CouchDB is described below:


1. CouchDB Engine: It is based on B-tree and in it, data is accessed by keys or key ranges
which map directly to the underlying B-tree operations. It is the core of the system which
manages to store internal data, documents, and views.
2. HTTP Request: It is used to create indices and extract data from documents. It is written in
JavaScript that allows creating Adhoc views that are made of MapReduce jobs.
3. Document: It stores a large amount of data.

Rohan Salunkhe 1
TYBSC CS – SEM VI – Data Science

4. Replica Database: It is used for replicating data to a local or remote database and
synchronizing design documents.

Features of CouchDB

Features of CouchDB includes the following:


1. Replication: It provides the simplest form of replication, and no other database is so simple
to replicate.
2. Document Storage: It is a NoSQL database that follows document storage where each field
is uniquely named and contains values of various data types such as text, number, Boolean,
lists, etc.
3. ACID Properties: The CouchDB file layout follows all the features of ACID properties.
4. Security: It also provides database-level security, and the permissions are divided into
readers and admins where readers can do both the read and write to the database.
5. Map/Reduce: The main reason for the popularity of CouchDB is a map/reduce system.
6. Authentication: CouchDB facilitates you to keep authentication open via a session
cookielike a web application.
7. Built for Offline: CouchDB can replicate to devices like smartphones that have a feature to
go offline and handle data sync for you when the device is back online.
8. Eventual Consistency: CouchDB guarantees eventual consistency to provide both
availability and partition tolerance.
9. HTTP API: All items have a unique URI (Unique Resource Identifier) that gets exposed via
HTTP. It uses the HTTP methods like POST, GET, PUT, and DELETE for the four basic CRUD
(Create, Read, Update, Delete) operations on all resources.

Advantages of CouchDB

Advantages of CouchDB includes the following:


1. HTTP API is used for easy Communication.
2. It is used to store any type of data.
3. ReduceMap allows optimizing the combining of data. 4. Structure of CouchDB is very simple
5. Fast indexing and retrieval.

Installation of CouchDB in Windows

Step 1. Download CouchDB


The official website for CouchDB is https://couchdb.apache.org. If you click the given link, you
can get the home page of the CouchDB official website as shown below.

Rohan Salunkhe 2
TYBSC CS – SEM VI – Data Science

Step 2 : Install the app, provide the username and password and hit next
Username = admin and passwords = 123456

Step 3 : Open browser and login into http://127.0.0.1:5984/_utils/ with username and password.

Step 4: Create a database with default options

Rohan Salunkhe 3
TYBSC CS – SEM VI – Data Science

Step 5 : After login you can see the below screenshot

Step 6 : Open command as administrator and http://127.0.0.1:5984

Step 7: To create database inside CouchDb do following command


$ curl -X PUT http://user:pass@127.0.0.1:5984/cat
After slash its the name of the database

Rohan Salunkhe 4
TYBSC CS – SEM VI – Data Science

Step 8: Getting all databases


$ curl -X GET http://admin:123456@127.0.0.1:5984/_all_dbs

Step 8: Insert Records


$ curl -X PUT "http://admin:123456@127.0.0.1:5984/student/36" -d
"{\"roll\":36,\"name\":\"Rohan\"}"

Rohan Salunkhe 5
TYBSC CS – SEM VI – Data Science

Step 9 : Update Records

$ curl -X PUT "http://admin:123456@127.0.0.1:5984/student/3" -d "{\"_rev\":\"3-


7f89edb031660ac615d735065428147e\",\"roll\":57,\"name\":\"aashish\"}"

Step 10 : Delete Records

$ curl -X DELETE
"http://admin:123456@127.0.0.1:5984/student/2?rev="2-
5d3f522eabd22f1b31679d071f6bcaac"

Rohan Salunkhe 6
TYBSC CS – SEM VI – Data Science

Practical No – 02
Aim: Data collection, Data curation and management for Large-scale Data system
using MongoDB

Introduction:

MongoDB is a NoSQL database which stores the data in form of key-value pairs. It
is an Open Source, Document Database which provides high performance and
scalability along with data modelling and data management of huge sets of data
in an enterprise application.
MongoDB also provides the feature of Auto-Scaling. Since, MongoDB is a cross
platform database and can be installed across different platforms like Windows,
Linux etc.

Document Database:
A record in MongoDB is a document, which is a data structure composed of field
and value pairs. MongoDB documents are similar to JSON objects. The values of
fields may include other documents, arrays, and arrays of documents.

The advantages of using documents are:


● Documents correspond to native data types in many programming languages.
● Embedded documents and arrays reduce need for expensive joins.
● Dynamic schema supports fluent polymorphism.

Step 1 : Download and Install mongoose shell

Rohan Salunkhe 1
TYBSC CS – SEM VI – Data Science

Rohan Salunkhe 2
TYBSC CS – SEM VI – Data Science

C:\Program Files\MongoDB\Server\6.0\data\

Rohan Salunkhe 3
TYBSC CS – SEM VI – Data Science

Rohan Salunkhe 4
TYBSC CS – SEM VI – Data Science

Step 2 : Search edit the system environment variable and open it

Step 3 : Click on environment variable

Rohan Salunkhe 5
TYBSC CS – SEM VI – Data Science

Step 4 : Now in environment variables click on new add variable name path and variable value to the
location where mongodb is install “C:\Program Files\MongoDB\Server\6.0\bin”

Rohan Salunkhe 6
TYBSC CS – SEM VI – Data Science

Step 5 : Do the same for system variables and click on ok

Rohan Salunkhe 7
TYBSC CS – SEM VI – Data Science

Step 6 : Go to command prompt and type :


$ mongod
Do not close this it’s the server

Note : If the server closes automatically then create a new folder in C drive and name it as “data”
and inside data create “db”.

Rohan Salunkhe 8
TYBSC CS – SEM VI – Data Science

Step 7 : Go to the folder where you extracted zip file and go that folder open the bin and click on mongosh

Step 8 : Now type $ mongo and type show dbs

Rohan Salunkhe 9
TYBSC CS – SEM VI – Data Science

Step 9 : To create new database type $ use library36 and $ show dbs

Step 10 : Type this command to insert new records

1. $ db.library36.insertOne({title: "Core Java", status_info:


{accession_no:"BS001",status:"Issued"},author:"James
Bond",cost:350, publisher:{pub_name:"TMH", city:"Mumbai"}})

2. db.library36.insertOne({title: "Python", status_info:


{accession_no:"BS002",status:"Issued"},author:"Rohan Bond", cost:450, publisher:
{pub_name:"Lame", city:"Pune"}})

Rohan Salunkhe 10
TYBSC CS – SEM VI – Data Science

3. db.library36.insertOne({title: "Javascript", status_info:


{accession_no:"BS003",status:"Available"},author:"Rishi Bond", cost:550, publisher:
{pub_name:"Booba", city:"Kerala"}})

4. db.library36.insertOne({title: "NodeJs", status_info:


[{accession_no:"BS004",status:"Issued"},
{accession_no:"BS005",status:"Issued"},{accession_no:"BS006",status:"Avai
lable"}],author:"Prachit Bond", cost: 5250, publisher:{pub_name:"MAN", city:"New
York"}})

Rohan Salunkhe 11
TYBSC CS – SEM VI – Data Science

Step 11 : Type $ db.student.find({})

Step 12 : Update Records


$ db.library36.updateOne({ title: 'NodeJs'}, {$set: {title : 'React'}})

Step 13 : Delete single record

Rohan Salunkhe 12
TYBSC CS – SEM VI – Data Science

$ db.library36.deleteOne({title:"React"})

Step 14 : Delete all Records


$ db.library36.deleteMany({})

Step 15 : Go to mongodb Compass and connect it go to your collection and select your db

Step 16 : Check your record in multiple format

Rohan Salunkhe 13
TYBSC CS – SEM VI – Data Science

Rohan Salunkhe 14
TYBSC CS – SEM VI – Data Science

Practical No – 03
Aim: Implementation of Principal Component Analysis

Introduction:

PCA-Principal Component Analysis is used in exploratory data analysis and for


making decisions in predictive models.

PCA is commonly used for dimensionality reduction by using each data point onto
only the first few principal components (most cases first and second dimensions)
to obtain lower-dimensional data while keeping as much of the data’s variation as
possible.

The first principal component can equivalently be defined as a direction that


maximizes the variance of the projected data.

The principal components are often analyzed by eigen decomposition of the data
covariance matrix or singular value decomposition (SVD) of the data matrix.

Step involved in the calculation of PCA:

Step 1 - Data normalization

By considering the example in the introduction, let’s consider, for instance, the following
information for a given client.

● Monthly expenses: $300


● Age: 27
● Rating: 4.5

This information has different scales and performing PCA using such data will lead to a
biased result. This is where data normalization comes in. It ensures that each attribute
has the same level of contribution, preventing one variable from dominating others. For
each variable, normalization is done by subtracting its mean and dividing by its standard
deviation.

Rohan Salunkhe 1
TYBSC CS – SEM VI – Data Science

Step 2 - Covariance matrix

As the same suggests, this step is about computing the covariable matrix from the
normalized data. This is a symmetric matrix, and each element (i, j) corresponds to the
covariance between variables i and j.

Step 3 - Eigenvectors and eigenvalues

Geometrically, an eigenvector represents a direction such as “vertical” or “90 degrees”.


An eigenvalue, on the other hand, is a number representing the amount of variance
present in the data for a given direction. Each eigenvector has its corresponding
eigenvalue.

Step 4 - Selection of principal components

There are as many pairs of eigenvectors and eigenvalues as the number of variables in
the data. In the data with only monthly expenses, age, and rate, there will be three
pairs. Not all the pairs are relevant. So, the eigenvector with the highest eigenvalue
corresponds to the first principal component. The second principal component is the
eigenvector with the second highest eigenvalue, and so on.

Step 5 - Data transformation in new dimensional space

This step involves re-orienting the original data onto a new subspace defined by the
principal components This reorientation is done by multiplying the original data by the
previously computed eigenvectors.

It is important to remember that this transformation does not modify the original data
itself but instead provides a new perspective to better represent the data.

Step 1: Read the Subject.csv file

$ x = read.csv("E:/TYCS Rohan/Ds/subjects.csv")
$x

Rohan Salunkhe 2
TYBSC CS – SEM VI – Data Science

Step 2: Display the covariance matrix


$ cov_mat = cov ( x )
$ cov_mat

Step 3 : Calculate eigen value


$ ex = eigen( cov_mat )
$ ex

Rohan Salunkhe 3
TYBSC CS – SEM VI – Data Science

Step 4 : Install packages


$ install.packages ("factoextra")

$ install.packages("FactoMineR")

Step 5 : Load library


$ library("FactoMineR")
$ datapca = PCA (x,ncp = 3 , graph = TRUE )

Step 6 : Type $ datapca$eig

Rohan Salunkhe 4
TYBSC CS – SEM VI – Data Science

Step 7 : Type $ datapca$var

Rohan Salunkhe 5
TYBSC CS – SEM VI – Data Science

Step 8 : Type
$ datapca$var$coord
$ library("factoextra")

Step 9 : Type
$ library("ggplot2")
$ fviz_eig(datapca,addlabels = TRUE , ylim = c (0,50))
$ fviz_eig(datapca,addlabels = TRUE , ylim = c (0,50))
$ fviz_eig(datapca,addlabels = TRUE , ylim = c (0,80))

Rohan Salunkhe 6
TYBSC CS – SEM VI – Data Science

Step 10 : Now type


$ datapca$eig
$ datapca

Step 11 : Read Iris.csv


$ y = read.csv("E:/TYCS Rohan/Ds/Iris.csv")
$y

Rohan Salunkhe 7
TYBSC CS – SEM VI – Data Science

Step 12 : Now
$ y = iris [ , -5]
$y

Step 13 : Now next type


$ cov_iris = cov ( y )
$ cov_iris

Rohan Salunkhe 8
TYBSC CS – SEM VI – Data Science

Now find covariance matrix :


$ cov_mat = cov (y)
$ cov_mat

Step 14 : Type
$ irispca = PCA ( y, ncp = 3 , graph = TRUE )

Rohan Salunkhe 9
TYBSC CS – SEM VI – Data Science

Step 15 : Now type


$ irispca$eig
$ irispca$var
TYBSC CS – SEM VI – Data Science

Rohan Salunkhe 10

$ irispca$var$coord

Step 16 : Type
$ fviz_eig ( irispca, addlabels = TRUE , ylim = c ( 0, 80))
TYBSC CS – SEM VI – Data Science

Rohan Salunkhe 11
TYBSC CS – SEM VI – Data Science

Practical No – 04
Aim: Implementation of Clustering
Introduction:

Clustering allows us to identify homogeneous groups and categorize them from


the dataset.

One of the simplest clustering's is K-means, the most used clustering method for
splitting a dataset into a set of n groups.

K Means is a clustering algorithm that repeatedly assigns a group amongst k


groups present to a data point according to the features of the point. It is a
centroid-based clustering method.

The number of clusters is decided, cluster centers are selected in random farthest
from one another, the distance between each data point and center is calculated
using Euclidean distance, the data point is assigned to the cluster whose center is
nearest to that point. This process is repeated until the center of clusters does not
change and data points remain in the same cluster.

R has a clustering package that calculates the above steps.

Step 1

Load dataset Step 2 select columns

to perform clustering

Step 3

The next step is to use the K Means algorithm. K Means is the method we use
which has parameters (data, no. of clusters or groups). Here our data is the x
object and we will have k=3 clusters.

Step 4

Assign different colors to the clusters.

Case Study 1
Step 1 : $ df = read.csv("E:/TYCS Rohan/Ds/prac4/age.csv") $ df

Rohan Salunkhe 1
TYBSC CS – SEM VI – Data Science

Step 2: $ plot(df)

Step 3: $ boxplot(df)

Rohan Salunkhe 2
TYBSC CS – SEM VI – Data Science

Step 4: $ set.seed(20)

Step 5: $ c = kmeans(df[,1:2],3)

Step 6: $ install.packages("factoextra")

Step 7: Load library

Rohan Salunkhe 3
TYBSC CS – SEM VI – Data Science

$ library("factoextra")

Step 8: $ fviz_cluster(c, data=df)

Rohan Salunkhe 4
TYBSC CS – SEM VI – Data Science

Case Study 2

Step 9: $ head(iris)

Step 10: $ View(iris)

Step 11: $ summary(iris)

Rohan Salunkhe 5
TYBSC CS – SEM VI – Data Science

Step 12: $ plot(iris[,3:4])

Step 13: $ library(ggplot2)


$ set.seed(50)

Step 14: $ c = kmeans(iris[,3:4],3)


$c

Rohan Salunkhe 6
TYBSC CS – SEM VI – Data Science

Step 15: $ df = read.csv("E:/TYCS Rohan/Ds/Iris.csv")


Step 16: $ ggplot(iris,aes(Petal.Length, Petal.Width, col = c$cluster))
+geom_point()
$ ggplot(iris,aes(Petal.Length, Petal.Width, col = Species))+geom_point()

Rohan Salunkhe 7
TYBSC CS – SEM VI – Data Science

Practical No – 05
Aim: Analysis of Time Series Forecasting
Introduction:

Time series forecasting is the process of using historical data to make predictions
about future events. It is commonly used in fields such as finance, economics, and
weather forecasting. R is a powerful programming language and software
environment for statistical computing and graphics that is widely used for time
series forecasting.
There are several R packages available for time series forecasting, including:
“forecast”: This package provides a wide range of methods for time series
forecasting, including exponential smoothing, ARIMA, and neural networks.
“tseries”: This package provides functions for time series analysis and forecasting,
including functions for decomposing time series data, and for fitting and
forecasting models such as ARIMA.

Time series data are decomposed into three components :


• Seasonal – Patterns that show how data is being changed over a certain
period of time. Example – A clothing e-commerce website will have heavy
traffic during festive seasons and less traffic during normal times. Here it is
a seasonal pattern as value is being increased only at a certain period of
time.
• Trend – It is a pattern that shows how values are being changed. For
example how a website is running overall if running successfully trend goes
up, if not, the trend comes down.
• Random – The remaining data of the time series after seasonal trends are
removed is a random pattern. This is also known as noise.

Rohan Salunkhe 1
TYBSC CS – SEM VI – Data Science

Step 1: install

$ install.packages("timeSeries")

$ install.packages("forecast")

Step 2: Load Libraries

$ library(timeSeries)

$ library(timeDate)

$ library(forecast)

Step 3:

$ data = table(AirPassengers)

$ data

Step 4: $ View(data)

Step 5: $ AirPassengers

Rohan Salunkhe 2
TYBSC CS – SEM VI – Data Science

Step 6: $ sum(is.na(AirPassenger)

Step 7: $ frequency(AirPassengers)

Step 8: $ summary(AirPassengers)

Step 9: $ install.packages("tseries")

$ library(tseries)

Step 10: $ adf.test(AirPassengers, alternative = "stationary", k = 12 )

Step 11: $ plot(AirPassengers , main="Air Passengers Count from 1949 to 1960")

Rohan Salunkhe 3
TYBSC CS – SEM VI – Data Science

Step 12: $ plot(AirPassengers)+ abline(reg =lm (AirPassengers~time(AirPassengers)))

Step 13: $ plot(log(AirPassengers))

Rohan Salunkhe 4
TYBSC CS – SEM VI – Data Science

Step 14: $ plot(diff(AirPassengers))

Step 15: $ plot(aggregate(AirPassengers, FUN = mean))

Rohan Salunkhe 5
TYBSC CS – SEM VI – Data Science

Step 16: $ boxplot(AirPassengers~cycle(AirPassengers))

Step 17:$ plot(decompose(AirPassengers))

Rohan Salunkhe 6
TYBSC CS – SEM VI – Data Science

Step 18: $ tsdata = ts(log (AirPassengers), frequency = 12)

Step 19: $ acf(AirPassengers)

Step 20: $ pacf(diff(log(AirPassengers)))

Rohan Salunkhe 7
TYBSC CS – SEM VI – Data Science

Step 21: $ fit = arima(log(AirPassengers),c(0,1,1),seasonal=list(order=


c(0,1,1), period=12))

$ fit

Step 22: $ prediction = predict(fit,n.ahead= 10*12)

$ prediction

Step 23: $ conPred = round(2.718^prediction$pred,0)

$ conPred

Step 24: $ ts.plot(AirPassengers,conPred, log="y", lty=c(1,3))

Rohan Salunkhe 8
TYBSC CS – SEM VI – Data Science

Rohan Salunkhe 9
TYBSC CS – SEM VI – Data Science

Practical No – 06
Aim: Analysis of Simple/Multiple Linear Regression

Introduction:
Linear regression is a regression model that uses a straight line to describe the relationship between
variables. It finds the line of best fit through your data by searching for the value of the regression
coefficient(s) that minimizes the total error of the model.

There are two main types of linear regression:

• Simple linear regression uses only one independent variable


• Multiple linear regression uses two or more independent variables

A. Simple linear regression:


The simple linear regression is used to predict a quantitative outcome y on the basis of one
single predictor variable x. The goal is to build a mathematical model (or formula) that defines y
as a function of the x variable.

Once, we built a statistically significant model, it’s possible to use it for predicting future
outcome on the basis of new x values.

In this practical we have used Income data set. This dataset contains observations about
income (in a range of $15k to $75k) and happiness (rated on a scale of 1 to 10) in an imaginary
sample of 500 people. The income values are divided by 10,000 to make the income data
match the scale of the happiness scores (so a value of $2 represents $20,000, $3 is $30,000,
etc.)

B. Multiple linear regression:


Multiple Linear Regression basically describes how a single response variable Y depends linearly
on several predictor variables.
In this practical we have used heart disease data set. This dataset contains observations on the
percentage of people biking to work each day, the percentage of people smoking, and the
percentage of people with heart disease in an imaginary sample of 500 towns .

Rohan Salunkhe 1
TYBSC CS – SEM VI – Data Science

Case Study 1- Simple Linear Regression

Step 1: Install packages 1. install.packages("ggplot2") 2. install.packages("ggpubr") 3.


install.packages("dplyr") 4. install.packages("broom")

Step 2: Load libraries • $ library(ggplot2) • $ library(broom) • $ library(dplyr) •


$ library(ggpubr)

Step 3: incomedata = read.csv("E:/TYCS Rohan/Ds/practical/incomedata.csv") $

head(incomedata)

Step 4: summary(incomedata)

Step 5: $ hist(incomedata$happiness)

Rohan Salunkhe 2
TYBSC CS – SEM VI – Data Science

Step 6: $ plot(happiness~ income, data = incomedata)

Step 7:
$ lml= lm(happiness ~income, data = incomedata) $ summary(lml)

Rohan Salunkhe 3
TYBSC CS – SEM VI – Data Science

Step 8:
$income.graph = ggplot(incomedata, aes ( x = income
, y=happiness))+geom_point() $ income.graph

Step 9:

Rohan Salunkhe 4
TYBSC CS – SEM VI – Data Science

$ income.graph = income.graph + geom_smooth(method = "lm", col = "black") $ income.graph

Step 10:
$ income.graph = income.graph + stat_regline_equation(label.x=3, label.y=7) $ income.graph

Rohan Salunkhe 5
TYBSC CS – SEM VI – Data Science

Step 11: $ income.graph = income.graph + theme_bw() + labs(title = "Reported happiness as a


function of income ",x="Income(x$10,000)", y="Happiness Score (0 to 10)") $ income.graph

Case Study:2 - Multiple linear regression

Rohan Salunkhe 6
TYBSC CS – SEM VI – Data Science

Step 1: $ heartdata = read.csv("E:/TYCS Rohan/Ds/practical/heartdata.csv")


$ heartdata

Step 2: $ head(heartdata)

Step 3: $ summary(heartdata)

Step 4: $ cor(heartdata$biking, heartdata$smoking)

Step 5: $ hist(heartdata$heart.disease)

Rohan Salunkhe 7
TYBSC CS – SEM VI – Data Science

Step 6: $ plot(heart.disease~biking, data = heartdata)

Step 7: $ plot(heart.disease~smoking, data = heartdata)

Rohan Salunkhe 8
TYBSC CS – SEM VI – Data Science

Step 8: $ heartlm = lm(heart.disease ~ biking + smoking, data = heartdata)


$ summary(heartlm)

Step 9: plotheartdata = expand.grid(biking = seq(min(heartdata$biking),


max(heartdata$biking), length.out = 30), smoking = c(min(heartdata$smoking),
mean(heartdata$smoking), max(heartdata$smoking)))
$ View(plotheartdata )

Rohan Salunkhe 9
TYBSC CS – SEM VI – Data Science

Step 10: $ plotheartdata$predictedvalue = predict.lm(heartlm,newdata=plotheartdata) $

View(plotheartdata)

Step 11: $ plotheartdata$smoking = round(plotheartdata$smoking,digits=2) $

View(plotheartdata)

Rohan Salunkhe 1
0
TYBSC CS – SEM VI – Data Science

Step 12: $ plotheartdata$smoking = as.factor(plotheartdata$smoking)

Step 13: $ heart.plot = ggplot(heartdata,aes(x=biking, y=heart.disease)) $ heart.plot =


ggplot(heartdata,aes(x=biking, y=heart.disease)) +geom_point() $ heart.plot

Rohan Salunkhe 1
1
TYBSC CS – SEM VI – Data Science

Step 14: $ heart.plot = heart.plot + geom_line(data = plotheartdata, aes(x=biking, y=predictvalue,


color=smoking),size=1.25) $ heart.plot

Step 15: $ heart.plot = heart.plot+theme_bw()+labs(title="Rate of heart disease(%of population) as


a function of biking to work and smoking" , x="Biking to work(%of population)", y="Heart
Disease(% of population)", color = "Smoking(% of population)"

Rohan Salunkhe 1
2
TYBSC CS – SEM VI – Data Science
Practical No – 07
Aim : Analysis of Logistic Regression

Introduction:
• Logistic regression is used to predict the class (or category) of individuals based on one
or multiple predictor variables (x). It is used to model a binary outcome, that is a
variable, which can have only two possible values: 0 or 1, yes or no, diseased or
nondiseased.

• Logistic regression belongs to a family, named Generalized Linear Model (GLM),


developed for extending the linear regression model to other situations. Other
synonyms are binary logistic regression, binomial logistic regression and logit model.

• Logistic regression does not return directly the class of observations. It allows us to
estimate the probability (p) of class membership. The probability will range between 0
and 1. You need to decide the threshold probability at which the category flips from
one to the other. By default, this is set to p = 0.5

Step 1 : x = read.csv("E:/TYCS Rohan/Ds/grades.csv")


Step 2 : x

Rohan Salunkhe 1
TYBSC CS – SEM VI –
Data Science

Step 3 : nrow(x)

Step 4 : s = sample(nrow(x),.7*nrow(x))

Step 5 : x_training = x[s,]

Step 6 : x_testing = x[-s,]

Step 7 : nrow(x_training)

Step 8 : nrow(x_testing)

2 Rohan Salunkhe

Step 9 : X_training
TYBSC CS – SEM VI – Data Science

Step 10 : X_testing

Step 11 : x2_training = x[-s,]

Step 12 : x2_testing = x[-s,]

Step 13 : lmode1 = glm(Grade ~ Exam1, data = x2_training,family = binomial,control = list(maxit=100))

Step 14 : summary(lmode1)

Step 15 : prediction = predict(lmode1,x_testing,type = "response")

Step 16: prediction

Data Science

Rohan Salunkhe 3
TYBSC CS – SEM VI –
Step 17 : actual_prediction <- data.frame(cbind(actuals = x2_testing$Grade,predicted = prediction))

Step 18 : actual_prediction

Step 19 : Do the same step from 13 to 18

Exam 2

4 Rohan Salunkhe
TYBSC CS – SEM VI – Data Science

Exam 3

Exam 4

Rohan Salunkhe 5
TYBSC CS – SEM VI – Data Science
Practical No – 08
Aim: Implementation of Hypothesis Testing

Introduction:
An analyst performs hypothesis testing on a statistical sample to present evidence of the
plausibility of the null hypothesis. Measurements and analyses are conducted on a random
sample of the population to test a theory. Analysts use a random population sample to test two
hypotheses: the null and alternative hypotheses.
The null hypothesis is typically an equality hypothesis between population parameters; for
example, a null hypothesis may claim that the population means return equals zero. The
alternate hypothesis is essentially the inverse of the null hypothesis (e.g., the population means
the return is not equal to zero). As a result, they are mutually exclusive, and only one can be
correct. One of the two possibilities, however, will always be correct.
Null Hypothesis and Alternate Hypothesis

• The Null Hypothesis is the assumption that the event will not occur. A null hypothesis
has no bearing on the study's outcome unless it is rejected. H0 is the symbol for it, and
it is pronounced H-naught.

Rhan Salunkhe 1

• The Alternate Hypothesis is the logical opposite of the null hypothesis. The acceptance
of the alternative hypothesis follows the rejection of the null hypothesis. H1 is the
symbol for it.
Types of Hypothesis Testing

• The one-sample t-test is a statistical hypothesis test used to determine whether an


unknown population mean is different from a specific value.
• The two-sample t-test (also known as the independent samples t-test) is a method used
to test whether the unknown population means of two groups are equal or not.
Interpretation of P-Value

• A p-value is a statistical measurement used to validate a hypothesis against observed


data.
• A p-value measures the probability of obtaining the observed results, assuming that the
null hypothesis is true.
• A p-value greater than 0.05 means that deviation from the null hypothesis is not
statistically significant, and the null hypothesis is not rejected
• A p-value less than 0.05 is typically considered to be statistically significant, in which
case the null hypothesis should be rejected.
TYBSC CS – SEM VI – Data Science

Step 1: $ weights = c(301, 305, 312, 315, 318, 319, 310, 318, 305, 313, 305, 305, 305)

Step 2: Install packages $ install.packages("ggpubr")

Step 3: $ t.test(x = weights, mu = 310, alternative = "two.sided")

52 Yagyesh Trivedi

Step 4: $ women_weight = c ( 38.9, 61.2, 73.3, 21.8, 63.4, 63.6, 48.4, 48.8, 48.5) $ men_weight =

c(67.8, 60, 63.4, 76, 89.4, 73.3, 67.3, 61.3, 62.4)

Step 5: my_data = data.frame(group = rep(c("Woman", "Man"), each = 9),weight =


c(women_weight, men_weight)) $ my_data

Step 6: $ install.packages("dplyr") $ library(dplyr)


TYBSC CS – SEM VI – Data Science

Step 7: $ group_by(my_data, group) %>%summarise(count = n(), mean = mean(weight, na.rm =


TRUE), sd = sd(weight, na.rm = TRUE))

Step 8: Load libraries $ library(ggplot2) $ library(ggpubr)

Rhan Salunkhe 3

Step 9: $ ggboxplot(my_data, x="group", y = "weight", color = "group", palette = c


("#00AFBB","#E7B800"), ylab = "Weight", xlab = "Groups")
TYBSC CS – SEM VI – Data Science
Step 10: $ with(my_data, shapiro.test(weight[group == "Man"]))

Step 11: $ with(my_data, shapiro.test(weight[group == "Woman"]))

4 Rohan Salunkhe

Step 12: $ res.test = var.test(weight~group, data = my_data) $ res.test


TYBSC CS – SEM VI – Data Science

Practical No – 09
Aim: Implementation of Analysis of Variance

Introduction:
ANOVA also known as Analysis of variance is used to investigate relations between
categorical variables and continuous variable in R Programming. It is a type of hypothesis
testing for population variance.
R – ANOVA Test

ANOVA test involves setting up:


• Null Hypothesis: All population means are equal.
• Alternate Hypothesis: Atleast one population mean is different from other.

ANOVA tests are of two types:


• One way ANOVA: It takes one categorical group into consideration.
• Two way ANOVA: It takes two categorical group into consideration.

Rohan Salunkhe 1
TYBSC CS – SEM VI – Data Science

Step 1: $ mtcars = read.csv("E:/TYCS Rohan/Ds/practical/mtcars.csv") $ mtcars

Step 2: $ head(mtcars)

Step 3 :$ boxplot(mtcars$disp~factor(mtcars$gear), xlab="Gear", ylab="Displacement")

Rohan Salunkhe 2
TYBSC CS – SEM VI – Data Science

Step 4: $ install.packages("dplyr") `$ library(dplyr) Step 5: $ mtcars_anova =


aov(mtcars$disp~factor(mtcars$gear)) $ summary(mtcars_anova)

Step 6: $ cropdata = read.csv("E:/TYCS Rohan/Ds/practical/cropdata.csv") $ cropdata

Rohan Salunkhe 3
TYBSC CS – SEM VI – Data Science

Step 7: $ head(cropdata)

Step 8: $ summary(cropdata)

Step 9: $ oneway = aov(yield~fertilizer, data= cropdata) $ summary(oneway)

Rohan Salunkhe 4
TYBSC CS – SEM VI – Data Science

Step 10: $ towway = aov(yield~fertilizer+density, data= cropdata) $ summary(towway)

Step 11:
$ mtcars_aov <-aov(mtcars$disp~factor(mtcars$gear)+factor(mtcars$am)) $ summary(mtcars_aov)

Rohan Salunkhe 5
TYBSC CS – SEM VI – Data Science
Practical No – 10
Aim : Implementation of Decision Tree

Introduction:
Decision Trees are generally used for regression problems where the relationship between the
dependent (response) variable and the independent (explanatory/predictor) variables is
nonlinear in nature.
Decision tree is a graph to represent choices and their results in form of a tree. The nodes in
the graph represent an event or choice and the edges of the graph represent the decision rules
or conditions
Following packages are used to display decision tree in R, hence we nee to install and load these
libraries:

• library(rpart)
• library(rpart.plot)
• library(caret)
• library(tree)
In order to grow our decision tree, we have to first load the rpart package. Then we can use the
rpart() function, specifying the model formula, data, and method parameters
rpart has some default parameters that prevented our
tree from minbucke growing. Namely minsplit and .
t
minsplit is “the minimum number of observations that must exist in a node in
order for a split to be attempted” minbucket is “the minimum number of observations in any
terminal node”.
Rpart.plot is used to display decision tree on scree.

Rohan Salunkhe 1
TYBSC CS – SEM VI – Data Science
Step 1 : Install packages $ install.packages("rattle") $ install.packages("rpart") $
install.packages("tree")

Step 2 : Load Libraries $ library(rattle) $ library(rpart) $ library(tree)

Step 3 : $ x = read.csv("E:/TYCS Rohan/Ds/weather.csv") $ x

2 Rohan Salunkhe

Step 4 : $ sample_weather = sample(nrow(x),.7*nrow(x)) $ weather_training = x [sample_weather,] $


weather_testing = x [sample_weather,] $ weather_training
TYBSC CS – SEM VI – Data Science

$ weather_testing

Step 5 : $ dtree = rpart(play~., data = weather_training, method = "class", control =


rpart.control(minsplit = 1, minbucket = 1))

Step 6 : $ library(rpart.plot) $ rpart.plot(dtree)

4 Rohan Salunkhe

Rohan Salunkhe 3
TYBSC CS – SEM VI – Data Science

Step 7 : $ p = predict(dtree, weather_testing,type = "class") $ p


TYBSC CS – SEM VI – Data Science
Step 8 : $ weather_testing

Step 9 : $ confusion_mt = table(weather_testing$play,p) $ confusion_mt

Step 10 : $ ac_test = sum(diag(confusion_mt))/sum(confusion_mt) $ ac_test

Output -> 0.1

Step 11 : $ ac_test*100 Output -> [1] 100

Step 12 : $ print(paste("Accuracy for Testing data set: ", ac_test*100,"%")) Output -> "Accuracy for
Testing data set: 100 %"

Rohan Salunkhe 5

You might also like