Practical No - 01: Aim: Data Collection, Data Curation and Management For Unstructured Data (Nosql) Using Apache Couchdb

TYBSC CS – SEM VI – Data Science
Practical No – 01
Aim: Data collection, Data curation and management for Unstructured data
(NoSQL) using Apache CouchDB
Introduction:
Apache CouchDB is an open-source document-oriented NoSQL database, implemented in
Erlang.CouchDB uses multiple formats and protocols to store, transfer, and process its data. It
uses JSON to store data, JavaScript as its query language using MapReduce, and HTTP for an API
Unlike a relational database, a CouchDB database does not store data and relationships in
tables. Instead, each database is a collection of independent documents. Each document
maintains its own data and self-contained schema. An application may access multiple
databases, such as one stored on a user's mobile phone and another on a server. Document
metadata contains revision information, making it possible to merge any differences that may
have occurred while the databases were disconnected.
Architecture:
The architecture of CouchDB is described below:

1. CouchDB Engine: It is based on B-tree and in it, data is accessed by keys or key ranges
which map directly to the underlying B-tree operations. It is the core of the system which
manages to store internal data, documents, and views.
2. HTTP Request: It is used to create indices and extract data from documents. It is written in
JavaScript that allows creating Adhoc views that are made of MapReduce jobs.
3. Document: It stores a large amount of data.
Rohan Salunkhe 1
4. Replica Database: It is used for replicating data to a local or remote database and
synchronizing design documents.
Features of CouchDB
Features of CouchDB includes the following:

1. Replication: It provides the simplest form of replication, and no other database is so simple
to replicate.
2. Document Storage: It is a NoSQL database that follows document storage where each field
is uniquely named and contains values of various data types such as text, number, Boolean,
lists, etc.
3. ACID Properties: The CouchDB file layout follows all the features of ACID properties.
4. Security: It also provides database-level security, and the permissions are divided into
readers and admins where readers can do both the read and write to the database.
5. Map/Reduce: The main reason for the popularity of CouchDB is a map/reduce system.
6. Authentication: CouchDB facilitates you to keep authentication open via a session
cookielike a web application.
7. Built for Offline: CouchDB can replicate to devices like smartphones that have a feature to
go offline and handle data sync for you when the device is back online.
8. Eventual Consistency: CouchDB guarantees eventual consistency to provide both
availability and partition tolerance.
9. HTTP API: All items have a unique URI (Unique Resource Identifier) that gets exposed via
HTTP. It uses the HTTP methods like POST, GET, PUT, and DELETE for the four basic CRUD
(Create, Read, Update, Delete) operations on all resources.
Advantages of CouchDB
Advantages of CouchDB includes the following:

1. HTTP API is used for easy Communication.
2. It is used to store any type of data.
3. ReduceMap allows optimizing the combining of data. 4. Structure of CouchDB is very simple
5. Fast indexing and retrieval.
Installation of CouchDB in Windows
Step 1. Download CouchDB

The official website for CouchDB is https://couchdb.apache.org. If you click the given link, you
can get the home page of the CouchDB official website as shown below.
Rohan Salunkhe 2
Step 2 : Install the app, provide the username and password and hit next
Username = admin and passwords = 123456
Step 3 : Open browser and login into http://127.0.0.1:5984/_utils/ with username and password.
Step 4: Create a database with default options
Rohan Salunkhe 3
Step 5 : After login you can see the below screenshot
Step 6 : Open command as administrator and http://127.0.0.1:5984
Step 7: To create database inside CouchDb do following command

$ curl -X PUT http://user:pass@127.0.0.1:5984/cat
After slash its the name of the database
Rohan Salunkhe 4
Step 8: Getting all databases

$ curl -X GET http://admin:123456@127.0.0.1:5984/_all_dbs
Step 8: Insert Records

$ curl -X PUT "http://admin:123456@127.0.0.1:5984/student/36" -d
"{\"roll\":36,\"name\":\"Rohan\"}"
Rohan Salunkhe 5
Step 9 : Update Records
$ curl -X PUT "http://admin:123456@127.0.0.1:5984/student/3" -d "{\"_rev\":\"3-

7f89edb031660ac615d735065428147e\",\"roll\":57,\"name\":\"aashish\"}"
Step 10 : Delete Records
$ curl -X DELETE
"http://admin:123456@127.0.0.1:5984/student/2?rev="2-
5d3f522eabd22f1b31679d071f6bcaac"
Rohan Salunkhe 6
Practical No – 02
Aim: Data collection, Data curation and management for Large-scale Data system
using MongoDB
Introduction:
MongoDB is a NoSQL database which stores the data in form of key-value pairs. It
is an Open Source, Document Database which provides high performance and
scalability along with data modelling and data management of huge sets of data
in an enterprise application.
MongoDB also provides the feature of Auto-Scaling. Since, MongoDB is a cross
platform database and can be installed across different platforms like Windows,
Linux etc.
Document Database:
A record in MongoDB is a document, which is a data structure composed of field
and value pairs. MongoDB documents are similar to JSON objects. The values of
fields may include other documents, arrays, and arrays of documents.
The advantages of using documents are:

● Documents correspond to native data types in many programming languages.
● Embedded documents and arrays reduce need for expensive joins.
● Dynamic schema supports fluent polymorphism.
Step 1 : Download and Install mongoose shell
Rohan Salunkhe 1
Rohan Salunkhe 2
C:\Program Files\MongoDB\Server\6.0\data\
Rohan Salunkhe 3
Rohan Salunkhe 4
Step 2 : Search edit the system environment variable and open it
Step 3 : Click on environment variable
Rohan Salunkhe 5
Step 4 : Now in environment variables click on new add variable name path and variable value to the
location where mongodb is install “C:\Program Files\MongoDB\Server\6.0\bin”
Rohan Salunkhe 6
Step 5 : Do the same for system variables and click on ok
Rohan Salunkhe 7
Step 6 : Go to command prompt and type :

$ mongod
Do not close this it’s the server
Note : If the server closes automatically then create a new folder in C drive and name it as “data”
and inside data create “db”.
Rohan Salunkhe 8
Step 7 : Go to the folder where you extracted zip file and go that folder open the bin and click on mongosh
Step 8 : Now type $ mongo and type show dbs
Rohan Salunkhe 9
Step 9 : To create new database type $ use library36 and $ show dbs
Step 10 : Type this command to insert new records
1. $ db.library36.insertOne({title: "Core Java", status_info:

{accession_no:"BS001",status:"Issued"},author:"James
Bond",cost:350, publisher:{pub_name:"TMH", city:"Mumbai"}})
2. db.library36.insertOne({title: "Python", status_info:

{accession_no:"BS002",status:"Issued"},author:"Rohan Bond", cost:450, publisher:
{pub_name:"Lame", city:"Pune"}})
Rohan Salunkhe 10
3. db.library36.insertOne({title: "Javascript", status_info:

{accession_no:"BS003",status:"Available"},author:"Rishi Bond", cost:550, publisher:
{pub_name:"Booba", city:"Kerala"}})
4. db.library36.insertOne({title: "NodeJs", status_info:

[{accession_no:"BS004",status:"Issued"},
{accession_no:"BS005",status:"Issued"},{accession_no:"BS006",status:"Avai
lable"}],author:"Prachit Bond", cost: 5250, publisher:{pub_name:"MAN", city:"New
York"}})
Rohan Salunkhe 11
Step 11 : Type $ db.student.find({})
Step 12 : Update Records

$ db.library36.updateOne({ title: 'NodeJs'}, {$set: {title : 'React'}})
Step 13 : Delete single record
Rohan Salunkhe 12
$ db.library36.deleteOne({title:"React"})
Step 14 : Delete all Records

$ db.library36.deleteMany({})
Step 15 : Go to mongodb Compass and connect it go to your collection and select your db
Step 16 : Check your record in multiple format
Rohan Salunkhe 13
Rohan Salunkhe 14
Practical No – 03
Aim: Implementation of Principal Component Analysis
Introduction:
PCA-Principal Component Analysis is used in exploratory data analysis and for

making decisions in predictive models.
PCA is commonly used for dimensionality reduction by using each data point onto
only the first few principal components (most cases first and second dimensions)
to obtain lower-dimensional data while keeping as much of the data’s variation as
possible.
The first principal component can equivalently be defined as a direction that

maximizes the variance of the projected data.
The principal components are often analyzed by eigen decomposition of the data
covariance matrix or singular value decomposition (SVD) of the data matrix.
Step involved in the calculation of PCA:
Step 1 - Data normalization
By considering the example in the introduction, let’s consider, for instance, the following
information for a given client.
● Monthly expenses: $300

● Age: 27
● Rating: 4.5
This information has different scales and performing PCA using such data will lead to a
biased result. This is where data normalization comes in. It ensures that each attribute
has the same level of contribution, preventing one variable from dominating others. For
each variable, normalization is done by subtracting its mean and dividing by its standard
deviation.
Rohan Salunkhe 1
Step 2 - Covariance matrix
As the same suggests, this step is about computing the covariable matrix from the
normalized data. This is a symmetric matrix, and each element (i, j) corresponds to the
covariance between variables i and j.
Step 3 - Eigenvectors and eigenvalues
Geometrically, an eigenvector represents a direction such as “vertical” or “90 degrees”.

An eigenvalue, on the other hand, is a number representing the amount of variance
present in the data for a given direction. Each eigenvector has its corresponding
eigenvalue.
Step 4 - Selection of principal components
There are as many pairs of eigenvectors and eigenvalues as the number of variables in
the data. In the data with only monthly expenses, age, and rate, there will be three
pairs. Not all the pairs are relevant. So, the eigenvector with the highest eigenvalue
corresponds to the first principal component. The second principal component is the
eigenvector with the second highest eigenvalue, and so on.
Step 5 - Data transformation in new dimensional space
This step involves re-orienting the original data onto a new subspace defined by the
principal components This reorientation is done by multiplying the original data by the
previously computed eigenvectors.
It is important to remember that this transformation does not modify the original data
itself but instead provides a new perspective to better represent the data.
Step 1: Read the Subject.csv file
$ x = read.csv("E:/TYCS Rohan/Ds/subjects.csv")
$x
Rohan Salunkhe 2
Step 2: Display the covariance matrix

$ cov_mat = cov ( x )
$ cov_mat
Step 3 : Calculate eigen value

$ ex = eigen( cov_mat )
$ ex
Rohan Salunkhe 3
Step 4 : Install packages

$ install.packages ("factoextra")
$ install.packages("FactoMineR")
Step 5 : Load library

$ library("FactoMineR")
$ datapca = PCA (x,ncp = 3 , graph = TRUE )
Step 6 : Type $ datapca$eig
Rohan Salunkhe 4
Step 7 : Type $ datapca$var
Rohan Salunkhe 5
Step 8 : Type
$ datapca$var$coord
$ library("factoextra")
Step 9 : Type
$ library("ggplot2")
$ fviz_eig(datapca,addlabels = TRUE , ylim = c (0,50))
Rohan Salunkhe 6
Step 10 : Now type

$ datapca$eig
$ datapca
Step 11 : Read Iris.csv

$ y = read.csv("E:/TYCS Rohan/Ds/Iris.csv")
$y
Rohan Salunkhe 7
Step 12 : Now
$ y = iris [ , -5]
$y
Step 13 : Now next type

$ cov_iris = cov ( y )
$ cov_iris
Rohan Salunkhe 8
Now find covariance matrix :

$ cov_mat = cov (y)
$ cov_mat
Step 14 : Type
$ irispca = PCA ( y, ncp = 3 , graph = TRUE )
Rohan Salunkhe 9
Step 15 : Now type

$ irispca$eig
$ irispca$var
Rohan Salunkhe 10
$ irispca$var$coord
Step 16 : Type
$ fviz_eig ( irispca, addlabels = TRUE , ylim = c ( 0, 80))
Rohan Salunkhe 11
Practical No – 04
Aim: Implementation of Clustering
Introduction:
Clustering allows us to identify homogeneous groups and categorize them from

the dataset.
One of the simplest clustering's is K-means, the most used clustering method for
splitting a dataset into a set of n groups.
K Means is a clustering algorithm that repeatedly assigns a group amongst k

groups present to a data point according to the features of the point. It is a
centroid-based clustering method.
The number of clusters is decided, cluster centers are selected in random farthest
from one another, the distance between each data point and center is calculated
using Euclidean distance, the data point is assigned to the cluster whose center is
nearest to that point. This process is repeated until the center of clusters does not
change and data points remain in the same cluster.
R has a clustering package that calculates the above steps.
Step 1
Load dataset Step 2 select columns
to perform clustering
Step 3
The next step is to use the K Means algorithm. K Means is the method we use
which has parameters (data, no. of clusters or groups). Here our data is the x
object and we will have k=3 clusters.
Step 4
Assign different colors to the clusters.
Case Study 1
Step 1 : $ df = read.csv("E:/TYCS Rohan/Ds/prac4/age.csv") $ df
Rohan Salunkhe 1
Step 2: $ plot(df)
Step 3: $ boxplot(df)
Rohan Salunkhe 2
Step 4: $ set.seed(20)
Step 5: $ c = kmeans(df[,1:2],3)
Step 6: $ install.packages("factoextra")
Step 7: Load library
Rohan Salunkhe 3
$ library("factoextra")
Step 8: $ fviz_cluster(c, data=df)
Rohan Salunkhe 4
Case Study 2
Step 9: $ head(iris)
Step 10: $ View(iris)
Step 11: $ summary(iris)
Rohan Salunkhe 5
Step 12: $ plot(iris[,3:4])
Step 13: $ library(ggplot2)

$ set.seed(50)
Step 14: $ c = kmeans(iris[,3:4],3)

$c
Rohan Salunkhe 6
Step 15: $ df = read.csv("E:/TYCS Rohan/Ds/Iris.csv")

Step 16: $ ggplot(iris,aes(Petal.Length, Petal.Width, col = c$cluster))
+geom_point()
$ ggplot(iris,aes(Petal.Length, Petal.Width, col = Species))+geom_point()
Rohan Salunkhe 7
Practical No – 05
Aim: Analysis of Time Series Forecasting
Introduction:
Time series forecasting is the process of using historical data to make predictions
about future events. It is commonly used in fields such as finance, economics, and
weather forecasting. R is a powerful programming language and software
environment for statistical computing and graphics that is widely used for time
series forecasting.
There are several R packages available for time series forecasting, including:
“forecast”: This package provides a wide range of methods for time series
forecasting, including exponential smoothing, ARIMA, and neural networks.
“tseries”: This package provides functions for time series analysis and forecasting,
including functions for decomposing time series data, and for fitting and
forecasting models such as ARIMA.
Time series data are decomposed into three components :

• Seasonal – Patterns that show how data is being changed over a certain
period of time. Example – A clothing e-commerce website will have heavy
traffic during festive seasons and less traffic during normal times. Here it is
a seasonal pattern as value is being increased only at a certain period of
time.
• Trend – It is a pattern that shows how values are being changed. For
example how a website is running overall if running successfully trend goes
up, if not, the trend comes down.
• Random – The remaining data of the time series after seasonal trends are
removed is a random pattern. This is also known as noise.
Rohan Salunkhe 1
Step 1: install
$ install.packages("timeSeries")
$ install.packages("forecast")
Step 2: Load Libraries
$ library(timeSeries)
$ library(timeDate)
$ library(forecast)
Step 3:
$ data = table(AirPassengers)
$ data
Step 4: $ View(data)
Step 5: $ AirPassengers
Rohan Salunkhe 2
Step 6: $ sum(is.na(AirPassenger)
Step 7: $ frequency(AirPassengers)
Step 8: $ summary(AirPassengers)
Step 9: $ install.packages("tseries")
$ library(tseries)
Step 10: $ adf.test(AirPassengers, alternative = "stationary", k = 12 )
Step 11: $ plot(AirPassengers , main="Air Passengers Count from 1949 to 1960")
Rohan Salunkhe 3
Step 12: $ plot(AirPassengers)+ abline(reg =lm (AirPassengers~time(AirPassengers)))
Step 13: $ plot(log(AirPassengers))
Rohan Salunkhe 4
Step 14: $ plot(diff(AirPassengers))
Step 15: $ plot(aggregate(AirPassengers, FUN = mean))
Rohan Salunkhe 5
Step 16: $ boxplot(AirPassengers~cycle(AirPassengers))
Step 17:$ plot(decompose(AirPassengers))
Rohan Salunkhe 6
Step 18: $ tsdata = ts(log (AirPassengers), frequency = 12)
Step 19: $ acf(AirPassengers)
Step 20: $ pacf(diff(log(AirPassengers)))
Rohan Salunkhe 7
Step 21: $ fit = arima(log(AirPassengers),c(0,1,1),seasonal=list(order=

c(0,1,1), period=12))
$ fit
Step 22: $ prediction = predict(fit,n.ahead= 10*12)
$ prediction
Step 23: $ conPred = round(2.718^prediction$pred,0)
$ conPred
Step 24: $ ts.plot(AirPassengers,conPred, log="y", lty=c(1,3))
Rohan Salunkhe 8
Rohan Salunkhe 9
Practical No – 06
Aim: Analysis of Simple/Multiple Linear Regression
Introduction:
Linear regression is a regression model that uses a straight line to describe the relationship between
variables. It finds the line of best fit through your data by searching for the value of the regression
coefficient(s) that minimizes the total error of the model.
There are two main types of linear regression:
• Simple linear regression uses only one independent variable

• Multiple linear regression uses two or more independent variables
A. Simple linear regression:

The simple linear regression is used to predict a quantitative outcome y on the basis of one
single predictor variable x. The goal is to build a mathematical model (or formula) that defines y
as a function of the x variable.
Once, we built a statistically significant model, it’s possible to use it for predicting future
outcome on the basis of new x values.
In this practical we have used Income data set. This dataset contains observations about
income (in a range of $15k to $75k) and happiness (rated on a scale of 1 to 10) in an imaginary
sample of 500 people. The income values are divided by 10,000 to make the income data
match the scale of the happiness scores (so a value of $2 represents $20,000, $3 is $30,000,
etc.)
B. Multiple linear regression:

Multiple Linear Regression basically describes how a single response variable Y depends linearly
on several predictor variables.
In this practical we have used heart disease data set. This dataset contains observations on the
percentage of people biking to work each day, the percentage of people smoking, and the
percentage of people with heart disease in an imaginary sample of 500 towns .
Rohan Salunkhe 1
Case Study 1- Simple Linear Regression
Step 1: Install packages 1. install.packages("ggplot2") 2. install.packages("ggpubr") 3.

install.packages("dplyr") 4. install.packages("broom")
Step 2: Load libraries • $ library(ggplot2) • $ library(broom) • $ library(dplyr) •

$ library(ggpubr)
Step 3: incomedata = read.csv("E:/TYCS Rohan/Ds/practical/incomedata.csv") $
head(incomedata)
Step 4: summary(incomedata)
Step 5: $ hist(incomedata$happiness)
Rohan Salunkhe 2
Step 6: $ plot(happiness~ income, data = incomedata)
Step 7:
$ lml= lm(happiness ~income, data = incomedata) $ summary(lml)
Rohan Salunkhe 3
Step 8:
$income.graph = ggplot(incomedata, aes ( x = income
, y=happiness))+geom_point() $ income.graph
Step 9:
Rohan Salunkhe 4
$ income.graph = income.graph + geom_smooth(method = "lm", col = "black") $ income.graph
Step 10:
$ income.graph = income.graph + stat_regline_equation(label.x=3, label.y=7) $ income.graph
Rohan Salunkhe 5
Step 11: $ income.graph = income.graph + theme_bw() + labs(title = "Reported happiness as a

function of income ",x="Income(x$10,000)", y="Happiness Score (0 to 10)") $ income.graph
Case Study:2 - Multiple linear regression
Rohan Salunkhe 6
Step 1: $ heartdata = read.csv("E:/TYCS Rohan/Ds/practical/heartdata.csv")

$ heartdata
Step 2: $ head(heartdata)
Step 3: $ summary(heartdata)
Step 4: $ cor(heartdata$biking, heartdata$smoking)
Step 5: $ hist(heartdata$heart.disease)
Rohan Salunkhe 7
Step 6: $ plot(heart.disease~biking, data = heartdata)
Step 7: $ plot(heart.disease~smoking, data = heartdata)
Rohan Salunkhe 8
Step 8: $ heartlm = lm(heart.disease ~ biking + smoking, data = heartdata)

$ summary(heartlm)
Step 9: plotheartdata = expand.grid(biking = seq(min(heartdata$biking),

max(heartdata$biking), length.out = 30), smoking = c(min(heartdata$smoking),
mean(heartdata$smoking), max(heartdata$smoking)))
$ View(plotheartdata )
Rohan Salunkhe 9
Step 10: $ plotheartdata$predictedvalue = predict.lm(heartlm,newdata=plotheartdata) $
View(plotheartdata)
Step 11: $ plotheartdata$smoking = round(plotheartdata$smoking,digits=2) $
View(plotheartdata)
Rohan Salunkhe 1
0
Step 12: $ plotheartdata$smoking = as.factor(plotheartdata$smoking)
Step 13: $ heart.plot = ggplot(heartdata,aes(x=biking, y=heart.disease)) $ heart.plot =

ggplot(heartdata,aes(x=biking, y=heart.disease)) +geom_point() $ heart.plot
Rohan Salunkhe 1
1
Step 14: $ heart.plot = heart.plot + geom_line(data = plotheartdata, aes(x=biking, y=predictvalue,

color=smoking),size=1.25) $ heart.plot
Step 15: $ heart.plot = heart.plot+theme_bw()+labs(title="Rate of heart disease(%of population) as

a function of biking to work and smoking" , x="Biking to work(%of population)", y="Heart
Disease(% of population)", color = "Smoking(% of population)"
Rohan Salunkhe 1
2
Practical No – 07
Aim : Analysis of Logistic Regression
Introduction:
• Logistic regression is used to predict the class (or category) of individuals based on one
or multiple predictor variables (x). It is used to model a binary outcome, that is a
variable, which can have only two possible values: 0 or 1, yes or no, diseased or
nondiseased.
• Logistic regression belongs to a family, named Generalized Linear Model (GLM),

developed for extending the linear regression model to other situations. Other
synonyms are binary logistic regression, binomial logistic regression and logit model.
• Logistic regression does not return directly the class of observations. It allows us to
estimate the probability (p) of class membership. The probability will range between 0
and 1. You need to decide the threshold probability at which the category flips from
one to the other. By default, this is set to p = 0.5
Step 1 : x = read.csv("E:/TYCS Rohan/Ds/grades.csv")

Step 2 : x
Rohan Salunkhe 1
TYBSC CS – SEM VI –
Data Science
Step 3 : nrow(x)
Step 4 : s = sample(nrow(x),.7*nrow(x))
Step 5 : x_training = x[s,]
Step 6 : x_testing = x[-s,]
Step 7 : nrow(x_training)
Step 8 : nrow(x_testing)
2 Rohan Salunkhe
Step 9 : X_training
Step 10 : X_testing
Step 11 : x2_training = x[-s,]
Step 12 : x2_testing = x[-s,]
Step 13 : lmode1 = glm(Grade ~ Exam1, data = x2_training,family = binomial,control = list(maxit=100))
Step 14 : summary(lmode1)
Step 15 : prediction = predict(lmode1,x_testing,type = "response")
Step 16: prediction
Data Science
Rohan Salunkhe 3
TYBSC CS – SEM VI –
Step 17 : actual_prediction <- data.frame(cbind(actuals = x2_testing$Grade,predicted = prediction))
Step 18 : actual_prediction
Step 19 : Do the same step from 13 to 18
Exam 2
4 Rohan Salunkhe
Exam 3
Exam 4
Rohan Salunkhe 5
Practical No – 08
Aim: Implementation of Hypothesis Testing
Introduction:
An analyst performs hypothesis testing on a statistical sample to present evidence of the
plausibility of the null hypothesis. Measurements and analyses are conducted on a random
sample of the population to test a theory. Analysts use a random population sample to test two
hypotheses: the null and alternative hypotheses.
The null hypothesis is typically an equality hypothesis between population parameters; for
example, a null hypothesis may claim that the population means return equals zero. The
alternate hypothesis is essentially the inverse of the null hypothesis (e.g., the population means
the return is not equal to zero). As a result, they are mutually exclusive, and only one can be
correct. One of the two possibilities, however, will always be correct.
Null Hypothesis and Alternate Hypothesis
• The Null Hypothesis is the assumption that the event will not occur. A null hypothesis
has no bearing on the study's outcome unless it is rejected. H0 is the symbol for it, and
it is pronounced H-naught.
Rhan Salunkhe 1
• The Alternate Hypothesis is the logical opposite of the null hypothesis. The acceptance
of the alternative hypothesis follows the rejection of the null hypothesis. H1 is the
symbol for it.
Types of Hypothesis Testing
• The one-sample t-test is a statistical hypothesis test used to determine whether an

unknown population mean is different from a specific value.
• The two-sample t-test (also known as the independent samples t-test) is a method used
to test whether the unknown population means of two groups are equal or not.
Interpretation of P-Value
• A p-value is a statistical measurement used to validate a hypothesis against observed

data.
• A p-value measures the probability of obtaining the observed results, assuming that the
null hypothesis is true.
• A p-value greater than 0.05 means that deviation from the null hypothesis is not
statistically significant, and the null hypothesis is not rejected
• A p-value less than 0.05 is typically considered to be statistically significant, in which
case the null hypothesis should be rejected.
Step 1: $ weights = c(301, 305, 312, 315, 318, 319, 310, 318, 305, 313, 305, 305, 305)
Step 2: Install packages $ install.packages("ggpubr")
Step 3: $ t.test(x = weights, mu = 310, alternative = "two.sided")
52 Yagyesh Trivedi
Step 4: $ women_weight = c ( 38.9, 61.2, 73.3, 21.8, 63.4, 63.6, 48.4, 48.8, 48.5) $ men_weight =
c(67.8, 60, 63.4, 76, 89.4, 73.3, 67.3, 61.3, 62.4)
Step 5: my_data = data.frame(group = rep(c("Woman", "Man"), each = 9),weight =

c(women_weight, men_weight)) $ my_data
Step 6: $ install.packages("dplyr") $ library(dplyr)

Step 7: $ group_by(my_data, group) %>%summarise(count = n(), mean = mean(weight, na.rm =

TRUE), sd = sd(weight, na.rm = TRUE))
Step 8: Load libraries $ library(ggplot2) $ library(ggpubr)
Rhan Salunkhe 3
Step 9: $ ggboxplot(my_data, x="group", y = "weight", color = "group", palette = c

("#00AFBB","#E7B800"), ylab = "Weight", xlab = "Groups")
Step 10: $ with(my_data, shapiro.test(weight[group == "Man"]))
Step 11: $ with(my_data, shapiro.test(weight[group == "Woman"]))
4 Rohan Salunkhe
Step 12: $ res.test = var.test(weight~group, data = my_data) $ res.test

Practical No – 09
Aim: Implementation of Analysis of Variance
Introduction:
ANOVA also known as Analysis of variance is used to investigate relations between
categorical variables and continuous variable in R Programming. It is a type of hypothesis
testing for population variance.
R – ANOVA Test
ANOVA test involves setting up:

• Null Hypothesis: All population means are equal.
• Alternate Hypothesis: Atleast one population mean is different from other.
ANOVA tests are of two types:

• One way ANOVA: It takes one categorical group into consideration.
• Two way ANOVA: It takes two categorical group into consideration.
Rohan Salunkhe 1
Step 1: $ mtcars = read.csv("E:/TYCS Rohan/Ds/practical/mtcars.csv") $ mtcars
Step 2: $ head(mtcars)
Step 3 :$ boxplot(mtcars$disp~factor(mtcars$gear), xlab="Gear", ylab="Displacement")
Rohan Salunkhe 2
Step 4: $ install.packages("dplyr") `$ library(dplyr) Step 5: $ mtcars_anova =

aov(mtcars$disp~factor(mtcars$gear)) $ summary(mtcars_anova)
Step 6: $ cropdata = read.csv("E:/TYCS Rohan/Ds/practical/cropdata.csv") $ cropdata
Rohan Salunkhe 3
Step 7: $ head(cropdata)
Step 8: $ summary(cropdata)
Step 9: $ oneway = aov(yield~fertilizer, data= cropdata) $ summary(oneway)
Rohan Salunkhe 4
Step 10: $ towway = aov(yield~fertilizer+density, data= cropdata) $ summary(towway)
Step 11:
$ mtcars_aov <-aov(mtcars$disp~factor(mtcars$gear)+factor(mtcars$am)) $ summary(mtcars_aov)
Rohan Salunkhe 5
Practical No – 10
Aim : Implementation of Decision Tree
Introduction:
Decision Trees are generally used for regression problems where the relationship between the
dependent (response) variable and the independent (explanatory/predictor) variables is
nonlinear in nature.
Decision tree is a graph to represent choices and their results in form of a tree. The nodes in
the graph represent an event or choice and the edges of the graph represent the decision rules
or conditions
Following packages are used to display decision tree in R, hence we nee to install and load these
libraries:
• library(rpart)
• library(rpart.plot)
• library(caret)
• library(tree)
In order to grow our decision tree, we have to first load the rpart package. Then we can use the
rpart() function, specifying the model formula, data, and method parameters
rpart has some default parameters that prevented our
tree from minbucke growing. Namely minsplit and .
t
minsplit is “the minimum number of observations that must exist in a node in
order for a split to be attempted” minbucket is “the minimum number of observations in any
terminal node”.
Rpart.plot is used to display decision tree on scree.
Rohan Salunkhe 1
Step 1 : Install packages $ install.packages("rattle") $ install.packages("rpart") $
install.packages("tree")
Step 2 : Load Libraries $ library(rattle) $ library(rpart) $ library(tree)
Step 3 : $ x = read.csv("E:/TYCS Rohan/Ds/weather.csv") $ x
2 Rohan Salunkhe
Step 4 : $ sample_weather = sample(nrow(x),.7*nrow(x)) $ weather_training = x [sample_weather,] $

weather_testing = x [sample_weather,] $ weather_training
$ weather_testing
Step 5 : $ dtree = rpart(play~., data = weather_training, method = "class", control =

rpart.control(minsplit = 1, minbucket = 1))
Step 6 : $ library(rpart.plot) $ rpart.plot(dtree)
4 Rohan Salunkhe
Rohan Salunkhe 3
Step 7 : $ p = predict(dtree, weather_testing,type = "class") $ p

Step 8 : $ weather_testing
Step 9 : $ confusion_mt = table(weather_testing$play,p) $ confusion_mt
Step 10 : $ ac_test = sum(diag(confusion_mt))/sum(confusion_mt) $ ac_test
Output -> 0.1
Step 11 : $ ac_test*100 Output -> [1] 100
Step 12 : $ print(paste("Accuracy for Testing data set: ", ac_test*100,"%")) Output -> "Accuracy for
Testing data set: 100 %"
Rohan Salunkhe 5

Practical No - 01: Aim: Data Collection, Data Curation and Management For Unstructured Data (Nosql) Using Apache Couchdb

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Practical No - 01: Aim: Data Collection, Data Curation and Management For Unstructured Data (Nosql) Using Apache Couchdb

Uploaded by

Copyright:

Available Formats

TYBSC CS – SEM VI – Data Science

The architecture of CouchDB is described below:

Features of CouchDB includes the following:

Advantages of CouchDB includes the following:

Installation of CouchDB in Windows

Step 1. Download CouchDB

Step 4: Create a database with default options

Step 5 : After login you can see the below screenshot

Step 6 : Open command as administrator and http://127.0.0.1:5984

Step 7: To create database inside CouchDb do following command

Step 8: Getting all databases

Step 8: Insert Records

Step 9 : Update Records

$ curl -X PUT "http://admin:123456@127.0.0.1:5984/student/3" -d "{\"_rev\":\"3-

Step 10 : Delete Records

The advantages of using documents are:

Step 1 : Download and Install mongoose shell

Step 2 : Search edit the system environment variable and open it

Step 3 : Click on environment variable

Step 5 : Do the same for system variables and click on ok

Step 6 : Go to command prompt and type :

Step 8 : Now type $ mongo and type show dbs

Step 10 : Type this command to insert new records

1. $ db.library36.insertOne({title: "Core Java", status_info:

2. db.library36.insertOne({title: "Python", status_info:

3. db.library36.insertOne({title: "Javascript", status_info:

4. db.library36.insertOne({title: "NodeJs", status_info:

Step 11 : Type $ db.student.find({})

Step 12 : Update Records

Step 13 : Delete single record

Step 14 : Delete all Records

Step 16 : Check your record in multiple format

PCA-Principal Component Analysis is used in exploratory data analysis and for

The first principal component can equivalently be defined as a direction that

Step involved in the calculation of PCA:

Step 1 - Data normalization

● Monthly expenses: $300

Step 2 - Covariance matrix

Step 3 - Eigenvectors and eigenvalues

Geometrically, an eigenvector represents a direction such as “vertical” or “90 degrees”.

Step 4 - Selection of principal components

Step 5 - Data transformation in new dimensional space

Step 1: Read the Subject.csv file

Step 2: Display the covariance matrix

Step 3 : Calculate eigen value

Step 4 : Install packages

Step 5 : Load library

Step 6 : Type $ datapca$eig

Step 7 : Type $ datapca$var

Step 10 : Now type

Step 11 : Read Iris.csv

Step 13 : Now next type

Now find covariance matrix :

Step 15 : Now type

Clustering allows us to identify homogeneous groups and categorize them from

K Means is a clustering algorithm that repeatedly assigns a group amongst k

R has a clustering package that calculates the above steps.

Load dataset Step 2 select columns

Assign different colors to the clusters.

Step 7: Load library

Step 8: $ fviz_cluster(c, data=df)

Step 10: $ View(iris)