Professional Documents
Culture Documents
Practical No – 01
Aim: Data collection, Data curation and management for Unstructured data
(NoSQL) using Apache CouchDB
Introduction:
Apache CouchDB is an open-source document-oriented NoSQL database, implemented in
Erlang.CouchDB uses multiple formats and protocols to store, transfer, and process its data. It
uses JSON to store data, JavaScript as its query language using MapReduce, and HTTP for an API
Unlike a relational database, a CouchDB database does not store data and relationships in
tables. Instead, each database is a collection of independent documents. Each document
maintains its own data and self-contained schema. An application may access multiple
databases, such as one stored on a user's mobile phone and another on a server. Document
metadata contains revision information, making it possible to merge any differences that may
have occurred while the databases were disconnected.
Architecture:
Rohan Salunkhe 1
TYBSC CS – SEM VI – Data Science
4. Replica Database: It is used for replicating data to a local or remote database and
synchronizing design documents.
Features of CouchDB
Advantages of CouchDB
Rohan Salunkhe 2
TYBSC CS – SEM VI – Data Science
Step 2 : Install the app, provide the username and password and hit next
Username = admin and passwords = 123456
Step 3 : Open browser and login into http://127.0.0.1:5984/_utils/ with username and password.
Rohan Salunkhe 3
TYBSC CS – SEM VI – Data Science
Rohan Salunkhe 4
TYBSC CS – SEM VI – Data Science
Rohan Salunkhe 5
TYBSC CS – SEM VI – Data Science
$ curl -X DELETE
"http://admin:123456@127.0.0.1:5984/student/2?rev="2-
5d3f522eabd22f1b31679d071f6bcaac"
Rohan Salunkhe 6
TYBSC CS – SEM VI – Data Science
Practical No – 02
Aim: Data collection, Data curation and management for Large-scale Data system
using MongoDB
Introduction:
MongoDB is a NoSQL database which stores the data in form of key-value pairs. It
is an Open Source, Document Database which provides high performance and
scalability along with data modelling and data management of huge sets of data
in an enterprise application.
MongoDB also provides the feature of Auto-Scaling. Since, MongoDB is a cross
platform database and can be installed across different platforms like Windows,
Linux etc.
Document Database:
A record in MongoDB is a document, which is a data structure composed of field
and value pairs. MongoDB documents are similar to JSON objects. The values of
fields may include other documents, arrays, and arrays of documents.
Rohan Salunkhe 1
TYBSC CS – SEM VI – Data Science
Rohan Salunkhe 2
TYBSC CS – SEM VI – Data Science
C:\Program Files\MongoDB\Server\6.0\data\
Rohan Salunkhe 3
TYBSC CS – SEM VI – Data Science
Rohan Salunkhe 4
TYBSC CS – SEM VI – Data Science
Rohan Salunkhe 5
TYBSC CS – SEM VI – Data Science
Step 4 : Now in environment variables click on new add variable name path and variable value to the
location where mongodb is install “C:\Program Files\MongoDB\Server\6.0\bin”
Rohan Salunkhe 6
TYBSC CS – SEM VI – Data Science
Rohan Salunkhe 7
TYBSC CS – SEM VI – Data Science
Note : If the server closes automatically then create a new folder in C drive and name it as “data”
and inside data create “db”.
Rohan Salunkhe 8
TYBSC CS – SEM VI – Data Science
Step 7 : Go to the folder where you extracted zip file and go that folder open the bin and click on mongosh
Rohan Salunkhe 9
TYBSC CS – SEM VI – Data Science
Step 9 : To create new database type $ use library36 and $ show dbs
Rohan Salunkhe 10
TYBSC CS – SEM VI – Data Science
Rohan Salunkhe 11
TYBSC CS – SEM VI – Data Science
Rohan Salunkhe 12
TYBSC CS – SEM VI – Data Science
$ db.library36.deleteOne({title:"React"})
Step 15 : Go to mongodb Compass and connect it go to your collection and select your db
Rohan Salunkhe 13
TYBSC CS – SEM VI – Data Science
Rohan Salunkhe 14
TYBSC CS – SEM VI – Data Science
Practical No – 03
Aim: Implementation of Principal Component Analysis
Introduction:
PCA is commonly used for dimensionality reduction by using each data point onto
only the first few principal components (most cases first and second dimensions)
to obtain lower-dimensional data while keeping as much of the data’s variation as
possible.
The principal components are often analyzed by eigen decomposition of the data
covariance matrix or singular value decomposition (SVD) of the data matrix.
By considering the example in the introduction, let’s consider, for instance, the following
information for a given client.
This information has different scales and performing PCA using such data will lead to a
biased result. This is where data normalization comes in. It ensures that each attribute
has the same level of contribution, preventing one variable from dominating others. For
each variable, normalization is done by subtracting its mean and dividing by its standard
deviation.
Rohan Salunkhe 1
TYBSC CS – SEM VI – Data Science
As the same suggests, this step is about computing the covariable matrix from the
normalized data. This is a symmetric matrix, and each element (i, j) corresponds to the
covariance between variables i and j.
There are as many pairs of eigenvectors and eigenvalues as the number of variables in
the data. In the data with only monthly expenses, age, and rate, there will be three
pairs. Not all the pairs are relevant. So, the eigenvector with the highest eigenvalue
corresponds to the first principal component. The second principal component is the
eigenvector with the second highest eigenvalue, and so on.
This step involves re-orienting the original data onto a new subspace defined by the
principal components This reorientation is done by multiplying the original data by the
previously computed eigenvectors.
It is important to remember that this transformation does not modify the original data
itself but instead provides a new perspective to better represent the data.
$ x = read.csv("E:/TYCS Rohan/Ds/subjects.csv")
$x
Rohan Salunkhe 2
TYBSC CS – SEM VI – Data Science
Rohan Salunkhe 3
TYBSC CS – SEM VI – Data Science
$ install.packages("FactoMineR")
Rohan Salunkhe 4
TYBSC CS – SEM VI – Data Science
Rohan Salunkhe 5
TYBSC CS – SEM VI – Data Science
Step 8 : Type
$ datapca$var$coord
$ library("factoextra")
Step 9 : Type
$ library("ggplot2")
$ fviz_eig(datapca,addlabels = TRUE , ylim = c (0,50))
$ fviz_eig(datapca,addlabels = TRUE , ylim = c (0,50))
$ fviz_eig(datapca,addlabels = TRUE , ylim = c (0,80))
Rohan Salunkhe 6
TYBSC CS – SEM VI – Data Science
Rohan Salunkhe 7
TYBSC CS – SEM VI – Data Science
Step 12 : Now
$ y = iris [ , -5]
$y
Rohan Salunkhe 8
TYBSC CS – SEM VI – Data Science
Step 14 : Type
$ irispca = PCA ( y, ncp = 3 , graph = TRUE )
Rohan Salunkhe 9
TYBSC CS – SEM VI – Data Science
Rohan Salunkhe 10
$ irispca$var$coord
Step 16 : Type
$ fviz_eig ( irispca, addlabels = TRUE , ylim = c ( 0, 80))
TYBSC CS – SEM VI – Data Science
Rohan Salunkhe 11
TYBSC CS – SEM VI – Data Science
Practical No – 04
Aim: Implementation of Clustering
Introduction:
One of the simplest clustering's is K-means, the most used clustering method for
splitting a dataset into a set of n groups.
The number of clusters is decided, cluster centers are selected in random farthest
from one another, the distance between each data point and center is calculated
using Euclidean distance, the data point is assigned to the cluster whose center is
nearest to that point. This process is repeated until the center of clusters does not
change and data points remain in the same cluster.
Step 1
to perform clustering
Step 3
The next step is to use the K Means algorithm. K Means is the method we use
which has parameters (data, no. of clusters or groups). Here our data is the x
object and we will have k=3 clusters.
Step 4
Case Study 1
Step 1 : $ df = read.csv("E:/TYCS Rohan/Ds/prac4/age.csv") $ df
Rohan Salunkhe 1
TYBSC CS – SEM VI – Data Science
Step 2: $ plot(df)
Step 3: $ boxplot(df)
Rohan Salunkhe 2
TYBSC CS – SEM VI – Data Science
Step 4: $ set.seed(20)
Step 5: $ c = kmeans(df[,1:2],3)
Step 6: $ install.packages("factoextra")
Rohan Salunkhe 3
TYBSC CS – SEM VI – Data Science
$ library("factoextra")
Rohan Salunkhe 4
TYBSC CS – SEM VI – Data Science
Case Study 2
Step 9: $ head(iris)
Rohan Salunkhe 5
TYBSC CS – SEM VI – Data Science
Rohan Salunkhe 6
TYBSC CS – SEM VI – Data Science
Rohan Salunkhe 7
TYBSC CS – SEM VI – Data Science
Practical No – 05
Aim: Analysis of Time Series Forecasting
Introduction:
Time series forecasting is the process of using historical data to make predictions
about future events. It is commonly used in fields such as finance, economics, and
weather forecasting. R is a powerful programming language and software
environment for statistical computing and graphics that is widely used for time
series forecasting.
There are several R packages available for time series forecasting, including:
“forecast”: This package provides a wide range of methods for time series
forecasting, including exponential smoothing, ARIMA, and neural networks.
“tseries”: This package provides functions for time series analysis and forecasting,
including functions for decomposing time series data, and for fitting and
forecasting models such as ARIMA.
Rohan Salunkhe 1
TYBSC CS – SEM VI – Data Science
Step 1: install
$ install.packages("timeSeries")
$ install.packages("forecast")
$ library(timeSeries)
$ library(timeDate)
$ library(forecast)
Step 3:
$ data = table(AirPassengers)
$ data
Step 4: $ View(data)
Step 5: $ AirPassengers
Rohan Salunkhe 2
TYBSC CS – SEM VI – Data Science
Step 6: $ sum(is.na(AirPassenger)
Step 7: $ frequency(AirPassengers)
Step 8: $ summary(AirPassengers)
Step 9: $ install.packages("tseries")
$ library(tseries)
Rohan Salunkhe 3
TYBSC CS – SEM VI – Data Science
Rohan Salunkhe 4
TYBSC CS – SEM VI – Data Science
Rohan Salunkhe 5
TYBSC CS – SEM VI – Data Science
Rohan Salunkhe 6
TYBSC CS – SEM VI – Data Science
Rohan Salunkhe 7
TYBSC CS – SEM VI – Data Science
$ fit
$ prediction
$ conPred
Rohan Salunkhe 8
TYBSC CS – SEM VI – Data Science
Rohan Salunkhe 9
TYBSC CS – SEM VI – Data Science
Practical No – 06
Aim: Analysis of Simple/Multiple Linear Regression
Introduction:
Linear regression is a regression model that uses a straight line to describe the relationship between
variables. It finds the line of best fit through your data by searching for the value of the regression
coefficient(s) that minimizes the total error of the model.
Once, we built a statistically significant model, it’s possible to use it for predicting future
outcome on the basis of new x values.
In this practical we have used Income data set. This dataset contains observations about
income (in a range of $15k to $75k) and happiness (rated on a scale of 1 to 10) in an imaginary
sample of 500 people. The income values are divided by 10,000 to make the income data
match the scale of the happiness scores (so a value of $2 represents $20,000, $3 is $30,000,
etc.)
Rohan Salunkhe 1
TYBSC CS – SEM VI – Data Science
head(incomedata)
Step 4: summary(incomedata)
Step 5: $ hist(incomedata$happiness)
Rohan Salunkhe 2
TYBSC CS – SEM VI – Data Science
Step 7:
$ lml= lm(happiness ~income, data = incomedata) $ summary(lml)
Rohan Salunkhe 3
TYBSC CS – SEM VI – Data Science
Step 8:
$income.graph = ggplot(incomedata, aes ( x = income
, y=happiness))+geom_point() $ income.graph
Step 9:
Rohan Salunkhe 4
TYBSC CS – SEM VI – Data Science
Step 10:
$ income.graph = income.graph + stat_regline_equation(label.x=3, label.y=7) $ income.graph
Rohan Salunkhe 5
TYBSC CS – SEM VI – Data Science
Rohan Salunkhe 6
TYBSC CS – SEM VI – Data Science
Step 2: $ head(heartdata)
Step 3: $ summary(heartdata)
Step 5: $ hist(heartdata$heart.disease)
Rohan Salunkhe 7
TYBSC CS – SEM VI – Data Science
Rohan Salunkhe 8
TYBSC CS – SEM VI – Data Science
Rohan Salunkhe 9
TYBSC CS – SEM VI – Data Science
View(plotheartdata)
View(plotheartdata)
Rohan Salunkhe 1
0
TYBSC CS – SEM VI – Data Science
Rohan Salunkhe 1
1
TYBSC CS – SEM VI – Data Science
Rohan Salunkhe 1
2
TYBSC CS – SEM VI – Data Science
Practical No – 07
Aim : Analysis of Logistic Regression
Introduction:
• Logistic regression is used to predict the class (or category) of individuals based on one
or multiple predictor variables (x). It is used to model a binary outcome, that is a
variable, which can have only two possible values: 0 or 1, yes or no, diseased or
nondiseased.
• Logistic regression does not return directly the class of observations. It allows us to
estimate the probability (p) of class membership. The probability will range between 0
and 1. You need to decide the threshold probability at which the category flips from
one to the other. By default, this is set to p = 0.5
Rohan Salunkhe 1
TYBSC CS – SEM VI –
Data Science
Step 3 : nrow(x)
Step 4 : s = sample(nrow(x),.7*nrow(x))
Step 7 : nrow(x_training)
Step 8 : nrow(x_testing)
2 Rohan Salunkhe
Step 9 : X_training
TYBSC CS – SEM VI – Data Science
Step 10 : X_testing
Step 14 : summary(lmode1)
Data Science
Rohan Salunkhe 3
TYBSC CS – SEM VI –
Step 17 : actual_prediction <- data.frame(cbind(actuals = x2_testing$Grade,predicted = prediction))
Step 18 : actual_prediction
Exam 2
4 Rohan Salunkhe
TYBSC CS – SEM VI – Data Science
Exam 3
Exam 4
Rohan Salunkhe 5
TYBSC CS – SEM VI – Data Science
Practical No – 08
Aim: Implementation of Hypothesis Testing
Introduction:
An analyst performs hypothesis testing on a statistical sample to present evidence of the
plausibility of the null hypothesis. Measurements and analyses are conducted on a random
sample of the population to test a theory. Analysts use a random population sample to test two
hypotheses: the null and alternative hypotheses.
The null hypothesis is typically an equality hypothesis between population parameters; for
example, a null hypothesis may claim that the population means return equals zero. The
alternate hypothesis is essentially the inverse of the null hypothesis (e.g., the population means
the return is not equal to zero). As a result, they are mutually exclusive, and only one can be
correct. One of the two possibilities, however, will always be correct.
Null Hypothesis and Alternate Hypothesis
• The Null Hypothesis is the assumption that the event will not occur. A null hypothesis
has no bearing on the study's outcome unless it is rejected. H0 is the symbol for it, and
it is pronounced H-naught.
Rhan Salunkhe 1
• The Alternate Hypothesis is the logical opposite of the null hypothesis. The acceptance
of the alternative hypothesis follows the rejection of the null hypothesis. H1 is the
symbol for it.
Types of Hypothesis Testing
Step 1: $ weights = c(301, 305, 312, 315, 318, 319, 310, 318, 305, 313, 305, 305, 305)
52 Yagyesh Trivedi
Step 4: $ women_weight = c ( 38.9, 61.2, 73.3, 21.8, 63.4, 63.6, 48.4, 48.8, 48.5) $ men_weight =
Rhan Salunkhe 3
4 Rohan Salunkhe
Practical No – 09
Aim: Implementation of Analysis of Variance
Introduction:
ANOVA also known as Analysis of variance is used to investigate relations between
categorical variables and continuous variable in R Programming. It is a type of hypothesis
testing for population variance.
R – ANOVA Test
Rohan Salunkhe 1
TYBSC CS – SEM VI – Data Science
Step 2: $ head(mtcars)
Rohan Salunkhe 2
TYBSC CS – SEM VI – Data Science
Rohan Salunkhe 3
TYBSC CS – SEM VI – Data Science
Step 7: $ head(cropdata)
Step 8: $ summary(cropdata)
Rohan Salunkhe 4
TYBSC CS – SEM VI – Data Science
Step 11:
$ mtcars_aov <-aov(mtcars$disp~factor(mtcars$gear)+factor(mtcars$am)) $ summary(mtcars_aov)
Rohan Salunkhe 5
TYBSC CS – SEM VI – Data Science
Practical No – 10
Aim : Implementation of Decision Tree
Introduction:
Decision Trees are generally used for regression problems where the relationship between the
dependent (response) variable and the independent (explanatory/predictor) variables is
nonlinear in nature.
Decision tree is a graph to represent choices and their results in form of a tree. The nodes in
the graph represent an event or choice and the edges of the graph represent the decision rules
or conditions
Following packages are used to display decision tree in R, hence we nee to install and load these
libraries:
• library(rpart)
• library(rpart.plot)
• library(caret)
• library(tree)
In order to grow our decision tree, we have to first load the rpart package. Then we can use the
rpart() function, specifying the model formula, data, and method parameters
rpart has some default parameters that prevented our
tree from minbucke growing. Namely minsplit and .
t
minsplit is “the minimum number of observations that must exist in a node in
order for a split to be attempted” minbucket is “the minimum number of observations in any
terminal node”.
Rpart.plot is used to display decision tree on scree.
Rohan Salunkhe 1
TYBSC CS – SEM VI – Data Science
Step 1 : Install packages $ install.packages("rattle") $ install.packages("rpart") $
install.packages("tree")
2 Rohan Salunkhe
$ weather_testing
4 Rohan Salunkhe
Rohan Salunkhe 3
TYBSC CS – SEM VI – Data Science
Step 12 : $ print(paste("Accuracy for Testing data set: ", ac_test*100,"%")) Output -> "Accuracy for
Testing data set: 100 %"
Rohan Salunkhe 5