You are on page 1of 121

RAMNARAIN RUIA AUTONOMOUS COLLEGE 

DEPARMENT OF COMPUTER SCIENCE & IT (2022-23)

SEAT NO - 4837 STUDENT NAME – VIPUL GUPTA_


CLASS: T.Y.B.Sc. SEMESTER: VI COURSE CODE: RUSCSP604
COURSE NAME - Practical of Data Science
FACULTY IN-CHARGE - Ms. Megha Sawant

INDEX

SR.
DATE PARTICULAR SIGNATURE
NO.

1 Data Preparation
11/11/22
Principal Component
2
   17/11/22 Analysis

3 Exploratory Data Analysis


18/11/22
4 Classification -Decision Tree
25/11/22

5 Clustering
2/12/22

6 9/12/22 Association

7 Prediction: Time Series


15/12/22
8 NoSQL Database - MongoDB
13/1/23
9 13/1/23 Topic modelling
Vipul Gupta (4837)
DATA SCIENCE
PRACTICAL 1: Data preprocessing-chi square & correlation
Data Preparation - Correlation coefficient

Description

Correlation test is used to evaluate the association between two or


more variables.

The variables may be two columns of a given data set of observations,


often called a sample, or two components of a multivariate random
variable with a known distribution.

The Pearson product-moment correlation coefficient is a measure of


the strength and direction of the linear relationship between two
variables that is defined as the covariance of the variables divided by
the product of their standard deviations. This is the best-known and
most commonly used type of correlation coefficient.

Strength: The greater the absolute value of the correlation coefficient,


the stronger the relationship.

Direction: The sign of the correlation coefficient represents the


direction of the
relationship.
The values range between -1.0 and 1.0. 
A calculated number greater than 1.0 or less than -1.0 means that
there was an error in the correlation measurement. 
A correlation of -1.0 shows a perfect negative correlation, while a
correlation of 1.0 shows a perfect positive correlation. 
A correlation of 0.0 shows no relationship between the movement of
the two variables. 

First install the R library

Dataset –

Create a csv file


Code:
my_data <-
read.csv("C:/Users/student/Desktop/JanhaviTYSem6/corelation.csv")
print(my_data)
res <- cor(my_data$Age , my_data$Glucose)
res
res1 <- cor.test(my_data$Age, my_data$Glucose, 
                method = "pearson")
res1
Output:
The correlation obtained is t = 1.2494 is positive.
If correlation is –ve then not co related and if the value is +ve then its
correlated
Hypothesis H0: age and Glucose are correlated.
p-value is 0.2796
The estimated p-value is 0.05
Therefore, the p value is greater than 0.05.
Hence,
Null hypothesis is accepted.
Conclusion:
Therefore, Age and Glucose are correlated.
As the age increases the glucose count increases 
Chi square
Description

Chi-squared test, a statistical method, is used by machine learning


methods to check the correlation between two categorical variables.

Chi-Square distribution

A random variable ꭓ follows chi-square distribution if it can be written


as a sum of squared standard normal variables.

Degrees of freedom

Degrees of freedom refers to the maximum number of logically


independent values, which have the freedom to vary.

A chi-square test is used in statistics to test the independence of two


events. Given the data of two variables, we can get observed count O
and expected count E. Chi-Square measures how expected count E and
observed count O deviates each other.
Steps to perform the Chi-Square Test
● Define Hypothesis.
● Build a Contingency table.
● Find the expected values.
● Calculate the Chi-Square statistic.
● Accept or Reject the Null Hypothesis.

Role/Importance
The Chi-square test is intended to test how likely it is that an observed
distribution is due to chance. It is also called a “goodness of fit”
statistic,because it measures how well the observed distribution of data
fits with the distribution that is expected if the variables are
independent.

A Chi-square test is designed to analyze categorical data. That means


that the
data has been counted and divided into categories.

5% level of significance – prob of rejecting null hypothesis when it is


true
First install the R library

Dataset 

Code:
data_frame <-
read.csv("C:/Users/student/Desktop/JanhaviTYSem6/treatment.csv") 
#Reading CSV
table(data_frame$treatment, data_frame$improvement)
chisq.test(data_frame$treatment, data_frame$improvement)
Output:

Chisq.test performs chi squared contingency tables test and goodness


of fit tests.
Conclusion: treatment and improvement are not correlated.
Interpretation:
As chi square increases p-value decreases.
p-value is 0.03083
It should be rejected as it is less than 0.05(estimated value)
Null hypothesis is accepted.
Therefore, treatment and improvement are not correlated.
Dataset –
Program –

Output -

Interpretation

Null hypothesis H0:


Service and Salary are independent

Alternative Hypothesis H1:


Service and Salary are dependent

p-value: 0.2796
p- value is greater than 0.05
Conclusion

Null Hypothesis is accepted


Hence there is no relationship between the service provided and salary.
Vipul Gupta (4837)
DATA SCIENCE 
PRACTICAL 2: PCA

Principal Component Analysis, or PCA, is a dimensionality-reduction


method that is often used to reduce the dimensionality of large data
sets, by transforming a large set of variables into a smaller one that still
contains most of the information in the large set.
basics of PCA are as follows: you take a dataset with many variables,
and you simplify that dataset by turning your original variables into a
smaller number of "Principal Components".
Principal Components are the underlying structure in the data. They are
the directions where there is the most variance, the directions where
the data is most spread out. This means that we try to find the straight
line that best spreads the data out when it is projected along it. This is
the first principal component, the straight line that shows the most
substantial variance in the data.
PCA is a type of linear transformation on a given data set that has
values for a certain number of variables (coordinates) for a certain
amount of spaces. This linear transformation fits this dataset to a new
coordinate system in such a way that the most significant variance is
found on the first coordinate, and each subsequent coordinate is
orthogonal to the last and has a lesser variance. 
Where many variables correlate with one another, they will all
contribute strongly to the same principal component. Each principal
component sums up a certain percentage of the total variation in the
dataset
CODE.:- 

mtcars.pca <- prcomp(mtcars[,c(1:7,10,11)], center = TRUE,scale. =


TRUE)
mtcars.pca
summary(mtcars.pca)
head(mtcars)
str(mtcars.pca)

install.packages("devtools", type = "win.binary")

#Plot PCA
library(devtools)
install_github("vqv/ggbiplot")

library(ggbiplot)
ggbiplot(mtcars.pca)

ggbiplot(mtcars.pca, labels=rownames(mtcars))

#Interpreting the results(Grouping)


mtcars.country <- c(rep("Japan", 3), rep("US",4), rep("Europe",
7),rep("US",3), "Europe", rep("Japan", 3), rep("US",4), rep("Europe", 3),
"US", rep("Europe", 3))
ggbiplot(mtcars.pca,ellipse=TRUE,  labels=rownames(mtcars),
groups=mtcars.country)

ggbiplot(mtcars.pca,ellipse=TRUE,choices=c(3,4),  
labels=rownames(mtcars), groups=mtcars.country)

#Graphical parameters with ggbiplot


ggbiplot(mtcars.pca,ellipse=TRUE,circle=TRUE,
labels=rownames(mtcars), groups=mtcars.country)
ggbiplot(mtcars.pca,ellipse=TRUE,obs.scale = 1, var.scale = 1, 
labels=rownames(mtcars), groups=mtcars.country)
ggbiplot(mtcars.pca,ellipse=TRUE,obs.scale = 1, var.scale =
1,var.axes=FALSE,   labels=rownames(mtcars), groups=mtcars.country)

OUTPUT.:-
CONCLUSION.:-
Europe and US Origin Cars have higher variance as compared to Japan
Origin Cars
There’s separation between American and Japanese cars along a
principal component that is closely correlated to cyl, disp, wt and mpg.
These variables can be considered .
Vipul Gupta (4837)
DATA SCIENCE
 PRACTICAL 3 :EXPLORATORY DATA ANALYSIS
Exploratory Data Analysis refers to the critical process of performing
initial investigations on data so as to discover patterns, to spot
anomalies, to test hypothesis and to check assumptions with the help
of summary statistics and graphical representations.
 At a high level, EDA is the practice of describing the data by means
of statistical and visualization techniques to bring important
aspects of that data into focus for further analysis. 
 This involves looking at your data set from many angles,
describing it, and summarizing it without making any assumptions
about its contents. 
 This is a significant step to take before diving into machine
learning or statistical modeling, to make sure the data are really
what they are claimed to be and that there are no obvious
problems. 
Diabetes Dataset

Data set
The datasets consist of several medical predictor (independent)
variables and one target (dependent) variable, Outcome. Independent
variables include the number of pregnancies the patient has had, their
BMI, insulin level, age, and so on.
Data Description
 Pregnancies: Number of times pregnant
 Glucose: Plasma glucose concentration 2 hours in an oral
glucose tolerance test
 Blood Pressure: Diastolic blood pressure (mm Hg)
 Skin Thickness: Triceps skin fold thickness (mm)
 Insulin: 2-Hour serum insulin (mu U/ml)
 BMI: Body mass index (weight in kg/(height in m)^2)
 Diabetes Pedigree Function: Diabetes pedigree function
 Age: Age (years)
 Outcome: Class variable (0 or 1)  1 indicates diabetes is present

Program 
diabet <- read.csv('C:/Users/student/Downloads/diabetes.csv')
head(diabet)

str(diabet)

summary(diabet)
# Display only 10 values from whole data
diabet[1:10,]

#Display only first 2 columns of the dataset


diabet[,1:2]
diabet[1:10,1:2]
#Consider all the data whose Outcome is 1
newdata1<-subset(diabet,diabet$Outcome=="1")
newdata1
# Display data who is pregnant only once and the there is no outcome
newdata2<-subset(diabet,diabet$Pregnancies=="1"
&diabet$Outcome=="0")
newdata2

#Dispay column 1 and 2 where pregnancies is 1 and outcome is 0


newdata3<-subset(diabet,diabet$Pregnancies=="1" |
diabet$Outcome=="0",select=c(1,2))
newdata3

# Display data in ascending order based on the BMI


newdata4<-diabet[order(diabet$BMI), ]
newdata4

#Display data in descending order based on the BMI


newdata5<-diabet[order(-diabet$BMI),]
newdata5
#Find the aggregate of dependent variable (BMI) and independent
variable (Outcome)
newdata6<-aggregate(BMI~Outcome,data=diabet,FUN=mean)
newdata6

#Find the columns of the dataset and to check missing values in dataset
names(diabet)
colSums(is.na(diabet))
#Draw histogram
hist(diabet$BMI,col='RED')

#Do plotting of data


plot(diabet$BMI)

#Do boxplot
boxplot(diabet$BMI)
#Find mean, median , max and min
mean(diabet$BMI)
median(diabet$BMI)
max(diabet$BMI)
min(diabet$BMI)

#Do boxplot of data


boxplot(newdata2$SkinThickness)
data<-read.csv('C:/Users/student/Downloads/diabetes.csv')
attach(data)
data
class(BMI)

Comments - used to access the variables present in the data and


framework without calling the data frame.
count<-table(Outcome)

#Do bar plot of data


barplot(count,col=2)

#Do pie chart of data


pie(count)
table(Pregnancies)
count<-table(Pregnancies)

barplot(count)
pie(count)

barplot(data$BMI,data$Glucose,beside = TRUE, col='YELLOW')


newdata7 < -subset(diabet, diabet$Pregnancies <= 7 &
diabet$Outcome == "1")
newdata7

#Draw a bar chart and boxplot of BMI and Glucose for the above subset
barplot(newdata7$BMI, newdata7$Glucose, beside=TRUE,
col='YELLOW')
boxplot(newdata7$BMI, newdata7$Glucose, beside=TRUE,
col='YELLOW')
newdata8 < -subset(diabet, diabet$Pregnancies >= 8 | diabet$Outcome
== "1")
newdata8

#Draw a bar chart and boxplot of BMI and Glucose for the above subset
barplot(newdata8$BMI, newdata8$Glucose, beside=TRUE,
col='YELLOW')
boxplot(newdata8$BMI, newdata8$Glucose, beside=TRUE,
col='YELLOW')
Conclusion –
1. BMI not affecting much .
2. Blood pressure level not affecting diabetes.
3. Pregnancies affecting diabetes levels.

Valorant Dataset
Data set
This dataset contains various stats about the game's weapons like
damage, price, fire rate, etc.
Program
valorant <-read.csv('C:/Users/student/Downloads/valorant-stats.csv')
head(valorant)

str(valorant)
summary(valorant)

valorant[1:10, ]
valorant[, 1:2]
valorant[1:10, 1:2]
newdata1 < -subset(valorant, valorant$Magazine.Capacity >= "12")
newdata1
newdata2 < -subset(valorant, valorant$Weapon.Type == "Rifle" &
valorant$Wall.Penetration == "Medium")
newdata2
newdata3 < -subset(valorant, valorant$Fire.Rate >= "5" | valorant$Price
>= "500", select=c(1, 2))
newdata3

newdata4 < -valorant[order(valorant$Magazine.Capacity), ]


newdata4

newdata5 < -valorant[order(-valorant$Price), ]


newdata5

names(valorant)
colSums( is .na(valorant))

hist(valorant$BDMG_1 ,col='RED')
plot(valorant$BDMG_1)

boxplot(valorant$BDMG_1)
mean(valorant$BDMG_1)
median(valorant$BDMG_1)
max(valorant$BDMG_1)
min(valorant$BDMG_1)

valorant < -read.csv('C:/Users/student/Downloads/valorant-stats.csv')


attach(data)
data < -read.csv('C:/Users/student/Downloads/valorant-stats.csv')
attach(data)
data

class(Magazine.Capacity)
table(Wall.Penetration)
count < -table(Wall.Penetration)

barplot(count,col=2)

pie(count)
table(Magazine.Capacity)
count < -table(Magazine.Capacity)

barplot(count)
pie(count)

Conclusion –
1. There are more guns with medium penetration
2. The maximum number of magazine capacity is 12 and 30 
Vipul Gupta (4837)
DATA SCIENCE
PRACTICAL 4 : Decision Tree

A decision tree is a decision support tool that uses a tree-like graph or


model of decisions and their possible consequences, including chance
event outcomes, resource costs, and utility. It is one way to display an
algorithm that only contains conditional control statements.
A decision tree is a flowchart-like structure in which each internal node
represents a “test” on an attribute (e.g. whether a coin flip comes up
heads or tails), each branch represents the outcome of the test, and
each leaf node represents a class label (decision taken after computing
all attributes). The paths from root to leaf represent classification rules.
Tree based learning algorithms are considered to be one of the best
and mostly used supervised learning methods. Tree based methods
empower predictive models with high accuracy, stability and ease of
interpretation. Unlike linear models, they map non-linear relationships
quite well. They are adaptable at solving any kind of problem at hand
(classification or regression). Decision Tree algorithms are referred to
as CART (Classification and Regression Trees).

Common terms used with Decision trees:


 Root Node: It represents entire population or sample and this
further gets divided into two or more homogeneous sets.
 Splitting: It is a process of dividing a node into two or more sub-
nodes.
 Decision Node: When a sub-node splits into further sub-nodes,
then it is called decision node.
 Leaf/ Terminal Node: Nodes do not split is called Leaf or Terminal
node.
 Pruning: When we remove sub-nodes of a decision node, this
process is called pruning. You can say opposite process of
splitting.
 Branch / Sub-Tree: A sub section of entire tree is called branch or
sub-tree.
 Parent and Child Node: A node, which is divided into sub-nodes is
called parent node of sub-nodes whereas sub-nodes are the child
of parent node.

Diabetes dataset
Code:
install.packages("partykit")
install.packages("caret",type="win.binary") 
install.packages("pROC",type="win.binary")
install.packages('rattle',type="win.binary") 
install.packages('rpart.plot',type="win.binary") 
install.packages('data.table',type="win.binary")
titanic<-read.csv(file.choose(),header = T,sep=",") 
summary(titanic) 
names(titanic)
library(partykit) 
titanic$Outcome<-as.factor(titanic$Outcome)#convert to categorical 
summary(titanic$Outcome) 
names(titanic) 
set.seed(1234) 
pd<-sample(2,nrow(titanic),replace = TRUE, prob=c(0.8,0.2))#two
samples with distribution 0.8 and 0.2 
trainingset<-titanic[pd==1,]#first partition 
validationset<-titanic[pd==2,]#second partition 
tree<-ctree(formula = Outcome ~ Pregnancies  + Glucose +
BloodPressure + SkinThickness + Insulin + BMI +
DiabetesPedigreeFunction + Age ,data=trainingset) 
class(titanic$Outcome) 
plot(tree) 
#Prunning 
tree<-ctree(formula = Outcome ~ Pregnancies  + Glucose +
BloodPressure + SkinThickness + Insulin + BMI +
DiabetesPedigreeFunction +
Age ,data=trainingset,control=ctree_control(mincriterion =
0.99,minsplit = 500))
plot(tree) 
pred<-predict(tree,validationset,type="prob") 
pred 
pred<-predict(tree,validationset) 
pred 
library(caret) 
confusionMatrix(pred,validationset$Outcome) 
pred<-predict(tree,validationset,type="prob") 
pred
library(pROC) 
plot(roc(validationset$Outcome,pred[ ,2])) 
library(rpart) 
fit <- rpart(Outcome ~ Pregnancies  + Glucose + BloodPressure +
SkinThickness + Insulin + BMI + DiabetesPedigreeFunction +
Age ,data=titanic,method="class") 
plot(fit) 
text(fit) 
library(rattle) 
library(rpart.plot) 
library(RColorBrewer) 
fancyRpartPlot(fit) 
Prediction <- predict(fit, titanic, type = "class") 
Prediction
Output:
Conclusion:
 Accuracy: 69.75%
 Green color in tree = survival
 Blue color in tree = non-survival
 Darker shades mean more survival/non-survival
 The ROC graph shows that the model is not very accurate as
sensitivity and specificity are almost same.
 The true positive is not high enough so accuracy is medium.
 Specificity & sensitivity should be greater than 80 for proper
accuracy.
partykit: A Toolkit for Recursive Partitioning
A toolkit with infrastructure for representing, summarizing, and
visualizing tree-structured regression and classification models. This
unified infrastructure can be used for reading/coercing tree models
from different sources ('rpart', 'RWeka', 'PMML') yielding objects that
share functionality for print ()/plot ()/predict () methods.
Caret:
The caret package (short for Classification and Regression Training)
contains functions to streamline the model training process for complex
regression and classification problems. 
pROC
pROC is a set of tools to visualize, smooth and compare receiver
operating characteristic (ROC curves). (Partial) area under the curve
(AUC) can be compared with statistical tests based on U-statistics or
bootstrap. Confidence intervals can be computed for (p)AUC or ROC
curves.
Rattle
A package written in R providing a graphical user interface to very many
other R packages that provide functionality for data mining.
Data.table
Data. table is an extension of data. frame package in R. It is widely used
for fast aggregation of large datasets, low latency
add/update/remove of columns, quicker ordered joins, and a fast file
reader.
rpart.plot
This function combines and extends plot. rpart and text. rpart in the
rpart package. It automatically scales and adjusts the displayed tree for
best fit.
Vipul Gupta (4837) 
DATA SCIENCE
PRACTICAL 5:CLUSTERING

1. Import dataset
data<-read.csv("C:/Users/student/Desktop/4823_JanhaviTYSem6/
ds/coordinate.csv")
data<-data[1:150,]
names (data)

2. Making subset containing x feature

3. Checking outliers using boxplot


For X

For Y

4. Calculating K-means to make 2 clusters


cl<-kmeans(new_data, 2)
cl 

5. Calculate WSS
data<-new_data
wss<-sapply(1:15, function(k){kmeans(data,k)}$tot.withinss)
wss

6. Plot elbow graph


plot(1:15, wss, type="b", pch = 19, frame = FALSE, xlab="Number of
clusters",ylab= "total within-clusters sum of squares")
7. Silhouette graph
library(factoextra)
fviz_nbclust(data, pam, method = "silhouette")
8. Plot clusters
library(cluster)      
clusplot(new_data, cl$cluster, color=TRUE, shade=TRUE,
labels=FALSE, lines=0)
9. Classification of points based on cluster
cl$cluster
cl$centers
10. Hierarchical clustering based on y feature
clusters <- hclust(dist(data[, 0:1]), method = 'average')
clustercut1 <- cutree(clusters, 2)
table(clustercut1, data$y)
plot(clusters)

ggplot(data, aes(X, y)) +


  geom_point(alpha = 0.4, size = 3.5) + geom_point(col = clustercut1)
+
  scale_color_manual (values = c("red", "green","black","blue"))
11. Hierarchical clustering based on x feature
clusters <- hclust(dist(data[, 0:1]))
plot(clusters)
clustercut <- cutree(clusters, 2)
table(clustercut, data$X)

library(ggplot2)
ggplot(data, aes(X,y)) +
  geom_point(alpha = 0.4, size = 3.5) + geom_point(col = clustercut) +
  scale_color_manual (values = c("red", "green","black","blue"))
12. DBSCAN clustering
library(fpc)
data_1<-data[-5]
set.seed(220)
Dbscan_cl<-dbscan(data_1,eps=0.45,MinPts = 5)
Dbscan_cl$cluster
table(Dbscan_cl$cluster , data$X)
plot(Dbscan_cl , data_1 , main="DBScan")
plot(Dbscan_cl , data_1 , main = "X vs Y")
Conclusion:
As we can see the box plot of both x and y feature there is no outliers
were present, thus no need to construct features.
Both elbow method and silhouette methods give 2 cluster as optimal
clusters, thus we can use to make to clusters
We construct clusters using k means, Hierarchical and DBSCAN
clustering method but K means clusters shows good representation of
clustering data than remaining both
It is found that clusters are made using x feature as prime aspect. As we
can see in below figure as x value increases then second cluster is got
formed, we conclude that by observing y feature in both clusters
contain high value but first cluster has low x feature values than second
cluster, thus in this case x feature is prime aspect for clustering.

 
Vipul Gupta (4837)
DATA SCIENCE 
PRACTICAL 6: ASSOCIATION
Association:
Association is a data mining technique that discovers the probability of
the co-occurrence of items in a collection. The relationships between
co-occurring items are expressed as Association Rules. Association rule
mining finds interesting associations and relationships among large sets
of data items. Association rules are "if-then" statements, that help to
show the probability of relationships between data items, within large
data sets in various types of databases. Here the If element is called
antecedent, and then statement is called as Consequent. These types of
relationships where we can find out some association or relation
between two items is known as single cardinality. Association rule
mining has a number of applications and is widely used to help discover
sales correlations in transactional data or in medical data sets.

Apriori:
Apriori algorithm is given by R. Agrawal and R. Srikant in 1994 for
finding frequent itemsets in a dataset for boolean association rule.
Name of the algorithm is Apriori because it uses prior knowledge of
frequent itemset properties. We apply an iterative approach or level-
wise search where k-frequent itemsets are used to find k+1 itemsets.
To improve the efficiency of level-wise generation of frequent itemsets,
an important property is used called Apriori property which helps by
reducing the search space.
Apriori Property – All non-empty subset of frequent itemset must be
frequent. 
Limitations of Apriori Algorithm
Apriori Algorithm can be slow. 
The main limitation is time required to hold a vast number of candidate
sets with much frequent itemsets, low minimum support or large
itemsets i.e. it is not an efficient approach for large number of datasets.
It will check for many sets from candidate itemsets, also it will scan
database many times repeatedly for finding candidate itemsets. Apriori
will be very low and inefficiency when memory capacity is limited with
large number of transactions.
Algorithm
 Calculate the support of item sets (of size k = 1) in the
transactional database (note that support is the frequency of
occurrence of an itemset). This is called generating the candidate
set.
 Prune the candidate set by eliminating items with a support less
than the given threshold.
 Join the frequent itemsets to form sets of size k + 1, and repeat
the above sets until no more itemsets can be formed. This will
happen when the set(s) formed have a support less than the given
support.
OR
1. Set a minimum support and confidence.
2. Take all the subset present in the transactions which have higher
support than minimum support.
3. Take all the rules of these subsets which have higher confidence than
minimum confidence.
4. Sort the rules by decreasing lift.
Components of Apriori
Support and Confidence:
Support refers to items’ frequency of occurrence i.e. x and y items are
purchased together, confidence is a conditional probability that y item
is purchased given that x item is purchased. 
Support( I )=( Number of transactions containing item I ) / ( Total
number of transactions )
Confidence( I1 -> I2 ) =( Number of transactions containing I1 and I2 ) / (
Number of transactions containing I1 )

Lift: 
Lift gives the correlation between A and B in the rule A=>B. Correlation
shows how one item-set A effects the item-set B.
If the rule had a lift of 1,then A and B are independent and no rule can
be derived from them.
If the lift is > 1, then A and B are dependent on each other, and the
degree of which is given by ift value.
If the lift is < 1, then presence of A will have negative effect on B.
Lift( I1 -> I2 ) = ( Confidence( I1 -> I2 ) / ( Support(I2) )
Coverage:
Coverage (also called cover or LHS-support) is the support of the left-
hand-side of the rule X => Y , i.e., supp(X). 
It represents a measure of to how often the rule can be applied.
Coverage can be quickly calculated from the rule’s quality measures
(support and confidence) 
Fp tree:
The FP-Growth Algorithm proposed by Han in The FP-Growth Algorithm
is an alternative way to find frequent item sets without using candidate
generations, thus improving performance. For so much, it uses a divide-
and-conquer strategy. The core of this method is the usage of a special
data structure named frequent-pattern tree (FP-tree), which retains the
item set association information. This tree-like structure is made with
the initial itemsets of the database. The purpose of the FP tree is to
mine the most frequent pattern. Each node of the FP tree represents an
item of the itemset. 
The root node represents null while the lower nodes represent the
itemsets. The association of the nodes with the lower nodes that is the
itemsets with the other itemsets are maintained while forming the
tree.
Algorithm:
Building the tree

Find Patterns Having p From P-conditional Database


Calculate conditional frequent pattern tree.

Dataset: supermarket.csv

1. Import libraries
library(arules)
library(arulesViz)
library(RColorBrewer)

2. Import dataset
 data<-read.transactions('D:/college/sem_6/data
science/code/supermarket.csv', rm.duplicates= TRUE,
format="single",sep=",",header = TRUE,cols=c("Branch","Product
line"))
#data<-read.csv('C:/kunal_Ganjale_TY_4808/DS/code/Super
Store.csv')
#data <- subset(data, select = c(0,1))
3. Display structure of data
 str(data)

4. Items and transaction ids


 inspect(head(data))

5. Labels of items
 data@itemInfo$labels
6. Generating rules
 data_rules <- apriori(data, parameter = list(supp = 0.01, conf = 0.2))
data_rules

7. Inspect rules
 inspect(data_rules[1:20])

8. Inspect top 10 rules


 inspect(head(sort(data_rules, by = "confidence"), 10))
9. Inspect bottom 10 rules
 inspect(tail(sort(data_rules, by = "confidence"), 10))

10. Determine rules which reach to fashion accessories


 fashion_rules <- apriori(data=data, parameter=list (supp=0.001,conf
=  0.08), appearance = list (rhs="Fashion accessories"))
inspect(head(sort(fashion_rules, by = "confidence"), 10))

11. Determine rules which reach to fashion accessories with


increased support
 fashion_rules_increased_support <- apriori(data, parameter =
list(support =0.02, confidence = 0.5))
inspect(head(sort(fashion_rules_increased_support, by =
"confidence"), 10))

12. Plot absolute item frequency graph


 itemFrequencyPlot(data,topN=20,type="absolute",col=brewer.pal(8
,'Pastel2'), main="Absolute Item Frequency Plot")
Vipul Gupta (4837)
DATA SCIENCE 
PRACTICAL 7: PREDICTION –TIMESERIES

CODE:
library(igraph)
data<-read.csv("C:\\Users\\student\\Downloads\\income1.csv")
attach(data)
head(data)
x<-data$Year
y<-data$Value
d.y<-diff(y)
library(ggplot2)
ggplot(data, aes(x,y)) +
  geom_point() +
  theme(axis.text.x = element_text(angle=45, hjust=1, vjust = 1))
#plot(x,y)
acf(y)
pacf(y)
acf(d.y)
arima(y,order = c(1,0,0))
mydata.arima001<-arima(y,order=c(0,0,1))
mydata.pred01<-predict(mydata.arima001,n.ahead = 100)
head(mydata.pred01)
plot(y)
lines(mydata.pred01$pred,col='blue')
attach(mydata.pred01)

tail(mydata.pred01$pred)
head(mydata.pred01$pred)

Output:
1. Import CSV file
data<-read.csv('C:/kunal_Ganjale_TY_4808/DS/code/income1.csv')
attach(data)
head(data)

2. Assign x variable for month and y for total filled jobs


x<-data$Year
y<-data$Value

3. Calculate difference between pairs of consecutive element


vectors
d.y<-diff(y)

4. Calculate and plot acf


acf(y)
5. Calculate and plot pacf
pacf(y)
6. Calculate and plot acf of d.y
acf(d.y)

7. Generate arima model for y variable based on month


arima(y,order = c(1,0,0))

8. Store arima values in table


mydata.arima001<-arima(y,order=c(0,0,1))
9. Predict 100 values using arima values
mydata.pred01<-predict(mydata.arima001,n.ahead = 100)
head(mydata.pred01)

10. Plot y values


Plot(y)
11. Display head and tail of predicted values from prediction
table
tail(mydata.pred01$pred)
head(mydata.pred01$pred)

Conclusion:
As we can see trend of flow of y parameter as blue line we can
conclude that model is predicting values as per trend
Vipul Gupta (4837)
DATA SCIENCE
PRACTICAL 8: MONGODB

1.   Extract mongodb zip file in C;\ drive


2. create folder data in C:\
3. Create folder db in C:\data\db
4. Goto C:\mongodb\bin and click on mongod.exe and keep server
running
Click on mongo.exe
Create database kunal

Create table student and insert records


> db.student.insert({name:"kunal",age:22,address:[{city:"mumbai"},
{pin:400614}]})
WriteResult({ "nInserted" : 1 })
> db.student.insert({name:"Sajjad",age:22,address:[{city:"Dombivli"},
{pin:401614}]})
WriteResult({ "nInserted" : 1 })
> db.student.insert({name:"Pankaj",age:21,address:[{city:"Pune"},
{pin:406721}]})
WriteResult({ "nInserted" : 1 })
db.student.insert({name:"Akshay",age:24,address:[{city:"Pune"},
{pin:456765}]})
db.student.insert({name:"Yash",age:21,address:[{city:"Satara"},
{pin:345234}]})

Display inserted records


Create table student_mark and insert records in it

> db.student_mark.insert({name:"kunal",marks:[{physics:79},
{chem:89},{bio:87}]})
WriteResult({ "nInserted" : 1 })
> db.student_mark.insert({name:"Sajjad",marks:[{physics:90},
{chem:79},{bio:84}]})
WriteResult({ "nInserted" : 1 })
> db.student_mark.insert({name:"Pankaj",marks:[{physics:76},
{chem:89},{bio:67}]})
WriteResult({ "nInserted" : 1 })
> db.student_mark.insert({name:"Akshay",marks:[{physics:63},
{chem:78},{bio:88}]})
WriteResult({ "nInserted" : 1 })
> db.student_mark.insert({name:"Yash",marks:[{physics:71},{chem:55},
{bio:65}]})
WriteResult({ "nInserted" : 1 })
Display record in json format
db.student.find().forEach(printjson)

display details of student who’s age is greater than 22

> db.student.find({age:{$gt:22}}).pretty()
display details of student who’s city is pune

db.student.find({'address.city':'Pune'}).pretty()
Display student who got more than 84 marks in physics

Display student who got less than 85 marks in physics


db.student_mark.find({'marks.bio':{$lte:84}}).pretty()
 

Display students who lives in pune or mumbai and whos age is


greatter than 21

db.student.find({'address.city':{$in:["Pune","mumbai"]},age:
{$gte:21}}).pretty()
Display students who got more than 70 marks in all subject 
db.student_mark.find({'marks.bio':{$gte:70},'marks.physics':
{$gte:70},'marks.chem':{$gte:70}}).pretty()

Display student who got 84 marks in bio


db.student_mark.find({'marks.bio':84}).pretty()

Update student name to Anurag who got 84 marks in bio


db.student_mark.update({'marks.bio':84},{$set:{'name':'anurag'}})
Remove student who’s name is anurag
db.student_mark.remove({'name':'anurag'},1)

Delete collection
db.student_mark.drop()

Drop table
db.dropDatabase()
Vipul Gupta (4837)
DATA SCIENCE
Practical 9: topic Modelling

Load all the text files from folder


library(tm)
library(topicmodels)
setwd("C:/british-fiction-corpus")
filenames<-list.files(path="C:/british-fiction-corpus",pattern="*.txt")
filenames

Find all word in all files with the specific length


filetext<-lapply(filenames,readLines)#lapply returns a list of the same
length as X, applying FUN to the corresponding element of X.
mycorpus<-Corpus(VectorSource(filetext))# VectorSource interprets
each element of the vector x as a document.
mycorpus<-tm_map(mycorpus,removeNumbers)
mycorpus<-tm_map(mycorpus,removePunctuation)
mycorpus

provide list of stopwords to fin in each text files and map it on words
from files
mystopwords=c("of","a","and","the","in","to","for","that","is","on","ar
e","with","as","by"
              ,"be","an","which","it","from","or","can","have","these","has","
such","you")
mycorpus<-tm_map(mycorpus,tolower)
mycorpus<-tm_map(mycorpus,removeWords,mystopwords)
dtm<-DocumentTermMatrix(mycorpus)
k<-3

#lda_output_3<-LDA(dtm,k,method="VEM",control=control_VEM)
# control_VEM

#lda_output_3<-LDA(dtm,k,method="VEM",control=NULL)
lda_output_3<-LDA(dtm,k,method="VEM")

#lda_output_3<-LDA(dtm,k,method="VEM")
#lda_output_3@Dim
#lda_output_3<-LDA(dtm,k,method="VEM")
#show (dtm)

topics(lda_output_3)
terms(lda_output_3,10)

Output:
Conclusion:
The keywords used in all the texts files are more suitable for natural
language processing (NPL)

You might also like