Assignment 1 Data - R Sivani Jayanth

ASSIGNMENT 2
Data Analytics & R

TOPIC: Dataset analysis using RStudio
SUBMITTED TO: SUBMITTED BY:

SUBHANKAR MISHRAH SIVANI JAYANTH
ASST.PROFESSOR ROLL NO. 21
BFT/17/1109
DATASET
Dataset : “Heart Disease Dataset” : Kaggle
HEART DISEASE DATASET
DESCRIPTION
“Heart Disease” data consist of 300 observations with 14 variables. People with various age
groups were observed attempting to confirm presence of heart disease. Various conditions
were analysed of these people to know their effect on presence and absence of heart
disease.
INSIGHTS:-
 There is 300 observations of 14 Variables.

 Dependent Variable is Target which has 2 value 0 and 1. 0 means there is no heart
attack and 1 means there is Heart Attack.
 All other Variables such as age, sex, cp and others are independent variables.
 There is integer and Numeric Data Types of variables
FORMAT
The data frame contains the following components:
 Age : age in years

 Sex : (1 = male; 0 = female)
 Cp : Chest pain type
 Trestbps : Resting blood pressure (in mm Hg on admission to the hospital)
 Chol : Serum cholesterol in mg/dl
 Fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
 Restecg : Resting electrocardiographic results
 Thalach :Maximum heart rate achieved
 Exang : Exercise induced angina (1 = yes; 0 = no)
 Oldpeak : ST depression induced by exercise relative to rest
 Slope : The Slope of the peak exercise ST segment
 Ca : Number of major vessels (0-3) coloured by fluoroscopy
 Thal : 1 = normal; 2 = fixed defect; 3 = reversible defect
 target1 or 0
SOURCE :
https://www.kaggle.com/johnsmith88/heart-disease-dataset
IMPORTING THE DATASET TO R

library(readxl)
heart_disease <- read_excel("C:/Users/ACER/Desktop/heart disease.xlsx")
REMOVING THE OUTLIERS
Removing outliers from “chol”:
chol= heart_disease$chol
boxplot(chol)
outliers <- boxplot(chol, plot=FALSE)$out
boxplot(chol, plot=FALSE)$out
heart_disease <- heart_disease[-which(chol %in% outliers),]
chol= heart_disease$chol
max(chol)
min(chol)
boxplot(chol)
Removing outliers from “ trestbps “:
trestbps = heart_disease$trestbps
outliers <- boxplot(trestbps, plot=FALSE)$out
boxplot(trestbps, plot=FALSE)$out
heart_disease <- heart_disease[-which(trestbps %in% outliers),]
age = heart_disease$age
trestbps = heart_disease$trestbps
chol = heart_disease$chol
STATICAL INDICATORS
 Mean
mean(chol)
 Variance
var(chol)
 Standard Deviation
sd(chol)
 Standard Error
sd(chol)/sqrt(length(chol))
 Median absolute deviation
mad(chol)
 Minimum
min(chol)
 Median
median(chol)
 Maximum
max(chol)
 Range
max(chol)-min(chol)
 Quantile
quantile(chol)
 Interquartile Range
quantile(trestbps,1)
quantile(trestbps,0.1)
IQR(chol)
summary(chol)
BASIC GRAPHS IN R
1) BARPLOT:
sex = heart_disease$sex
target = heart_disease$target
b=table(sex,target)
barplot(b,
col=c("red","blue"),
legend=rownames(b),
beside=TRUE,
xlab="Target",
ylab="Count",
main="Side by Side Barplot")
2) BOXPLOT:
boxplot(chol,col = redblue(1))
3) HISTOGRAM:
hist(trestbps,col=rainbow(7),
main="Histogram for RBP",
xlab="Rest Blood Pressure Class",
ylab="Frequency",
labels=TRUE)
4) PLOT :
plot(chol,col = factor(sex))
5) DENSITY :
oldpeak = heart_disease$oldpeak
plot(density(oldpeak),
main="Density plot of Oldpeak",
xlab="Oldpeak",
ylab="Density")
polygon(density(oldpeak),col="orange",border="green")
PIE CHART :
a=table(heart_disease$sex)
pct=round(a/sum(a)*100)
lbs=paste(c("Female","Male")," ",pct,"%",sep=" ")
pie(a,labels=lbs,main="Percentage of Male and Female")
PLOTTING GRAPHS USING DIFFERENT GEOMS

1) AGE Vs. GENDER
 Geom used: geom_histogram
geom_histogram :
Visualise the distribution of a single continuous variable by dividing the x axis into bins and
counting the number of observations in each bin. Histograms ( geom_histogram() ) display
the counts with bars.
 Command:
ggplot(heart_disease,aes(x=age,fill = factor(sex)))+
geom_histogram()+
xlab("Age") +
ylab("Number")+ guides(fill = guide_legend(title = "Gender"))
 Graph : Distribution of Male and Female population across Age parameter
 Inference :
This graph shows the distribution of male and female in each age category. The x-axis
represents the age and the y axis represents frequency of each age group. Here , gender is
shown across the age parameter using colours ( blue and red ). “1” represents male and has
blue colour, whereas “2” represents female and has red colour. From this graph we can infer
that male of different age group is observed more than female and also the age group most
observed is between 50 years and 60 years.
2) REPRESENTATION OF CHOLESTROL LEVEL

 Geom used : geom_point
The geom point is used to create scatterplots. The scatterplot is most useful for displaying
the relationship between two continuous variables.
 Command :
ggplot(heart_disease, aes(x=age,y=chol,color=as.factor(sex), size=chol))+
geom_point(alpha=0.7)+xlab("Age") +
ylab("Cholestoral")
 Graph : Cholesterol among different age groups with their gender
 Inference :
In this graph the x axis shows Age, y axis shows the cholesterol level , points also represent
the cholesterol level and gender is represented by colour. The points represent the
cholesterol level, where the different levels of cholesterol levels are denoted in different
sizes. Here “1” represents male and “0”. From this graph we can infer that male is more
observed than female , most people with high cholestrol is between the age group 50 years
and 60 years and most are men.
3) COMPARISON OF BLOOD PRESSURE ACROSS PAIN TYPE
• Geom used : geom_boxplot & facet_grid
A data. frame , or other object, will override the plot data. All objects will be fortified to
produce a data frame. The upper and lower "hinges" correspond to the first and third
quartiles (the 25th and 7th percentiles).
The Facet grid forms a matrix of panels defined by row and column faceting variables. It is
most useful when you have two discrete variables, and all combinations of the variables
exist in the data.
 Command
ggplot(heart_disease,aes(x=factor(sex),y=trestbps))+
geom_boxplot(fill="darkorange")+
xlab("Sex")+
ylab("BP")+
facet_grid(~cp)
 Graph : comparison of blood pressure across pain type
 Inference:
In this graph x- axis represents Sex and y-axis represents Blood pressure. This graph has
been divided to different grids according to the level of pain type. In each grid we can also
see BP value of each gender separately. From this graph we can infer that most people
under observation face, level 2 pain type and their BP is between 120 & 140.This graphs
also help us to clearly know level of cp faced by male and female and their BP level.
4) OLDPEAK ANALYSIS
 Geom used : geom_point , geom_smooth
Geom_smooth is designed to estimate f(x) when the shape is unknown, but assumed to be
smooth.
 Command :
ggplot(heart_disease,aes(x=oldpeak,y=target))+geom_point()+geom_smooth(color="red")+
xlab("OldPeak")+ylab("Prob. of Heart Attack") + ggtitle("Relation between oldpeak and
heart attack")
 Graph : Relation between oldpeak and heart attack
 Inference: In this graph x axis represents the ST depression induced by exercise

relative to rest (oldpeak), and y axis the probability of heart attack. From the smooth curve
between oldpeak and Heart attack we can see that there is a decreasing trend between
them. In general on increasing oldpeak probability of heart attack is decreasing.
5) SLOPE ANALYSIS
 Geom used : Command : geom_smooth & geom_point
 Command
ggplot(heart_disease,aes(x=slope,y=target))+geom_point()+geom_smooth(color="cyan")+
scale_x_continuous(name="Slope")+ scale_y_continuous(name="Target")+
ggtitle("Relationship between Slope and Target")
 Graph : Slope analysis with probability of heart disease
 Inference:
In the graph above it shows probability of heart disease with the increase of oldpeak. From
the smooth curve between slope and target we can see that after unit 1 of slope with
increase in slope, probability of heart attack is increasing.
FORMULATION OF HYPOTHESIS
To test whether there is any significant relationship between the Age and Heart rate, the
following hypothesis was formulated for this study:
“On getting older Heart rate is decreasing”
 Ho: On getting older Heart rate is decreasing.

 H1: On getting older Heart rate is increasing.
TRANSFORMATION
I here , want to find a relation between heart rate Vs age and prove that when age
increases heart rate also increases .For this I transform the dataset according to my
convenience .
Using ,
Arrange () : arranged the age of people from lowest to highest
Filter() : chose the age between 25 and 80
Select() : Selected columns “age” & “thalach”
CODE :
Heartrate <- heart_disease %>% arrange(age, thalach) %>%
filter(age > 25 , age< 80 ) %>%
dplyr::select(age, thalach)
age = Heartrate$age
thalach = Heartrate$thalach
summary(Heartrate)
MODELING
DATASET : HEARTRATE
View(Heartrate)
SCATTER PLOT: VISUALIZE THE LINEAR RELATIONSHIP BETWEEN THE

PREDICTOR AND RESPONSE
PLOT: HEART RATE Vs AGE ( Using ggplot )
CODE: ggplot(Heartrate) + geom_point (mapping = aes(age,thalach))
Graph:
INFERENCE: In this graph we have plotted Heart rate in y axis and age in x axis. From this
we can infer that there is a liner relation between the Thalach and age. This graph help us to
visualize the linear relationship between the age and heart rate.
CORRELATION
Correlation is a statistical measure that suggests the level of linear dependence between two
variables, that occur in pair .Here, we have age and heart rate. Correlation can take values
between -1 to +1.
Code : cor(age, thalach)
The value of our correlation is : “ [ 1] -0.4178001 ”
From this we can infer that our correlation is negative
If we observe for every instance where age increases, the Heart rate decreases along with it.
Here we have a high negaticve correlation between them because the value is closer to -1.
BUILD LINEAR MODEL
Now that we have seen the linear relationship pictorially in the scatter plot and by computing
the correlation, lets see the syntax for building the linear model. The function used for
building linear models is lm(). The lm() function takes in two main arguments, namely: 1.
Formula 2. Data. The data is typically a data.frame and the formula is a object of class
formula. But the most common convention is to write out the formula directly in place of the
argument as written below.
Code : linearMod <- lm(data = Heartrate, thalach ~ age)

print(linearMod)
print(linearMod) : Call:
lm(formula = thalach ~ age, data = Heartrate)
Coefficients:
(Intercept) age
206.759 -1.059
Now that we have built the linear model, we also have established the relationship between
the predictor and response in the form of a mathematical formula for Heart rate as a function
for age
For the above output, you can notice the ‘Coefficients’ part having two components:
Intercept: 206.759 , age : -1.059 These are also called the beta coefficients. In other words,
dist = Intercept + (β ∗ speed)
=> Heart Rate = 206.759 + -1.059 ∗age
ASSESSING OUR MODEL VISUALLY
We use geom_ablime and geom point to acess our model visually . We use geom_ablime
because it allow us to plot line with slope and intercept, this helps us to predict error.
Code : ggplot(Heartrate, mapping = aes(age,thalach)) + geom_point() +

geom_abline(slope = -1.059 , intercept = 206.759 )
Graph :
PLOTTING LINEAR MODEL
We use geom_smooth to plot linear model :
Here we use geom_smooth(method = "lm") followed by geom_smooth(). This allows us to

compare the linearity of our model (blue line with the 95% confidence interval in shaded
region) with a non-linear (red) LOESS model. Considering the LOESS smoother remains
within the confidence interval we can assume the linear trend fits the essence of this
relationship.
LINEAR REGRESSION DIAGNOSTICS
Now the linear model is built and we have a formula that we can use to predict the dist value
if a corresponding speed is known.
SUMMARY
Code : modelSummary <- summary(linearMod)
print(modelSummary)
RESIDUAL STANDARD ERROR
sigma(linearMod) : 21.06809
sigma(linearMod)/mean(thalach) : [1] 0.1409384

PREDICTION
STEP 1: CREATE THE TRAINING (DEVELOPMENT) AND TEST (VALIDATION) DATA

SAMPLES FROM ORIGINAL DATA.
set.seed(100)
sample <- sample(c(TRUE, FALSE), nrow(Heartrate),replace = T, prob = c(0.8,0.2))
sample
!sample
train <- Heartrate[sample, ]
test <- Heartrate [!sample, ]
STEP 2: DEVELOP THE MODEL ON THE TRAINING DATA AND USE IT TO PREDICT
THE DISTANCE ON TEST DATA
pmodel <- lm(data = train, thalach ~ age)

print(pmodel)
summary(pmodel)
slope = -1.059 , intercept = 206.759
STEP 3: COMPARE ORIGINAL AND TRAINING DATA, VISUALISING USING

GEOM_ABLIME
ggplot(Heartrate, mapping = aes(age,thalach)) +

geom_point() +
geom_abline(slope = -1.0123 , intercept = 203.4721 , colour="blue" )+
geom_abline(slope = -1.059 , intercept = 206.759 , colour="red")
STEP 4: CALCULATE PREDICTION ACCURACY AND ERROR RATE
 Training : m1 and c1
 Testing : - I use m1 and c1 to predict y' from x
 Pred-Error : Difference between y' and y
 Prediction y' = mx + c ---- (x , y)
prediction <- predict(pmodel, test)
ACTUAL VS PREDICTED:
avsp <- data.frame(cbind(actuals=age, predicted=prediction))
ACCURACY OF CORRELATION: ACTUAL & PREDICTED :
Code : correlation_accuracy <- cor(avsp)

print(correlation_accuracy)
avsp
actuals predicted
actuals 1.000000 - 0.1712265
0
predicted -0.1712265 1.0000000
PREDICTION ERROR BY MUTING A NEW COLUMN” ERR0R” TO AVSP
Here, we can make a new col

avsp <- avsp %>%
mutate (predError = actuals - predicted)
avspT <- as_tibble(avsp)
PLOTTING THE PREDICTED ERROR
Code : ggplot(data = avspT) +

geom_point(mapping = aes(x = actuals, y = predError))
Graph:
The actual and predicted errors are plotted against each other thus proving the
predicted data theory.

Assignment 1 Data - R Sivani Jayanth

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Assignment 1 Data - R Sivani Jayanth

Uploaded by

Copyright:

Available Formats

ASSIGNMENT 2

Data Analytics & R

SUBMITTED TO: SUBMITTED BY:

HEART DISEASE DATASET

 There is 300 observations of 14 Variables.

The data frame contains the following components:

 Age : age in years

IMPORTING THE DATASET TO R

heart_disease <- read_excel("C:/Users/ACER/Desktop/heart disease.xlsx")

REMOVING THE OUTLIERS

Removing outliers from “chol”:

outliers <- boxplot(chol, plot=FALSE)$out

heart_disease <- heart_disease[-which(chol %in% outliers),]

Removing outliers from “ trestbps “:

outliers <- boxplot(trestbps, plot=FALSE)$out

heart_disease <- heart_disease[-which(trestbps %in% outliers),]

 Median absolute deviation

main="Side by Side Barplot")

main="Histogram for RBP",

xlab="Rest Blood Pressure Class",

main="Density plot of Oldpeak",

lbs=paste(c("Female","Male")," ",pct,"%",sep=" ")

pie(a,labels=lbs,main="Percentage of Male and Female")

PLOTTING GRAPHS USING DIFFERENT GEOMS

 Geom used: geom_histogram

 Graph : Distribution of Male and Female population across Age parameter

2) REPRESENTATION OF CHOLESTROL LEVEL

ggplot(heart_disease, aes(x=age,y=chol,color=as.factor(sex), size=chol))+

 Graph : Cholesterol among different age groups with their gender

• Geom used : geom_boxplot & facet_grid

 Graph : comparison of blood pressure across pain type

 Geom used : geom_point , geom_smooth

 Graph : Relation between oldpeak and heart attack

 Inference: In this graph x axis represents the ST depression induced by exercise

 Geom used : Command : geom_smooth & geom_point

 Graph : Slope analysis with probability of heart disease

“On getting older Heart rate is decreasing”

 Ho: On getting older Heart rate is decreasing.

Arrange () : arranged the age of people from lowest to highest

Filter() : chose the age between 25 and 80

Select() : Selected columns “age” & “thalach”

Heartrate <- heart_disease %>% arrange(age, thalach) %>%

filter(age > 25 , age< 80 ) %>%

SCATTER PLOT: VISUALIZE THE LINEAR RELATIONSHIP BETWEEN THE

PLOT: HEART RATE Vs AGE ( Using ggplot )

CODE: ggplot(Heartrate) + geom_point (mapping = aes(age,thalach))

Code : cor(age, thalach)

The value of our correlation is : “ [ 1] -0.4178001 ”

From this we can infer that our correlation is negative

BUILD LINEAR MODEL

Code : linearMod <- lm(data = Heartrate, thalach ~ age)

Code : ggplot(Heartrate, mapping = aes(age,thalach)) + geom_point() +

PLOTTING LINEAR MODEL

We use geom_smooth to plot linear model :

Here we use geom_smooth(method = "lm") followed by geom_smooth(). This allows us to

RESIDUAL STANDARD ERROR

sigma(linearMod)/mean(thalach) : [1] 0.1409384

STEP 1: CREATE THE TRAINING (DEVELOPMENT) AND TEST (VALIDATION) DATA

pmodel <- lm(data = train, thalach ~ age)

slope = -1.059 , intercept = 206.759

STEP 3: COMPARE ORIGINAL AND TRAINING DATA, VISUALISING USING

ggplot(Heartrate, mapping = aes(age,thalach)) +

STEP 4: CALCULATE PREDICTION ACCURACY AND ERROR RATE