You are on page 1of 18

ASSIGNMENT 2

Data Analytics & R


TOPIC: Dataset analysis using RStudio

SUBMITTED TO: SUBMITTED BY:


SUBHANKAR MISHRAH SIVANI JAYANTH
ASST.PROFESSOR ROLL NO. 21
BFT/17/1109

DATASET
Dataset : “Heart Disease Dataset” : Kaggle

HEART DISEASE DATASET

DESCRIPTION

“Heart Disease” data consist of 300 observations with 14 variables. People with various age
groups were observed attempting to confirm presence of heart disease. Various conditions
were analysed of these people to know their effect on presence and absence of heart
disease.

INSIGHTS:-

 There is 300 observations of 14 Variables.


 Dependent Variable is Target which has 2 value 0 and 1. 0 means there is no heart
attack and 1 means there is Heart Attack.
 All other Variables such as age, sex, cp and others are independent variables.
 There is integer and Numeric Data Types of variables

FORMAT

The data frame contains the following components:

 Age : age in years


 Sex : (1 = male; 0 = female)
 Cp : Chest pain type
 Trestbps : Resting blood pressure (in mm Hg on admission to the hospital)
 Chol : Serum cholesterol in mg/dl
 Fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
 Restecg : Resting electrocardiographic results
 Thalach :Maximum heart rate achieved
 Exang : Exercise induced angina (1 = yes; 0 = no)
 Oldpeak : ST depression induced by exercise relative to rest
 Slope : The Slope of the peak exercise ST segment
 Ca : Number of major vessels (0-3) coloured by fluoroscopy
 Thal : 1 = normal; 2 = fixed defect; 3 = reversible defect
 target1 or 0

SOURCE :

https://www.kaggle.com/johnsmith88/heart-disease-dataset

IMPORTING THE DATASET TO R


library(readxl)

heart_disease <- read_excel("C:/Users/ACER/Desktop/heart disease.xlsx")

REMOVING THE OUTLIERS

Removing outliers from “chol”:

chol= heart_disease$chol

boxplot(chol)

outliers <- boxplot(chol, plot=FALSE)$out

boxplot(chol, plot=FALSE)$out

heart_disease <- heart_disease[-which(chol %in% outliers),]

chol= heart_disease$chol

max(chol)

min(chol)

boxplot(chol)

Removing outliers from “ trestbps “:

trestbps = heart_disease$trestbps

outliers <- boxplot(trestbps, plot=FALSE)$out

boxplot(trestbps, plot=FALSE)$out

heart_disease <- heart_disease[-which(trestbps %in% outliers),]

age = heart_disease$age

trestbps = heart_disease$trestbps

chol = heart_disease$chol

STATICAL INDICATORS
 Mean

mean(chol)

 Variance

var(chol)

 Standard Deviation

sd(chol)

 Standard Error

sd(chol)/sqrt(length(chol))

 Median absolute deviation

mad(chol)

 Minimum

min(chol)

 Median

median(chol)

 Maximum

max(chol)

 Range

max(chol)-min(chol)

 Quantile

quantile(chol)

 Interquartile Range

quantile(trestbps,1)

quantile(trestbps,0.1)

IQR(chol)

summary(chol)

BASIC GRAPHS IN R
1) BARPLOT:

sex = heart_disease$sex

target = heart_disease$target

b=table(sex,target)

barplot(b,

col=c("red","blue"),

legend=rownames(b),

beside=TRUE,

xlab="Target",

ylab="Count",

main="Side by Side Barplot")

2) BOXPLOT:

boxplot(chol,col = redblue(1))

3) HISTOGRAM:

hist(trestbps,col=rainbow(7),

main="Histogram for RBP",

xlab="Rest Blood Pressure Class",

ylab="Frequency",

labels=TRUE)
4) PLOT :

plot(chol,col = factor(sex))

5) DENSITY :

oldpeak = heart_disease$oldpeak

plot(density(oldpeak),

main="Density plot of Oldpeak",

xlab="Oldpeak",

ylab="Density")

polygon(density(oldpeak),col="orange",border="green")

PIE CHART :

a=table(heart_disease$sex)

pct=round(a/sum(a)*100)

lbs=paste(c("Female","Male")," ",pct,"%",sep=" ")

pie(a,labels=lbs,main="Percentage of Male and Female")

PLOTTING GRAPHS USING DIFFERENT GEOMS


1) AGE Vs. GENDER

 Geom used: geom_histogram

geom_histogram :

Visualise the distribution of a single continuous variable by dividing the x axis into bins and
counting the number of observations in each bin. Histograms ( geom_histogram() ) display
the counts with bars.

 Command:

ggplot(heart_disease,aes(x=age,fill = factor(sex)))+
geom_histogram()+
xlab("Age") +
ylab("Number")+ guides(fill = guide_legend(title = "Gender"))

 Graph : Distribution of Male and Female population across Age parameter

 Inference :

This graph shows the distribution of male and female in each age category. The x-axis
represents the age and the y axis represents frequency of each age group. Here , gender is
shown across the age parameter using colours ( blue and red ). “1” represents male and has
blue colour, whereas “2” represents female and has red colour. From this graph we can infer
that male of different age group is observed more than female and also the age group most
observed is between 50 years and 60 years.

2) REPRESENTATION OF CHOLESTROL LEVEL


 Geom used : geom_point

The geom point is used to create scatterplots. The scatterplot is most useful for displaying
the relationship between two continuous variables.

 Command :

ggplot(heart_disease, aes(x=age,y=chol,color=as.factor(sex), size=chol))+

geom_point(alpha=0.7)+xlab("Age") +

ylab("Cholestoral")

 Graph : Cholesterol among different age groups with their gender

 Inference :

In this graph the x axis shows Age, y axis shows the cholesterol level , points also represent
the cholesterol level and gender is represented by colour. The points represent the
cholesterol level, where the different levels of cholesterol levels are denoted in different
sizes. Here “1” represents male and “0”. From this graph we can infer that male is more
observed than female , most people with high cholestrol is between the age group 50 years
and 60 years and most are men.
3) COMPARISON OF BLOOD PRESSURE ACROSS PAIN TYPE

• Geom used : geom_boxplot & facet_grid

A data. frame , or other object, will override the plot data. All objects will be fortified to
produce a data frame. The upper and lower "hinges" correspond to the first and third
quartiles (the 25th and 7th percentiles).

The Facet grid forms a matrix of panels defined by row and column faceting variables. It is
most useful when you have two discrete variables, and all combinations of the variables
exist in the data.

 Command

ggplot(heart_disease,aes(x=factor(sex),y=trestbps))+
geom_boxplot(fill="darkorange")+
xlab("Sex")+
ylab("BP")+
facet_grid(~cp)

 Graph : comparison of blood pressure across pain type

 Inference:

In this graph x- axis represents Sex and y-axis represents Blood pressure. This graph has
been divided to different grids according to the level of pain type. In each grid we can also
see BP value of each gender separately. From this graph we can infer that most people
under observation face, level 2 pain type and their BP is between 120 & 140.This graphs
also help us to clearly know level of cp faced by male and female and their BP level.
4) OLDPEAK ANALYSIS

 Geom used : geom_point , geom_smooth

Geom_smooth is designed to estimate f(x) when the shape is unknown, but assumed to be
smooth.

 Command :

ggplot(heart_disease,aes(x=oldpeak,y=target))+geom_point()+geom_smooth(color="red")+
xlab("OldPeak")+ylab("Prob. of Heart Attack") + ggtitle("Relation between oldpeak and
heart attack")

 Graph : Relation between oldpeak and heart attack

 Inference: In this graph x axis represents the ST depression induced by exercise


relative to rest (oldpeak), and y axis the probability of heart attack. From the smooth curve
between oldpeak and Heart attack we can see that there is a decreasing trend between
them. In general on increasing oldpeak probability of heart attack is decreasing.
5) SLOPE ANALYSIS

 Geom used : Command : geom_smooth & geom_point

 Command

ggplot(heart_disease,aes(x=slope,y=target))+geom_point()+geom_smooth(color="cyan")+
scale_x_continuous(name="Slope")+ scale_y_continuous(name="Target")+
ggtitle("Relationship between Slope and Target")

 Graph : Slope analysis with probability of heart disease

 Inference:

In the graph above it shows probability of heart disease with the increase of oldpeak. From
the smooth curve between slope and target we can see that after unit 1 of slope with
increase in slope, probability of heart attack is increasing.
FORMULATION OF HYPOTHESIS

To test whether there is any significant relationship between the Age and Heart rate, the
following hypothesis was formulated for this study:

“On getting older Heart rate is decreasing”

 Ho: On getting older Heart rate is decreasing.


 H1: On getting older Heart rate is increasing.

TRANSFORMATION
I here , want to find a relation between heart rate Vs age and prove that when age
increases heart rate also increases .For this I transform the dataset according to my
convenience .

Using ,

Arrange () : arranged the age of people from lowest to highest

Filter() : chose the age between 25 and 80

Select() : Selected columns “age” & “thalach”

CODE :

Heartrate <- heart_disease %>% arrange(age, thalach) %>%

filter(age > 25 , age< 80 ) %>%

dplyr::select(age, thalach)

age = Heartrate$age

thalach = Heartrate$thalach

summary(Heartrate)
MODELING

DATASET : HEARTRATE

View(Heartrate)

SCATTER PLOT: VISUALIZE THE LINEAR RELATIONSHIP BETWEEN THE


PREDICTOR AND RESPONSE

PLOT: HEART RATE Vs AGE ( Using ggplot )

CODE: ggplot(Heartrate) + geom_point (mapping = aes(age,thalach))

Graph:

INFERENCE: In this graph we have plotted Heart rate in y axis and age in x axis. From this
we can infer that there is a liner relation between the Thalach and age. This graph help us to
visualize the linear relationship between the age and heart rate.
CORRELATION

Correlation is a statistical measure that suggests the level of linear dependence between two
variables, that occur in pair .Here, we have age and heart rate. Correlation can take values
between -1 to +1.

Code : cor(age, thalach)

The value of our correlation is : “ [ 1] -0.4178001 ”

From this we can infer that our correlation is negative

If we observe for every instance where age increases, the Heart rate decreases along with it.
Here we have a high negaticve correlation between them because the value is closer to -1.

BUILD LINEAR MODEL

Now that we have seen the linear relationship pictorially in the scatter plot and by computing
the correlation, lets see the syntax for building the linear model. The function used for
building linear models is lm(). The lm() function takes in two main arguments, namely: 1.
Formula 2. Data. The data is typically a data.frame and the formula is a object of class
formula. But the most common convention is to write out the formula directly in place of the
argument as written below.

Code : linearMod <- lm(data = Heartrate, thalach ~ age)


print(linearMod)

print(linearMod) : Call:
lm(formula = thalach ~ age, data = Heartrate)

Coefficients:
(Intercept) age
206.759 -1.059

Now that we have built the linear model, we also have established the relationship between
the predictor and response in the form of a mathematical formula for Heart rate as a function
for age

For the above output, you can notice the ‘Coefficients’ part having two components:
Intercept: 206.759 , age : -1.059 These are also called the beta coefficients. In other words,
dist = Intercept + (β ∗ speed)
=> Heart Rate = 206.759 + -1.059 ∗age
ASSESSING OUR MODEL VISUALLY

We use geom_ablime and geom point to acess our model visually . We use geom_ablime
because it allow us to plot line with slope and intercept, this helps us to predict error.

Code : ggplot(Heartrate, mapping = aes(age,thalach)) + geom_point() +


geom_abline(slope = -1.059 , intercept = 206.759 )

Graph :

PLOTTING LINEAR MODEL

We use geom_smooth to plot linear model :

Here we use geom_smooth(method = "lm") followed by geom_smooth(). This allows us to


compare the linearity of our model (blue line with the 95% confidence interval in shaded
region) with a non-linear (red) LOESS model. Considering the LOESS smoother remains
within the confidence interval we can assume the linear trend fits the essence of this
relationship.
LINEAR REGRESSION DIAGNOSTICS

Now the linear model is built and we have a formula that we can use to predict the dist value
if a corresponding speed is known.

SUMMARY
Code : modelSummary <- summary(linearMod)
print(modelSummary)

RESIDUAL STANDARD ERROR

sigma(linearMod) : 21.06809

sigma(linearMod)/mean(thalach) : [1] 0.1409384


PREDICTION

STEP 1: CREATE THE TRAINING (DEVELOPMENT) AND TEST (VALIDATION) DATA


SAMPLES FROM ORIGINAL DATA.

set.seed(100)
sample <- sample(c(TRUE, FALSE), nrow(Heartrate),replace = T, prob = c(0.8,0.2))
sample
!sample
train <- Heartrate[sample, ]
test <- Heartrate [!sample, ]

STEP 2: DEVELOP THE MODEL ON THE TRAINING DATA AND USE IT TO PREDICT
THE DISTANCE ON TEST DATA

pmodel <- lm(data = train, thalach ~ age)


print(pmodel)
summary(pmodel)

slope = -1.059 , intercept = 206.759

STEP 3: COMPARE ORIGINAL AND TRAINING DATA, VISUALISING USING


GEOM_ABLIME

ggplot(Heartrate, mapping = aes(age,thalach)) +


geom_point() +
geom_abline(slope = -1.0123 , intercept = 203.4721 , colour="blue" )+
geom_abline(slope = -1.059 , intercept = 206.759 , colour="red")

STEP 4: CALCULATE PREDICTION ACCURACY AND ERROR RATE

 Training : m1 and c1
 Testing : - I use m1 and c1 to predict y' from x
 Pred-Error : Difference between y' and y
 Prediction y' = mx + c ---- (x , y)

prediction <- predict(pmodel, test)

ACTUAL VS PREDICTED:

avsp <- data.frame(cbind(actuals=age, predicted=prediction))

ACCURACY OF CORRELATION: ACTUAL & PREDICTED :

Code : correlation_accuracy <- cor(avsp)


print(correlation_accuracy)
avsp

actuals predicted
actuals 1.000000 - 0.1712265
0
predicted -0.1712265 1.0000000

PREDICTION ERROR BY MUTING A NEW COLUMN” ERR0R” TO AVSP

Here, we can make a new col


avsp <- avsp %>%
mutate (predError = actuals - predicted)

avspT <- as_tibble(avsp)

PLOTTING THE PREDICTED ERROR

Code : ggplot(data = avspT) +


geom_point(mapping = aes(x = actuals, y = predError))
Graph:

The actual and predicted errors are plotted against each other thus proving the
predicted data theory.

You might also like