Professional Documents
Culture Documents
DATASET
Dataset : “Heart Disease Dataset” : Kaggle
DESCRIPTION
“Heart Disease” data consist of 300 observations with 14 variables. People with various age
groups were observed attempting to confirm presence of heart disease. Various conditions
were analysed of these people to know their effect on presence and absence of heart
disease.
INSIGHTS:-
FORMAT
SOURCE :
https://www.kaggle.com/johnsmith88/heart-disease-dataset
chol= heart_disease$chol
boxplot(chol)
boxplot(chol, plot=FALSE)$out
chol= heart_disease$chol
max(chol)
min(chol)
boxplot(chol)
trestbps = heart_disease$trestbps
boxplot(trestbps, plot=FALSE)$out
age = heart_disease$age
trestbps = heart_disease$trestbps
chol = heart_disease$chol
STATICAL INDICATORS
Mean
mean(chol)
Variance
var(chol)
Standard Deviation
sd(chol)
Standard Error
sd(chol)/sqrt(length(chol))
mad(chol)
Minimum
min(chol)
Median
median(chol)
Maximum
max(chol)
Range
max(chol)-min(chol)
Quantile
quantile(chol)
Interquartile Range
quantile(trestbps,1)
quantile(trestbps,0.1)
IQR(chol)
summary(chol)
BASIC GRAPHS IN R
1) BARPLOT:
sex = heart_disease$sex
target = heart_disease$target
b=table(sex,target)
barplot(b,
col=c("red","blue"),
legend=rownames(b),
beside=TRUE,
xlab="Target",
ylab="Count",
2) BOXPLOT:
boxplot(chol,col = redblue(1))
3) HISTOGRAM:
hist(trestbps,col=rainbow(7),
ylab="Frequency",
labels=TRUE)
4) PLOT :
plot(chol,col = factor(sex))
5) DENSITY :
oldpeak = heart_disease$oldpeak
plot(density(oldpeak),
xlab="Oldpeak",
ylab="Density")
polygon(density(oldpeak),col="orange",border="green")
PIE CHART :
a=table(heart_disease$sex)
pct=round(a/sum(a)*100)
geom_histogram :
Visualise the distribution of a single continuous variable by dividing the x axis into bins and
counting the number of observations in each bin. Histograms ( geom_histogram() ) display
the counts with bars.
Command:
ggplot(heart_disease,aes(x=age,fill = factor(sex)))+
geom_histogram()+
xlab("Age") +
ylab("Number")+ guides(fill = guide_legend(title = "Gender"))
Inference :
This graph shows the distribution of male and female in each age category. The x-axis
represents the age and the y axis represents frequency of each age group. Here , gender is
shown across the age parameter using colours ( blue and red ). “1” represents male and has
blue colour, whereas “2” represents female and has red colour. From this graph we can infer
that male of different age group is observed more than female and also the age group most
observed is between 50 years and 60 years.
The geom point is used to create scatterplots. The scatterplot is most useful for displaying
the relationship between two continuous variables.
Command :
geom_point(alpha=0.7)+xlab("Age") +
ylab("Cholestoral")
Inference :
In this graph the x axis shows Age, y axis shows the cholesterol level , points also represent
the cholesterol level and gender is represented by colour. The points represent the
cholesterol level, where the different levels of cholesterol levels are denoted in different
sizes. Here “1” represents male and “0”. From this graph we can infer that male is more
observed than female , most people with high cholestrol is between the age group 50 years
and 60 years and most are men.
3) COMPARISON OF BLOOD PRESSURE ACROSS PAIN TYPE
A data. frame , or other object, will override the plot data. All objects will be fortified to
produce a data frame. The upper and lower "hinges" correspond to the first and third
quartiles (the 25th and 7th percentiles).
The Facet grid forms a matrix of panels defined by row and column faceting variables. It is
most useful when you have two discrete variables, and all combinations of the variables
exist in the data.
Command
ggplot(heart_disease,aes(x=factor(sex),y=trestbps))+
geom_boxplot(fill="darkorange")+
xlab("Sex")+
ylab("BP")+
facet_grid(~cp)
Inference:
In this graph x- axis represents Sex and y-axis represents Blood pressure. This graph has
been divided to different grids according to the level of pain type. In each grid we can also
see BP value of each gender separately. From this graph we can infer that most people
under observation face, level 2 pain type and their BP is between 120 & 140.This graphs
also help us to clearly know level of cp faced by male and female and their BP level.
4) OLDPEAK ANALYSIS
Geom_smooth is designed to estimate f(x) when the shape is unknown, but assumed to be
smooth.
Command :
ggplot(heart_disease,aes(x=oldpeak,y=target))+geom_point()+geom_smooth(color="red")+
xlab("OldPeak")+ylab("Prob. of Heart Attack") + ggtitle("Relation between oldpeak and
heart attack")
Command
ggplot(heart_disease,aes(x=slope,y=target))+geom_point()+geom_smooth(color="cyan")+
scale_x_continuous(name="Slope")+ scale_y_continuous(name="Target")+
ggtitle("Relationship between Slope and Target")
Inference:
In the graph above it shows probability of heart disease with the increase of oldpeak. From
the smooth curve between slope and target we can see that after unit 1 of slope with
increase in slope, probability of heart attack is increasing.
FORMULATION OF HYPOTHESIS
To test whether there is any significant relationship between the Age and Heart rate, the
following hypothesis was formulated for this study:
TRANSFORMATION
I here , want to find a relation between heart rate Vs age and prove that when age
increases heart rate also increases .For this I transform the dataset according to my
convenience .
Using ,
CODE :
dplyr::select(age, thalach)
age = Heartrate$age
thalach = Heartrate$thalach
summary(Heartrate)
MODELING
DATASET : HEARTRATE
View(Heartrate)
Graph:
INFERENCE: In this graph we have plotted Heart rate in y axis and age in x axis. From this
we can infer that there is a liner relation between the Thalach and age. This graph help us to
visualize the linear relationship between the age and heart rate.
CORRELATION
Correlation is a statistical measure that suggests the level of linear dependence between two
variables, that occur in pair .Here, we have age and heart rate. Correlation can take values
between -1 to +1.
If we observe for every instance where age increases, the Heart rate decreases along with it.
Here we have a high negaticve correlation between them because the value is closer to -1.
Now that we have seen the linear relationship pictorially in the scatter plot and by computing
the correlation, lets see the syntax for building the linear model. The function used for
building linear models is lm(). The lm() function takes in two main arguments, namely: 1.
Formula 2. Data. The data is typically a data.frame and the formula is a object of class
formula. But the most common convention is to write out the formula directly in place of the
argument as written below.
print(linearMod) : Call:
lm(formula = thalach ~ age, data = Heartrate)
Coefficients:
(Intercept) age
206.759 -1.059
Now that we have built the linear model, we also have established the relationship between
the predictor and response in the form of a mathematical formula for Heart rate as a function
for age
For the above output, you can notice the ‘Coefficients’ part having two components:
Intercept: 206.759 , age : -1.059 These are also called the beta coefficients. In other words,
dist = Intercept + (β ∗ speed)
=> Heart Rate = 206.759 + -1.059 ∗age
ASSESSING OUR MODEL VISUALLY
We use geom_ablime and geom point to acess our model visually . We use geom_ablime
because it allow us to plot line with slope and intercept, this helps us to predict error.
Graph :
Now the linear model is built and we have a formula that we can use to predict the dist value
if a corresponding speed is known.
SUMMARY
Code : modelSummary <- summary(linearMod)
print(modelSummary)
sigma(linearMod) : 21.06809
set.seed(100)
sample <- sample(c(TRUE, FALSE), nrow(Heartrate),replace = T, prob = c(0.8,0.2))
sample
!sample
train <- Heartrate[sample, ]
test <- Heartrate [!sample, ]
STEP 2: DEVELOP THE MODEL ON THE TRAINING DATA AND USE IT TO PREDICT
THE DISTANCE ON TEST DATA
Training : m1 and c1
Testing : - I use m1 and c1 to predict y' from x
Pred-Error : Difference between y' and y
Prediction y' = mx + c ---- (x , y)
ACTUAL VS PREDICTED:
actuals predicted
actuals 1.000000 - 0.1712265
0
predicted -0.1712265 1.0000000
The actual and predicted errors are plotted against each other thus proving the
predicted data theory.