You are on page 1of 7

Assignment on Data Mining: Assignment 1

Md. Makfidunnabi (20228025)

2023-06-25

Contents

Data 2

Boxplot 2

Chi-square test 3

Correlation 4

Regression 5

1
This assignment is based on solving the issues such as Box Plot, Scatter Diagram, Chi-square Test,
Correlation, and Regression using hypothetical data/secondary data based on R or Python.

Data

The dataset consists of three columns age, experience and income.

data <- read.csv("E:/Github/R/Data Mining/multiple_linear_regression_dataset.csv",


header = TRUE, sep = ",", stringsAsFactors = TRUE)
str(data)

## 'data.frame': 20 obs. of 3 variables:


## $ age : int 25 30 47 32 43 51 28 33 37 39 ...
## $ experience: int 1 3 2 5 10 7 5 4 5 8 ...
## $ income : int 30450 35670 31580 40130 47830 41630 41340 37650 40250 45150 ...

Boxplot

library(ggplot2)
age_boxplot <- ggplot(data) + geom_boxplot(aes(y = age))
experience_boxplot <- ggplot(data) + geom_boxplot(aes(y = experience))
income_boxplot <- ggplot(data) + geom_boxplot(aes(y = income))
library(gridExtra)
grid.arrange(age_boxplot, experience_boxplot, income_boxplot, ncol = 3)

2
60000
15

50

50000

experience
10

income
age

40

40000

5
30

30000

−0.4 −0.2 0.0 0.2 0.4 −0.4 −0.2 0.0 0.2 0.4 −0.4 −0.2 0.0 0.2 0.4

From first boxplot, 50% of participants age are between 31 to 47 years. Minimum age is 23 and maximum
age is 57. There is no outlier in this category. Median age is 40 years.

From second boxplot which displays experience category, 50% of people have 4 to 9 years of experience.
Median experience is 5 years. Experience has one outlier.

Third boxplot shows income statistics. Half of the people have income between 35000 to 45000. Median
income is 40000. Income has also an outlier.

Chi-square test

The chi-square test is a statistical test used to determine if there is a significant association between two
categorical variables.

chisq.test(c(data$age, data$experience))

##
## Chi-squared test for given probabilities
##
## data: c(data$age, data$experience)

3
## X-squared = 585.51, df = 39, p-value < 2.2e-16

chisq.test(c(data$age, data$income))

##
## Chi-squared test for given probabilities
##
## data: c(data$age, data$income)
## X-squared = 878716, df = 39, p-value < 2.2e-16

In first chi-square test, age and experience are two categorical variables. Here Null Hypothesis states
that there is no association between the variables, while the alternative hypothesis suggests that there is
a significant association between two variables. Here P value is less than 0.05. So Null hypothesis is
rejected. That means There is a significant association between age and experience.

In second chi-square test, age and income are two categorical variables. Here p value is less than 0.05.
So null hypothesis is rejected. So that there is significant association between age and income.

Correlation

library(corrplot)

## corrplot 0.92 loaded

cor_matrix <- cor(data)


cor_matrix

## age experience income


## age 1.0000000 0.6151652 0.5322043
## experience 0.6151652 1.0000000 0.9842266
## income 0.5322043 0.9842266 1.0000000

corrplot(cor_matrix, method = "circle", type = "upper", tl.col = "black")

4
experience

income
age
1

0.8
age
0.6

0.4

0.2

experience 0

−0.2

−0.4

−0.6
income
−0.8

−1

The plot shows the correlation among three variables. Here, color gradient show the relationship between
the variables. Dark blue show the strong relationship between two variable and light blue show less
correlation between two variables. The light blue circle of age and experience indicate that they have
less correlation between them. This is true for age and income also. On the other hand experience and
income has dark blue circle which implies that they have high correlation between them.

From correlation matrix, age and experience have 61.5% correlation between them. On the other hand
age and income have 53.22% correlation. Experience and income have greatly high correlation of
98.42%.

Regression

model <- lm(income ~ age + experience, data = data)


summary(model)

##
## Call:
## lm(formula = income ~ age + experience, data = data)

5
##
## Residuals:
## Min 1Q Median 3Q Max
## -2707.43 -584.21 25.85 925.75 2043.76
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31261.69 1306.44 23.929 1.57e-14 ***
## age -99.20 38.98 -2.545 0.0209 *
## experience 2162.40 94.77 22.817 3.44e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1343 on 17 degrees of freedom
## Multiple R-squared: 0.9773, Adjusted R-squared: 0.9747
## F-statistic: 366.5 on 2 and 17 DF, p-value: 1.048e-14

Here income is dependent variable and age, experience are independent variables. So regression equa-
tion will be
𝑦 = 𝛽 0 + 𝛽 1 𝑥1 + 𝛽 2 𝑥2

Here,
𝑦 = 𝐼𝑛𝑐𝑜𝑚𝑒

𝛽0 = 𝐼𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡

𝛽1 = 𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑒𝑖𝑛𝑡 𝑜𝑓 𝐴𝑔𝑒

𝛽2 = 𝐶𝑜𝑒𝑓𝑓𝑖𝑐𝑒𝑛𝑡 𝑜𝑓 𝐸𝑥𝑝𝑒𝑟𝑖𝑒𝑛𝑐𝑒

𝑥1 = 𝐴𝑔𝑒

𝑥2 = 𝐸𝑥𝑝𝑒𝑟𝑖𝑒𝑛𝑐𝑒

From the model summary, Intercept = 31261.69. That means an employee will get minimum 31261
wage if his age and experience is not in account.

Coefficient of age 𝛽1 is -99.20 which indicate that increasing of one year age will decrease by 99.20.
P-value of 𝛽1 is near to 0.05. So we can remove this variable from regression model because the more

6
nearer the p-value to 0.05 the variable becomes less significant.

Coefficient of experience 𝛽2 is 2162.40. For increasing one year of experience, employee wage will
increased by 2162.40. It’s p-value is much lower than 0.05. So the variable is significant.

If we remove age variable and plot regression model where income is dependent variable and experience
is independent variable, then we get the following plot:

ggplot(data, aes(x = experience, y = income)) + geom_point() +


geom_smooth(method = "lm", se = FALSE) + xlab("Experience") +
ylab("Income") + labs(title = "Regression Model")

## `geom_smooth()` using formula = 'y ~ x'

Regression Model

60000

50000
Income

40000

30000

5 10 15
Experience

You might also like