You are on page 1of 17

ASSIGNMENT: CA1

COURSE CODE- MGNM801

COURSE TITLE- BUSINESS ANALYTICS - I

SUBMITTED BY

NAME- Anuvartika Bharti

SECTION- Q2143

MITTAL SCHOOL OF BUSINESS

REGISTRATION NO- 12107644


ROLL NO- RQ2143B45

SUBMITTED TO
Mr. Tanveer Kajla
Faculty, Mittal School of Business
INTRODUCTION
The first dataset used is about Covid 19 current updates in India. The dataset of Covid 19 is taken
from Kaggle. The data contains State and union territories wise Population, total cases of covid
registered, Active cases, Deaths happened in the particular state, Number of people recovered and
discharged, Active ratio, Discharge ratio, and Death ratio.

OVERVIEW OF DATA
The Second dataset used is about Lung Capacity of people as per their habit of smoking that is
whether they smoke or not, their type of birth whether they were born Caesarean or Normal, and on
the basis of their Age that is how old they are. The dataset of Lung Capacity is taken from Kaggle.
The data contains columns like Lung Capacity, Age, Height that is how tall they are, Smoke,
Gender if they are male or female, and Birth type.

OVERVIEW OF DATA
ANALYSIS
All the Analysis, Code, Visualization is performed through RStudio.

STEP I: The data used for the analysis purpose is in csv format so, first I have imported the dataset
in the RStudio environment. And for importing the csv file read.csv function is used in RStudio.
And the data imported is saved in variable ‘d’ so, that I can easily access the data with the variable
name d. Then data is attached in the R environment so that the column names of the dataset can be
easily accessible.

STEP II: View the data in RStudio to see if all the data is imported correctly in the environment of
R. After executing View function in RStudio here we get the overview of Covid 19 data that we
imported in the RStudio.
STEP III: Performing Correlation in the covid 19 dataset.

Finding the correlation between Total number of cases and Deaths happening due to covid 19.

Firstly, I installed the required packages like “ggplot” for creating visualization and getting
interactive insight of data through graphs and “ggcorrplot” for creating correlation matrix. And I
have loaded the installed packages using library function to use it in R environment.

As I am calculating correlation between the variables so it is required to have only numerical data
as we can correlate numerical data and for that I have extracted all the numerical columns of the
dataset in variable d1 from variable in which I earlier saved the imported data.
And I retrieved all columns required to work on for performing correlation.

Then I run the correlation function cor () by passing data saved in variable d1, which will tell the
correlation between every element of the dataset as shown in the output.

Here I am getting the correlation between Total cases and deaths as 0.9129154.
Using ‘ggplot’ function I am displaying scatterplot to see the correlation between total cases and
deaths as I have to find the correlation between total cases and deaths happening due to covid,
which is saved in variable ‘a’, so by executing variable a I got the correlation graph that is scatter
plot showing the relation between Total cases and deaths. And by using geom_smooth I have
printed the trendline over the scatter plot to get the clear insight of the correlation.

SCATTER PLOT

Looking at the scatter plot we can see the it is showing the strong positive correlation between total
cases and Deaths occurring du to covid. Which means the increase in total cases the death count is
also increasing. As, the correlation between Total cases and Deaths is coming to be 0.9129154 as
we have printed earlier which is very close to +1 which shows it have strong positive correlation.
Using ‘ggcorrplot’ function I am displaying correlation matrix to see the correlation between
variables.

Here we can see the correlation between different variable through colour coding. If it is 1.0 which
means positive correlation then the block colour should be dark red and if it is -1.0 which means
negative correlation then the block colour should be dark blue. And in between if it is weak positive
the colour should be light red and for weak negative the colour should light blue. This is how it is
showing the correlation between all variables.
I am finding the correlation between Total cases and deaths happening due to covid. I am taking
Total cases from x-axis and deaths from y-axis which is leading to the block marked in graph. So,
through this graph I have predicted that between Total cases and death there is strong positive
correlation as the block colour is dark red which is very close to positive correlation. Hence, we can
say that as Total case is increasing death count is also increasing depicting strong positive
correlation.

STEP IV: Performing Ranking (Horizontal Bar Chart) on covid 19 dataset.

Finding State/UT wise death ratio.

I have used ggplot to create Bar Chart and used Coord_flip to make it Horizontal Bar Chart and
then I have used reorder function in aesthetic to arrange the plot in descending order and fill the
colour of bar according to the State/UTs.
The graph is showing the Death Ratio according to the States/UTs. So, here we can interpret that
the highest death ratio during covid is in the Punjab state which is 2.75 and lowest death ratio
during covid is in the Dadra and Nagar Haveli and Daman and Diu which is 0.04.

STEP V: Now I have worked on different data for Distribution and Composition which Lung
Capacity Dataset. The data used is in csv format so, first I have imported the dataset in the RStudio
environment. And for importing the csv file read.csv function is used in RStudio. And the data
imported is saved in variable ‘d’ so, that I can easily access the data with the variable name d. Then
data is attached in the R environment so that the column names of the dataset can be easily
accessible.

STEP VI: View the data in RStudio to see if all the data is imported correctly in the environment of
R. After executing View function in RStudio here we get the overview of Lung capacity data that
we imported in the RStudio.

STEP VII: Performing Distribution (Histogram and Boxplot) on Lung capacity dataset.

Plotting Histogram for Lung Capacity.


I have used ggplot to create Histogram and I have added mean and median values in histogram
using geom_vline. And allocated different colour to the mean and median line on histogram using
scale_color_manual.
Through this graph we can interpret that mean Lung Capacity of people is nearly 7.9 and median
Lung Capacity of people is 8. Which shows that most of the people are having the Lung Capacity of
8.

STEP VIII: Performing Distribution (Histogram and Boxplot) on Lung capacity dataset.

Finding Gender wise Lung Capacity using Box Plot.

For plotting box plot I have used ggplot.


In this graph the black arrows show quartile 1, the middle arrow shows quartile 2, and last one
shows quartile 3. And the orange arrow at top shows the maximum value and orange arrow at
bottom shows the minimum value.

In this graph we can interpret that 25 % female have Lung Capacity of 5.7 and 50% of female have
Lung Capacity of 7.7 and 75% of female have Lung Capacity of 9.2. Maximum Lung Capacity of
female is 13 and minimum Lung Capacity of female is 0.5. So, we can say that most of the female
have lung capacity of 7.7.

In this graph we can also interpret that 25 % male have Lung Capacity of 6.5 and 50% of male have
Lung Capacity of 8.3 and 75% of male have Lung Capacity of 10.3. Maximum Lung Capacity of
male is 14.7 and minimum Lung Capacity of male is 1.1. So, we can say that most of the male have
lung capacity of 8.3.

By looking at the box plot we can clearly interpret that male have higher Lung Capacity than
female.

STEP IX: Finding Smoking habit wise Lung Capacity using Box Plot.

For plotting box plot I have used ggplot.


In this graph we can interpret that 25 % People who don’t smoke have Lung Capacity of 6 and 50%
of People who don’t smoke have Lung Capacity of 7.9 and 75% of People who don’t smoke have
Lung Capacity of 9.7. So, we can say that most of the People who don’t smoke have lung capacity
of 7.9.

In this graph we can also interpret that 25 % people who smoke have Lung Capacity of 7.3 and 50%
of people who smoke have Lung Capacity of 8.6 and 75% of people who smoke have Lung
Capacity of 10.03. So, we can say that most of the people who smoke have lung capacity of 8.6.

STEP X: Performing Composition (Pie chart) on Lung Capacity dataset.


Finding percentage of people according to their birth type using Pie chart (donut chart).

Firstly, I have installed the lessR package which I am using to create pie chart. Then I loaded the
lessR package using library function. Then I have executed the PieChart function in lessR.
In this graph we can interpret that 77% of people are born Normal whereas 23% of people are born
caesarean.

REFERENCE

i. https://www.kaggle.com/anandhuh/latest-covid19-india-statewise-data
ii. https://www.kaggle.com/radhakrishna4/lung-capacity

You might also like