You are on page 1of 3

Descriptive Analytics on “iris”, “mtcars” data sets

Useful Statistical Measures for Descriptive Analytics :


1. Count: used for counting number of observation or group variable observations. Mostly it is
used for categorical data column. Eg., Species: Is a variable in “iris” data set in built in R
studio. It is a group variable, because there are three varieties of species present in that
column.
Following are the methods used to describe Species variable:
 Frequency table of Species
 Compare the different types of species
 Bar or Pie chart can be prepared for this variable

# Usage of count for descriptive analytics

library(dplyr)

sp=iris%>%count(Species) #dplyr package


Species n
<fct> <int>
1 setosa 50
2 versicolor 50
3 virginica 50
“sp” is a data set, it can be copied to working directory with the following syntax:

write.csv(sp,"sp.csv")

Bar plot for the variable Species in iris data set:

counts=table(iris$Species)
barplot(counts, main="Species Distribution",
xlab="Number of Species")

2. Mean: Used to represent the continuous variable measure and also used for comparing two
or more than two variables collected under the same characteristics. Eg., "Sepal.Length"
"Sepal.Width" "Petal.Length" "Petal.Width" these variables means can be compared for
further analysis.
3. Median: Is used to represent the ordinal data measure and also used for comparing two or
more than two variables collected under the same characteristics.
4. Mode: Is used to count highest number of times occurring observations in the
categorical/specifically nominal variable.
5. Range: Is used for understanding the spread of the data distribution in a simplest way with
minimum and maximum of the values in the variable.
6. Quartiles and other positional measures: These are positional values of the ordinal data
observations. Genrally for any quartile Qn = n(n+1)/4, if it is decile, Dn=n(n+1)/10 and
percentile is Pn =n(n+1)/100
7. Standard deviation: It is one of mostly used statistical measure to understand the deviations
between the values of the variable and there by best measure for representing the variation
among all values. It is used for numerical measures which suits for mathematical operations.
( x− x́ )2
Formula for Standard deviation (Std)=
√ n
8. Skewness: Skewness is a measure of symmetry, or more precisely, the lack of symmetry.
A distribution, or data set, is symmetric if it looks the same to the left and right of the centre
point.
Examples of skewness:

9. Kurtosis: Kurtosis is a parameter that describes the shape of a random variable’s probability
distribution. For normal probability distribution values, value of kurtosis is almost equal to
one. If it is positive value, the peak is higher and for negative value flatter is more.

#Mean

mean(iris$Sepal.Length)

mean(iris$Sepal.Width)

mean(iris$Petal.Length)

mean(iris$Petal.Width)

summary(iris$Sepal.Length)

summary(iris$Sepal.Width)

summary(iris$Petal.Length)

summary.data.frame(iris)

irisn=iris[,-5]

irist=summary.data.frame(irisn)

irisst=as.data.frame(irist)

irisst

write.csv(irist,"irist.csv")
sd(irisn$Sepal.Length)

var(irisn$Sepal.Length)

boxplot(irisn)

library(ggplot2)

ggplot(iris, aes(y = Sepal.Length, x = Species, color = Species)) +

geom_boxplot() +

theme_classic()

quantile(iris$Sepal.Length)

quantile(irisn$Sepal.Length,c(0.30,0.45))

library(e1071)

skewness(iris$Sepal.Length) # run for e1071 package

kurtosis(iris$Sepal.Length)

You might also like