Professional Documents
Culture Documents
(CSF 21103)
Course : SMSKKI
Contents Pages
i. Introduction 2
a. Raw Data 3
d. i) Histogram 12-13
1
Introduction
Our group has studied data related to Paper Citations VS H-index. The source has been taken
from the Kaggle website. The H-index is a measure of a researcher's productivity and impact.
The higher the H-index, the more productive and influential the researcher is. Citations are
another way of measuring a researcher's impact. The more citations a researcher has, the
more other researchers have cited their work. This dataset can be used to compare the
productivity and impact of computer science researchers.
2
a. Present your raw data in a Table: Total data = 100 Number of parameter
(quantitative) = 1
Parameter = H-Index
Figure 2 : Datasets
The data is presented and sorted in ascending order in order to read the data accurately.
3
Figure 3 : Frequency Table
The frequency table consists of class limits, class boundaries, frequency, cumulative
frequency, midpoint and fx (frequency x midpoint) which helps for future use.
ii) Sketch a frequency polygon for the frequency table. Is the shape of the distribution of
times uniform, skewed or bell shaped?
Figure 4 : Datasets
4
Figure 5 : Frequency and Midpoint Table
The shape distribution for the frequency polygon graph is in Bell - Shaped where the highest
frequency is in the middle.
5
iii) Determine the mean for the frequency table and find the actual mean.
The mean for the frequency table is 73 when the multiplication of frequency and midpoint is
divided by the total number of the frequency (100). The actual mean is 72.42 when the
overall raw data summed up (7242) and divided by the total number of the frequency (100).
iv) How does the grouped mean compare to the actual mean?
The group mean is 73 while actual mean is 72.42 So, the actual mean are slightly same to the
grouped mean .The mean calculated from ungrouped data is more accurate because its
calculation takes all the observations in the data into consideration.
6
v) Next, create a histogram and an ogive.
7
8
vi) Then create an ordered stem plot for 15 data.
9
b.vii.
10
Figure 11 : Dataset
Mean
The formula helps to find the mean of the dataset by summing all the data and divide by 100
where it is the total count of data.
11
Standard Deviation
STDEV.P is the formula for finding the standard deviation of the entire dataset.
i) Histogram
12
Figure 16 : Histogram
The data is normally distributed because we can see the “ Bell Curved “ shape which
indicates that it is symmetric about the mean, showing that data near the mean are more
frequent in occurrence than data far from the mean.
13
ii) PI of skewness
Figure 17 : Dataset
The formula for the PI of skewness is using the SKEW.P which is used for finding the PI
Skew for population.
14
Figure 19 : PI of Skewness Output
iii) Outliers
In order to find the outliers, we must find the above output. We need to find the quartile 1 &
3, interquartile. Then we will know the output of the lower bound & upper bound for the
range.
Figure 21 : Q1
15
Figure 22 : Q3
Figure 23 : IQR
For finding the interquartile, we just need to set the formula with subtraction only. F2 & E2 is
the position of the Q3 and Q1 data.
The formula used in the excel is based on the formula “Q1 - 1.5 (IQR)”. The lower
boundary to determine the lowest value of range.
16
Figure 25 : Upper Bound
The formula used in the excel is based on the formula “Q3 + 1.5 (IQR)”. The upper
boundary to determine the highest value of range.
Figure 26 : Output
The range value from the lower and upper bound. There are no outliers found in the dataset.
17
Figure 28 : Outliers Table Output
18
e. Find the value that separates the bottom 10% of the data.
19
f. Find the value that lies in the top 80% of the data.
g. Find the values so that 50% of the middle area is bounded by them.
20
h. Compute a 95% confidence interval for the mean of selected data. Then, conclude
your data based on confidence interval and mean’s result.
21