You are on page 1of 22

PROBABILITY AND STATISTICAL DATA ANALYSIS

(CSF 21103)

ASSIGNMENT REPORT: PAPERS CITATION VS H-INDEX


Lecturer Name: Dr. Nurnadiah Zamri
Group Members

Course : SMSKKI

Name Matric No.


1. MUHAMMAD ADNAN SAIDIE BIN 066726
ABDULLAH

2. MOHAMMAD HISYAM BIN 067896


SHARIPUDDIN

3. AWANG HILMI BIN AWANG 067953


MOHDAR

4. EZZAH NAJIHAH BINTI ZAIDON @ 066987


MOHAMAD

5. DHENESWARI A/P VASANTHAN 068022


TABLE OF CONTENTS

Contents Pages

i. Introduction 2

ii. Calculation of data using Technology

a. Raw Data 3

b. i) Frequency Table 3-4

ii) Frequency Polygon 4-5

iii) Mean and Actual Mean 6

iv) Comparison between grouped mean and actual mean 6

v) Histogram and Ogive 7-8

vi) Stem plot 9-10

c. Mean and Standard Deviation 11

d. i) Histogram 12-13

ii) PI of Skewness 14-15

iii) Outliers 17-18

e. Value that separates the bottom 10% of the data 19

f. Value that lies in the top 80% of the data 20

g. Values so that 50% of the middle area is bounded by them 20

h. Conclusion of Confidence Interval and Mean’s Result 21

iii. Source of Data - www.kaggle.com 21

1
Introduction

Our group has studied data related to Paper Citations VS H-index. The source has been taken
from the Kaggle website. The H-index is a measure of a researcher's productivity and impact.
The higher the H-index, the more productive and influential the researcher is. Citations are
another way of measuring a researcher's impact. The more citations a researcher has, the
more other researchers have cited their work. This dataset can be used to compare the
productivity and impact of computer science researchers.

2
a. Present your raw data in a Table: Total data = 100 Number of parameter
(quantitative) = 1

Figure 1 : 100 Raw Data

b. Select one parameter and state the name of that parameter.

i) Create a frequency table with 6 categories

Parameter = H-Index

Figure 2 : Datasets

The data is presented and sorted in ascending order in order to read the data accurately.

3
Figure 3 : Frequency Table

The frequency table consists of class limits, class boundaries, frequency, cumulative
frequency, midpoint and fx (frequency x midpoint) which helps for future use.

ii) Sketch a frequency polygon for the frequency table. Is the shape of the distribution of
times uniform, skewed or bell shaped?

Figure 4 : Datasets

4
Figure 5 : Frequency and Midpoint Table

Figure 6 : Frequency Polygon Graph

The shape distribution for the frequency polygon graph is in Bell - Shaped where the highest
frequency is in the middle.

5
iii) Determine the mean for the frequency table and find the actual mean.

Figure 7 : Frequency Table and Mean

The mean for the frequency table is 73 when the multiplication of frequency and midpoint is
divided by the total number of the frequency (100). The actual mean is 72.42 when the
overall raw data summed up (7242) and divided by the total number of the frequency (100).

iv) How does the grouped mean compare to the actual mean?

The group mean is 73 while actual mean is 72.42 So, the actual mean are slightly same to the
grouped mean .The mean calculated from ungrouped data is more accurate because its
calculation takes all the observations in the data into consideration.

6
v) Next, create a histogram and an ogive.

7
8
vi) Then create an ordered stem plot for 15 data.

9
b.vii.

c. Compute mean and standard deviation for this data.

10
Figure 11 : Dataset

The dataset helps in finding mean and standard deviation.

Mean

Figure 12 : Mean Formula

The formula helps to find the mean of the dataset by summing all the data and divide by 100
where it is the total count of data.

Figure 13 : Population Mean

The output from the mean formula is 72.42.

11
Standard Deviation

Figure 14 : Population Formula

STDEV.P is the formula for finding the standard deviation of the entire dataset.

Figure 15 : Population Standard Deviation

The output for Standard Deviation of the dataset is 32.92376647

d. Determine if the data are approximately normally distributed using

i) Histogram

12
Figure 16 : Histogram

The data is normally distributed because we can see the “ Bell Curved “ shape which
indicates that it is symmetric about the mean, showing that data near the mean are more
frequent in occurrence than data far from the mean.

13
ii) PI of skewness

Figure 17 : Dataset

The dataset is used for finding the PI of skewness and as a reference.

Figure 18 : PI Skewness Formula

The formula for the PI of skewness is using the SKEW.P which is used for finding the PI
Skew for population.

14
Figure 19 : PI of Skewness Output

iii) Outliers

Figure 20 : Quartile, Interquartile and Bound Table

In order to find the outliers, we must find the above output. We need to find the quartile 1 &
3, interquartile. Then we will know the output of the lower bound & upper bound for the
range.

Figure 21 : Q1

The formula for the Q1 is QUARTILE( “data” , 1).

15
Figure 22 : Q3

The formula for the Q3 is QUARTILE( “data” , 3).

Figure 23 : IQR

For finding the interquartile, we just need to set the formula with subtraction only. F2 & E2 is
the position of the Q3 and Q1 data.

Figure 24 : Lower Bound

The formula used in the excel is based on the formula “Q1 - 1.5 (IQR)”. The lower
boundary to determine the lowest value of range.

16
Figure 25 : Upper Bound

The formula used in the excel is based on the formula “Q3 + 1.5 (IQR)”. The upper
boundary to determine the highest value of range.

Figure 26 : Output

The output for all needed variables.

Figure 27 : Range & Outliers Statement

The range value from the lower and upper bound. There are no outliers found in the dataset.

17
Figure 28 : Outliers Table Output

This figure proves if it is outlier or not.

18
e. Find the value that separates the bottom 10% of the data.

19
f. Find the value that lies in the top 80% of the data.

g. Find the values so that 50% of the middle area is bounded by them.

20
h. Compute a 95% confidence interval for the mean of selected data. Then, conclude
your data based on confidence interval and mean’s result.

Source of Data - www.kaggle.com

21

You might also like