Assignment - KI - Group 2

PROBABILITY AND STATISTICAL DATA ANALYSIS
(CSF 21103)
ASSIGNMENT REPORT: PAPERS CITATION VS H-INDEX

Lecturer Name: Dr. Nurnadiah Zamri
Group Members
Course : SMSKKI
Name Matric No.

1. MUHAMMAD ADNAN SAIDIE BIN 066726
ABDULLAH
2. MOHAMMAD HISYAM BIN 067896

SHARIPUDDIN
3. AWANG HILMI BIN AWANG 067953

MOHDAR
4. EZZAH NAJIHAH BINTI ZAIDON @ 066987

MOHAMAD
5. DHENESWARI A/P VASANTHAN 068022

TABLE OF CONTENTS
Contents Pages
i. Introduction 2
ii. Calculation of data using Technology
a. Raw Data 3
b. i) Frequency Table 3-4
ii) Frequency Polygon 4-5
iii) Mean and Actual Mean 6
iv) Comparison between grouped mean and actual mean 6
v) Histogram and Ogive 7-8
vi) Stem plot 9-10
c. Mean and Standard Deviation 11
d. i) Histogram 12-13
ii) PI of Skewness 14-15
iii) Outliers 17-18
e. Value that separates the bottom 10% of the data 19
f. Value that lies in the top 80% of the data 20
g. Values so that 50% of the middle area is bounded by them 20
h. Conclusion of Confidence Interval and Mean’s Result 21
iii. Source of Data - www.kaggle.com 21
1
Introduction
Our group has studied data related to Paper Citations VS H-index. The source has been taken
from the Kaggle website. The H-index is a measure of a researcher's productivity and impact.
The higher the H-index, the more productive and influential the researcher is. Citations are
another way of measuring a researcher's impact. The more citations a researcher has, the
more other researchers have cited their work. This dataset can be used to compare the
productivity and impact of computer science researchers.
2
a. Present your raw data in a Table: Total data = 100 Number of parameter
(quantitative) = 1
Figure 1 : 100 Raw Data
b. Select one parameter and state the name of that parameter.
i) Create a frequency table with 6 categories
Parameter = H-Index
Figure 2 : Datasets
The data is presented and sorted in ascending order in order to read the data accurately.
3
Figure 3 : Frequency Table
The frequency table consists of class limits, class boundaries, frequency, cumulative
frequency, midpoint and fx (frequency x midpoint) which helps for future use.
ii) Sketch a frequency polygon for the frequency table. Is the shape of the distribution of
times uniform, skewed or bell shaped?
Figure 4 : Datasets
4
Figure 5 : Frequency and Midpoint Table
Figure 6 : Frequency Polygon Graph
The shape distribution for the frequency polygon graph is in Bell - Shaped where the highest
frequency is in the middle.
5
iii) Determine the mean for the frequency table and find the actual mean.
Figure 7 : Frequency Table and Mean
The mean for the frequency table is 73 when the multiplication of frequency and midpoint is
divided by the total number of the frequency (100). The actual mean is 72.42 when the
overall raw data summed up (7242) and divided by the total number of the frequency (100).
iv) How does the grouped mean compare to the actual mean?
The group mean is 73 while actual mean is 72.42 So, the actual mean are slightly same to the
grouped mean .The mean calculated from ungrouped data is more accurate because its
calculation takes all the observations in the data into consideration.
6
v) Next, create a histogram and an ogive.
7
8
vi) Then create an ordered stem plot for 15 data.
9
b.vii.
c. Compute mean and standard deviation for this data.
10
Figure 11 : Dataset
The dataset helps in finding mean and standard deviation.
Mean
Figure 12 : Mean Formula
The formula helps to find the mean of the dataset by summing all the data and divide by 100
where it is the total count of data.
Figure 13 : Population Mean
The output from the mean formula is 72.42.
11
Standard Deviation
Figure 14 : Population Formula
STDEV.P is the formula for finding the standard deviation of the entire dataset.
Figure 15 : Population Standard Deviation
The output for Standard Deviation of the dataset is 32.92376647
d. Determine if the data are approximately normally distributed using
i) Histogram
12
Figure 16 : Histogram
The data is normally distributed because we can see the “ Bell Curved “ shape which
indicates that it is symmetric about the mean, showing that data near the mean are more
frequent in occurrence than data far from the mean.
13
ii) PI of skewness
Figure 17 : Dataset
The dataset is used for finding the PI of skewness and as a reference.
Figure 18 : PI Skewness Formula
The formula for the PI of skewness is using the SKEW.P which is used for finding the PI
Skew for population.
14
Figure 19 : PI of Skewness Output
iii) Outliers
Figure 20 : Quartile, Interquartile and Bound Table
In order to find the outliers, we must find the above output. We need to find the quartile 1 &
3, interquartile. Then we will know the output of the lower bound & upper bound for the
range.
Figure 21 : Q1
The formula for the Q1 is QUARTILE( “data” , 1).
15
Figure 22 : Q3
The formula for the Q3 is QUARTILE( “data” , 3).
Figure 23 : IQR
For finding the interquartile, we just need to set the formula with subtraction only. F2 & E2 is
the position of the Q3 and Q1 data.
Figure 24 : Lower Bound
The formula used in the excel is based on the formula “Q1 - 1.5 (IQR)”. The lower
boundary to determine the lowest value of range.
16
Figure 25 : Upper Bound
The formula used in the excel is based on the formula “Q3 + 1.5 (IQR)”. The upper
boundary to determine the highest value of range.
Figure 26 : Output
The output for all needed variables.
Figure 27 : Range & Outliers Statement
The range value from the lower and upper bound. There are no outliers found in the dataset.
17
Figure 28 : Outliers Table Output
This figure proves if it is outlier or not.
18
e. Find the value that separates the bottom 10% of the data.
19
f. Find the value that lies in the top 80% of the data.
g. Find the values so that 50% of the middle area is bounded by them.
20
h. Compute a 95% confidence interval for the mean of selected data. Then, conclude
your data based on confidence interval and mean’s result.
Source of Data - www.kaggle.com
21

Assignment - KI - Group 2

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Assignment - KI - Group 2

Uploaded by

Copyright:

Available Formats

PROBABILITY AND STATISTICAL DATA ANALYSIS

ASSIGNMENT REPORT: PAPERS CITATION VS H-INDEX

Name Matric No.

2. MOHAMMAD HISYAM BIN 067896

3. AWANG HILMI BIN AWANG 067953

4. EZZAH NAJIHAH BINTI ZAIDON @ 066987

5. DHENESWARI A/P VASANTHAN 068022

ii. Calculation of data using Technology

b. i) Frequency Table 3-4

ii) Frequency Polygon 4-5

iii) Mean and Actual Mean 6

iv) Comparison between grouped mean and actual mean 6

v) Histogram and Ogive 7-8

vi) Stem plot 9-10

c. Mean and Standard Deviation 11

ii) PI of Skewness 14-15

iii) Outliers 17-18

e. Value that separates the bottom 10% of the data 19

f. Value that lies in the top 80% of the data 20

g. Values so that 50% of the middle area is bounded by them 20

h. Conclusion of Confidence Interval and Mean’s Result 21

iii. Source of Data - www.kaggle.com 21

Figure 1 : 100 Raw Data

b. Select one parameter and state the name of that parameter.

i) Create a frequency table with 6 categories

Figure 6 : Frequency Polygon Graph

Figure 7 : Frequency Table and Mean

c. Compute mean and standard deviation for this data.

The dataset helps in finding mean and standard deviation.

Figure 12 : Mean Formula

Figure 13 : Population Mean

The output from the mean formula is 72.42.

Figure 14 : Population Formula

Figure 15 : Population Standard Deviation

The output for Standard Deviation of the dataset is 32.92376647

d. Determine if the data are approximately normally distributed using

The dataset is used for finding the PI of skewness and as a reference.

Figure 18 : PI Skewness Formula

Figure 20 : Quartile, Interquartile and Bound Table

The formula for the Q1 is QUARTILE( “data” , 1).

The formula for the Q3 is QUARTILE( “data” , 3).

Figure 24 : Lower Bound

The output for all needed variables.

Figure 27 : Range & Outliers Statement

This figure proves if it is outlier or not.

Source of Data - www.kaggle.com

You might also like