Kelbie Davidson (44817015) COMP4702 - Assignment 1

Kelbie Davidson (44817015)
COMP4702 – Assignment 1
Question 1
Data is imported into MATLAB from .xlsx spreadsheet. The data is filtered to a column vector
of mass values. Some values are NaN, in this case empty cell. Hence nanmean and nanstd are
used to disregard these cells. Script 1 describes the MATLAB implementation.
Script 1 MATLAB implementation of sample mean and standard deviation
From this, it is determined that:

• Mean is 467.4876 (rounded to 4 decimal places); and
• Standard deviation is 288.3465 (rounded to 4 decimal places)
Question 2
a)
Classification is defined as discrete categorisation of input variables; and
Regression is defined as continuous categorisation of input values.
Hence, since the output will be a part of a continuous numerical expression, the
categorisation is that of a regression.
b)
Extrapolation is defined as estimating values outside the range of a given data; and
Interpolation is defined as an estimating values between known values of data.
Hence, since the expression is valid within the month of February, estimations for the month
of March would be extrapolation.
Question 3
Script 2 describes the Sum_to_n program, written in Python. arr is array of unique, unordered
integers. n is integer.
Script 2 Sum to n function implementation in Python

Question 4
a)
One can estimate, without prior knowledge of what the data set represents, if a feature is
categorical by considering a threshold number of unique values compared to the size of the
set, such that:
𝐶𝑎𝑡𝑎𝑔𝑜𝑟𝑖𝑐𝑎𝑙, (𝑛𝑢𝑚 𝑢𝑛𝑖𝑞𝑢𝑒 𝑣𝑎𝑙𝑢𝑒𝑠) < 0.05 (𝑛𝑢𝑚 𝑣𝑎𝑙𝑢𝑒𝑠)

{
𝑁𝑜𝑛 − 𝐶𝑎𝑡𝑎𝑔𝑜𝑟𝑖𝑐𝑎𝑙, 𝑒𝑙𝑠𝑒
N is 801 for all features. By this likelihood threshold, classification features, as determined in
Script 3, include:
Feature 0-18, n = 4-6 Feature 20, n = 10 Feature 21, n = 6

Feature 23, n = 34 Feature 25, n = 6 Feature 28, n = 8
Feature 32, n = 18 Feature 33, n = 19 Feature 35, n = 7
Feature 36, n = 2 Feature 39, n = 39
Script 3 Calculating the number of unique values Python script
b)
As there is prior knowledge of data parametrisation, spearman’s non-parametric approach
was utilised and Spearman coefficients were calculated using Pythons pandas library,
described in Script 4.
Features 0 and 1 have the strongest (coefficient of 1) absolute correlation and are positively
correlated after removing features containing non-numeric values (Features 23, 32 and 33).
Script 4 Calculating correlation matrix for numerical features

c)
A logical high in Feature 36 has a 91.26% likelihood of indicating a missing value in Feature
28.
WEKA visualisation tool (Figure 1) was utilised to provide indication of a relationship and its
strength verified via Python script, utilising the Pandas library (Script 5).
Figure 1 WEKA visualisation of Feature 28 with respect to Feature 36
Script 5 Python implementation to verify relationship between Feature 28 and 36
d)
Feature 35 stands out as the only feature with an immediately noticeable pattern, ascending
from a value of 1 to 7 over the 801 values. This would suggest:
1. Categorisation, however the values would have to have been sorted to ascend; or
2. A label (i.e. for trials), however there is variability between the number of “trials”;
Hence, without prior information on what the data represents, the purpose of Feature 35 is
difficult to interpret.
e)
i)
Figure 2 describes a correlation heatmap for features in the set [19,24,27,29,31] using
Spearman’s method as the relationship between the features is unknown but does not appear
linear.
Figure 2 Correlation heatmap for features in set [19,24,27,29,30,31]
Script 6 describes the Python implementation utilised in creating Figure 2.
Script 6 Python implementation of heatmap plot, with MATPLOTLIB
ii)
Features 24 and 31 have the lowest absolute correlation, with a value of 0.072.
iii)
General statistical properties, described in Table 1, were calculated using Excel functions.
Mean Median Mode Standard Deviation Q1 Q3 Min Max

Feature 24 73.0 70 50 30.8 50 90 5 230
Feature 31 66.3 65 60 28.9 45 85.5 5 180
Although the features are statistically similar, both having a minimum of 5, Feature 24 has a
significantly greater maximum and greater standard deviation. This difference is reflected
through the greater mean, median and quartiles.
Question 5
A box and whisker plot displays a five number summary of a set of data, including:
1. Minimum value for the series;

2. Maximum value for the series;
3. First quartile border;
4. Third quartile border; and
5. The median value
The interquartile range is the range between the First and Third quartile borders. The quartile
bounds were calculated using excel QUARTILE.INC, where:
Q1 border = 247
Q3 border = 619.5
Hence, the closest values to the boundary, not including on the boundary include:
Closest to Q1 include:
POPULATION SEX DATE BODY MASS LOGBM SVL LOGSVL

HERDSMAN M 37165.00 246 2.3909351 80 1.90309
&

JOONDALUP M 39428.00 248 2.3944517 74 1.869232
Closest to Q3 include:

WILLIAMS M 38792.00 620 2.7923917 109 2.037426
Question 6
a)
The only indication (or Easter egg) of the data sets origin were bracketed labels to a value in
Feature 23:
30 (Meteorite)255 (Core)
And then it hit me… the nostalgia came rushing back…

b)
The data sheet is composed of statistics of Pokémon. “30 (Meteorite)255 (Core)” references
the capture rate of the Pokémon Minior is its Meteorite and Core states.
This is verified by confirming features for base stats, happiness, height etc throughout the
data. With a simple Google, the remaining columns can be related to specific Pokémon
entries.

Kelbie Davidson (44817015) COMP4702 - Assignment 1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Kelbie Davidson (44817015) COMP4702 - Assignment 1

Uploaded by

Copyright:

Available Formats

Kelbie Davidson (44817015)

Script 1 MATLAB implementation of sample mean and standard deviation

From this, it is determined that:

Script 2 Sum to n function implementation in Python

𝐶𝑎𝑡𝑎𝑔𝑜𝑟𝑖𝑐𝑎𝑙, (𝑛𝑢𝑚 𝑢𝑛𝑖𝑞𝑢𝑒 𝑣𝑎𝑙𝑢𝑒𝑠) < 0.05 (𝑛𝑢𝑚 𝑣𝑎𝑙𝑢𝑒𝑠)

Feature 0-18, n = 4-6 Feature 20, n = 10 Feature 21, n = 6

Script 3 Calculating the number of unique values Python script

Script 4 Calculating correlation matrix for numerical features

Figure 1 WEKA visualisation of Feature 28 with respect to Feature 36

Script 5 Python implementation to verify relationship between Feature 28 and 36

Figure 2 Correlation heatmap for features in set [19,24,27,29,30,31]

Script 6 describes the Python implementation utilised in creating Figure 2.

Script 6 Python implementation of heatmap plot, with MATPLOTLIB

Mean Median Mode Standard Deviation Q1 Q3 Min Max

1. Minimum value for the series;

POPULATION SEX DATE BODY MASS LOGBM SVL LOGSVL

POPULATION SEX DATE BODY MASS LOGBM SVL LOGSVL

POPULATION SEX DATE BODY MASS LOGBM SVL LOGSVL

And then it hit me… the nostalgia came rushing back…

You might also like