You are on page 1of 7

Kelbie Davidson (44817015)

COMP4702 – Assignment 1
Question 1
Data is imported into MATLAB from .xlsx spreadsheet. The data is filtered to a column vector
of mass values. Some values are NaN, in this case empty cell. Hence nanmean and nanstd are
used to disregard these cells. Script 1 describes the MATLAB implementation.

Script 1 MATLAB implementation of sample mean and standard deviation

From this, it is determined that:


• Mean is 467.4876 (rounded to 4 decimal places); and
• Standard deviation is 288.3465 (rounded to 4 decimal places)

Question 2

a)
Classification is defined as discrete categorisation of input variables; and
Regression is defined as continuous categorisation of input values.

Hence, since the output will be a part of a continuous numerical expression, the
categorisation is that of a regression.

b)
Extrapolation is defined as estimating values outside the range of a given data; and
Interpolation is defined as an estimating values between known values of data.
Hence, since the expression is valid within the month of February, estimations for the month
of March would be extrapolation.

Question 3
Script 2 describes the Sum_to_n program, written in Python. arr is array of unique, unordered
integers. n is integer.

Script 2 Sum to n function implementation in Python


Question 4
a)
One can estimate, without prior knowledge of what the data set represents, if a feature is
categorical by considering a threshold number of unique values compared to the size of the
set, such that:

𝐶𝑎𝑡𝑎𝑔𝑜𝑟𝑖𝑐𝑎𝑙, (𝑛𝑢𝑚 𝑢𝑛𝑖𝑞𝑢𝑒 𝑣𝑎𝑙𝑢𝑒𝑠) < 0.05 (𝑛𝑢𝑚 𝑣𝑎𝑙𝑢𝑒𝑠)


{
𝑁𝑜𝑛 − 𝐶𝑎𝑡𝑎𝑔𝑜𝑟𝑖𝑐𝑎𝑙, 𝑒𝑙𝑠𝑒

N is 801 for all features. By this likelihood threshold, classification features, as determined in
Script 3, include:

Feature 0-18, n = 4-6 Feature 20, n = 10 Feature 21, n = 6


Feature 23, n = 34 Feature 25, n = 6 Feature 28, n = 8
Feature 32, n = 18 Feature 33, n = 19 Feature 35, n = 7
Feature 36, n = 2 Feature 39, n = 39

Script 3 Calculating the number of unique values Python script

b)
As there is prior knowledge of data parametrisation, spearman’s non-parametric approach
was utilised and Spearman coefficients were calculated using Pythons pandas library,
described in Script 4.

Features 0 and 1 have the strongest (coefficient of 1) absolute correlation and are positively
correlated after removing features containing non-numeric values (Features 23, 32 and 33).

Script 4 Calculating correlation matrix for numerical features


c)
A logical high in Feature 36 has a 91.26% likelihood of indicating a missing value in Feature
28.

WEKA visualisation tool (Figure 1) was utilised to provide indication of a relationship and its
strength verified via Python script, utilising the Pandas library (Script 5).

Figure 1 WEKA visualisation of Feature 28 with respect to Feature 36

Script 5 Python implementation to verify relationship between Feature 28 and 36

d)
Feature 35 stands out as the only feature with an immediately noticeable pattern, ascending
from a value of 1 to 7 over the 801 values. This would suggest:

1. Categorisation, however the values would have to have been sorted to ascend; or
2. A label (i.e. for trials), however there is variability between the number of “trials”;

Hence, without prior information on what the data represents, the purpose of Feature 35 is
difficult to interpret.

e)
i)
Figure 2 describes a correlation heatmap for features in the set [19,24,27,29,31] using
Spearman’s method as the relationship between the features is unknown but does not appear
linear.

Figure 2 Correlation heatmap for features in set [19,24,27,29,30,31]

Script 6 describes the Python implementation utilised in creating Figure 2.

Script 6 Python implementation of heatmap plot, with MATPLOTLIB

ii)

Features 24 and 31 have the lowest absolute correlation, with a value of 0.072.

iii)

General statistical properties, described in Table 1, were calculated using Excel functions.

Mean Median Mode Standard Deviation Q1 Q3 Min Max


Feature 24 73.0 70 50 30.8 50 90 5 230
Feature 31 66.3 65 60 28.9 45 85.5 5 180
Although the features are statistically similar, both having a minimum of 5, Feature 24 has a
significantly greater maximum and greater standard deviation. This difference is reflected
through the greater mean, median and quartiles.

Question 5
A box and whisker plot displays a five number summary of a set of data, including:

1. Minimum value for the series;


2. Maximum value for the series;
3. First quartile border;
4. Third quartile border; and
5. The median value

The interquartile range is the range between the First and Third quartile borders. The quartile
bounds were calculated using excel QUARTILE.INC, where:

Q1 border = 247
Q3 border = 619.5

Hence, the closest values to the boundary, not including on the boundary include:

Closest to Q1 include:

POPULATION SEX DATE BODY MASS LOGBM SVL LOGSVL


HERDSMAN M 37165.00 246 2.3909351 80 1.90309

&

POPULATION SEX DATE BODY MASS LOGBM SVL LOGSVL


JOONDALUP M 39428.00 248 2.3944517 74 1.869232

Closest to Q3 include:

POPULATION SEX DATE BODY MASS LOGBM SVL LOGSVL


WILLIAMS M 38792.00 620 2.7923917 109 2.037426

Question 6
a)
The only indication (or Easter egg) of the data sets origin were bracketed labels to a value in
Feature 23:

30 (Meteorite)255 (Core)

And then it hit me… the nostalgia came rushing back…


b)
The data sheet is composed of statistics of Pokémon. “30 (Meteorite)255 (Core)” references
the capture rate of the Pokémon Minior is its Meteorite and Core states.

This is verified by confirming features for base stats, happiness, height etc throughout the
data. With a simple Google, the remaining columns can be related to specific Pokémon
entries.

You might also like