Part 1: Data Investigation and Cleaning: Classification For Data Errors

Question 1:
Part 1: Data Investigation and Cleaning

Classification for data errors
After loading the datasets file as the data frame, we use the method info() to have overview of
summary of the data frame.
As noticed that, there are some errors format in the column “year” such as “19-99” instead of
correct format “1999”. So, replacing the character ‘-‘ with blank character for the series in
column “year” solves this issue.
For errors data, that contains the character ‘-‘ or is a negative value, removing the character ‘-‘
will gives the correct data by doubling checking the value with the population in the World Bank
Data population.
From the data type of the column, some entries are in the incorrect data type – not integer.
Moreover, there are missing values and negative value in the datasets. So, the data errors are
classified as 3 categories:
- not integer value: the population must be in integer data type
- negative value: the population of a country can not a negative number
- missing value: there are Missing completely at random (MCAR) as well as Missing at
random (MAR) values, which is critically affect the further progress using classification
of this dataset.
It is necessary to allocate the error data and investigate the patterns of them. So, the library
“CustomElementValidation” from Pandas library was imported to handle this issue[1]. Number of
entries data errors for each category:
- not integer value: 171
- missing value: 178
- negative value: 11
- equal to 0: 24
- total data errors: 386
Logic/Rules to clean up data
After counting and filtering the data errors into categories, it is necessary to filter the errors by
country. This could help to further investigating the pattern of errors to choose the correct rules
and method to handle them.
Most of the country have only one data error and it is complete random, while some other
countries have more than 5 errors following a pattern of the following years in a period.
From the figure above, the countries with many errors were print out with the exact number of
errors. The errors were assigned to another data frame to better investigation. Then, we must
every error by each country to identify the patterns.
- Country Name “Not classified” with 110 data errors: all the population data for this
country is missing. In fact, the country name does not make sense and could not be found
in internet sources for the information. Therefore, Country Name “not classified” is
removed from the dataset.
- Data errors of others country in this list are missing or incorrect format but following by a
period of years. For example, “Guatemala” has missing data from 1960 to 1980 and from
2008 to 2017.
- Especially, for the Country Name “West Bank and Gaza” there are so many missing
values from 1960 to 1989. There are only official records and data collecting from 1990.
Other name of this country is Palestinians and the history are complicated due to by
immigration and domination policy. The data is decided to be dropped since it could be
considered as noise the world population due to outliner data in the classification phrase.
Rules and methods to cleaning up data
Since we have identified the patterns of the errors for critical cases for some country with many
data errors, so the cleaning up data is proceeded by group by country. There are two rules to
cleaning this dataset:
- For country that does not have many errors and the errors are completely random, the
Mice library is applied to imputing the data [4]. Overall, the population of every country is
increase by 3.6% to 5% each year, so the imputing method is suitable and reasonable in
this case.
- For country that have many errors in a pattern, the substituting method is applied because
the imputing method in the case could generate data very differently from the real
population. The substituting data is collected from World Data Bank and from World
Meters (with the data source from the United Nations, Department of Economic and
Social Affairs, Population Division. World Population Prospects: The 2019 Revision)
Number of removed Rows and Columns
Two Columns has been removed, which are the 'Indicator Name', 'Indicator Code'. These two
attributes contain only one string value for all rows, which tend to be the code of population
using worldwide.
For the removed rows, all the rows that are not country are removed from the datasets since the
dataset contains countries as well as world geographical development indicators such as Central
Europe and the Baltics, East Asia & Pacific, etc.…
Part 2: Country Histogram for start and end digit
First digit histogram: The small digit occurs most frequently, and the greater number is, the less
frequency it occurs as the first digit. This can be explained using the observation of the first
digits of numbers in any real-world datasets of Benford’s Law[2]. The same pattern occurs with
the data from 1980 to 2018.
Last digit histogram: The number 0 is occurring most frequently, while other number occurs
equally and less than number 0. The reason behind this could be this population number is often
round to the number, which is dividable by 10. This helps to highlight the Significant Figure [3].
Question 2:
Statistical Analysis:
The statistics of the iris dataset is calculated as follow:
- Statistics of Setosa Iris
- Statistics of Versicolor Iris
- Statistics of Virginica Iris
Regarding to the statistical values of iris species, we can observe that:

- From the mean and the median of the sepal length, the Virginica Iris has longer sepal
length than Sentosa and Versicolor Iris
- From the sepal with values, Setosa Iris has wider sepal than other two Irises Species.
Sepal width of Virginica Iris is slightly higher than Versicolor Iris. However, in the range
of 25% to 75% data these two species have many overlapping data, which should be
considered for choosing the classification.
- About the petal length statistics, the mean and median between three species are very
different and there are very little overlapping data in the range of 25 to 75% data through
each specie. The iris specie with the longest petal length is the Virginica Iris then comes
to Versicolor and Setosa Irises, respectively.
- Same as the petal length, the petal width shows very big different between three species
and very little overlapping data. And the order from the highest petal width to the lowest
petal width is the same as the petal length as well: Virginica, Versicolor, and then Setosa.
From the statistical analysis above, we can classify the iris as following:
- If petal length is in range of 5.2 to 6.1 cm or petal width is between 1 to 1.5 cm, the iris
usually is the Versicolor Iris Specie.
- If sepal length is from 6.2 to 7 cm, the iris should belong to the Virginia Iris.
- If sepal length is less than 5.2 cm and petal width is less than 0.6 cm, the iris belongs to
Setosa Specie
- If petal length is between 1 and 1.9 cm, it should be Setosa Iris.
- Any other cases do not match the above classifications should be suggested as Versicolor
Iris
Correlation and Covariance Matrix for each Iris Species
- Correlation Matrix:
- Covariance Matrix:
From the Covariance Matrixes of all three iris species, we can observe that all the covariance
between two attributes are positive related.
From the Correlation Matrixes, there are some attribute which is highly correlated to each other,
which should be key factors for iris classifications:
- The Setosa Iris has a high correlation between two attributes: sepal width and sepal
length
- Very attributes of Versicolor are highly correlated
- The Virginica Iris has its petal length highly correlated with sepal length
Scatter Plot
The Scatter Plot figure presents more details on the data distribution for each attribute of each
iris specie. The petal length and petal width graph have the highly separation for all three
species, except for some overlapping data between the Versicolor and Virginia Irises. This
finding reinforces the classification rules defined in the statistical analysis (sub-question of
question 2). Moreover, it is also clear that the iris with a high petal length and petal width should
be classified as the Virginica Specie. The graph also verified the classification that there is no
Setosa Iris with the Petal width greater than 0.6 cm.
Question 3:
How long does it would take to calculate the first 10,000 and 25,000 Fibonacci numbers?
- Functions to generate n first numbers of Fibonacci:
- Calculation and Estimation execution time for first 10,000 and 25,000 number:
Challenges while calculating the 1,000,000th Fibonacci number

To calculate the 1,000,000th Fibonacci number, we can use the previous code the calculate the
first 1,000,000 numbers and then take the last number of the result array as the result.
Challenges to calculate the 1,000,000th Fibonacci number:
- Memory consumption: to calculate the great-nth Fibonacci number, the current number
must be store to the memory for the next calculation. In fact, due to using the previous
code, the computer needs to store every numbers from 0 to 1,000,000 to the temporary
memory. While running that code to calculate first 1,000,000 numbers, the error of
memory errors occurs.
- Optimization: there is not necessary to store all first 1,000,000 to calculate the 1,000,000th
number. Therefore, the challenge is to calculate that number without wasting memory
and time. Moreover, the output I/O of python is limited, even when printing out 25,000
first numbers of Fibonacci. The error occurs as “IOPub data rate exceeded. Notebook
server will temporarily stop sending output”
Solution for calculating the 1,000,000th Fibonacci number
To calculate exact the nth number of Fibonacci without generating the whole series of number,
we can use the method called Fast Doubling [5], which has the complexity of O(logn):
When we have a number-k is F(k) and the next number is F(k+1), we can determine the F(2k)
and F(2k+1) by applying this equation:
F ( 2 k ) =F ( k )∗( F ( k +1 )∗2−F ( n ) )
F (2 k +1)=F(k )2∗F(k +1)2
- Implementation:
As the result, the 1,000,000th number is calculated correctly, and the execution time is only less
than 1 millisecond
Reference
[1]B. Cojocar, "How to do column validation with pandas", Medium, 2020. [Online]. Available:
https://medium.com/@bogdan.cojocar/how-to-do-column-validation-with-pandas-
bbeb38f88990. [Accessed: 10- Aug- 2020].
[2]P. Corn, H. Vee and C. Williams, "Benford's Law | Brilliant Math & Science
Wiki", Brilliant.org, 2020. [Online]. Available: https://brilliant.org/wiki/benfords-law/.
[Accessed: 11- Aug- 2020].
[3]"Significant Figures", Staff.vu.edu.au, 2020. [Online]. Available:
http://www.staff.vu.edu.au/mcaonline/units/numbers/numsig.html. [Accessed: 12- Aug- 2020].
[4]S. Buuren and C. Groothuis-Oudshoorn, "(PDF) MICE: Multivariate Imputation by Chained
Equations in R", ResearchGate, 2020. [Online]. Available:
https://www.researchgate.net/publication/44203418_MICE_Multivariate_Imputation_by_Chaine
d_Equations_in_R. [Accessed: 09- Aug- 2020].
[5]V. Kumar, "Fast Doubling method to find nth Fibonacci number - Vinay
Kumar", HackerEarth, 2020. [Online]. Available:
https://www.hackerearth.com/de/practice/notes/fast-doubling-method-to-find-nth-fibonacci-
number/. [Accessed: 03- Aug- 2020].

Part 1: Data Investigation and Cleaning: Classification For Data Errors

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Part 1: Data Investigation and Cleaning: Classification For Data Errors

Uploaded by

Copyright:

Available Formats

Question 1:

Part 1: Data Investigation and Cleaning

- Statistics of Virginica Iris

Regarding to the statistical values of iris species, we can observe that:

Challenges while calculating the 1,000,000th Fibonacci number

F (2 k +1)=F(k )2∗F(k +1)2

You might also like