You are on page 1of 27

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

 
PALMER STATION (ANTARCTICA)
PENGUINS CLASSIFICATION USING
MACHINE LEARNING
WEB APP THAT ALLOWS RECOGNITION OF INDIVIDUAL
PENGUIN SPECIES COULD AID IN CONSERVATION

MENTOR :
DR.K.ANANTHAJOTHI, M.E., Ph.D.
(Associate Professor)

SHREERAM S (200701234)
SANTHOSH KUMAR D V (200701217)
ABSTRACT

Penguin classifier helps us to find the penguin species with help of their physical features , sex and Habitat which help Biologist

for their Study. Penguin Population count marks the ecological change in the Antarctic Peninsula which is an important problem in

the context of global climate change, also they are an important indicator of the health of the oceans and their associated

ecosystems. Data for three penguin species observed on three islands ('Biscoe', 'Dream', 'Torgersen') in the Palmer Archipelago,

Antarctica, collected by Dr. Kristen Gorman .Our approach is applying Supervised Machine learning model over dataset. Our

Results for predicting the class label of Penguin species using Predictor variables is promising.

2
INTRODUCTION

The Adélie penguin (Pygoscelis adeliae) is a species of penguin common along the entire coast of the Antarctic continent, which is the only

place where it is found. It is the most widespread penguin species, and, along with the emperor penguin, is the most southerly distributed of all

penguins.

The gentoo penguin (Pygoscelis papua) is a penguin. The earliest scientific description was made in 1781 by Johann Reinhold Forster with a

type locality in the Falkland Islands. The species calls in a variety of ways, but the most frequently heard is a loud trumpeting, which the bird

emits with its head thrown back.

The chinstrap penguin (Pygoscelis antarcticus) is a species of penguin that inhabits a variety of islands and shores in the Southern Pacific and

the Antarctic Oceans. Its name stems from the narrow black band under its head, which makes it appear as if it were wearing a black helmet,

making it easy to identify. Other common names include ringed penguin, bearded penguin, and stone cracker penguin, due to its loud, harsh

call.

Penguin Population marks the ecological change which is an important in the context of global climate change. Data for three penguin species

observed on three islands ('Biscoe', 'Dream', 'Torgersen') in the Palmer Archipelago, Antarctica, collected by Dr. Kristen Gorman. These three

penguins are the class labels that we are going to consider and produce the output of the species name. This is a Multiple Classification as it
3
involves more than one class label.
Fig 1.1 The Three Palmer Species of
Antarctica

Fig 1.3 Palmer Station Antarctica (Biscoe, Dream & Torgersen Islands)

Fig 1.2 Dr. Gorman, in action collecting some penguin data


4
MOTIVATION

• As we know that there is quite confusion amongst marine biologist in recognizing and predicting penguin species based on their

features.

• This problem of marine biologist should be minimized by developing a penguin classification system with providing their features with

accurate prediction and analysis of that feature using powerful machine learning algorithms.

• Since penguins living in a cold climate its not accessible to study about it. So, this interface would surely facilitate them.

5
CHALLENGES

• "They all look alike to me" is no longer an excuse while studying about penguins.

• The study of individual penguin which is important for monitoring population dynamics and understanding migratory

patterns. Ornithologist and Marine Biologists are Researchers who specialize in Study of Penguins.

• These Researchers find difficult to Study about the Antarctic Penguin Species due to the Climatic Condition and other

Difficulties.

• The problem statement that we have in our hand is to classify the species of the penguin, given the different predictor variables

using the Palmer species dataset.

6
AIM AND OBJECTIVES OF THE PROJECT

• To build a penguin classifier web app for predicting the class label of penguin species as being Adelie, chinstrap

or gentoo as a function of 4 quantitative variables and 2 qualitative variables.

• To perform ordinal feature encoding on the 3 qualitative variables comprising of the target y variable (species)

and the 2 x variables (sex and island) and train the dataset

7
SURVEY TABLE
WORK DESCRIPTION TECHNIQUE DISADVANTAGES ADVANTAGES
USED

Ecological Sexual Dimorphism and Environmental Dataset The research was limited only to the The knowledge derived from this
Variability within a Community of Antarctic Creation based three predominant species other species analysis can be used to Sex segregation
Penguins (Genus Pygoscelis) Kristen B. on readings like emperor penguins and many more and find differences in sexually
Gorman1,2*, Tony D. Williams1, William R. species could also be added which will dimorphic species.
Fraser2 aid marine biologist.
A memetic algorithm using emperor penguin and SVM (Support  It doesn't perform well when we have The experimental results confirm that
social engineering optimization for medical data Vector Machine large data set because the required the proposed method is significantly
classification SK Baliarsingh, W Ding, S Vipsita, S Algorithm) training time is higher. superior to other existing techniques in
Bakshi - Applied Soft Computing, 2019 – Elsevier. terms of accuracy and number of genes
selected.
An Emperor Penguin Population Estimate: The Very High When the resolution of an image is too Where colonies were identified, VHR
First Global, Synoptic Survey of a Species from Resolution high it end up with obscenely large file imagery was obtained in the 2009
Space Peter T. Fretwell, Michelle A. LaRue , Paul (VHR) sizes and when the resolution is too breeding season with population
Morin. low image was blurry. estimate of ∼238,000 breeding –pairs.
Random forests: from early developments to Random Forest A trained forest may require significant Random forest algorithm avoids and
recent advancements Khaled Fawagreh , Mohamed memory for storage, due to the need for prevents overfitting by using multiple
Medhat Gaber & Eyad Elyan retaining the information from several trees.
hundred individual trees.
Learning Tableau: A data visualization tool Steven Tableau – Data We cannot change management or Tableau is a charting system
Batt, Tara Grealis, Oskar Harmon & Paul Visualization versioning. High cost than other tools. for presenting data in an easy-to-read
Tomolonis format. Tableau can handle millions of
rows of data with ease.

8
REQUIREMENT SPECIFICATION

HARDWARE SPECIFICATIONS

Processor : Pentium IV Or Higher


Memory Size : 256 GB (Minimum)
HDD : 40 GB (Minimum)

SOFTWARE SPECIFICATIONS

Operating System : WINDOWS 07 Or XP


Front – End : STREAMLIT
Language : PYTHON

9
SYSTEM DESIGN

SYSTEM ARCHITECTURE

A system architecture is the conceptual model that defines the structure, behavior, and more views of a system The Penguin
Classification system predicts the Species based on their features. Then the dataset is also visualized on tableau.

Fig 1.6 System Architecture


10
RESULT AND DISCUSSION

DATASET COLLECTION

Dataset is a collection of related sets of information that is composed of separate elements but can be manipulated as a unit by a computer. Data
sets describe values for each variable for unknown quantities such as height, weight, temperature, volume, etc. of an object or values of random
numbers. The values in this set are known as a datum. The data set consists of data of one or more members corresponding to each row.
a) The rigorous study was conducted in the islands of the Palmer Archipelago, Antarctica.
b) These data were collected from 2007 to 2009 by Dr. Kristen Gorman with the Palmer Station Long Term Ecological Research Program, part of
the US Long Term Ecological Research Network.
c) The dataset is taken from the website Kaggle.

Fig 1.7 Dataset of Palmar Penguins Fig 1.8 Input Features data set
11
DATASET DESCRIPTION

ATTRIBUTE NUMBER ATTRIBUTE NAME ATTRIBUTE TYPES

1 SPECIES Represent the penguin species (Adélie, Chinstrap and Gentoo)

2 ISLAND Represent the Palmer Points of penguin’s habitat which are the three islands

('Biscoe', 'Dream', 'Torgersen')

3 BILL LENGTH  Length of penguin's bill, or beak, is pointed at the end

4 BILL DEPTH depth of penguin's bill, or beak, is pointed at the end

5 FLIPPER LENGTH Length of Flippers which are used for different types of propulsion, control, and

rotation

6 BODY MASS Total Body Mass of penguin

7 SEX Gender of the penguin (Male or Female)


Table 1.1 Description Of Dataset Attributes 12
DATA PRE-PROCESSING

Data Preprocessing is a technique that is used to convert the raw data into a clean data set. In other words, whenever the data is
gathered from different sources it is collected in raw format which is not feasible for analysis.

Fig 1.9 Preprocessing of Data – Depicting the number of null values and Removal of unwanted
Columns

13
ALGORITHMS AND OUTPUT

 
RANDOM FOREST CLASSIFIER

Random Forest is a popular machine learning algorithm that belongs to the supervised learning technique. It can be used for

both Classification and Regression problems in ML. It is based on the concept of ensemble learning, which is a process of

combining multiple classifiers to solve a complex problem and to improve the performance of the model.

As the name suggests, "Random Forest is a classifier that contains a number of decision trees on various subsets of the given

dataset and takes the average to improve the predictive accuracy of that dataset." Instead of relying on one decision tree, the

random forest takes the prediction from each tree and based on the majority votes of predictions, and it predicts the final output.

The greater number of trees in the forest leads to higher accuracy and prevents the problem of overfitting.
 

14
Fig 1.7 Output Page -1
15
Fig 1.8 Output Page -2
16
Fig 1.9 Output Page -3 17
Fig 1.10 Output Page -4
18
Fig 1.11 Output Page – Displaying the Predicted Values for inputted csv 19
file
DATA VISUALIZATION VIA TABLEAU

Tableau is a Data Visualization tool that is widely used for Business Intelligence but is not limited to it. It helps create interactive graphs and charts in
the form of dashboards and worksheets to gain business insights. Being new to creating dashboards in Tableau, I decided to explore the data as scatter
plots using all the 3 categorical columns (species, gender, location). One of the first observations this layout enables is regarding the differential
distribution of the three species across the islands. To look for sex specific differences, I decided to mark the gender of each penguin in the dataset,
excluding nulls or where information was missing.

Adelie, the original but disappearing inhabitant of the region, has colonies on all the three islands in the dataset, while the Chinstrap is found only on
Dream island and Gentoo on Biscoe island. To understand the effects of climate change on these penguins visit the educational resources at the Palmer
Station Antarctica LTER.

Plotting the bill and flipper lengths gives a measure of their size differences clustered by species and gender. Gentoos have the largest flippers. The
difference in bill lengths is not as dramatic, though Adelie do seem to have shorter beaks. Body mass, that can vary in penguins depending on seasonal
cycles, is plotted separately with Gentoo well in the lead, befitting one of the largest known penguin species. Notable is the sexual size dimorphism
among the Pygoscelis penguins, examined in the original reference Gorman et al., 2014 .

20
21
Fig 1.12 Dashboard Created on Tableau
Fig 1.13 Sheets Created on Tableau 22
COMPARISON BASED ON PERFORMANCE

1. DECISION TREE:

Decision Tree is the most powerful and popular tool for classification and prediction. A Decision tree is a flowchart-like tree structure, where each internal
node denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (terminal node) holds a class label.

2. SUPPORT VECTOR MACHINE:

Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms, which is used for Classification as well as Regression
problems. However, primarily, it is used for Classification problems in Machine Learning. The goal of the SVM algorithm is to create the best line or
decision boundary that can segregate n-dimensional space into classes so that we can easily put the new data point in the correct category in the future.
This best decision boundary is called a hyperplane.

3. RANDOM FOREST:

Random Forest is a popular machine learning algorithm that belongs to the supervised learning technique. It can be used for both Classification and
Regression problems in ML. It is based on the concept of ensemble learning, which is a process of combining multiple classifiers to solve a complex
problem and to improve the performance of the model.
23
COMPARISON OF ALGORITHMS BASED ON PERFORMANCE –
PYTHON

ALGORITHM ACCURACY PRECISION RECALL

DECISION 97.01 97.02 96.98


TREE
SVM 96.21 96.05 96.35

RANDOMFOREST 97.76 96.76 97.26

Table 1.2 Comparison of Algorithms Based on Performance –Python Fig 1.14 Comparison of Algorithms Based on Performance- Graphs

The above table compares the various classification algorithms based on various performance
metrics such as accuracy and precision. It can infer that random forest performs well compared to
other algorithms like SVM and decision tree as the accuracy is higher compared to that of other
algorithms.

24
CONCLUSION

We applied data visualization methods to understand the dataset better and did the preprocessing technique to take care of missing data.

Used Plotline, Seaborn, and Matplotlib Libraries to plot different interactive plots describing relationships among variables. Many

features have some prominent separation alone for distincting the penguin species. Using Random Forest Classifier, we were able to use

the most important features from the data set and able to reach the desired accuracy.
 
 

Penguin dataset from palmer penguins package used very well to teach Data Science concepts correlation, regression, classification and

could be used to teach Data Visualizations. If you’re someone who writes a lot of Data Science articles and always needs to pick a

dataset that’s quite versatile, penguins is an option you should explore.


 

25
FUTURE WORK

 
 

As a future work we can extend this to other species of different taxa to classify it based on the physical

features which are provided by the user. Also, this Classification Can be extended to Classify the Genes by

extending gene analysis, for example Classification of Homo Sapiens Hemoglobin Genes (HBA1,HBB &

HBA2).

26
REFERENCES
 

1. Ecological Sexual Dimorphism and Environmental Variability within a Community of Antarctic Penguins (Genus Pygoscelis) Kristen B.

Gorman1,2*, Tony D. Williams1, William R. Fraser2

2. A memetic algorithm using emperor penguin and social engineering optimization for medical data classification SK Baliarsingh, W Ding,

S Vipsita, S Bakshi - Applied Soft Computing, 2019 – Elsevier.

3. An Emperor Penguin Population Estimate: The First Global, Synoptic Survey of a Species from Space Peter T. Fretwell1 *, Michelle A.

LaRue2 , Paul Morin2 , Gerald L. Kooyman3 , Barbara Wienecke4 , Norman Ratcliffe1 , Adrian J. Fox1 , Andrew H. Fleming1 , Claire

Porter2 , Phil N. Trathan1

4. Random forests: from early developments to recent advancements Khaled Fawagreh , Mohamed Medhat Gaber & Eyad Elyan

5. Learning Tableau: A data visualization tool Steven Batt, Tara Grealis, Oskar Harmon & Paul Tomolonis

27

You might also like