You are on page 1of 20

PREMIER UNIVERSITY,

CHATTOGRAM
Department of Computer Science & Engineering

7th Semester Project Report


on
Exploring Multiclass Classification: A YEAST Dataset Analysis

SUBMITTED BY
Name: Oishe Dey
ID: 2003810202094

SUBMITTED TO
MD. Tamim Hossain
Lecturer
Department of Computer Science & Engineering
Premier University, Chattogram

13th February, 2023


Contents

1 Introduction

2 OBJECTIVE 2
2.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

3 ABOUT DATASET 3
3.1 Description About Dataset . . . . . . . . . . . . . . . . . . . . . . . . 3

4 Problem Statement 4
4.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

5 Dataset Collection 5
5.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

6 DATA PREPROCESSING 6
6.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

7 MODEL ARCHITECTURE 8
7.1 MODEL ARCHITECTURE . . . . . . . . . . . . . . . . . . . . . . . 8
7.1.1 Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . 9

8 RESULTS 10
8.1 Count Plot & Pie Chart . . . . . . . . . . . . . . . . . . . . . . . . . 10
8.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
8.3 Evaluate Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . 13
8.4 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
8.5 Train accuracy & validation accuracy . . . . . . . . . . . . . . . . . . 14
8.6 Train loss & validation loss . . . . . . . . . . . . . . . . . . . . . . . 14

9 Conclusion 15
9.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

10 References i
10.1 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
List of Figures

7.1 Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

8.1 Count Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11


8.2 Pie Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
8.3 Data Preprocessing. . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
8.4 Evaluate Logistic Regressiong. . . . . . . . . . . . . . . . . . . . . . . 13
8.5 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
8.6 Train accuracy & validation accuracy. . . . . . . . . . . . . . . . . . . 14
8.7 Train loss & validation loss. . . . . . . . . . . . . . . . . . . . . . . . 14
Chapter 1

Introduction

Multiclass Classification: Multiclass classification in neural networks involves train-


ing a model to categorize input data into more than two distinct classes. Unlike
binary classification, where the task is to classify inputs into two categories, mul-
ticlass classification deals with scenarios where there are multiple possible classes.
In neural networks designed for multiclass classification, the output layer typically
consists of neurons corresponding to each class, and the model is trained to produce
probability distributions over these classes. The chosen class is usually the one with
the highest predicted probability. This approach is commonly employed in tasks
like image recognition, where the goal is to identify objects belonging to various
categories within a single model. During training, the network learns to adjust its
parameters to accurately assign input samples to the correct class labels. Logistic
Regression: Logistic regression in the context of neural networks is a type of linear
model used for binary classification tasks. Despite its name, logistic regression is
often employed as the output layer of a neural network for tasks where the goal is
to predict the probability of an input belonging to one of two classes. The logistic
function, also known as the sigmoid function, is applied to the weighted sum of
input features, transforming the output into a range between 0 and 1. This output
is interpreted as the probability of the input belonging to the positive class. Dur-
ing training, the model adjusts its weights through techniques like gradient descent
to minimize the difference between predicted probabilities and actual class labels.
Though logistic regression is a simple model,when used as part of a neural network,
it can contribute to more complex architectures for various machine learning tasks,
especially in the initial layers of deep learning models.

Confusion Matrix: A confusion matrix in the context of neural networks is a tab-


ular representation that summarizes the performance of a classification model. It
compares the predicted classifications against the true labels, breaking down the
results into four categories: true positives (correctly predicted positive instances),
true negatives (correctly predicted negative instances), false positives (incorrectly
predicted as positive), and false negatives (incorrectly predicted as negative). The
matrix provides valuable insights into the model’s accuracy, precision, recall, and
overall performance, serving as a useful tool for evaluating the effectiveness of a neu-
ral network in classification tasks. Decision Tree: A Decision Tree is a non-linear
supervised machine learning algorithm used for both classification and regression
No Cluster Worry Free Thesis Template

tasks. In the context of neural networks, it is a standalone model rather than a


neural network architecture. The algorithm recursively splits the dataset into sub-
sets based on the most significant features, creating a treelike structure of decision
nodes. Each node represents a decision based on a feature, leading to subsequent
nodes until reaching the final decision or prediction. Decision Trees are transparent
and easy to interpret, making them valuable for understanding feature importance
in a model. While not directly a part of neural networks, Decision Trees can comple-
ment neural network models within ensemble methods for improved performance.
Adam Optimizer: The Adam optimizer is a popular optimization algorithm used in
training neural networks. Combining elements of both Momentum and RMSprop,
Adam efficiently adjusts the learning rates for each parameter individually. It main-
tains two moving averages for each parameter: the first moment (mean) and the
second moment (uncentered variance). These moving averages are used to adap-
tively scale the learning rates during training. Adam is known for its effectiveness
in handling sparse gradients, making it suitable for a wide range of deep learning
tasks. Its adaptive learning rate and momentum contribute to faster convergence
and improved performance in optimizing neural network weights during the training
process.

CHAPTER 1. INTRODUCTION 1
Chapter 2

OBJECTIVE

2.1 Objective
The provided code aims to read a CSV file named ’yeast.csv’ from Google Drive
into a Pandas DataFrame. Subsequently, it displays basic information about the
DataFrame, including the number of rows, columns, and the shape of the data. The
dataset appears to have columns labeled ’mcg’, ’gvh’, ’alm’, ’mit’, ’erl’, ’pox’, ’vac’,
’nuc’, and ’name’, though the data itself is not explicitly shown in the provided code
snippet. The objective is to provide a brief overview and summary statistics of the
yeast dataset.

2
Chapter 3

ABOUT DATASET

This data set contains the following Features: ”mcg, gvh, alm, mit, erl, pox, vac,
nuc, name”

3.1 Description About Dataset


The code snippet aims to load a dataset named ’yeast.csv’ from Google Drive into
a Pandas DataFrame and provides basic information about the dataset. However,
the actual content of the dataset is not displayed in the provided code snippet. The
dataset seems to have the following columns: ’mcg’, ’gvh’, ’alm’, ’mit’, ’erl’, ’pox’,
’vac’, ’nuc’, and ’name’. Each column likely represents different features or attributes
of the data, and ’name’ may be a categorical label. The code does not show the spe-
cific content of the dataset, so detailed insights into the data distribution, statistics,
or patterns are not available from this code snippet. To gain a more comprehensive
understanding of the dataset, further exploration, analysis, and visualization of the
data would be necessary. This might include examining summary statistics, visual-
izing distributions, exploring relationships between variables, and checking for any
missing or inconsistent data.

3
Chapter 4

Problem Statement

4.1 Problem Statement


Problem statement The code reads a dataset named ’yeast.csv’ from Google Drive
into a Pandas DataFrame and provides basic information about the dataset, such
as the number of rows, columns, and the shape of the data. However, the problem
statement or the purpose of analyzing this yeast dataset is not explicitly defined in
the provided code snippet. A clear problem statement is crucial for understanding
the context and goals of the analysis. It could include questions like: What specific
insights or patterns are we seeking from the yeast dataset? Are there particular
features of interest, and what information are we trying to extract from them? Is
there a goal for prediction, classification, or any other data-driven task? Without a
well-defined problem statement, it’s challenging to determine the objectives of the
analysis or the intended use of the yeast dataset. Clarifying the problem statement
would guide subsequent data exploration, analysis, and visualization steps towards
addressing specific objectives or questions.

4
Chapter 5

Dataset Collection

5.1 Data Collection


The provided code snippet is designed to load a dataset named ’yeast.csv’ from
Google Drive into a Pandas DataFrame for further analysis. The dataset seems to
contain information related to yeast, with columns labeled ’mcg’, ’gvh’, ’alm’, ’mit’,
’erl’, ’pox’, ’vac’, ’nuc’, and ’name’. The code snippet then displays basic informa-
tion about the DataFrame, including the number of rows and columns, to provide
an initial overview of the dataset. However, the code snippet does not explicitly
mention the specific details regarding how and why the dataset was collected. A
comprehensive dataset collection description might include information such as the
source of the data, the context in which it was collected, any preprocessing steps ap-
plied, and the purpose of using the dataset for analysis. Understanding the dataset’s
origin and context is essential for interpreting the findings and drawing meaningful
conclusions during the subsequent data analysis process.

5
Chapter 6

DATA PREPROCESSING

6.1 Data Preprocessing


Data preprocessing is a crucial step in the data analysis pipeline, involving the
cleaning and transformation of raw data into a format suitable for machine learn-
ing algorithms. The goal is to enhance the quality of the dataset, address missing
or inaccurate values, and prepare the data for effective model training and analy-
sis. This process includes several key tasks such as handling missing data through
imputation or removal, scaling numerical features to a standard range, encoding
categorical variables into numerical representations, and detecting and addressing
outliers. Data preprocessing is essential to ensure that the machine learning model
can effectively learn patterns within the data, leading to more accurate and reliable
predictions.
Furthermore, data preprocessing involves the exploration and understanding of the
dataset to make informed decisions about feature selection and engineering. It may
also include techniques like normalization, where data is transformed to adhere to
a standard scale, and handling of skewed distributions to achieve a more balanced
representation. Proper data preprocessing contributes significantly to the success of
machine learning models by mitigating issues arising from incomplete or inconsis-
tent data, ultimately enhancing the model’s ability to generalize and make accurate
predictions on new, unseen data.
First Load Data. Loads the Yeast from a CSV file into a Pandas DataFrame named
’data’. Then, Preprocess Data. One-hot encodes the ’class’ column using the origi-
nal dataset is displayed using fulldata.info(), providing information on column data
types, non-null counts, and memory usage.Rows with missing values are dropped
from the dataset using fulldata.dropna(). A comment notes that this step is not usu-
ally needed for the Yeast dataset, suggesting the dataset may not have significant
missing values.If the target variable (’name’) is categorical, it is encoded using Labe-
lEncoder() to convert categorical labels into numerical format. The encoded values
are stored back in the ’name’ column.Features (X) and the target variable (y) are
separated. Features are stored in X by dropping the ’name’ column, and the target
variable is stored in y.The dataset is split into training and testing sets using train-
testsplit() with a test size of 20% and a specified random state for reproducibility.
The resulting sets are Xtrain, Xtest, ytrain, and ytest.Standardization is performed
on the features using StandardScaler(). This step is optional and depends on the

6
No Cluster Worry Free Thesis Template

algorithm’s sensitivity to the scale of input features. The scaled training and testing
sets are stored in Xtrainscaled and Xtestscaled, respectively.Information about the
preprocessed dataset is displayed, including the shapes of the training and testing
sets.ptionally, if the scaled features need to be inverse-transformed later, the code
snippet provides comments on how to perform the inverse transformation.
This data preprocessing pipeline prepares the dataset for machine learning tasks, en-
suring that it is free of missing values, encoded properly, and split into training and
testing sets. Feature scaling is applied for standardization, and the optional inverse
scaling step is included for scenarios where the original feature scale is needed.

CHAPTER 6. DATA PREPROCESSING 7


Chapter 7

MODEL ARCHITECTURE

7.1 MODEL ARCHITECTURE


In the code sets up a neural network using a library called Keras. Here’s a breakdown
in simpler terms:

1. ‘model = Sequential()‘: Initiates a neural network model. This model will be


built layer by layer.

2. ‘model.add(Dense(64, activation=’relu’, inputdim=Xtrain.shape[1]))‘: Adds the


first layer to the model. This layer has 64 neurons, uses the ’relu’ activation func-
tion (which helps the model learn complex patterns), and takes input data with
dimensions specified by ‘Xtrain.shape[1]‘.

3. ‘model.add(Dense(32, activation=’relu’))‘: Adds a second layer to the model


with 32 neurons and ’relu’ activation. This layer doesn’t need the input dimensions
because it can automatically figure them out from the previous layer.

4. ‘model.add(Dense(6, activation=’softmax’))‘: Adds the output layer to the


model. This layer has 6 neurons, representing the possible classes in the ’Index’.
The ’softmax’ activation function is used here, which is typical for multiclass classi-
fication problems. It converts the output values into probabilities, making it easier
to interpret the model’s predictions.

In summary, this code defines a neural network with three layers (input, hidden, and
output) for a multiclass classification task (assuming 6 classes). The ’relu’ activation
function is used for the hidden layers, and ’softmax’ is used for the output layer to
predict the class probabilities.

8
No Cluster Worry Free Thesis Template

7.1.1 Neural Network

Figure 7.1: Neural Network

CHAPTER 7. MODEL ARCHITECTURE 9


Chapter 8

RESULTS

8.1 Count Plot & Pie Chart


The provided code generates two insightful visualizations to depict the distribu-
tion of classes within the target column (’name’) of the dataset, accompanied by a
discussion on data preprocessing steps.
The count plot, situated on the left, presents a comprehensive view of the absolute
frequency of each class using vertical bars. Each bar represents the count of occur-
rences for a specific class, facilitating a direct comparison of class frequencies. This
visualization serves as a crucial tool for identifying any potential class imbalances
within the dataset, allowing for a precise assessment of the relative sizes of different
classes.
On the right, the pie chart offers a complementary perspective by illustrating the
distribution of classes as proportions of the total dataset. Each slice of the pie
corresponds to the proportion of a class relative to the entire dataset, with per-
centage labels providing precise numerical information. This chart offers a holistic
understanding of the relative sizes of each class, enabling quick comparisons of class
proportions and aiding in the identification of dominant and minority classes within
the dataset.
The analysis is further enriched by incorporating data preprocessing steps. Pre-
processing plays a vital role in preparing the dataset for machine learning tasks,
ensuring data quality, and enhancing model performance. Common preprocessing
steps may include handling missing values, feature scaling, and dealing with class
imbalances. In the context of class imbalances, techniques such as oversampling,
undersampling, or synthetic data generation can be employed to mitigate biases
and improve model generalization.
By utilizing these visualizations in conjunction with appropriate preprocessing steps,
analysts can effectively assess the balance of classes within the dataset and make
informed decisions regarding model training and evaluation. This comprehensive
approach not only enhances the robustness of machine learning models but also fos-
ters fairness and equity in the predictive outcomes.

10
No Cluster Worry Free Thesis Template

Figure 8.1: Count Plot

Figure 8.2: Pie Chart

CHAPTER 8. RESULTS 11
No Cluster Worry Free Thesis Template

8.2 Data Preprocessing

Figure 8.3: Data Preprocessing.

12 CHAPTER 8. RESULTS
No Cluster Worry Free Thesis Template

8.3 Evaluate Logistic Regression

Figure 8.4: Evaluate Logistic Regressiong.

8.4 Confusion Matrix


A Confusion Matrix is a way of measuring how well a model can classify difference
categories of data. It compares the actual categories (True label) with the ones that
the model predicted (Predicted label). The numbers in each cell show how many
times the model got it right or wrong for each category. Ideally, the model should
get all the numbers on the diagonal line, which means it matched the true and
predicted labels perfectly. numbers off the diagonal line are errors, which means the
model confused one category with another. The fewer errors, the better the model
is.

Figure 8.5: Confusion Matrix

CHAPTER 8. RESULTS 13
No Cluster Worry Free Thesis Template

8.5 Train accuracy & validation accuracy

Figure 8.6: Train accuracy & validation accuracy.

8.6 Train loss & validation loss

Figure 8.7: Train loss & validation loss.

14 CHAPTER 8. RESULTS
Chapter 9

Conclusion

9.1 Conclusion
The logistic regression and random forest models are trained and evaluated, provid-
ing insights into the classification performance on the yeast dataset. The confusion
matrix and accuracy scores give a detailed view of model performance, and visual-
izations aid in understanding the training process. The combination of Logistic Re-
gression and Random Forest models allows for a balanced evaluation of the dataset,
contributing to a more robust analysis. The conclusion is tailored based on the
provided code snippets. Additional insights could be obtained by further exploring
hyperparameter tuning, feature importance analysis, and potentially trying other
machine learning algorithms for comparison.

15
Chapter 10

References

10.1 References
[1] Dataset.https://www.kaggle.com/datasets/yasserh/bmidataset/data

[2] Kaggle Documentation: Logistic Regression.https://www.kaggle.com/code/shruthi-


layaalvale/logistic-regression-03

[3] Kaggle Documentation: Deep Dive into Multiclass Classification.https://www.kag-


gle.com/code/shrutimechlearn/deep-dive-into-multiclass-classification/notebook

[4] OpenAI. https://chat.openai.com/

[5] Google.http://google.com/

You might also like