Professional Documents
Culture Documents
CHITKARA UNIVERSITY
BADDI HIMACHAL PRADESH-174103 (INDIA)
A PROJECT REPORT
ON
US Census Income
Submitted by
Abhishek (2011981005)
Anuj (2011981014)
Rishabh Verma (2011981099)
Supervised By
Mr.Shivam Singh
1
DECLARATION
We, the undersigned collaborators involved in the development and compilation of this project
report titled 'US Census Income,' affirm the absolute authenticity, originality, and integrity of the
work presented herein.
We unequivocally declare that this report has not been previously submitted, either in part or in
whole, for any other academic or professional qualification. It stands as a distinctive and original
scholarly creation adhering to the highest ethical standards and academic integrity expected
within our educational institution.
Additionally, we acknowledge that any contributions from external sources, whether in the form
of direct quotations, paraphrased information, or borrowed concepts, have been appropriately
cited and referenced, demonstrating our commitment to upholding established academic
conventions.
By undersigning this declaration, we assume full responsibility for the authenticity, accuracy,
and originality of the content presented in this project report.
Date: 20/11/2023
2
ACKNOWLEGDEMENT
We extend our deepest gratitude to Mr. Shivam Singh for his dedicated mentorship, expert
guidance, and unwavering support throughout this project. His insightful feedback and
commitment to excellence played a pivotal role in navigating the complexities of our research
and refining our methodologies.
Our appreciation also goes to the esteemed faculty and staff of the Department of Computer
Science and Engineering at Chitkara University. Their scholarly wisdom, continuous support,
and encouragement created an intellectually stimulating environment, fuelling our academic
growth and providing essential resources for this endeavour.
Heartfelt thanks are due to our peers and colleagues for their collaborative spirit, thought-
provoking discussions, and constructive criticism. Their diverse perspectives and shared
enthusiasm significantly contributed to the evolution of our ideas and the refinement of our
project.
Our profound appreciation extends to our families and friends for their unwavering support,
understanding, and encouragement. Their patience and belief in our aspirations were
fundamental in sustaining us through the demanding phases of this academic pursuit.
This project's feasibility owes much to the collective support, guidance, and encouragement from
these individuals and institutions. Whether in mentoring, scholarly resources, or personal
support, their invaluable contributions have shaped this research endeavour and our broader
academic journey.
3
ABSTRACT
During this project, our undertaking involved delving into a real-world dataset, where our
objective was to explore the application of machine learning algorithms in discerning intricate
patterns within the data.
Beyond the technical facets, this project serves as a testament to the interdisciplinary nature of
modern data science. It underscores the importance of ethical considerations in handling
sensitive demographic information and the responsibility that comes with wielding predictive
models. As we harnessed the power of machine learning, we also recognized the ethical
imperative to ensure fairness, transparency, and accountability in our analyses.
The US Census Income dataset, akin to an expansive puzzle, serves as a critical tool for
governmental insights and research endeavours aimed at comprehending individuals' financial
landscapes. This colossal survey engages myriad individuals, extracting details about their
earnings from diverse occupational sources.
The significance of this multifaceted puzzle lies in its pivotal role in governmental decision-
making processes, influencing the allocation of resources to vital sectors such as education,
healthcare, and infrastructure development.
Moreover, it acts as a compass guiding authorities in identifying individuals who might require
additional support, such as financial assistance. Researchers leverage this rich dataset to delve
into nuanced aspects like income inequality and the dynamic shifts in people's financial
circumstances over temporal trajectories.
4
Table of Content
DECLARATION..................................................................................................................................................................2
ACKNOWLEGDEMENT .....................................................................................................................................................3
ABSTRACT ........................................................................................................................................................................4
Chapter 1: INTRODUCTION .............................................................................................................................................6
1.1Background........................................................................................................................................................6
1.2 Problem Statement ..........................................................................................................................................6
1.3 Problem Aim .....................................................................................................................................................7
1.4 Chapter Overview .............................................................................................................................................7
CHAPTER 2: SYSTEM REQUIREMENT ..............................................................................................................................8
2.1 Introduction ..........................................................................................................................................................8
2.2 Software and Hardware Requirements ............................................................................................................. 10
Software Requirements: ..................................................................................................................................... 10
Hardware Requirements: .................................................................................................................................... 11
2.3 Functional and Non-Functional Requirements .................................................................................................. 11
Functional Requirements: ................................................................................................................................... 11
Non-Functional Requirements: ........................................................................................................................... 12
2.4 Summary ............................................................................................................................................................ 12
CHAPTER 3: SYSTEM DESIGN........................................................................................................................................ 13
3.1 Introduction ....................................................................................................................................................... 14
3.2 Proposed System ............................................................................................................................................... 14
3.3 Data flow Diagram ............................................................................................................................................. 16
3.3.1 Description ................................................................................................................................................. 16
3.3.2 Uses of DFD’s .............................................................................................................................................. 16
3.3.3 Level of Abstraction .................................................................................................................................... 17
3.4 summary ............................................................................................................................................................ 19
CHAPTER 4: METHDOLOGY .......................................................................................................................................... 19
4.1 Dataset Description ....................................................................................................................................... 20
4.2 Data Acquisition and Preparation ................................................................................................................. 20
4.3 Feature Selection and Exploratory Data Analysis ......................................................................................... 20
4.4 Feature Engineering & Selection ................................................................................................................... 23
4.5 Data Pre-processing (Splitting & Balancing the dataset) .............................................................................. 24
4.6 Proposed Machine Learning Algorithms for Classification ........................................................................... 25
5
CHAPTER 5: EXPERIMENTAL RESULTS .......................................................................................................................... 27
5.1 Result ................................................................................................................................................................. 27
5.2 Deployment ....................................................................................................................................................... 30
CHAPTER 6: CONCLUSION AND FUTURE SCOPE .......................................................................................................... 33
REFERENCE ................................................................................................................................................................... 34
APPENDICES ................................................................................................................................................................. 35
Chapter 1: INTRODUCTION
1.1Background
The US Census Income dataset is akin to a large survey, capturing information about individuals'
income derived from various sources, providing a nuanced understanding of the economic
landscape. Its significance lies in its role as a crucial tool for governmental resource allocation,
influencing decisions regarding investments in areas such as education, healthcare, and
infrastructure. Moreover, the dataset acts as a lens through which researchers scrutinize income
inequality and societal financial dynamics.
This project's focus is to employ machine learning methodologies to gain deeper insights into the
dataset's intricate patterns and relationships. The challenges posed by the dataset, such as its
large volume, imbalanced class distribution, missing data, and the need for effective categorical
encoding, are acknowledged. The methodology involves rigorous data collection, cleaning,
exploration, and visualization, leveraging Python and associated libraries like Pandas, NumPy,
and Scikit-Learn. Machine learning algorithms, including Decision Trees, Random Forests, and
Logistic Regression, are implemented for predictive modeling.
In essence, the US Census Income dataset stands not merely as a collection of numbers but as a
dynamic and multifaceted resource that fuels crucial societal insights, shapes policy decisions,
and empowers researchers in their pursuit of understanding and addressing the complex
landscape of income dynamics within a nation.
A significant challenge is the imbalanced nature of the data, where one income class dominates
the other, potentially introducing bias into the model outcomes. Moreover, the vast size of the
6
dataset poses computational hurdles, necessitating meticulous optimization efforts to ensure
efficient and effective model training. In addition to these challenges, the project emphasizes the
crucial task of feature selection, recognizing that not all attributes contribute equally to the
model's predictive capacity.
Chapter I: Introduction
The chapter talks about the problem statement and what is the reason for selecting the following
problem statement and what contribution would be made to solve the problem along with the
execution plan.
"System requirements" refer to the specifications and capabilities that a computer or software
application needs to operate efficiently. These requirements typically include details such as
minimum and recommended hardware specifications, supported operating systems, required
software dependencies, and sometimes network or internet connectivity specifications.
System design is a phase in the software development process that involves creating the
architectural blueprint for a computer-based system. It encompasses the specification of how the
system components should be organized and interact to fulfill the specified requirements. This
includes decisions about the overall structure, modules, interfaces, and data for the system.
This chapter entails an in-depth exploration of the dataset, aiming to extract meaningful insights.
Subsequently, we will unravel patterns through exploration and conclude by discussing the tools
and techniques employed for forecasting and prediction tasks.
This chapter focuses on elucidating diverse evaluation metrics crucial for assessing the
performance of each Machine Learning model. These metrics provide insights into the
7
effectiveness, accuracy, and stability of the models, aiding in the identification of optimal
solutions for predictive tasks.
This chapter will offer a condensed summary of our report, provide essential recommendations
based on our analysis and insights, and explore avenues for improving our model using diverse
AI techniques.
2.1 Introduction
Requirement analysis is a crucial phase in product development, essential for assessing the
viability of an application. It encompasses software, hardware, and functional requirements.
Software requirements pertain to the specifications addressing end-user issues through software
solutions. In simpler terms, they define what the software should accomplish for users. This
involves understanding and documenting user needs, functionalities, and features to ensure the
software effectively meets its intended purpose. It serves as a blueprint for the development
team, guiding them in creating a solution aligned with user expectations and business objectives.
1. Elicitation: This phase involves gathering information directly from end users and
customers. It aims to capture their needs, preferences, and expectations, providing a
foundation for further analysis.
2. Analysis: In the analysis stage, the gathered information is logically examined to gain a
comprehensive understanding of customer needs. The goal is to refine and clarify
requirements, ensuring they are precise and aligned with the project's objectives.
8
4. Validation: Validation is the process of verifying and confirming that the specified
requirements meet the intended objectives. This step ensures that the documented
requirements are accurate, complete, and consistent with the stakeholders' expectations.
5. Management: Requirements management is an ongoing process throughout development.
As projects evolve, requirements may change, necessitating continuous testing and
updates. This stage involves tracking, testing, and updating requirements as needed.
9
These include:
♦ Processor Cores and Threads: The processing power of the central processing unit (CPU),
including the number of cores and threads, is a critical hardware requirement.
♦ GPU Processing Power: Graphics processing units (GPUs) play a crucial role in handling
graphical tasks and parallel processing. The required GPU power depends on the nature
of the application.
♦ Memory: The amount of random-access memory (RAM) available impacts the system's
ability to handle multiple tasks simultaneously. Sufficient memory is essential for optimal
performance.
♦ Secondary Storage: Adequate secondary storage, such as hard disk drives (HDDs) or
solid-state drives (SSDs), is necessary for storing data and applications.
♦ Network Connectivity: The ability to connect to a network is crucial for applications that
rely on data exchange or require online functionality.
In essence, hardware requirements ensure that the software being developed is compatible with
and can fully leverage the capabilities of the underlying physical infrastructure.
Software Requirements:
• Operating System: A Windows 8 and above (64-bit) operating system is necessary to
serve as the interface between user programs and the kernel.
10
• Jupyter Notebook: This open-source web application facilitates the creation and sharing
of documents containing live code, equations, visualizations, and narrative text. Jupyter
Notebook finds applications in data cleaning, transformation, numerical simulation,
statistical modelling, data visualization, machine learning, and more.
• Data Set: The dataset encompasses 48842 records, featuring a target variable with values
"0" and "1".
Hardware Requirements:
• Processor: An Intel i5 processor with a base frequency of 2.5 GHz, up to 3.5 GHz (or an
equivalent AMD processor), is recommended.
• GPU (preferred): For enhanced performance, a dedicated GPU from NVIDIA or AMD
with a minimum of 4GB VRAM is preferred.
• Memory: A minimum of 8GB RAM is required to support effective data processing and
analysis.
• Secondary Storage: The software necessitates a minimum of 128GB SSD (Solid State
Drive) or HDD (Hard Disk Drive) for storing data and applications.
In summary, these software and hardware requirements outline the necessary components for
performing time series analysis on fraud data. They ensure compatibility, optimal performance,
and efficient data handling throughout the analysis process.
Functional Requirements:
Data Pre-processing: The system must perform data pre-processing, which involves cleaning,
transforming, and reducing data to convert raw data into a useful format.
Training: Initially, the system needs to undergo a training phase based on the provided dataset.
During this period, the system learns how to perform the required task based on the inputs
provided through the dataset.
11
Forecasting: The system is required to perform forecasting, which is the process of making
predictions about the future based on past and present data. This may involve analyzing trends to
make informed predictions.
Evaluation: To assess the system's efficacy in predicting annual income ">50K" or "<=50K" the
predicted outcomes are subject to an evaluation process. The model-generated predictions for the
annual income ">50K" or "<=50K" validated against known data or ground truth to determine
the model's accuracy.
Non-Functional Requirements:
Accuracy: The system's performance will be measured by its accuracy, which is defined as the
number of correct outputs divided by the total number of outputs. The system should strive for
high accuracy in its predictions.
Openness: The system must demonstrate efficient operation over a specified period. It should be
reliable and maintain consistent performance throughout its operational lifespan.
Reliability: The system is expected to produce fast and accurate results consistently. Reliability
is crucial to ensure that the system can be trusted to deliver dependable outcomes in various
situations.
In essence, these functional and non-functional requirements outline the specific functions and
performance expectations of the system. They serve as a guideline for developing, testing, and
assessing the system's capabilities in time series analysis of fraud data.
2.4 Summary
Software and hardware requirements, when meticulously identified, serve as the foundation for
system development. The interplay between these elements must be harmonious to ensure
smooth integration. These requirements, expressed quantitatively, act as a measurable
benchmark for the system.
12
Functional requirements outline the specific operations a system must perform, ranging from
pre-processing to data extraction and evaluation. On the other hand, non-functional requirements
serve as metrics for evaluating how effectively the system executes these operations, assessing
aspects such as reliability, accuracy, and user-friendliness.
By breaking down high-level tasks into detailed requirements, developers can create a clear plan
of action. This process not only addresses user demands but also guides system design by
establishing clear goals. Requirement analysis is a critical prelude to project initiation, offering
insights into feasibility, complexity, and providing the groundwork for an effective execution
plan. In summary, it is an indispensable task that sets the stage for successful project
development.
13
3.1 Introduction
System Design is the process of delineating the components of a system, including interfaces,
algorithms, UML diagrams, and data sources or databases utilized to meet specified
requirements. It is crafted to fulfil the needs and demands of a business or organization, aiming
to construct a coherent and well-functioning system.
Frauds are inherently dynamic and lack discernible patterns, making them challenging to
identify. Fraudsters leverage recent technological advancements to bypass security measures,
resulting in substantial financial losses. Analysing and detecting anomalous activities through
data mining techniques provides a means of tracing fraudulent transactions. With an increasing
emphasis on enhancing services, many companies are turning to machine learning as an
investment.
Machine learning involves the amalgamation of diverse computer algorithms and statistical
modeling, enabling computers to execute tasks without explicit programming. The model
acquires knowledge from training data, learning patterns and relationships. Subsequently,
predictions can be generated, or actions performed based on the assimilated experiential
knowledge. In essence, machine learning empowers systems to adapt and evolve based on
acquired insights, offering a dynamic approach to addressing complex challenges like fraud
detection.
The proposed system envisions an application that utilizes the trained machine learning model to
predict income levels based on new data. This involves deploying the model in a real-world
scenario, integrating it into an application or system for continuous predictions and assessments.
In conclusion, this project not only explores the nuances of the US Census Income dataset but
also underscores the potential of machine learning in addressing socioeconomic challenges, such
as university dropout, through predictive modeling and early intervention strategies.
♦ Sampling: The initial step involves transforming the continuous attributes of the US
Census Income dataset into discrete signals. This conversion simplifies computational
processing, allowing for more efficient analysis within the computational environment.
♦ Feature Extraction: The US Census Income dataset encompasses various features, such
as age, education level, occupation, and more. Feature extraction aims to emphasize
significant components while disregarding redundant or less informative data segments,
preparing the dataset for machine learning analysis.
14
♦ Normalization: Ensuring uniformity across diverse features, normalization standardizes
the extracted attributes within a consistent range. This step prevents any individual
feature, due to its scale, from disproportionately influencing the machine learning model's
learning process.
♦ Data Cleaning (Silencing): Like academic datasets, the US Census Income dataset may
have sections with incomplete or non-informative data. Data cleaning involves
identifying and removing such sections, ensuring the dataset is robust and suitable for
analysis.
♦ Segmentation (Framing): The US Census Income dataset can be vast, requiring a more
granular examination. Segmentation involves dividing the dataset into smaller frames or
segments, allowing for a focused analysis of income trends within specific demographic
or categorical intervals.
15
3.3 Data flow Diagram
3.3.1 Description
A Data Flow Diagram (DFD) is a visual representation of how data moves through an
information system, illustrating the flow and processing of information. DFDs are instrumental
in depicting the input and output of data, its sources and destinations, and storage locations
within a system. They provide a clear overview of the data's journey, detailing what information
enters the system, how it is processed, and where it ultimately goes. Notably, DFDs do not delve
into the timing of processes or whether they operate sequentially or concurrently.
These diagrams are valuable for visually mapping the data flow in a business information
system. They outline the processes involved in transferring data from input sources to file
storage and report generation. DFDs can be categorized into logical and physical representations.
The logical DFD focuses on the flow of data to achieve specific business functionalities, while
the physical DFD delves into the actual implementation of this logical data flow. In essence,
DFDs serve as a powerful tool for understanding and communicating the intricacies of data
movement within a system.
1. Boundary Definition: DFDs establish the boundary of the business or system domain
under investigation, delineating the scope of analysis activity.
2. Identification of External Entities: They identify external entities and their data
interfaces that interact with the processes of interest within the system.
3. Stakeholder Agreement: DFDs are a useful tool for securing stakeholder agreement,
often involving sign-off on the project scope.
4. Process Breakdown: They assist in breaking down complex processes into sub-
processes, facilitating a more detailed and focused analysis.
5. Logical Information Flow: DFDs illustrate the logical flow of information within the
system, depicting how data moves through different processes.
6. Physical System Construction: They contribute to determining the requirements for the
physical construction of the system based on the logical data flow.
16
7. Simplicity of Notation: DFDs utilize a simple notation that is easy to understand, aiding
in the clear representation of complex information.
8. Manual and Automated System Requirements: DFDs help establish both manual and
automated system requirements, providing insights into how processes should be
executed.
In essence, Data Flow Diagrams play a multifaceted role in system analysis, offering a visual
representation of data flow and interactions within a system while aiding in project scoping,
stakeholder communication, and detailed process analysis.
In the realm of software engineering, Data Flow Diagrams (DFD) serve as a tool to represent
systems at different levels of abstraction. Higher-level DFDs are partitioned into lower levels,
providing more detailed information and functional elements. The levels in DFD are typically
denoted as 0, 1, 2, or beyond. Specifically, we will focus on two main levels in data flow
diagrams: 0-level DFD and 1-level DFD.
The primary purpose of the Context Diagram is to offer an at-a-glance view that can be easily
understood by a broad audience, including stakeholders, business analysts, data analysts, and
developers. Its simplicity and clarity make it a valuable communication tool, facilitating a shared
understanding of the system's context and interactions.
17
Fig describes the overall process of the project. We input the important dataset file data and
preprocess the data before prediction using different classifier to predict the type of result.
Level 1:
The exploration of the context-level Data Flow Diagram (DFD) is followed by the creation of a
Level 1 DFD, delving into more detailed aspects of the modelled system. The Level 1 DFD
illustrates how the system is subdivided into sub-systems or processes, each handling specific
data flows to or from external agents. Together, these sub-systems collectively provide the
complete functionality of the overall system.
Furthermore, the Level 1 DFD identifies internal data stores that are crucial for the system to
effectively perform its functions. It illustrates the flow of data between various inputs of the
system, offering a more granular understanding of how information is processed and exchanged
within the different components of the system. Essentially, the Level 1 DFD provides a detailed
representation of the system's internal processes and data flows, offering insights into its
operational intricacies.
18
3.4 summary
This chapter provides a concise introduction to the system design process, outlining the
methodologies essential for system development. It addresses various types of design processes
applied in real-world scenarios and explores different system architectures. The proposed model
elucidates the precise workings of the system.
The chapter employs Data Flow Diagrams (DFDs) at three distinct levels of abstraction to
illustrate the proposed model. DFDs serve as graphical representations, summarizing the flow of
data in various process levels. The focus is on articulating the details of the proposed system and
comparing it with the existing system. Emphasis is placed on how the proposed system brings
about a reduction in complexity, cost, and enhances overall system performance. In essence, the
chapter delves into the design intricacies, providing insights into the envisioned system and its
potential improvements over the current system.
CHAPTER 4: METHDOLOGY
The goal of this methodology is to provide a clear and concise summary of the US Census
Income data, which contains information about the income levels of people in the United States.
This synopsis will help readers quickly understand key insights from the dataset. Start by
gathering the US Census Income data, which typically includes information about individuals
and their income. This data is usually collected through surveys and questionnaires.
19
4.1 Dataset Description
The US Census Income dataset, commonly known as "Census Income" or "Adult Income," is a
widely utilized resource in the realms of machine learning and data analysis. Its primary
objective is to predict whether an individual's income surpasses or falls below a specified
threshold, typically set at $50,000 annually. The dataset comprises a diverse set of features,
encompassing numerical and categorical attributes that offer insights into various aspects of an
individual's life, including age, education, employment, marital status, and more. One set of
features includes demographic details such as age, education level, and marital status. For
instance, education is represented both categorically and numerically, providing a comprehensive
view of an individual's highest level of education. Another set of features delves into
employment-related aspects, classifying work situations, occupations, and the number of hours
worked per week. Additionally, personal attributes like race, gender, capital gains, losses, and
native country contribute to the dataset's richness. The target variable, "Income," is binary,
indicating whether an individual earns above or below $50,000, encapsulating the essence of the
predictive task associated with this valuable dataset .Now, we are going to prepare the dataset by
removing some unwanted noise such as missing values and duplicate rows, then we need to
perform analysis with different graphs and charts, after this we should go for Feature
Engineering and pre-processing, and finally we will build the classification model with the help
of different Machine Learning algorithms.
20
summarizing the data, enabling a more informed understanding before delving into predictive
modeling.
Our initial analysis focuses on the target variable, where we employ a pie chart to visually
represent the distribution of individuals based on their annual income. The chart allows us to
observe the proportion of individuals earning less than or equal to $50k and those earning more
than $50k annually, providing a clear snapshot of the income distribution within the dataset.
21
The pie chart above illustrates that a significant majority, approximately 76.1%, of individuals in
the total population have an annual income less than or equal to $50k. In contrast, around 23.9%
of individuals fall into the category of annual income exceeding $50k. This breakdown provides
a clear visual representation of the income distribution within the dataset.
In this dataset, characterized by an extensive list of categorical variables and a limited set of
numerical variables, our initial focus is on analyzing the numerical data. To achieve this, we
employ a Heatmap featuring a correlation matrix. This matrix provides insights into the
interrelationships among the numerical features, offering a visual representation of their
correlations. The Heatmap not only enhances our understanding of the dataset's numerical
dynamics but also presents this information in a visually intuitive manner.
22
From the above plot, we are getting some interesting insights about the dataset and those insights
are following.
1. hours-per-week & educational-num is having strong correlation between them and
correlation value is 0.14
2. Educational-num & capital gain is having average relation between them since the
correlation value is 0.13
3. And the final observation from this Heatmap is that hours-per-week and fnlwgt are
having negative relation between them.
After this we have plotted the distribution plot for all these numerical variables to observe the
skewness of the data and then we have plotted the box plot to check the presence of outliers in
the dataset. And here we observe continuous data contains an outlier. So, till now we have
successfully analysed the continuous data and after that we have even analysed our categorical
(discrete) data with different plots.
In categorical data, we can observe that each data contains so many distinct categories; there are
only few variables where we are having just 2, 3, 4 or 5 categories. Although we have analysed
all these variables with the count plot and got some interesting insights. Some of those insights
are as following:
➢ The prevailing trend indicates that a significant portion of individuals in the dataset has
pursued education beyond high school.
➢ It is evident from the observation that the count of males surpasses that of females in both
income categories, both <=50k and >50k annually.
➢ The ratio of individuals earning <=50k and >50k appears to be roughly equivalent among
those with an occupation in Exec-managerial roles.
23
square test to check the relationship between input and target variable. The Chi-Square test is a
hypothesis testing method which helps us to understand the relation between two categorical
variables. In this hypothesis testing, Null hypothesis states that there is no relation between two
variables whereas Alternate Hypothesis states that there is some relationship between two
categorical variables. So, by using this test we can simply drop those features which are not
making any relationship with the target variable. And after performing this test we observe that
there are 4 variables which do not have any relation with the target variable, so we just drop and
finally we are moving ahead for modeling with remaining features.
24
And finally, after balancing the dataset we just split the data into training and test set with the
ratio of 80:20. And finally we come on to the final stage of this section where we are going to
discuss our Machine Learning algorithms for building a model.
1. Logistic Regression Classifier: The first algorithm which I’ve used for classification is
Logistic regression. Logistic Regression is a statistical method used for binary
classification, predicting the probability of an observation belonging to a specific class.
Despite its name, it's employed for classification rather than regression. The algorithm
models the relationship between the independent variables and the log-odds of the
probability of a particular outcome. It uses the logistic function, also known as the
sigmoid function, to constrain predictions between 0 and 1. During training, the model
adjusts its parameters through optimization techniques like gradient descent to minimize
the difference between predicted probabilities and actual class labels.
2. Decision Tree: This is a highly interpretable model that can be used for both
classification and regression tasks. The tree made up of 3 nodes which consist of root
node, child node and leaf node where the best feature is selected as a root node and to
select the root node we are calculating the information gain for each feature and to
calculate the information gain there are 2 techniques, one is known as entropy and the
second one is Gini impurity. The algorithm has 1 hyperparameter also which is known as
depth and sometimes if the depth is high, it may also lead to overfitting.
3. Random Forest: Random Forest is one of the popular ensembles learning methods. It’s a
bagging technique where the number of decision trees runs parallel on a different subset
of data. These subsets are created by performing sampling with replacement using a
method called bootstrapping. Bootstrapping is a statistical technique that estimates a
statistic from the sample of data and can be used to reduce an error. On these subsets we
train strong classifiers (over fitted model – deep decision trees) and at a time of
aggregation the variance will reduce with some reasonable amount and finally it becomes
a generalized model. The best thing about this model is all the classifiers are train
parallelly, so we can train a bunch of classifiers without increasing any computational or
time complexity
4. XG Boost Classifier: XG Boost is again a very popular and effective ensemble learning
technique but unlike Random Forest, XG Boost is a boosting technique. Boosting builds
models from individual weak learners and unlike bagging it’s a sequential learning
process. XG Boost is a well optimized boosting technique and that’s a reason it performs
equally well certain deep learning algorithms. It uses advanced l1 and l2 regularization
which prevents overfitting and one of the best aspects of this model is parallelization
which makes it super-fast algorithm.
25
5. Adaptive Boosting: Adaptive Boosting is an ensemble learning method in machine
learning that combines the strengths of multiple weak learners to create a robust and
accurate predictive model. The algorithm operates iteratively, assigning varying weights
to instances in the dataset based on their classification accuracy. During each iteration, it
focuses on the misclassified instances from the previous round, adjusting their weights to
prioritize their correct classification in the next iteration.Weak learners, typically simple
models with slightly better than random accuracy, are sequentially added to the ensemble.
The final model aggregates the predictions of these weak learners, with each contributing
a weighted vote.
26
CHAPTER 5: EXPERIMENTAL RESULTS
5.1 Result
There are different evaluation metrics available which can help us to evaluate the performance of
the model. For this classification task, we have used different classifiers such as Logistics
Regression, Decision Tree, Random Forrest, XG Boost and Ada Boost . Now, since it’s a multi-
class classification task, here we found accuracy score, f1-score, precision score, recall score and
finally visualized confusion matrix as well which even helps us to understand how much data we
have classified incorrectly. Let’s try to discuss and understand each metric one by one.
27
28
Figure5: Conclusion Matrics for XGB Classifier.
Confusion Matrix can give us the visual representation of total sum of correctly classified points
and sum of misclassified points. In this heatmap, we can observe that 12,190 points are classified
correctly where 2380 points are mis-classified.
29
1. Accuracy Score: It is calculated as total number of correctly classified points divided by
total number of points is test set. We are getting maximum accuracy with XG Boost
classifier (83%), followed by Logistic Regression (82%) then Random Forest, Ada Boost
and Decision Tree classifier (81%)
2. Precision Score: Precision is calculated as the ratio of correctly classified positive points
(True Positives) to the total number of correctly classified positive points (True Positives
+ False Positives).
3. Recall Score: Recall tells us what proportion of the positive class got correctly classified.
It is calculated as the total number of true positive points divided by true positive and false
negative.
4. F1-score: In Precision, we are focusing more on false positive values whereas in case of
recall we focused on false negative values. Now, if we want to minimize false positive as
well as false negative then we can use f1-score. It is calculated as harmonic means of
precision and recall. For this dataset again Cat Boost classifier has maximum f1-score
followed by Light GBM and Random Forest.
5. Confusion Matrix: Confusion Matrix can give us the visual representation of total sum of
correctly classified points and sum of misclassified points. We have plotted the confusion
matrix of the best classifier (Cat Boost). Now, let’s visualize it:
5.2 Deployment
Our project has seamlessly transitioned into the operational phase following a successful
deployment on Flask, a powerful and versatile web framework built on Python. Leveraging
Flask's inherent capabilities, we've established a robust hosting environment that not only
ensures the reliability of our application but also allows for scalability as user interactions grow.
With this deployment, the project now offers a user-friendly and accessible web interface,
enabling a seamless and engaging experience for our users as they interact with the various
features and functionalities we've implemented.
30
Figure6: Result (Predicting less than or equal to 50k.)
31
Figure7: Result (Predicting greater than 50k.)
32
CHAPTER 6: CONCLUSION AND
FUTURE SCOPE
The project shows the in-depth implementation of the different algorithms, and we compared
these algorithms based on accuracy score, precision score, recall score, and f1-score. The
research shows how tree-based techniques and ensemble-based techniques are implemented to
predict the income of individuals of Us whether it is <=50k or >50k. Out of 5 different
algorithms we can observe that XG Boost outperformed all other algorithms and showed the best
results. Logistic Regression Classifier also performed well with high accuracy scores, and we
may use it as the second-best models for this problem.
The proposed model was tested on the available dataset. A total of 48842 records were used to
analyze the data and build the model. The research demonstrates the different data pre-
processing techniques along with visualization techniques that help us understand the data in a
better way and identify the patterns in the data. The main aim of this project was to build a model
that can give us accurate and stable performance and we’re achieving accuracy more than 82%.
Next, we can try to improve this performance and can get more than 90% by performing hyper
parameter optimization as these ensemble methods consist lots of hyper parameters, so if we can
optimize these parameter so there is a chance of getting accuracy more than 90% or secondly we
can also try Neural Network which is a kind of Deep Learning approach to enhance the
performance.
33
Figure8: Bar plot for Accuracy Scores Comparison.
In the comparative analysis of machine learning models, XG Boost emerged as the standout
performer, showcasing the highest accuracy score of 83.665%. This signifies its superior capability
in accurately predicting outcomes for the specific task at hand. The robust performance of XG
Boost positions it as the top-performing algorithm among the evaluated models, underscoring its
effectiveness in handling the complexities of the given dataset and task requirements. This result
suggests that XG Boost is a promising choice for applications requiring precise and reliable
predictions.
REFERENCE
34
1. 1. Ameen, A. O., Alarape, M. A., & Adewole, K. S. (2019). STUDENTS’ ACADEMIC
PERFORMANCE AND DROPOUT PREDICTION. MALAYSIAN JOURNAL OF
COMPUTING, 4(2), 278.
2. 2. Alyahyan, E., & Düştegör, D. (2020). Predicting academic success in higher education:
literature review and best practices. International Journal of Educational Technology in
Higher Education, 17(1).
3. 3. Respondek, L., Seufert, T., Stupnisky, R., & Nett, U. E. (2017). Perceived Academic
Control and Academic Emotions Predict Undergraduate University Student Success:
Examining Effects on Dropout Intention and Achievement.
4. 4. Niyogisubizo, J., Liao, L., Nziyumva, E., Murwanashyaka, E., & Nshimyumukiza, P.
C. (2022). Predicting student’s dropout in university classes using a two-layer ensemble
machine learning approach: A novel stacked generalization.
5. 5. Chen, T., & Guestrin, C. (2016). XGBoost. Proceedings of the 22nd ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining.
6. 6. Patel, H. H., & Prajapati, P. (2018). Study and Analysis of Decision Tree-Based
Classification Algorithms.
7. Hastie, T., Tibsherany, R., & Friedman, J. (2009). The Elements of Statistical Learning:
Data Mining, Inference, and Prediction. Springer.
8. Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegel Meyer, W. P. (2002). SMOTE:
Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research,
16, 321-357.
9. Friedman, J. H. (2001). Greedy Function Approximation: A Gradient Boosting Machine.
Annals of Statistics, 29(5), 1189-1232.
10. Powers, D. M. (2011). Evaluation: From Precision, Recall and F-Measure to ROC,
Informed Ness, Markedness & Correlation. Journal of Machine Learning Technologies,
2(1), 37-63.
APPENDICES
35
Project Report Overview
This comprehensive project report encapsulates our efforts to predict individuals' income levels,
a pivotal pursuit in the realm of socio-economic analysis. The report provides a detailed account
of our methodologies, findings, and implications for future research and practical applications
within the domain of income prediction.
36
37
38
39