You are on page 1of 39

Department of Computer Science and Engineering

CHITKARA UNIVERSITY
BADDI HIMACHAL PRADESH-174103 (INDIA)

A PROJECT REPORT
ON

US Census Income

Submitted by

Abhishek (2011981005)
Anuj (2011981014)
Rishabh Verma (2011981099)

Supervised By
Mr.Shivam Singh

1
DECLARATION
We, the undersigned collaborators involved in the development and compilation of this project
report titled 'US Census Income,' affirm the absolute authenticity, originality, and integrity of the
work presented herein.

This report reflects our independent research, comprehensive analysis, and


scholarly dedication. Every piece of information, findings, and conclusions contained herein is
the result of our diligent efforts, meticulously referenced in accordance with academic standards
and acknowledging all external sources of knowledge or data used.

We unequivocally declare that this report has not been previously submitted, either in part or in
whole, for any other academic or professional qualification. It stands as a distinctive and original
scholarly creation adhering to the highest ethical standards and academic integrity expected
within our educational institution.

Additionally, we acknowledge that any contributions from external sources, whether in the form
of direct quotations, paraphrased information, or borrowed concepts, have been appropriately
cited and referenced, demonstrating our commitment to upholding established academic
conventions.

By undersigning this declaration, we assume full responsibility for the authenticity, accuracy,
and originality of the content presented in this project report.

Name of the Student:


Abhishek koundal
Anuj
Rishabh Verma

Place: Baddi, Himachal Pradesh

Date: 20/11/2023

2
ACKNOWLEGDEMENT

We extend our deepest gratitude to Mr. Shivam Singh for his dedicated mentorship, expert
guidance, and unwavering support throughout this project. His insightful feedback and
commitment to excellence played a pivotal role in navigating the complexities of our research
and refining our methodologies.

Our appreciation also goes to the esteemed faculty and staff of the Department of Computer
Science and Engineering at Chitkara University. Their scholarly wisdom, continuous support,
and encouragement created an intellectually stimulating environment, fuelling our academic
growth and providing essential resources for this endeavour.

Heartfelt thanks are due to our peers and colleagues for their collaborative spirit, thought-
provoking discussions, and constructive criticism. Their diverse perspectives and shared
enthusiasm significantly contributed to the evolution of our ideas and the refinement of our
project.

Additionally, we acknowledge the contributions of researchers, scholars, and institutions whose


seminal work and comprehensive literature served as guiding beacons, enriching our
understanding and laying the foundation for this project.

Our profound appreciation extends to our families and friends for their unwavering support,
understanding, and encouragement. Their patience and belief in our aspirations were
fundamental in sustaining us through the demanding phases of this academic pursuit.

This project's feasibility owes much to the collective support, guidance, and encouragement from
these individuals and institutions. Whether in mentoring, scholarly resources, or personal
support, their invaluable contributions have shaped this research endeavour and our broader
academic journey.

3
ABSTRACT

During this project, our undertaking involved delving into a real-world dataset, where our
objective was to explore the application of machine learning algorithms in discerning intricate
patterns within the data.

We embarked on an experiential journey utilizing a prevalent data visualization and machine


learning library, employing diverse algorithms, and culminating in the submission of a
comprehensive report elucidating our findings and methodologies.

Beyond the technical facets, this project serves as a testament to the interdisciplinary nature of
modern data science. It underscores the importance of ethical considerations in handling
sensitive demographic information and the responsibility that comes with wielding predictive
models. As we harnessed the power of machine learning, we also recognized the ethical
imperative to ensure fairness, transparency, and accountability in our analyses.

The US Census Income dataset, akin to an expansive puzzle, serves as a critical tool for
governmental insights and research endeavours aimed at comprehending individuals' financial
landscapes. This colossal survey engages myriad individuals, extracting details about their
earnings from diverse occupational sources.

Subsequently, this amalgamated information paints a comprehensive portrait of the economic


prosperity or constraints within the nation.

The significance of this multifaceted puzzle lies in its pivotal role in governmental decision-
making processes, influencing the allocation of resources to vital sectors such as education,
healthcare, and infrastructure development.

Moreover, it acts as a compass guiding authorities in identifying individuals who might require
additional support, such as financial assistance. Researchers leverage this rich dataset to delve
into nuanced aspects like income inequality and the dynamic shifts in people's financial
circumstances over temporal trajectories.

4
Table of Content

DECLARATION..................................................................................................................................................................2
ACKNOWLEGDEMENT .....................................................................................................................................................3
ABSTRACT ........................................................................................................................................................................4
Chapter 1: INTRODUCTION .............................................................................................................................................6
1.1Background........................................................................................................................................................6
1.2 Problem Statement ..........................................................................................................................................6
1.3 Problem Aim .....................................................................................................................................................7
1.4 Chapter Overview .............................................................................................................................................7
CHAPTER 2: SYSTEM REQUIREMENT ..............................................................................................................................8
2.1 Introduction ..........................................................................................................................................................8
2.2 Software and Hardware Requirements ............................................................................................................. 10
Software Requirements: ..................................................................................................................................... 10
Hardware Requirements: .................................................................................................................................... 11
2.3 Functional and Non-Functional Requirements .................................................................................................. 11
Functional Requirements: ................................................................................................................................... 11
Non-Functional Requirements: ........................................................................................................................... 12
2.4 Summary ............................................................................................................................................................ 12
CHAPTER 3: SYSTEM DESIGN........................................................................................................................................ 13
3.1 Introduction ....................................................................................................................................................... 14
3.2 Proposed System ............................................................................................................................................... 14
3.3 Data flow Diagram ............................................................................................................................................. 16
3.3.1 Description ................................................................................................................................................. 16
3.3.2 Uses of DFD’s .............................................................................................................................................. 16
3.3.3 Level of Abstraction .................................................................................................................................... 17
3.4 summary ............................................................................................................................................................ 19
CHAPTER 4: METHDOLOGY .......................................................................................................................................... 19
4.1 Dataset Description ....................................................................................................................................... 20
4.2 Data Acquisition and Preparation ................................................................................................................. 20
4.3 Feature Selection and Exploratory Data Analysis ......................................................................................... 20
4.4 Feature Engineering & Selection ................................................................................................................... 23
4.5 Data Pre-processing (Splitting & Balancing the dataset) .............................................................................. 24
4.6 Proposed Machine Learning Algorithms for Classification ........................................................................... 25

5
CHAPTER 5: EXPERIMENTAL RESULTS .......................................................................................................................... 27
5.1 Result ................................................................................................................................................................. 27
5.2 Deployment ....................................................................................................................................................... 30
CHAPTER 6: CONCLUSION AND FUTURE SCOPE .......................................................................................................... 33
REFERENCE ................................................................................................................................................................... 34
APPENDICES ................................................................................................................................................................. 35

Chapter 1: INTRODUCTION
1.1Background
The US Census Income dataset is akin to a large survey, capturing information about individuals'
income derived from various sources, providing a nuanced understanding of the economic
landscape. Its significance lies in its role as a crucial tool for governmental resource allocation,
influencing decisions regarding investments in areas such as education, healthcare, and
infrastructure. Moreover, the dataset acts as a lens through which researchers scrutinize income
inequality and societal financial dynamics.
This project's focus is to employ machine learning methodologies to gain deeper insights into the
dataset's intricate patterns and relationships. The challenges posed by the dataset, such as its
large volume, imbalanced class distribution, missing data, and the need for effective categorical
encoding, are acknowledged. The methodology involves rigorous data collection, cleaning,
exploration, and visualization, leveraging Python and associated libraries like Pandas, NumPy,
and Scikit-Learn. Machine learning algorithms, including Decision Trees, Random Forests, and
Logistic Regression, are implemented for predictive modeling.
In essence, the US Census Income dataset stands not merely as a collection of numbers but as a
dynamic and multifaceted resource that fuels crucial societal insights, shapes policy decisions,
and empowers researchers in their pursuit of understanding and addressing the complex
landscape of income dynamics within a nation.

1.2 Problem Statement


The overarching objective of this project is to tackle the intricate task of predicting income levels
by harnessing the potential of the US Census Income dataset. Through the strategic utilization of
machine learning techniques, the primary aim is to construct a robust model proficient in
precisely categorizing individuals' incomes into the binary classes of above or below $50,000.
The project is not only poised to address this fundamental classification challenge but also
confronts additional complexities intrinsic to the dataset.

A significant challenge is the imbalanced nature of the data, where one income class dominates
the other, potentially introducing bias into the model outcomes. Moreover, the vast size of the

6
dataset poses computational hurdles, necessitating meticulous optimization efforts to ensure
efficient and effective model training. In addition to these challenges, the project emphasizes the
crucial task of feature selection, recognizing that not all attributes contribute equally to the
model's predictive capacity.

1.3 Problem Aim


The project aims to predict income levels (>50K or <=50K) based on individual attributes in the
US Census Income dataset. By leveraging machine learning techniques, the goal is to create a
robust predictive model that can contribute to a deeper understanding of income dynamics and
assist policymakers in making informed decisions for societal well-being. The ultimate objective
is to deploy the developed model for real-world applications, providing insights into income
predictions based on new data.

1.4 Chapter Overview

Chapter I: Introduction

The chapter talks about the problem statement and what is the reason for selecting the following
problem statement and what contribution would be made to solve the problem along with the
execution plan.

Chapter II: System Requirements

"System requirements" refer to the specifications and capabilities that a computer or software
application needs to operate efficiently. These requirements typically include details such as
minimum and recommended hardware specifications, supported operating systems, required
software dependencies, and sometimes network or internet connectivity specifications.

Chapter III: System Design

System design is a phase in the software development process that involves creating the
architectural blueprint for a computer-based system. It encompasses the specification of how the
system components should be organized and interact to fulfill the specified requirements. This
includes decisions about the overall structure, modules, interfaces, and data for the system.

Chapter IV: Methodology

This chapter entails an in-depth exploration of the dataset, aiming to extract meaningful insights.
Subsequently, we will unravel patterns through exploration and conclude by discussing the tools
and techniques employed for forecasting and prediction tasks.

Chapter V: Experimental Results

This chapter focuses on elucidating diverse evaluation metrics crucial for assessing the
performance of each Machine Learning model. These metrics provide insights into the

7
effectiveness, accuracy, and stability of the models, aiding in the identification of optimal
solutions for predictive tasks.

Chapter VI: Conclusion and Future Scope

This chapter will offer a condensed summary of our report, provide essential recommendations
based on our analysis and insights, and explore avenues for improving our model using diverse
AI techniques.

CHAPTER 2: SYSTEM REQUIREMENT

2.1 Introduction

Requirement analysis is a crucial phase in product development, essential for assessing the
viability of an application. It encompasses software, hardware, and functional requirements.

Software requirements pertain to the specifications addressing end-user issues through software
solutions. In simpler terms, they define what the software should accomplish for users. This
involves understanding and documenting user needs, functionalities, and features to ensure the
software effectively meets its intended purpose. It serves as a blueprint for the development
team, guiding them in creating a solution aligned with user expectations and business objectives.

The requirement analysis process involves several key stages:

1. Elicitation: This phase involves gathering information directly from end users and
customers. It aims to capture their needs, preferences, and expectations, providing a
foundation for further analysis.

2. Analysis: In the analysis stage, the gathered information is logically examined to gain a
comprehensive understanding of customer needs. The goal is to refine and clarify
requirements, ensuring they are precise and aligned with the project's objectives.

3. Specification: The specification stage involves documenting requirements in a structured


format, such as use cases, user stories, functional requirements, or visual representations.
This documentation serves as a reference for the development team.

8
4. Validation: Validation is the process of verifying and confirming that the specified
requirements meet the intended objectives. This step ensures that the documented
requirements are accurate, complete, and consistent with the stakeholders' expectations.
5. Management: Requirements management is an ongoing process throughout development.
As projects evolve, requirements may change, necessitating continuous testing and
updates. This stage involves tracking, testing, and updating requirements as needed.

6. Hardware requirements: Hardware requirements encompass the physical components


essential for application development.

9
These include:

♦ Processor Cores and Threads: The processing power of the central processing unit (CPU),
including the number of cores and threads, is a critical hardware requirement.

♦ GPU Processing Power: Graphics processing units (GPUs) play a crucial role in handling
graphical tasks and parallel processing. The required GPU power depends on the nature
of the application.

♦ Memory: The amount of random-access memory (RAM) available impacts the system's
ability to handle multiple tasks simultaneously. Sufficient memory is essential for optimal
performance.

♦ Secondary Storage: Adequate secondary storage, such as hard disk drives (HDDs) or
solid-state drives (SSDs), is necessary for storing data and applications.

♦ Network Connectivity: The ability to connect to a network is crucial for applications that
rely on data exchange or require online functionality.

In essence, hardware requirements ensure that the software being developed is compatible with
and can fully leverage the capabilities of the underlying physical infrastructure.

2.2 Software and Hardware Requirements

Software Requirements:
• Operating System: A Windows 8 and above (64-bit) operating system is necessary to
serve as the interface between user programs and the kernel.

• Anaconda: The software requires Anaconda, a free and open-source distribution of


Python and R programming languages designed for scientific computing. Anaconda
simplifies package management and deployment, with package versions managed by the
Conda package management system.

10
• Jupyter Notebook: This open-source web application facilitates the creation and sharing
of documents containing live code, equations, visualizations, and narrative text. Jupyter
Notebook finds applications in data cleaning, transformation, numerical simulation,
statistical modelling, data visualization, machine learning, and more.

• Data Set: The dataset encompasses 48842 records, featuring a target variable with values
"0" and "1".

Hardware Requirements:

• Processor: An Intel i5 processor with a base frequency of 2.5 GHz, up to 3.5 GHz (or an
equivalent AMD processor), is recommended.

• GPU (preferred): For enhanced performance, a dedicated GPU from NVIDIA or AMD
with a minimum of 4GB VRAM is preferred.

• Memory: A minimum of 8GB RAM is required to support effective data processing and
analysis.

• Secondary Storage: The software necessitates a minimum of 128GB SSD (Solid State
Drive) or HDD (Hard Disk Drive) for storing data and applications.

• Network Connectivity: A network bandwidth of approximately 10 Mbps to 75 Mbps is


recommended for seamless data access and processing.

In summary, these software and hardware requirements outline the necessary components for
performing time series analysis on fraud data. They ensure compatibility, optimal performance,
and efficient data handling throughout the analysis process.

2.3 Functional and Non-Functional Requirements

Functional Requirements:
Data Pre-processing: The system must perform data pre-processing, which involves cleaning,
transforming, and reducing data to convert raw data into a useful format.

Training: Initially, the system needs to undergo a training phase based on the provided dataset.
During this period, the system learns how to perform the required task based on the inputs
provided through the dataset.

11
Forecasting: The system is required to perform forecasting, which is the process of making
predictions about the future based on past and present data. This may involve analyzing trends to
make informed predictions.

Evaluation: To assess the system's efficacy in predicting annual income ">50K" or "<=50K" the
predicted outcomes are subject to an evaluation process. The model-generated predictions for the
annual income ">50K" or "<=50K" validated against known data or ground truth to determine
the model's accuracy.

Non-Functional Requirements:
Accuracy: The system's performance will be measured by its accuracy, which is defined as the
number of correct outputs divided by the total number of outputs. The system should strive for
high accuracy in its predictions.

Openness: The system must demonstrate efficient operation over a specified period. It should be
reliable and maintain consistent performance throughout its operational lifespan.

Portability: The system must be designed to be platform-independent, allowing it to run on


various systems without significant modifications. This ensures versatility and ease of
deployment across different platforms.

Reliability: The system is expected to produce fast and accurate results consistently. Reliability
is crucial to ensure that the system can be trusted to deliver dependable outcomes in various
situations.

In essence, these functional and non-functional requirements outline the specific functions and
performance expectations of the system. They serve as a guideline for developing, testing, and
assessing the system's capabilities in time series analysis of fraud data.

2.4 Summary

In essence, requirement analysis is a pivotal phase in product development, integral to


determining the feasibility of an application. This process involves a detailed examination of
various system requirements, encompassing both software and hardware components. Careful
consideration and compatibility of these requirements are crucial for the seamless integration of
the system, ultimately leading to the successful delivery of the final product.

Software and hardware requirements, when meticulously identified, serve as the foundation for
system development. The interplay between these elements must be harmonious to ensure
smooth integration. These requirements, expressed quantitatively, act as a measurable
benchmark for the system.

12
Functional requirements outline the specific operations a system must perform, ranging from
pre-processing to data extraction and evaluation. On the other hand, non-functional requirements
serve as metrics for evaluating how effectively the system executes these operations, assessing
aspects such as reliability, accuracy, and user-friendliness.

By breaking down high-level tasks into detailed requirements, developers can create a clear plan
of action. This process not only addresses user demands but also guides system design by
establishing clear goals. Requirement analysis is a critical prelude to project initiation, offering
insights into feasibility, complexity, and providing the groundwork for an effective execution
plan. In summary, it is an indispensable task that sets the stage for successful project
development.

CHAPTER 3: SYSTEM DESIGN

13
3.1 Introduction
System Design is the process of delineating the components of a system, including interfaces,
algorithms, UML diagrams, and data sources or databases utilized to meet specified
requirements. It is crafted to fulfil the needs and demands of a business or organization, aiming
to construct a coherent and well-functioning system.

Frauds are inherently dynamic and lack discernible patterns, making them challenging to
identify. Fraudsters leverage recent technological advancements to bypass security measures,
resulting in substantial financial losses. Analysing and detecting anomalous activities through
data mining techniques provides a means of tracing fraudulent transactions. With an increasing
emphasis on enhancing services, many companies are turning to machine learning as an
investment.

Machine learning involves the amalgamation of diverse computer algorithms and statistical
modeling, enabling computers to execute tasks without explicit programming. The model
acquires knowledge from training data, learning patterns and relationships. Subsequently,
predictions can be generated, or actions performed based on the assimilated experiential
knowledge. In essence, machine learning empowers systems to adapt and evolve based on
acquired insights, offering a dynamic approach to addressing complex challenges like fraud
detection.

3.2 Proposed System

The proposed system envisions an application that utilizes the trained machine learning model to
predict income levels based on new data. This involves deploying the model in a real-world
scenario, integrating it into an application or system for continuous predictions and assessments.

In conclusion, this project not only explores the nuances of the US Census Income dataset but
also underscores the potential of machine learning in addressing socioeconomic challenges, such
as university dropout, through predictive modeling and early intervention strategies.

♦ Sampling: The initial step involves transforming the continuous attributes of the US
Census Income dataset into discrete signals. This conversion simplifies computational
processing, allowing for more efficient analysis within the computational environment.
♦ Feature Extraction: The US Census Income dataset encompasses various features, such
as age, education level, occupation, and more. Feature extraction aims to emphasize
significant components while disregarding redundant or less informative data segments,
preparing the dataset for machine learning analysis.

14
♦ Normalization: Ensuring uniformity across diverse features, normalization standardizes
the extracted attributes within a consistent range. This step prevents any individual
feature, due to its scale, from disproportionately influencing the machine learning model's
learning process.

♦ Data Cleaning (Silencing): Like academic datasets, the US Census Income dataset may
have sections with incomplete or non-informative data. Data cleaning involves
identifying and removing such sections, ensuring the dataset is robust and suitable for
analysis.

♦ Segmentation (Framing): The US Census Income dataset can be vast, requiring a more
granular examination. Segmentation involves dividing the dataset into smaller frames or
segments, allowing for a focused analysis of income trends within specific demographic
or categorical intervals.

♦ Windowing: Due to the extensive nature of the US Census Income dataset,


comprehensive processing at once may be challenging. Windowing addresses this by
partitioning the data into manageable sections, facilitating statistical analysis and
enabling a focused examination of income patterns or trends within specific contexts.

15
3.3 Data flow Diagram
3.3.1 Description

A Data Flow Diagram (DFD) is a visual representation of how data moves through an
information system, illustrating the flow and processing of information. DFDs are instrumental
in depicting the input and output of data, its sources and destinations, and storage locations
within a system. They provide a clear overview of the data's journey, detailing what information
enters the system, how it is processed, and where it ultimately goes. Notably, DFDs do not delve
into the timing of processes or whether they operate sequentially or concurrently.

These diagrams are valuable for visually mapping the data flow in a business information
system. They outline the processes involved in transferring data from input sources to file
storage and report generation. DFDs can be categorized into logical and physical representations.
The logical DFD focuses on the flow of data to achieve specific business functionalities, while
the physical DFD delves into the actual implementation of this logical data flow. In essence,
DFDs serve as a powerful tool for understanding and communicating the intricacies of data
movement within a system.

3.3.2 Uses of DFD’s

1. Boundary Definition: DFDs establish the boundary of the business or system domain
under investigation, delineating the scope of analysis activity.

2. Identification of External Entities: They identify external entities and their data
interfaces that interact with the processes of interest within the system.

3. Stakeholder Agreement: DFDs are a useful tool for securing stakeholder agreement,
often involving sign-off on the project scope.

4. Process Breakdown: They assist in breaking down complex processes into sub-
processes, facilitating a more detailed and focused analysis.

5. Logical Information Flow: DFDs illustrate the logical flow of information within the
system, depicting how data moves through different processes.

6. Physical System Construction: They contribute to determining the requirements for the
physical construction of the system based on the logical data flow.

16
7. Simplicity of Notation: DFDs utilize a simple notation that is easy to understand, aiding
in the clear representation of complex information.

8. Manual and Automated System Requirements: DFDs help establish both manual and
automated system requirements, providing insights into how processes should be
executed.

In essence, Data Flow Diagrams play a multifaceted role in system analysis, offering a visual
representation of data flow and interactions within a system while aiding in project scoping,
stakeholder communication, and detailed process analysis.

3.3.3 Level of Abstraction

To construct an accurate understanding of a subject, it is imperative to utilize various levels of


abstraction. This necessity arises from the inherent relationship between an object and an
observer or a subject and its learner. The transition from one level of abstraction to the next
involves qualitative changes and signifies a progression of form and formative principles.

In the realm of software engineering, Data Flow Diagrams (DFD) serve as a tool to represent
systems at different levels of abstraction. Higher-level DFDs are partitioned into lower levels,
providing more detailed information and functional elements. The levels in DFD are typically
denoted as 0, 1, 2, or beyond. Specifically, we will focus on two main levels in data flow
diagrams: 0-level DFD and 1-level DFD.

Level-0: Level 0, often referred to as a Context Diagram, serves as a fundamental overview of


the entire system or process under analysis or modelling. This diagram provides a quick and
comprehensible snapshot, presenting the system as a single, high-level process and illustrating its
connections to external entities.

The primary purpose of the Context Diagram is to offer an at-a-glance view that can be easily
understood by a broad audience, including stakeholders, business analysts, data analysts, and
developers. Its simplicity and clarity make it a valuable communication tool, facilitating a shared
understanding of the system's context and interactions.

17
Fig describes the overall process of the project. We input the important dataset file data and
preprocess the data before prediction using different classifier to predict the type of result.

Level 1:

The exploration of the context-level Data Flow Diagram (DFD) is followed by the creation of a
Level 1 DFD, delving into more detailed aspects of the modelled system. The Level 1 DFD
illustrates how the system is subdivided into sub-systems or processes, each handling specific
data flows to or from external agents. Together, these sub-systems collectively provide the
complete functionality of the overall system.

Furthermore, the Level 1 DFD identifies internal data stores that are crucial for the system to
effectively perform its functions. It illustrates the flow of data between various inputs of the
system, offering a more granular understanding of how information is processed and exchanged
within the different components of the system. Essentially, the Level 1 DFD provides a detailed
representation of the system's internal processes and data flows, offering insights into its
operational intricacies.

18
3.4 summary

This chapter provides a concise introduction to the system design process, outlining the
methodologies essential for system development. It addresses various types of design processes
applied in real-world scenarios and explores different system architectures. The proposed model
elucidates the precise workings of the system.

The chapter employs Data Flow Diagrams (DFDs) at three distinct levels of abstraction to
illustrate the proposed model. DFDs serve as graphical representations, summarizing the flow of
data in various process levels. The focus is on articulating the details of the proposed system and
comparing it with the existing system. Emphasis is placed on how the proposed system brings
about a reduction in complexity, cost, and enhances overall system performance. In essence, the
chapter delves into the design intricacies, providing insights into the envisioned system and its
potential improvements over the current system.

CHAPTER 4: METHDOLOGY

The goal of this methodology is to provide a clear and concise summary of the US Census
Income data, which contains information about the income levels of people in the United States.
This synopsis will help readers quickly understand key insights from the dataset. Start by
gathering the US Census Income data, which typically includes information about individuals
and their income. This data is usually collected through surveys and questionnaires.

19
4.1 Dataset Description
The US Census Income dataset, commonly known as "Census Income" or "Adult Income," is a
widely utilized resource in the realms of machine learning and data analysis. Its primary
objective is to predict whether an individual's income surpasses or falls below a specified
threshold, typically set at $50,000 annually. The dataset comprises a diverse set of features,
encompassing numerical and categorical attributes that offer insights into various aspects of an
individual's life, including age, education, employment, marital status, and more. One set of
features includes demographic details such as age, education level, and marital status. For
instance, education is represented both categorically and numerically, providing a comprehensive
view of an individual's highest level of education. Another set of features delves into
employment-related aspects, classifying work situations, occupations, and the number of hours
worked per week. Additionally, personal attributes like race, gender, capital gains, losses, and
native country contribute to the dataset's richness. The target variable, "Income," is binary,
indicating whether an individual earns above or below $50,000, encapsulating the essence of the
predictive task associated with this valuable dataset .Now, we are going to prepare the dataset by
removing some unwanted noise such as missing values and duplicate rows, then we need to
perform analysis with different graphs and charts, after this we should go for Feature
Engineering and pre-processing, and finally we will build the classification model with the help
of different Machine Learning algorithms.

4.2 Data Acquisition and Preparation


Before proceeding with dataset preparation, a crucial step involves checking for missing values,
as their presence can compromise the authenticity of analysis and predictions. Utilizing the null
() function, we identify and handle any missing values by either dropping them or imputing them
with statistical measures. Notably, in this real-world survey dataset containing approximately
4.8k instances, no columns exhibit missing or null values, ensuring data integrity.
Another essential check involves identifying and addressing duplicate rows, which can impact
accuracy. In this dataset, no duplicate rows are observed. However, 52 instances with missing
values marked as "?" in categorical columns are detected. To enhance accuracy, these instances
are subsequently dropped, and missing categorical values are imputed with the mode for a more
robust dataset.

4.3 Feature Selection and Exploratory Data Analysis


The primary challenge in this dataset lies in the judicious selection of the most influential
features among the 14 available, and the subsequent analysis of how these features impact the
target variable. Data exploration, a fundamental step in data analysis, is imperative before
venturing into machine learning model creation. This initial exploration involves summarizing
and comprehending the dataset statistically, providing invaluable insights. Statistical analysis
facilitates the understanding of inter-feature relationships, variation within each feature,
distribution characteristics, and the contribution of input variables to the target variable.
In our project, given the substantial number of student data examples and features, conducting
exploratory data analysis becomes a crucial precursor to accurate predictions. The inclusion of
various graphs and plots in our project serves the purpose of unraveling comprehensive insights
and information encapsulated within the dataset. This common yet pivotal approach aids in

20
summarizing the data, enabling a more informed understanding before delving into predictive
modeling.

Our initial analysis focuses on the target variable, where we employ a pie chart to visually
represent the distribution of individuals based on their annual income. The chart allows us to
observe the proportion of individuals earning less than or equal to $50k and those earning more
than $50k annually, providing a clear snapshot of the income distribution within the dataset.

Figure1.Pie chart to display the amount (percentage) distribution of the Target


Variable

21
The pie chart above illustrates that a significant majority, approximately 76.1%, of individuals in
the total population have an annual income less than or equal to $50k. In contrast, around 23.9%
of individuals fall into the category of annual income exceeding $50k. This breakdown provides
a clear visual representation of the income distribution within the dataset.

In this dataset, characterized by an extensive list of categorical variables and a limited set of
numerical variables, our initial focus is on analyzing the numerical data. To achieve this, we
employ a Heatmap featuring a correlation matrix. This matrix provides insights into the
interrelationships among the numerical features, offering a visual representation of their
correlations. The Heatmap not only enhances our understanding of the dataset's numerical
dynamics but also presents this information in a visually intuitive manner.

Figure 2 Heatmap of correlation matrix

22
From the above plot, we are getting some interesting insights about the dataset and those insights
are following.
1. hours-per-week & educational-num is having strong correlation between them and
correlation value is 0.14
2. Educational-num & capital gain is having average relation between them since the
correlation value is 0.13
3. And the final observation from this Heatmap is that hours-per-week and fnlwgt are
having negative relation between them.

After this we have plotted the distribution plot for all these numerical variables to observe the
skewness of the data and then we have plotted the box plot to check the presence of outliers in
the dataset. And here we observe continuous data contains an outlier. So, till now we have
successfully analysed the continuous data and after that we have even analysed our categorical
(discrete) data with different plots.

In categorical data, we can observe that each data contains so many distinct categories; there are
only few variables where we are having just 2, 3, 4 or 5 categories. Although we have analysed
all these variables with the count plot and got some interesting insights. Some of those insights
are as following:

➢ The prevailing trend indicates that a significant portion of individuals in the dataset has
pursued education beyond high school.
➢ It is evident from the observation that the count of males surpasses that of females in both
income categories, both <=50k and >50k annually.
➢ The ratio of individuals earning <=50k and >50k appears to be roughly equivalent among
those with an occupation in Exec-managerial roles.

4.4 Feature Engineering & Selection


In this section, our primary objective is to adequately prepare the data for modeling, addressing
aspects such as missing values, categorical data, skewness, and outliers. Given the absence of
null values in our dataset, we do not need to handle missing values. However, since our dataset
includes categorical variables, particularly the target variable, we employ a technique called
Label Encoding. This method is instrumental in converting categorical data into numerical
values, facilitating the modeling process.
The next important issue that we observed during our analysis is that the continuous variables
contain outliers and since the distribution of variables are skewed; we can handle these outliers
by using IQR (Inter Quartile Range) method. The mathematical formula for IQR is the difference
between 75th percentiles and 25th percentiles (Q3 – Q1). So, by using this formula we handled
the outliers present in our dataset.
After handling outliers, we just try to select the best features and dropping the irrelevant features
and since we have too many discrete random variables, so to do that we are using here Chi-

23
square test to check the relationship between input and target variable. The Chi-Square test is a
hypothesis testing method which helps us to understand the relation between two categorical
variables. In this hypothesis testing, Null hypothesis states that there is no relation between two
variables whereas Alternate Hypothesis states that there is some relationship between two
categorical variables. So, by using this test we can simply drop those features which are not
making any relationship with the target variable. And after performing this test we observe that
there are 4 variables which do not have any relation with the target variable, so we just drop and
finally we are moving ahead for modeling with remaining features.

4.5 Data Pre-processing (Splitting & Balancing the dataset)


Now after preprocessing the next step is to balance the data. As we have already observed at a
time of analysis, the data is quite imbalanced in nature, so it’s very important to balance the data
and to balance the data we are using a SMOTE (Synthetic Minority Oversampling Technique). It
creates new synthetic observations without replicating the same observation and for creating
those synthetic observations it uses a KNN (K-Nearest Neighbor) technique and that’s why this
technique is more optimized and stable than any other sampling technique. But in this case
SMOTE does not made any changes in our result.so we have not implemented in our project

Figure3: Code snippet for balancing the data

24
And finally, after balancing the dataset we just split the data into training and test set with the
ratio of 80:20. And finally we come on to the final stage of this section where we are going to
discuss our Machine Learning algorithms for building a model.

4.6 Proposed Machine Learning Algorithms for Classification


Now, the data is ready, and it’s split into training and test set so in this final section of
methodology I’m going to discuss the Machine Learning algorithms which have been used over
here for this multi-Class classification task.

1. Logistic Regression Classifier: The first algorithm which I’ve used for classification is
Logistic regression. Logistic Regression is a statistical method used for binary
classification, predicting the probability of an observation belonging to a specific class.
Despite its name, it's employed for classification rather than regression. The algorithm
models the relationship between the independent variables and the log-odds of the
probability of a particular outcome. It uses the logistic function, also known as the
sigmoid function, to constrain predictions between 0 and 1. During training, the model
adjusts its parameters through optimization techniques like gradient descent to minimize
the difference between predicted probabilities and actual class labels.

2. Decision Tree: This is a highly interpretable model that can be used for both
classification and regression tasks. The tree made up of 3 nodes which consist of root
node, child node and leaf node where the best feature is selected as a root node and to
select the root node we are calculating the information gain for each feature and to
calculate the information gain there are 2 techniques, one is known as entropy and the
second one is Gini impurity. The algorithm has 1 hyperparameter also which is known as
depth and sometimes if the depth is high, it may also lead to overfitting.

3. Random Forest: Random Forest is one of the popular ensembles learning methods. It’s a
bagging technique where the number of decision trees runs parallel on a different subset
of data. These subsets are created by performing sampling with replacement using a
method called bootstrapping. Bootstrapping is a statistical technique that estimates a
statistic from the sample of data and can be used to reduce an error. On these subsets we
train strong classifiers (over fitted model – deep decision trees) and at a time of
aggregation the variance will reduce with some reasonable amount and finally it becomes
a generalized model. The best thing about this model is all the classifiers are train
parallelly, so we can train a bunch of classifiers without increasing any computational or
time complexity

4. XG Boost Classifier: XG Boost is again a very popular and effective ensemble learning
technique but unlike Random Forest, XG Boost is a boosting technique. Boosting builds
models from individual weak learners and unlike bagging it’s a sequential learning
process. XG Boost is a well optimized boosting technique and that’s a reason it performs
equally well certain deep learning algorithms. It uses advanced l1 and l2 regularization
which prevents overfitting and one of the best aspects of this model is parallelization
which makes it super-fast algorithm.

25
5. Adaptive Boosting: Adaptive Boosting is an ensemble learning method in machine
learning that combines the strengths of multiple weak learners to create a robust and
accurate predictive model. The algorithm operates iteratively, assigning varying weights
to instances in the dataset based on their classification accuracy. During each iteration, it
focuses on the misclassified instances from the previous round, adjusting their weights to
prioritize their correct classification in the next iteration.Weak learners, typically simple
models with slightly better than random accuracy, are sequentially added to the ensemble.
The final model aggregates the predictions of these weak learners, with each contributing
a weighted vote.

26
CHAPTER 5: EXPERIMENTAL RESULTS

5.1 Result

There are different evaluation metrics available which can help us to evaluate the performance of
the model. For this classification task, we have used different classifiers such as Logistics
Regression, Decision Tree, Random Forrest, XG Boost and Ada Boost . Now, since it’s a multi-
class classification task, here we found accuracy score, f1-score, precision score, recall score and
finally visualized confusion matrix as well which even helps us to understand how much data we
have classified incorrectly. Let’s try to discuss and understand each metric one by one.

Figure4: Code snippet for XG Boost Classifier.

27
28
Figure5: Conclusion Matrics for XGB Classifier.

Confusion Matrix can give us the visual representation of total sum of correctly classified points
and sum of misclassified points. In this heatmap, we can observe that 12,190 points are classified
correctly where 2380 points are mis-classified.

29
1. Accuracy Score: It is calculated as total number of correctly classified points divided by
total number of points is test set. We are getting maximum accuracy with XG Boost
classifier (83%), followed by Logistic Regression (82%) then Random Forest, Ada Boost
and Decision Tree classifier (81%)

2. Precision Score: Precision is calculated as the ratio of correctly classified positive points
(True Positives) to the total number of correctly classified positive points (True Positives
+ False Positives).

3. Recall Score: Recall tells us what proportion of the positive class got correctly classified.
It is calculated as the total number of true positive points divided by true positive and false
negative.

4. F1-score: In Precision, we are focusing more on false positive values whereas in case of
recall we focused on false negative values. Now, if we want to minimize false positive as
well as false negative then we can use f1-score. It is calculated as harmonic means of
precision and recall. For this dataset again Cat Boost classifier has maximum f1-score
followed by Light GBM and Random Forest.

5. Confusion Matrix: Confusion Matrix can give us the visual representation of total sum of
correctly classified points and sum of misclassified points. We have plotted the confusion
matrix of the best classifier (Cat Boost). Now, let’s visualize it:

5.2 Deployment

Our project has seamlessly transitioned into the operational phase following a successful
deployment on Flask, a powerful and versatile web framework built on Python. Leveraging
Flask's inherent capabilities, we've established a robust hosting environment that not only
ensures the reliability of our application but also allows for scalability as user interactions grow.
With this deployment, the project now offers a user-friendly and accessible web interface,
enabling a seamless and engaging experience for our users as they interact with the various
features and functionalities we've implemented.

30
Figure6: Result (Predicting less than or equal to 50k.)

31
Figure7: Result (Predicting greater than 50k.)

32
CHAPTER 6: CONCLUSION AND
FUTURE SCOPE

The project shows the in-depth implementation of the different algorithms, and we compared
these algorithms based on accuracy score, precision score, recall score, and f1-score. The
research shows how tree-based techniques and ensemble-based techniques are implemented to
predict the income of individuals of Us whether it is <=50k or >50k. Out of 5 different
algorithms we can observe that XG Boost outperformed all other algorithms and showed the best
results. Logistic Regression Classifier also performed well with high accuracy scores, and we
may use it as the second-best models for this problem.

The proposed model was tested on the available dataset. A total of 48842 records were used to
analyze the data and build the model. The research demonstrates the different data pre-
processing techniques along with visualization techniques that help us understand the data in a
better way and identify the patterns in the data. The main aim of this project was to build a model
that can give us accurate and stable performance and we’re achieving accuracy more than 82%.

Next, we can try to improve this performance and can get more than 90% by performing hyper
parameter optimization as these ensemble methods consist lots of hyper parameters, so if we can
optimize these parameter so there is a chance of getting accuracy more than 90% or secondly we
can also try Neural Network which is a kind of Deep Learning approach to enhance the
performance.

33
Figure8: Bar plot for Accuracy Scores Comparison.

In the comparative analysis of machine learning models, XG Boost emerged as the standout
performer, showcasing the highest accuracy score of 83.665%. This signifies its superior capability
in accurately predicting outcomes for the specific task at hand. The robust performance of XG
Boost positions it as the top-performing algorithm among the evaluated models, underscoring its
effectiveness in handling the complexities of the given dataset and task requirements. This result
suggests that XG Boost is a promising choice for applications requiring precise and reliable
predictions.

REFERENCE
34
1. 1. Ameen, A. O., Alarape, M. A., & Adewole, K. S. (2019). STUDENTS’ ACADEMIC
PERFORMANCE AND DROPOUT PREDICTION. MALAYSIAN JOURNAL OF
COMPUTING, 4(2), 278.

2. 2. Alyahyan, E., & Düştegör, D. (2020). Predicting academic success in higher education:
literature review and best practices. International Journal of Educational Technology in
Higher Education, 17(1).

3. 3. Respondek, L., Seufert, T., Stupnisky, R., & Nett, U. E. (2017). Perceived Academic
Control and Academic Emotions Predict Undergraduate University Student Success:
Examining Effects on Dropout Intention and Achievement.

4. 4. Niyogisubizo, J., Liao, L., Nziyumva, E., Murwanashyaka, E., & Nshimyumukiza, P.
C. (2022). Predicting student’s dropout in university classes using a two-layer ensemble
machine learning approach: A novel stacked generalization.

5. 5. Chen, T., & Guestrin, C. (2016). XGBoost. Proceedings of the 22nd ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining.

6. 6. Patel, H. H., & Prajapati, P. (2018). Study and Analysis of Decision Tree-Based
Classification Algorithms.

7. Hastie, T., Tibsherany, R., & Friedman, J. (2009). The Elements of Statistical Learning:
Data Mining, Inference, and Prediction. Springer.

8. Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegel Meyer, W. P. (2002). SMOTE:
Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research,
16, 321-357.
9. Friedman, J. H. (2001). Greedy Function Approximation: A Gradient Boosting Machine.
Annals of Statistics, 29(5), 1189-1232.

10. Powers, D. M. (2011). Evaluation: From Precision, Recall and F-Measure to ROC,
Informed Ness, Markedness & Correlation. Journal of Machine Learning Technologies,
2(1), 37-63.

11. Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5-32.

APPENDICES

35
Project Report Overview
This comprehensive project report encapsulates our efforts to predict individuals' income levels,
a pivotal pursuit in the realm of socio-economic analysis. The report provides a detailed account
of our methodologies, findings, and implications for future research and practical applications
within the domain of income prediction.

Research Scope and Objectives


Our primary objective was to develop predictive models capable of estimating individuals'
income levels based on various socio-economic factors. The scope involved employing advanced
machine learning algorithms, data analysis, and predictive modeling techniques, including
Random Search, AdaBoost, XGBoost, Logistic Regression, Decision Trees, and Random
Forests, to create robust frameworks for this critical task.

Methodologies and Approaches


The project involved a thorough analysis of the US Census Income dataset, exploring diverse
features and patterns inherent in individuals' socio-economic backgrounds. Leveraging
sophisticated machine learning algorithms such as Logistic Regression, Decision Trees, Random
Forests, AdaBoost, and XGBoost, with the optimization provided by Random Search, we
constructed predictive models capable of discerning income indicators and forecasting economic
achievements.

Key Findings and Insights


Through rigorous evaluation and validation processes, our models demonstrated significant
predictive accuracy. Notably, the XG Boost model emerged as a standout performer, achieving
an accuracy of 83.2% in predicting income levels, while random forest also exhibited robust
performance.

Implications and Future Directions


Our findings hold substantial implications for policymakers, economists, and social analysts. The
predictive models developed in this project, enhanced by techniques such as Random Search,
AdaBoost, and XGBoost, provide valuable insights into estimating individuals' income levels.
This offers the potential for informed decision-making and targeted policy interventions in the
socio-economic landscape.

36
37
38
39

You might also like