0% found this document useful (0 votes)
34 views22 pages

ml2 Paper

The document is a term paper report submitted by Sujeet Kumar Behera for the Bachelor of Technology degree in Computer Science Engineering at Lovely Professional University. It focuses on the development of a personal finance machine learning model, detailing the objectives, methodology, and expected outcomes of the project. The report includes sections on theoretical background, hardware and software requirements, and a structured approach to implementing the machine learning model for financial management.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views22 pages

ml2 Paper

The document is a term paper report submitted by Sujeet Kumar Behera for the Bachelor of Technology degree in Computer Science Engineering at Lovely Professional University. It focuses on the development of a personal finance machine learning model, detailing the objectives, methodology, and expected outcomes of the project. The report includes sections on theoretical background, hardware and software requirements, and a structured approach to implementing the machine learning model for financial management.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Annexure-I

A Term Paper Report


Submitted in partial fulfilment of the requirements for the award of degree of
Bachelor of Technology
(Computer Science Engineering)
Submitted to

LOVELY PROFESSIONAL UNIVERSITY


PHAGWARA ,PUNJAB
From 1st August, 2024 to October 25, 2024
SUBMITTED BY
Name of student :Sujeet Kumar Behera
Registration number:12105351
Faculty:Sajjad Manzoor Mir

1
Annexure-II: Student Declaration

To whom so ever it may concern

I, Sujeet Kumar Behera, 12105351, hereby declare that the work done by me on " Personal
finances Ml model" from August 1, 2024 to October 25, 2024, is a record of original work for
the partial fulfilment of the requirements for the award of the degree, Bachelor of Technology.
Name of the student: Sujeet Kumar Behera

Registration Number: 12105351

Dated: 22 October 2024

ACKNOWLEDGEMENT

Primarily I would like to thank God for being able to learn a new technology. Then I would
like to express my special thanks of gratitude to the teacher and instructor of the course
Machine Learning who provided me the golden opportunity to learn a new technology.
I would like to also thank my own college Lovely Professional University for offering such a
course which not only improve my programming skill but also taught me other new
technology.
Then I would like to thank my parents and friends who have helped me with their valuable
suggestions and guidance for choosing this course.
Finally, I would like to thank everyone who have helped me a lot.

2
SUPERVISOR’S CERTIFICATE
This is to certify that the work reported in the B. Tech Dissertation/dissertation proposal
entitled “Personal finances ML model”, submitted by Sujeet kumar Behera at Lovely
Professional University, Phagwara, India is a bonafide record of his original work carried out
under my supervision. This work has not been submitted elsewhere for any other degree.

Signature of Supervisor
Sajjad Manzoor Mir

3
[Link] Contents Page

1 TITLE 1
2 STUDENT DECLARATION 2

3 ACKNOWLEGDEMENT 2
4 TABLE OF CONTENT 3
5 ABSTRACT 4

6 OBJECTIVE 4
7 INTRODUCTION 5

8 THEORETICAL BACKGROUND 7

9 HARDWARE & SOFTWARE 9

10 METHODOLOGY 9

11 RESULTS 21

12 SUMMARY 21

13 COLCLUSION 22

14 BIBLIOGRAPHY 23

4
Abstract
Personal finances represent the individual or familial funds that one autonomously
oversees. Mastery in managing personal finances necessitates specialized training.
This project aims to devise a personal finance simulator and establish a machine
learning-based system to discern optimal financial strategies for individuals.
Personal finance is a multifaceted domain that encompasses the management of
individual or familial financial resources. In today's complex economic landscape,
effective management of personal finances is paramount for achieving financial
stability, security, and long-term prosperity. This abstract explores the fundamental
principles and challenges of personal finance, including budgeting, saving, investing,
debt management, and risk mitigation. It delves into the importance of financial
literacy and education in empowering individuals to make informed decisions about
their money. Furthermore, it discusses emerging technologies and tools, such as
mobile apps and online platforms, that facilitate financial management and planning.

Objectives of model
In the future, personal finance is expected to undergo significant transformations
driven by emerging trends and evolving consumer demands. One prominent trend is
the continued digitalization of financial services, with a growing integration of
technology and fintech solutions. This integration is anticipated to enhance
accessibility, automation, and personalization of financial products and services.
Additionally, there is a rising demand for holistic financial wellness programs,
emphasizing education, budgeting, saving, and retirement planning, offered by both
employers and financial institutions. Another notable trend is the increasing interest in
impact investing and consideration of environmental, social, and governance (ESG)
criteria in investment decisions. Personalized financial advice is also expected to
become more prevalent, facilitated by advanced data analytics, machine learning,
and artificial intelligence technologies.

5
Introduction (Personal finances )
As financial matters increasingly pervade our lives, it's crucial to stay abreast of
current economic trends and market dynamics. Equally important is the mastery of
personal finance management, an essential skill set for individuals and families alike.
Personal finances encompass one's own capital, managed autonomously,
necessitating dedicated training in effective financial management techniques.
incorporating the applicable criteria that follow.
Various scientific and applied sources worldwide extensively discuss a wide range
of approaches to personal finance management and planning. Typically, this issue is
analyzed through the lens of effectively managing and ensuring the safety of personal
fund liquidity.

Theoretical Background
What is Machine Learning?
Machine learning is a subfield of artificial intelligence (AI) that uses algorithms trained on
data sets to create self-learning models that are capable of predicting outcomes and
classifying information without human intervention. Machine learning is used today for a wide
range of commercial purposes, including suggesting products to consumers based on their
past purchases, predicting stock market fluctuations, and translating text from one language
to another. In common usage, the terms “machine learning” and “artificial intelligence” are
often used interchangeably with one another due to the prevalence of machine learning for
AI purposes in the world today. But, the two terms are meaningfully distinct. While AI refers
to the general attempt to create machines capable of human-like cognitive abilities, machine
learning specifically refers to the use of algorithms and data sets to do so.
1)Supervised Learning Models:

Supervised learning is a type of machine learning where the model learns a mapping between input
features and output labels based on labeled training data. In supervised learning, the algorithm learns
from a dataset that contains input-output pairs, where the inputs are the features or attributes, and the
outputs are the corresponding labels or target variables. The goal is to learn a mapping function from
the input variables to the output variable.

Regression Models: Predicts continuous values based on input features. Examples include Linear
Regression, Polynomial Regression, and Support Vector Regression.

Classification Models: Predicts class labels or discrete outcomes. Examples include Logistic
Regression, Decision Trees, Random Forests, and Support Vector Machines.

2) Unsupervised Learning Models:

Clustering Models: Groups similar data points together based on some similarity metric. Examples
include K-Means Clustering.

6
3)Semi-Supervised Learning Models:

Combines Supervised and Unsupervised Learning: Uses both labeled and unlabeled data for training.
Examples include Self-training and Co-training algorithms.

MODELS USED IN PROJECTS:


Linear Regression: A simple regression model that models the relationship between
a dependent variable and one or more independent variables by fitting a linear
equation to observed data.
Logistic Regression: A classification model used to model the probability of a
binary outcome based on one or more independent variables. Despite its name,
logistic regression is a linear model used for classification tasks.
Decision Trees: A non-linear model that makes decisions based on a set of rules
learned from the data. Decision trees partition the feature space into regions, with
each region representing a decision.
Random Forests: An ensemble learning method that builds multiple decision trees
during training and combines their predictions through averaging or voting to improve
predictive performance and reduce overfitting.
Support Vector Machines (SVM): A supervised learning algorithm used for
classification and regression tasks. SVMs find the hyperplane that best separates
the data points into different classes or predicts continuous values.

What is KNN?
The K-Nearest Neighbors (KNN) algorithm is a supervised machine learning method
employed to tackle classification and regression problems. Evelyn Fix and Joseph
Hodges developed this algorithm in 1951, which was subsequently expanded by
Thomas Cover. The article explores the fundamentals, workings, and implementation
of the KNN algorithm.
What is the K-Nearest Neighbors Algorithm?
KNN is one of the most basic yet essential classification algorithms in machine
learning. It belongs to the supervised learning domain and finds intense application
in
pattern recognition, data mining, and intrusion detection.
It is widely disposable in real-life scenarios since it is non-parametric, meaning it
does
not make any underlying assumptions about the distribution of data (as opposed to
other algorithms such as GMM, which assume a Gaussian distribution of the given
data). We are given some prior data (also called training data), which classifies
coordinates into groups identified by an attribute.

7
HARDWARE & SOFTWARE REQUIREMENTS
HARDWARE REQUIREMENTS

GPU (Graphics Processing Unit):


Deep learning models, especially large ones like convolutional neural networks (CNNs) or recurrent
neural networks (RNNs), benefit significantly from GPU acceleration.

NVIDIA GPUs are the most commonly used for deep learning due to their robust support for
frameworks
like TensorFlow and PyTorch.

The choice of GPU depends on your budget and requirements. High-end GPUs like NVIDIA GeForce
RTX
series or NVIDIA Tesla series are popular choices for deep learning workstations and servers.

CPU (Central Processing Unit):


Although not as crucial as GPUs for deep learning, CPUs are still important for tasks like data
preprocessing, model deployment, and handling non-GPU accelerated operations.
Multi-core CPUs with high clock speeds are preferable to speed up data processing tasks.

SOFTWARE REQUIREMENTS
Python:
Most deep learning frameworks are Python-based, so a working Python installation (preferably the
latest
version) is necessary.

Package management tools like pip or conda are useful for installing and managing Python packages
and
dependencies.

Development Environment:
Set up a development environment with integrated development environments (IDEs) like PyCharm,
Visual Studio Code, or Jupyter Notebooks for coding and experimentation.
Containerization tools like Docker can help manage project dependencies and ensure reproducibility
across different environments

METHODOLOGY
Importing libraries
1)import pandas as pd -pandas is a fast, powerful, flexible and easy to use open source
data analysis and manipulation tool, built on top of the Python programming language.
2)import numpy - Fast and versatile, the NumPy vectorization, indexing, and broadcasting
concepts are the de-facto standards of array computing today.

8
3)import ploty -Plotly's Python graphing library makes interactive, publication-quality
graphs. Examples of how to make line plots, scatter plots, area charts, bar charts, error bars,
box plots, histograms, heatmaps, subplots, multiple-axes, polar charts, and bubble charts.
4)import matplotlib - Matplotlib is a comprehensive library for creating static, animated, and
interactive visualizations in Python. Matplotlib makes easy things easy and hard things
possible.
5)import warnings - Warning messages are typically issued in situations where it is useful
to alert the user of some condition in a program, where that condition (normally) doesn’t
warrant raising an exception and terminating the program.
6) from sklearn.model_selection import train_test_split - This function allows you to
easily split your dataset into training and testing sets, which is crucial for evaluating the
performance of your machine learning models.
7) from sklearn.linear_model import LinearRegression- Linear regression is a simple yet
powerful technique used for predicting a continuous target variable based on one or more
predictor variables.
8) from [Link] import mean_squared_error, r2_score-Imports two important
functions, mean_squared_error and r2_score, from the [Link] library in Python.
These functions are commonly used for evaluating the performance of regression models.
9) from sklearn.linear_model import LogisticRegression-Imports the LogisticRegression
class from the sklearn.linear_model library in Python. This class implements the Logistic
Regression algorithm, which is a popular machine learning method used for classification
problems.
10) from [Link] import accuracy_score, classification_report,
confusion_matrix- Imports three beneficial functions for evaluating the performance of
classification models in Python, all from the [Link] library. These functions are
instrumental in assessing how well your classification model is performing.
accuracy_score: This function calculates the overall accuracy of your classification model.
It determines the proportion of predictions that your model got correct. It's calculated by
dividing the number of correctly classified samples by the total number of samples. A higher
accuracy score indicates better model performance.

classification_report: This function provides a more comprehensive assessment of your


model's performance for each class. It presents metrics like precision, recall, F1-score, and
support for each class label.

11) from [Link] import DecisionTreeRegressor- Imports the DecisionTreeRegressor


class from the [Link] library in Python. This class is used to create decision tree
regression models, which are a type of machine learning model well-suited for predicting
continuous target variables.

9
12) from [Link] import RandomForestRegressor-
Imports the RandomForestRegressor class from the [Link] library in Python. This
class is used to create random forest regression models, which are a powerful ensemble
machine learning technique for regression tasks.
13) from [Link] import SVR- Imports the SVR class from the [Link] library in
Python. This class is used for implementing Support Vector Regression (SVR), which is a
powerful technique for regression problems from the world of Support Vector Machines
(SVMs).

APPROACH
Phase 1: Data Collection and Preprocessing (Month 1):

Gather financial data including bank transactions, investment portfolios, income sources,
expenses, and credit card statements from users through secure APIs or data integrations.
Preprocess the financial data to handle missing values, categorize transactions, identify
recurring expenses, and aggregate data into meaningful features for analysis.
Collect external data sources such as economic indicators, market trends, and financial
news to provide contextual information for financial decision-making.

Phase 2: Machine Learning Model Development (Month 2):


Develop ML models to analyze financial data and provide personalized recommendations for
budgeting, saving, investing, and debt management.
Implement NLP algorithms to understand user queries and provide relevant responses,
allowing users to interact with the assistant through natural language interfaces (e.g.,
chatbots, voice assistants).
Train the ML models using historical financial data and user interactions, optimizing model
parameters to maximize accuracy and relevance of recommendations

Phase 3: System Integration and Evaluation (Month 3):


Integrate the ML models into the intelligent personal finance assistant platform, allowing
seamless interaction with users across various devices and channels.
Develop a user-friendly interface or mobile application for users to access their financial data,
receive personalized recommendations, and track progress towards their financial goals.
Evaluate the performance of the personal finance assistant through user testing and
validation studies, assessing metrics such as accuracy of recommendations, user
satisfaction ratings, and adherence to financial goals.

10
Deploy the personal finance assistant in real-world settings, collaborating with financial
institutions, fintech companies, and consumer platforms to promote adoption and integration
into daily financial routines.

4. Expected Outcomes:
Development of an intelligent personal finance assistant leveraging ML techniques to
provide personalized financial guidance and automate routine tasks.
Empowerment of individuals to make informed financial decisions, achieve their financial
goals, and improve their financial well-being.
Potential applications in financial planning, wealth management, and consumer banking to
enhance customer engagement and satisfaction.

5. Resources Required:
Financial data sources including bank APIs, investment platforms, and third-party data
providers.
Computational resources for model training and evaluation (e.g., cloud-based servers).
Collaboration with financial experts, data scientists, and software engineers for system
development and validation.

6. Timeline:
Month 1: Data collection, preprocessing, and exploration.
Month 2: ML model development, and training.
Month 3: System integration, evaluation, and deployment.

11
Exploratory Data Analysis

import numpy as np
import pandas as pd
import os
for dirname, _, filenames in [Link]('/kaggle/input'):
for filename in filenames:
print([Link](dirname, filename))
import seaborn as sns
import [Link] as plt
df = pd.read_csv("/content/[Link]")

1) # Let's visualize the distribution of Total Household Income


[Link](figsize=(10, 6))
[Link](df['Total Household Income'], bins=30, kde=True)
[Link]('Distribution of Total Household Income')
[Link]('Total Household Income')
[Link]('Frequency')
[Link]()

12
2) # Visualizing the relationship between Total Food Expenditure and Total Household
Income
[Link](figsize=(10, 6))
[Link](x='Total Household Income', y='Total Food Expenditure', data=df)
[Link]('Total Food Expenditure vs Total Household Income')
[Link]('Total Household Income')
[Link]('Total Food Expenditure')
[Link]()

13
3) Distribution of Household Head Age
[Link](figsize=(10, 6))
[Link](df['Household Head Age'], bins=20, kde=True)
[Link]('Distribution of Household Head Age')
[Link]('Household Head Age')
[Link]('Frequency')
[Link]()

14
MACHINE LEARNING
1)Feature selections

import pandas as pd

from sklearn.model_selection import train_test_split

from [Link] import DecisionTreeClassifier

X = df[['Total Food Expenditure', 'Bread and Cereals Expenditure','Total Rice Expenditure', 'Meat Expenditure' 'Total Fish and
marine products Expenditure','Fruit Expenditure', 'Vegetables Expenditure''Restaurant and hotels Expenditure', 'Alcoholic
Beverages Expenditure', 'Tobacco Expenditure', 'Clothing, Footwear and Other Wear Expenditure', 'Housing and water
Expenditure', 'Education Expenditure','Household Head Age', 'Total Number of Family members', 'Members with age less than
5 year old','Members with age 5 - 17 years old', 'House Floor Area', 'Number of bedrooms', 'Number of Television', 'Number of
Cellular phone']]

y = df['Type of Household'] # Assuming 'Type of Household' is the target variable

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the Decision Tree Classifier

dt_classifier = DecisionTreeClassifier(random_state=42)

# Fit the model on the training data

dt_classifier.fit(X_train, y_train)

# Get feature importances

feature_importances = dt_classifier.feature_importances_

# Create a DataFrame to display feature importances

feature_importance_df = [Link]({'Feature': [Link], 'Importance': feature_importances})

feature_importance_df.sort_values(by='Importance', ascending=False, inplace=True)

# Display the top N features by importance

top_n = 10 # Number of top features to display

print("Top", top_n, "features by importance:")

print(feature_importance_df.head(top_n))

Top 10 features by importance:

Feature Importance

13 Household Head Age 0.145237

14 Total Number of Family members 0.118960

16 Members with age 5 - 17 years old 0.096319

15 Members with age less than 5 year old 0.078877

11 Housing and water Expenditure 0.044962

5 Fruit Expenditure 0.043983

10 Clothing, Footwear and Other Wear Expenditure 0.043794

12 Education Expenditure 0.042492

6 Vegetables Expenditure 0.041309

4 Total Fish and marine products Expenditure 0.03922

15
2)LINEAR REGRESSION
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from [Link] import mean_squared_error, r2_score

features = df[['Total Food Expenditure', 'Bread and Cereals Expenditure',


'Total Rice Expenditure', 'Meat Expenditure',
'Total Fish and marine products Expenditure',
'Fruit Expenditure', 'Vegetables Expenditure',
'Restaurant and hotels Expenditure',
'Alcoholic Beverages Expenditure', 'Tobacco Expenditure',
'Clothing, Footwear and Other Wear Expenditure',
'Housing and water Expenditure', 'Imputed House Rental Value',
'Medical Care Expenditure', 'Transportation Expenditure',
'Communication Expenditure', 'Education Expenditure',
'Miscellaneous Goods and Services Expenditure',
'Special Occasions Expenditure', 'Crop Farming and Gardening expenses',
'Total Income from Entrepreneurial Acitivites', 'Household Head Age',
'Total Number of Family members', 'Members with age less than 5 year old',
'Members with age 5 - 17 years old', 'Total number of family members employed',
'House Floor Area', 'House Age', 'Number of bedrooms',
'Number of Television', 'Number of CD/VCD/DVD',
'Number of Component/Stereo set', 'Number of Refrigerator/Freezer',
'Number of Washing Machine', 'Number of Airconditioner',
'Number of Car, Jeep, Van', 'Number of Landline/wireless telephones',
'Number of Cellular phone', 'Number of Personal Computer',
'Number of Stove with Oven/Gas Range', 'Number of Motorized Banca',
'Number of Motorcycle/Tricycle']]

target = df['Total Household Income']

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)


model = LinearRegression()
[Link](X_train, y_train)
y_pred = [Link](X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Mean Squared Error:", mse)
print("R^2 Score:", r2)
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)

Mean Squared Error: 9314005171.223476


R^2 Score: 0.871480868737331
Coefficients: [ 5.32581252e-01 -5.70678620e-02 -1.72753381e-01 1.22236995e-01
8.08661461e-01 3.74836547e-01 -1.03792001e+00 1.92292019e-01
5.52998249e-02 -3.39374536e-02 2.50634826e+00 6.63569309e-01
5.76781809e-01 7.38353636e-01 1.19903438e+00 4.06569099e+00
8.63597659e-01 2.78391521e+00 1.16252278e+00 4.63944657e-02
6.86203432e-01 5.53156546e+02 -5.42222318e+03 9.36880439e+03
1.28265028e+03 2.70172213e+04 3.78973689e+01 -4.78977218e+01
4.43899140e+03 1.48705584e+03 -9.83015970e+02 -3.33790241e+03
3.02957799e+03 -1.48191745e+03 2.08026636e+04 2.04822929e+04
-1.50885409e+04 7.67770979e+02 9.93547413e+03 1.32974185e+04
-5.00932214e+03 1.68024307e+03]
Intercept: -36612.75464024517

16
3) Logistic regressions
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from [Link] import accuracy_score, classification_report,confusion_matrix
features = df[['Total Food Expenditure', 'Bread and Cereals Expenditure',
'Total Rice Expenditure', 'Meat Expenditure',
'Total Fish and marine products Expenditure',
'Fruit Expenditure', 'Vegetables Expenditure',
'Restaurant and hotels Expenditure',
'Alcoholic Beverages Expenditure', 'Tobacco Expenditure',
'Clothing, Footwear and Other Wear Expenditure',
'Housing and water Expenditure', 'Imputed House Rental Value',
'Medical Care Expenditure', 'Transportation Expenditure',
'Communication Expenditure', 'Education Expenditure',
'Miscellaneous Goods and Services Expenditure',
'Special Occasions Expenditure', 'Crop Farming and Gardening expenses',
'Total Income from Entrepreneurial Acitivites', 'Household Head Age',
'Total Number of Family members', 'Members with age less than 5 year old',
'Members with age 5 - 17 years old', 'Total number of family members employed',
'House Floor Area', 'House Age', 'Number of bedrooms',
'Number of Television', 'Number of CD/VCD/DVD',
'Number of Component/Stereo set', 'Number of Refrigerator/Freezer',
'Number of Washing Machine', 'Number of Airconditioner',
'Number of Car, Jeep, Van', 'Number of Landline/wireless telephones',
'Number of Cellular phone', 'Number of Personal Computer',
'Number of Stove with Oven/Gas Range', 'Number of Motorized Banca',
'Number of Motorcycle/Tricycle'
]]
target = df['Total Household Income']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

# Initialize the Logistic Regression model


model = LogisticRegression()

# Fit the model on the training data


[Link](X_train, y_train)

# Predict on the testing data


y_pred = [Link](X_test)

# Evaluate the model


accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Additional evaluation metrics


print("Classification Report:")
print(classification_report(y_test, y_pred))

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

17
4)Decision trees
import pandas as pd

from sklearn.model_selection import train_test_split

from [Link] import DecisionTreeRegressor

from [Link] import mean_squared_error, r2_score

X = df[['Total Food Expenditure', 'Bread and Cereals Expenditure','Total Rice Expenditure', 'Meat Expenditure' 'Total Fish and
marine products Expenditure','Fruit Expenditure', 'Vegetables Expenditure''Restaurant and hotels Expenditure', 'Alcoholic
Beverages Expenditure', 'Tobacco Expenditure', 'Clothing, Footwear and Other Wear Expenditure', 'Housing and water
Expenditure', 'Education Expenditure','Household Head Age', 'Total Number of Family members', 'Members with age less than
5 year old','Members with age 5 - 17 years old', 'House Floor Area', 'Number of bedrooms', 'Number of Television', 'Number of
Cellular phone']]

y = df['Type of Household'] # Assuming 'Type of Household' is the target variable

X_train, X_test, y_train, y_test = train_test_split(features, target,␣

↪ test_size=0.2, random_state=42)

model = DecisionTreeRegressor(random_state=42)

[Link](X_train, y_train)

y_pred = [Link](X_test)

# Evaluate the model

mse = mean_squared_error(y_test, y_pred)

r2 = r2_score(y_test, y_pred)

print("Mean Squared Error:", mse)

print("R^2 Score:", r2)

Mean Squared Error: 18294548455.50716

R^2 Score: 0.7475630052677185

18
5)Random forests
import pandas as pd
from sklearn.model_selection import train_test_split
from [Link] import RandomForestRegressor
from [Link] import mean_squared_error, r2_score
X = df[['Total Food Expenditure', 'Bread and Cereals Expenditure','Total Rice Expenditure',
'Meat Expenditure' 'Total Fish and marine products Expenditure','Fruit Expenditure',
'Vegetables Expenditure''Restaurant and hotels Expenditure', 'Alcoholic Beverages
Expenditure', 'Tobacco Expenditure', 'Clothing, Footwear and Other Wear Expenditure',
'Housing and water Expenditure', 'Education Expenditure','Household Head Age', 'Total
Number of Family members', 'Members with age less than 5 year old','Members with age 5 -
17 years old', 'House Floor Area', 'Number of bedrooms', 'Number of Television', 'Number of
Cellular phone']]
y = df['Type of Household'] # Assuming 'Type of Household' is the target variable
X_train, X_test, y_train, y_test = train_test_split(features, target,␣
↪ test_size=0.2, random_state=42)
# Initialize the Random Forest Regressor
model = RandomForestRegressor(random_state=42)
# Fit the model to the training data
[Link](X_train, y_train)
# Make predictions on the testing data
y_pred = [Link](X_test)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Mean Squared Error:", mse)
print("R^2 Score:", r2)

Mean Squared Error: 9024413126.92843


R^2 Score: 0.8754767993030954

19
6)K -means clustering
import pandas as pd
from [Link] import KMeans
import [Link] as plt

# Assuming 'df' is your DataFrame containing features

X = df[['Total Food Expenditure', 'Bread and Cereals Expenditure','Total Rice Expenditure', 'Meat Expenditure' 'Total
Fish and marine products Expenditure','Fruit Expenditure', 'Vegetables Expenditure''Restaurant and hotels
Expenditure', 'Alcoholic Beverages Expenditure', 'Tobacco Expenditure', 'Clothing, Footwear and Other Wear
Expenditure', 'Housing and water Expenditure', 'Education Expenditure','Household Head Age', 'Total Number of
Family members', 'Members with age less than 5 year old','Members with age 5 - 17 years old', 'House Floor Area',
'Number of bedrooms', 'Number of Television', 'Number of Cellular phone']]

# Perform k-means clustering


kmeans = KMeans(n_clusters=3, random_state=42) # You can choose the number of clusters based on your data
cluster_labels = kmeans.fit_predict(features)

# Add cluster labels to the DataFrame


df['Cluster'] = cluster_labels

# Visualize the clusters (example for 2D data)


[Link](df['Total Food Expenditure'], df['Total Household Income'], c=df['Cluster'], cmap='viridis')
[Link]('Total Food Expenditure')
[Link]('Total Household Income')
[Link]('K-means Clustering')
[Link](label='Cluster')
[Link]()

from [Link] import silhouette_score

# Assuming 'df' is your DataFrame containing features and cluster labels


silhouette_avg = silhouette_score(features, cluster_labels)
print("Silhouette Score:", silhouette_avg)

Silhouette Score: 0.7608313629296152

20
RESULTS
Models ACCURACY
Linear regressions 87%
Logistic regressions 75%
Decision trees 74%
Random forests 87%
k-means clustering 76%

Summary
The benefits of ML over traditional methods as illustrated above
together with the existing but still limited number of ML applications
in finance suggest a still mostly untapped potential for future
research. However, it is unclear whether the usage of ML methods
will actually gain broad popularity in the finance community.
Furthermore, prospective users of ML need to know whether ML
applications can also reach the most prestigious journals of the
profession or if they tend to be published only in specialty journals.
Finally, the different application categories of ML described by our
taxonomy and the wide variety of research fields in finance make it
difficult to pinpoint exactly where the most promising applications
of ML in finance research lie. In this section, we give indicative
answers to these questions by systematically analysing the
existing finance literature that already uses ML methods. In
particular, we investigate the publication success of such papers
and how it differs by research field and application type. Our
results may not only indicate the future prospects of ML in finance
but also show where and how researchers can apply ML to
maximise its future potential.

21
Conclusion
ML applications in finance by analysing the ML papers published
in major finance journals. Over the last few years, there has been
a strong growth in the number of ML applications in finance, and
many of these applications reached the highest-ranked journals of
the profession. Our results suggest that ML may become even
more widespread in finance research in the coming years. They
also indicate a particularly large potential of applying ML to
unconventional data to construct superior and novel measures of
topics related to the field of corporate finance and governance.
The fields of behavioural and household finance may also offer a
mostly untapped potential for ML in future research.

BIBLIOGRAPHY

Kaggle
Geeksforgeeks
Upgrad

22

You might also like