You are on page 1of 15

B.M.

S COLLEGE OF ENGINEERING
(An Autonomous College under VTU, Belagavi)
Bull Temple Road, Bangalore - 560 019

A Project Report-2022-23
On

“%Delivery predictor using ML”


Submitted as a part of Alternate Assessment for the Elective course

MACHINE LEARNING
offered by

DEPARTMENT of ELECTRONICS AND COMMUNICATIONS


ENGINEERING

Submitted By

USN
NAME: 1. Ayush Kumar Sinha 1BM20EC209
2. Umang Singh 1BM20EC208
3. Subhash S 1BM20EC212
4. Anurag Soni 1BM20ET011

FIC: Dr Geetishree Mishra


Designation Assistant professor
TABLE OF CONTENTS

Sl.no content Page no

1 Introduction 1

2 Problem definition 2

3 Proposed solution 2

4 Literature survey 3

5 Methodology 4

6 Implementation 5

7 Result analysis 6

8 Conclusion 6

9 Code 7

10 References 11
3
INTRODUCTION

Percentage delivery in stocks refers to the proportion of shares that are physically
transferred from the seller to the buyer during a stock market transaction. It is an
important measure that provides insights into the liquidity and trading activity of
a particular stock.
Percentage delivery helps in assessing the transparency and efficiency of the stock
market. Higher percentage delivery indicates a healthier market, where there is a
greater degree of physical delivery of shares. This implies that trading is taking
place based on actual ownership of shares, rather than speculative or manipulative
activities.
There is a need to efficiently predict the %delivery of shares to know the condition
of market and future trends in advance. It will also help in predicting the
performance of a share in specific conditions.
We have proposed using Machine Learning to make this prediction. We have
gathered the sales reports of a few shares as our dataset and we make our
predictions accordingly.

1|Page
PROBLEM DEFINITION

The objective of this project is to develop a machine learning model that can
accurately predict the percentage delivery of shares for a given stock based on
historical and real-time market data. The model aims to assist investors, traders,
and market analysts in understanding the liquidity, trading patterns, and potential
market manipulation of stocks.

PROPOSED SOLUTION

To tackle this problem we propose a Machine Learning model which takes the
previous year’s data of different companies as its dataset and applies different
regression methods to find the most accurate result. This is decided by comparing
each model’s R2 score. Our model makes a training set and a test set to find the
validity of predictions.

2|Page
LITERATURE SURVEY
1. In the paper,” A Novel Bayesian Additive Regression Trees Ensemble Model
Based on Linear Regression and Nonlinear Regression for Torrential Rain
Forecasting”, Jiansheng Wu has discussed how three different linear
regression model are used to extract the linear characteristic of rainfall
system with the Partial Squares Least Regression, the Quantile Regression
and the M-regression. (IEEE -2010)
2. “Prediction of Packet Delivery Ratio Using Lasso Regression in
Comparison with Linear Regression Algorithm for Multi Input Multi Output
Network”, In this paper V. Venu Gopal Reddy discusses the study is to
predict the accurate Packet Delivery Ratio (PDR) using the dataset provided
with the help of the machine learning technique Novel Linear and compared
with Lasso regression algorithms. (IEEE-2022)
3. “Improvement of Random Forest Cascade Regression Algorithm and Its
Application in Fatigue Detection”, Tao Qunzhu in this paper proposes a
method based on improved random forest cascading regression to detect the
face feature points. By dividing the facial feature points into regions and
performing shape regression on each region separately, the human face
shape is finally obtained. (IEEE-2019)
4. In this paper,” Rank Prediction in Graphs with Locally Weighted
Polynomial Regression and EM of Polynomial Mixture Models”, Michalis
Rallis describes a learning framework enabling ranking predictions for
graph nodes based solely on individual local historical data. The two
learning algorithms capitalize on the multi feature vectors of nodes in graphs
that evolve in time. In the first case we use weighted polynomial regression
(LWPR) while in the second we consider the Expectation Maximization
(EM) algorithm to fit a mixture of polynomial regression models.
(IEEE-2011)

3|Page
METHODOLOGY

Methodology for Predicting Percentage Delivery in Shares using Machine


Learning:

1. Data Collection: We gathered a comprehensive dataset containing


historical stock market data, including trading volumes, stock prices,
delivery percentages, and other relevant features. Data collection is in
fact the first and the most fundamental step in machine learning pipeline.
It’s part of the complex data processing phase within an ML lifecycle.

2. Data Preprocessing: Clean the data to handle missing values, outliers,


and inconsistencies. Normalize numerical features to ensure they are on
a similar scale. Encode categorical variables if necessary. Split the
dataset into training and testing sets. Data preprocessing is a process of
preparing the raw data and making it suitable for a machine learning
model. It is the first and crucial step while creating a machine learning
model.
3. Feature Engineering: Conduct exploratory data analysis to identify
relevant features that may impact the percentage delivery. Consider
factors such as trading volumes, price volatility, market indices, news
sentiment, and technical indicators. Create new features if appropriate,
such as moving averages, volatility measures, or liquidity ratios, to
enhance the model's predictive capabilities.
4. Model Selection: Evaluate various machine learning algorithms suitable
for regression tasks. Consider models such as linear regression, decision
trees, random forests etc. Choose the algorithm that best suits the
problem, considering factors like interpretability, accuracy, and
robustness.
5. Model Evaluation: Assess the trained model's performance using the
testing dataset. Calculate evaluation metrics such as mean squared error
(MSE), root mean squared error (RMSE), and R-squared (R2) to gauge
prediction accuracy and goodness-of-fit. Compare the model's
performance against baselines or benchmarks.
6. Interpretation and Communication: Analyze the model's predictions
and interpret the impact of different features on percentage delivery.
Interpretation of a machine learning model is the process wherein we try
to understand the predictions of a machine learning model.

4|Page
IMPLEMENTATION
We have used Jupyter Notebook to write the code for our project using python
v3.
POLYNOMIAL REGRESSION
Polynomial Regression is a regression algorithm that models the relationship
between a dependent(y) and independent variable(x) as nth degree polynomial.
The Polynomial Regression equation is given below:
y= b0+b1x1+ b2x12+ b2x13+........bnx1n

MULTIPLE REGRESSION
In Multiple Linear Regression, the target variable(Y) is a linear combination of
multiple predictor variables x1, x2, x3,..... ,xn. Since it is an enhancement of
Simple Linear Regression, so the same is applied for the multiple linear
regression equation, the equation becomes:
y= b0+b1x1+ b2x2+ b2x3+........bnxN

RANDOM FOREST REGRESSION


Random Forest is a popular machine learning algorithm that belongs to the
supervised learning technique. It can be used for both Classification and
Regression problems in ML. It is based on the concept of ensemble learning,
which is a process of combining multiple classifiers to solve a complex
problem and to improve the performance of the model.

R2 SCORE
Coefficient of determination also called as R2 score is used to evaluate the
performance of a linear regression model. It is the amount of the variation in the
output dependent attribute which is predictable from the input independent
variable(s).

R2= 1- SSres / SStot

RMSE
Root Mean Square Error (RMSE) is a standard way to measure the error of a
model in predicting quantitative data. Formally it is defined as follows:

5|Page
RESULT ANALYSIS
We find that all 3 models perform differently for the particular dataset. The
model using polynomial regression gives an R2 score of 0.63412 which is not
very accurate and has an RMSE of 0.10201

The model using Random forest regression regression gives an R2 score of


0.9810681 which is much more accurate and has an RMSE of 0.0232

The model using Multipleregression regression gives an R2 score of 0.5437 which


is not very accurate and has an RMSE of 0.1139

For our dataset Random forest Regression performs the best.

CONCLUSION

In conclusion, predicting percentage delivery in shares using machine learning


(ML) offers valuable insights into market liquidity, trading patterns, and
investment dynamics. By leveraging historical and real-time market data, ML
models can provide accurate predictions that aid investors, traders, and market
analysts in making informed decisions.
Through the methodology outlined, we can gather relevant data, preprocess it,
and engineer informative features. Selecting an appropriate ML algorithm,
training the model, and evaluating its performance allows us to build a
predictive tool capable of forecasting percentage delivery in shares. By
continuously monitoring the model and updating it with new data, we can
ensure its accuracy and adaptability to changing market conditions.

6|Page
CODE
IMPORTING LIBRARIES
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import math
IMPORTING DATASET
dataset = pd.read_csv(r'C:\Users\Medha\OneDrive\Desktop\ML
AAT\SHARES_DATASET.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
dataset.describe()

FILLING IN MISSING VALUES


from sklearn.impute import SimpleImputer
imputer= SimpleImputer(missing_values =np.nan, strategy='mean')
imputer.fit(X[:, 1:])
X[:, 1:]= imputer.transform(X[:, 1:])
ENCODING NON NUMERIC DATA
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct=ColumnTransformer(transformers=[('encoder',OneHotEncoder(),[0])],remainder='passthro
ugh')
X=np.array(ct.fit_transform(X))
SPLITTING TRAINING SET AND TEST SET
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

7
TRAINING MULTIPLE REGRESSION MODEL
from sklearn.linear_model import LinearRegression
regressor_m=LinearRegression()
regressor_m.fit(X_train,y_train)
TRAINING RANDOM FOREST MODEL
from sklearn.ensemble import RandomForestRegressor
regressor_r = RandomForestRegressor(n_estimators = 10, random_state = 0)
regressor_r.fit(X_train, y_train)
TRAINING POLYNOMIAL REGRESSION MODEL
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree =2 )
X_poly = poly_reg.fit_transform(X_train)
regressor_P = LinearRegression()
regressor_P.fit(X_poly, y_train)
POLYNOMIAL REGRESSION
y_pred_P = regressor_P.predict(poly_reg.transform(X_test))
np.set_printoptions(precision=3)
print(np.concatenate((y_pred_P.reshape(len(y_pred_P),1), y_test.reshape(len(y_test),1)),1))
from sklearn.metrics import r2_score
r2_P=r2_score(y_test,y_pred_P)
print("R2 Score is",r2_P)
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, y_pred_P)
rmse = math.sqrt(mse)
print("The difference between actual and predicted values", rmse)
RANDOM FOREST REGRESSION
y_pred_r=regressor_r.predict(X_test)
np.set_printoptions(precision=2)
print(np.concatenate((y_pred_r.reshape(len(y_pred_r),1),y_test.reshape(len(y_pred_r),1)),1))
from sklearn.metrics import r2_score

8
r2_RF=r2_score(y_test,y_pred_r)
print("R2 Score is",r2_RF)
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, y_pred_r)
rmse = math.sqrt(mse)
print("The difference between actual and predicted values", rmse)
MULTIPLE REGRESSION
y_pred_m=regressor_m.predict(X_test)
np.set_printoptions(precision=2)
print(np.concatenate((y_pred_m.reshape(len(y_pred_m),1),y_test.reshape(len(y_pred_m),1)),
1))
from sklearn.metrics import r2_score
r2_M=r2_score(y_test,y_pred_m)
print("R2 Score is",r2_M)
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, y_pred_m)
rmse = math.sqrt(mse)
print("The difference between actual and predicted values", rmse)
PREDICTION
if r2_M>r2_RF and r2_M>r2_P:
print("% Deliverable is",regressor_m.predict([[0, 1, 0, 0, 0, 1541, 1557, 1564.65, 1470,
1474.95, 1476.5, 1499.5, 982492, 577454]]))
elif r2_RF>r2_M and r2_RF>r2_P:
print("% Deliverable is",regressor_r.predict([[0, 1, 0, 0, 0, 1541, 1557, 1564.65, 1470,
1474.95, 1476.5, 1499.5, 982492, 577454]]))
else:
print("% Deliverable is",regressor_P.predict([[0, 1, 0, 0, 0, 1541, 1557, 1564.65, 1470,
1474.95, 1476.5, 1499.5, 982492, 577454]]))
PLOTTING
plt.scatter(X[:10, 8], y[:10], color = 'pink',s = 50,edgecolor ="green",marker ="s")
plt.scatter(X[:10, 9], y[:10], color = 'red',s = 50,edgecolor ="red",marker ="^")
#plt.plot(X, regressor_r.predict(X), color = 'blue')

9
plt.title('chart')
plt.xlabel('X axis')
plt.ylabel('%deliverable')
plt.legend(['volume','deliverable'])
plt.show()

10
REFRENCES

1. J. Wu, L. Huang and X. Pan, "A Novel Bayesian Additive Regression Trees
Ensemble Model Based on Linear Regression and Nonlinear
Regression for Torrential Rain Forecasting," 2010 Third International Joint
Conference on Computational Science and Optimization, Huangshan, China,
2010, pp. 466-470, doi: 10.1109/CSO.2010.15.

2. V. V. Gopal Reddy and S. Narendran, "Prediction of Packet Delivery Ratio Using


Lasso Regression in Comparison with Linear Regression Algorithm for Multi Input
Multi Output Network," 2022 4th International Conference on Advances in
Computing, Communication Control and Networking (ICAC3N), Greater Noida,
India, 2022, pp. 1837-1841, doi:
10.1109/ICAC3N56670.2022.10074219.

3. T. Qunzhu, Z. Rui, Y. Yufei, Z. Chengyao and L. Zhijun, "Improvement of Random


Forest Cascade Regression Algorithm and Its Application in Fatigue Detection," 2019
IEEE 2nd International Conference on Electronics Technology (ICET), Chengdu,
China, 2019, pp. 499-503, doi:
10.1109/ELTECH.2019.8839317.

4. M. Rallis and M. Vazirgiannis, "Rank Prediction in Graphs with Locally Weighted


Polynomial Regression and EM of Polynomial Mixture Models," 2011 International
Conference on Advances in Social Networks Analysis and Mining, Kaohsiung,
Taiwan, 2011, pp. 515-519, doi:
10.1109/ASONAM.2011.44.

5. S. Bal and R. R, "Prediction Of Heat Transfer Performance Using Polynomial


Regression," 2022 Second International Conference on Artificial Intelligence and
Smart Energy (ICAIS), Coimbatore, India, 2022, pp. 1735-1740, doi:
10.1109/ICAIS53314.2022.9742910.

6. Y. Gong and P. Zhang, "Predictive Analysis and Research Of Python Usage Rate Based
on Polynomial Regression Model," 2021 3rd International Conference on Artificial
Intelligence and Advanced Manufacture (AIAM), Manchester, United Kingdom, 2021, p
p. 266-270, doi: 10.1109/AIAM54119.2021.00061.

11
12

You might also like