You are on page 1of 5

ENHANCING E-BIKE TRIPS THROUGH PRECISE

DURATION AND BATTERY CONSUMPTION


PREDICITION USING MACHINE LEARNING

Rajasekhar Chandana Venkateswara Rao Bathina Rajini Gudipudi


Department of Electrical and Department of Electrical and Department of Electrical and
Electronics Engineering Electronics Engineering Electronics Engineering
V.R.Siddhartha Engineering College V.R.Siddhartha Engineering College V.R.Siddhartha Engineering College
Vijayawada, India Vijayawada, India Vijayawada, India
rajj.studies@gmail.com drbvrao@vrsiddhartha.ac.in rajinigudipudi1674@gmail.com

Akshith Roy Kopuri Sri Jayanth Javvaji


Department of Electrical and Department of Electrical and
Electronics Engineering Electronics Engineering
V.R.Siddhartha Engineering College V.R.Siddhartha Engineering College
Vijayawada, India Vijayawada, India
akshithroy007@gmail.com jsrijayanth@gmail.com

Abstract - In today's transportation systems, it is Efficient management of e-bike fleets and optimizing the
essential to effectively control trip time and battery user experience hinges on two critical factors: trip duration
usage, especially for electric modes such as e-bikes. The prediction and battery consumption management. The ability
foundation of this research is the enhancement of the to accurately predict how long a trip will take and how much
dataset with trip details, ambient conditions, and battery power will be consumed during that journey can
battery-specific information via strategic feature significantly enhance the usability and sustainability of E-
engineering. Using data mining techniques on a large bike sharing systems. This project, titled "Enhancing E-Bike
dataset, the goal is to properly estimate trip time and Trips Through Precise Duration and Battery Consumption
Prediction Through Machine Learning & Power BI," aims to
battery usage in a rental e-bike sharing system. Journey
tackle these essential challenges.
time and energy consumption are estimated using
machine learning models, such as Random Forest,
Gradient Boosting Machines, and Linear Regression; II. LITERATURE REVIEW
model efficacy is measured by performance metrics.
The research emphasizes the importance of trip duration
Through Predictive Mobility Insights, Power BI's
and battery management in modern transportation, especially
visualizations further improve data interpretation for
within the context of electric modes of transport. By
stakeholders, bringing in a new age of intelligent, integrating weather parameters and battery-specific data into
responsive, and environmentally friendly urban mobility. the dataset, the study aims to enhance predictive accuracy.
Keywords - Power-BI, E-bikes, Data mining, Data The machine learning models mentioned are employed to
visualization estimate both journey time and energy usage, indicating a
quantitative approach to solving the problem. The abstract
I. INTRODUCTION mentions the use of performance indicators to evaluate the
models, but specific metrics are not detailed. Furthermore,
In the dynamic landscape of contemporary urban the abstract highlights the use of Power BI for data
transportation, there is a growing emphasis on sustainable visualization, emphasizing its role in conveying complex
and eco-friendly mobility solutions. Electric bikes, or e- data to stakeholders effectively.
bikes, have emerged as a promising remedy for challenges Portigliotti, F., Rizzo, A.: ‘A network model for an urban
such as urban congestion, air pollution, and limited parking bikesharing system’, IFAC-PapersOnLine, 2017, 50, (1), pp.
space. E-bikes provide a convenient and environmentally 15633–15638 [1].
friendly transportation option, appealing to both commuters
and leisure riders. However, to achieve precise predictions, E-bikes, like other forms of active transportation, can
it's crucial to consider various factors influencing trip improve individual and community health. This theme of the
duration and battery usage. Variables like traffic, terrain, literature review considers the effects of e-bikes on both
weather conditions, and rider behavior play significant roles. physical and mental health and well-being.

By enriching the dataset with real-time ambient Bourne, J., Sauchelli, S., Perry, R., Page, A., Leary, S.,
conditions and battery specifics, we can improve the England, & C., Cooper, A. (2018).Health benefits of
accuracy of predictions. This allows for better fleet electrically-assisted cycling: a systematic review.
management and ensures a more satisfying user experience.

979-8-3503-1502-8/23/$31.00 ©2023 IEEE


International Journal of Behavioral Nutrition and Physical have to follow the normal distribution. Therefore, we use
Activity[2]. square root transformation on top of it.
III. FEATURE SELECTION
IV. CLEANING AND MANIPULATING THE
A. FEATURE DESCRIPTION:
DATASET
a) Date: The date of the day, during 365 days from A. DATA PREPROCESSING:
01/12/2017 to 30/11/2018, formatting in DD/MM/YYYY,
It is the process of transforming raw data into a useful,
we need to convert into date-time format.
understandable format. Realworld or raw data usually has
b) Rented Bike Count: Number of rented bikes per hour
inconsistent formatting, human errors, and can also be
which our dependent variable and we need to predict that. incomplete. Data pre-processing resolves such issues and
c) Hour: The hour of the day, starting from 0- 23 it's in a makes datasets more complete and efficient to perform data
digital time format. analysis.
d) Temperature (°C): Temperature of the weather in
Celsius and it varies from -17°C to 39.4°C. B. DATA CLEANING:
e) Humidity (%): Availability of Humidity in the air during Cleansing is the process of cleaning datasets by
the booking and ranges from 0 to 98%. accounting for missing values, removing outliers, correcting
f) Wind speed (m/s): Speed of the wind while booking and inconsistent data points, and smoothing noisy data. In
ranges from 0 to 7.4m/s. essence, the motive behind data cleaning is to offer complete
g) Visibility (10m): Visibility to the eyes during driving in and accurate samples for machine learning models.
“m” and ranges from 27m to 2000m. C. NOISY DATA:
h) Dew point temperature (°C): Temperature At the
beginning of the day, it ranges from - 30.6°C to 27.2°C. A large amount of meaningless data is called noise. More
precisely, it’s the random variance in a measured variable or
i) Solar Radiation (MJ/m2): Sun contribution or solar
data having incorrect attribute values. Noise includes
radiation during ride booking which varies from 0 to 3.5
duplicate or semi-duplicates of data points, data segments of
MJ/m2. no value for a specific research process, or unwanted
j) Rainfall (mm): The amount of rainfall during bike information fields.
booking which ranges from 0 to 35mm.
k) Snowfall (cm): Amount of snowing in cm during the DETECTING MULTICOLLINEARITY BY
booking in cm and ranges from 0 to 8.8 cm. VARIANCE INFLATION FACTOR (VIF):
l) Seasons: Seasons of the year and total there are 4 distinct
seasons I.e., summer, autumn, spring and winter.
m)Holiday: If the day is holiday period or not and there
are2 types of data that is holiday and no holiday.
n) N Functioning Day: If the day is a Functioning Day or
not and it contains object data type yes and no. Variance inflation factor (VIF) is a measure of the
amount of multicollinearity in a set of multiple regression
B. MISSING VALUES: variables. Mathematically, the VIF for a regression model
One of the ways to handle missing values is to simply variable is equal to the ratio of the overall model variance to
remove them from our dataset. We have known that we can the variance of a model that includes only that single
use the null() and not null() functions from the pandas library independent variable. This ratio is calculated for each
to determine null values. Since there are no missing values in independent variable. Where R^2 is the coefficient of
this data set. determination in linear regression. A higher R-squared value
denotes a stronger collinearity. Generally, a VIF above 5
C. DATA DUPLICATION: indicates a high multicollinearity. Here we have taken the
It is very likely that your dataset contains duplicate rows. VIF for consideration v alue is 8 for having some important
Removing them is essential to enhance the quality of the features to accord with the model which we will be using in
dataset. Since there is no duplicate value in these datasets. this dataset.
D. EXPLORATORY DATA ANALYSIS: V. MODEL BUILDING & BLOCK DIAGRAM
After loading the dataset, we performed this method by A. FEATURE SCALING:
comparing our target variable that is bike_count with other
independent variables. This process helped us figuring out Scaling data is the process of increasing or decreasing the
various aspects and relationships among the target and the magnitude according to a fixed ratio, in simpler words you
independent variables. It gave us a better idea of which change the size but not the shape of the data. There different
feature behaves in which manner compared to the target three types of feature scaling:
variable. ● Centring: The intercept represents the estimate of the
E. FEATURE TRANSFORMATION: target when all predictors are at their mean value, means
when x=0, the predictor value will be equal to the intercept.
Transformation of the skewed variables may also help In this method we centralize the data, then we divide by the
correct the distribution of the variables. These could be standard deviation to enforce that the standard deviation of
logarithmic, square root, or square transformations. In our the variable is one.
dataset Dependent variable i.e. bike count having a moderate
right skewed, to apply linear regression dependent features
VIII.BLOCK DIAGRAM EXPLANATION

● Normalization: Normalization most often refers to the A. Input Data:


process of “normalizing” a variable to be between 0 and 1.
Think of this as squishing the variable to be constrained to a The dataset is sourced from Kaggle. Visualization of the
specific range. This is also called min-max scaling workplace setting and predictive analytics development is
presented. The Machine Learning phase includes data pre-
processing, entropy-based feature engineering, and
classification modeling, yielding accurate outcomes. Iterative
feature selection and modeling occur for various attribute
combinations, with continuous tracking of ML approach and
performance.

VI. METHODOLOGY B. Data Preprocessing


Data pre-processing is done after acquiring a large
number of records. The dataset includes 8759 customer data
Algorithm that will be utilized in pre-processing. For the attributes of
the provided dataset, the sub categorization parameter and
Data classification methods are described. The multi-class variable
is used to assess the presence or absence data.

Input Trained
Model
CUSTMOR
DATA

Predicte
d output

Fig 1. Block diagram of input data analysis


Fig 3. Virtual IP data

VII. PREDICTION  This Dataset contain 8760 rows and 14 columns.


 Three categorical features Seasons, Holiday& Functioning
Day.
Input data Data  One Date time column ‘Date’.

C. Split Data:
Split Splitting a dataset involves dividing it into two types:
training and testing data. In this study, the split approach is
Train Test
employed for training and assessment.
D. Train Data
Algorithm Utilizing the information of users to train a ML concept
is known as training data in ML. Analysing or processing
training dataset learning requires some human input.
LR DT RF Depending on the machine learning algorithms used and the
kind of issue they're supposed to solve can vary how much
participation there is from people.
Selecting the E. Test Data
best result

An unknown data must be needed to test the machine


learning model after it has been constructed (using the
available training data). Further, this data can be used, which
Predictio
n is referred to as testing data, to assess the effectiveness and
development of the training of algorithms and to modify or
Fig 2. Block diagram of training data & test data improve them for better outcomes.
IX. EXPLORATORY DATA ANALYSIS D. CHECKING LINEARITY IN DATA:

A. VISUALIZING DISTRIBUTIONS
Data Collecting and Traffic Sensing Unit:
This unit is responsible for storing and managing the
collected data. It might include memory storage or USB
services to hold the data, which can be valuable for various
purposes, including analysis and archiving.

Fig 4. Graphical representation of visualizing distributions Fig 6. Graphical representation of checking linearity

B. CHECKING OUTLIER: X. ALGORITHMS

A. LINEAR REGRESSION

Fig 5. Graphical representation of checking outliner Fig 7. Graphical representation of Linear Regression
algorithm
 We see outliers in some columns like Sunlight, Wind,
Rain and Snow but let’s not treat them because they may  We plotted the absolute values of the beta coefficients
not be outliers as snowfall, rainfall etc. themselves are rare which can be seen parallel to the feature importance of
tree-based algorithms.
event in some countries.
 Since the performance of simple linear model is not so
 We treated the outliers in the target variable by capping
good. We experimented with some complex models
with IQR limits.
B. DECISION TREE
C. MANIPULATING THE DATASET:

 Added new feature named weekend that shows whether


it’s a weekend or not. Here Saturdays and Sundays means
1 else 0.
 Added one newer feature named time shift based on time
intervals. It has three values Night, Day and Evening.
 Dropped the date column because we already extracted
some useful features from that column.
 Created dummy features from the season column named
summer, autumn, spring and winter with one hot
encoding.
Fig 8. Graphical representation of Decision Tree Algorithm

 DecisionTreeRegressor(max_depth=10,min_samples_
leaf=40, min_samples_split=50, random_state=1) XII. CONCLUSION
 Decision tree performs well better than the linear reg  Functioning day is the most influencing feature and
witha test r2 score more than 70%. temperature is at the second place for LinearRegressor.
 Temperature is the most important feature for
C. RANDOM FOREST REGRESSOR DecisionTree.
 Functioning day is the most important feature and Winter
is the second most for Linear Regressor.
 RMSE Comparisons:
1) Liner regressorRMSE : 370.46
2) Decision Tree RegressorRMSE : 302.53
3) Random Forest MethodRMSE : 290.20
 The feature temperature is on the top list for all the
regressors.
 So It can be considered as the best model for given
problem.

REFERENCES

Fig 9. Graphical representation of Random Forest regressor [1] Portigliotti, F., Rizzo, A.: ‘A network model for an urban bike sharing
system’, IFAC-PapersOnLine, 2017, 50, (1), pp. 15633–15638.
 RandomForestRegressor(max_depth=10,min_samples [2] Bourne, J., Sauchelli, S., Perry, R., Page, A., Leary, S., England, &
C., Cooper, A. (2018).Health benefits of electrically-assisted cycling:
_leaf=40, min_samples_split=50, random_state=2) a systematic review. International Journal of Behavioral Nutrition and
 Random forest also performs well in both test and train Physical Activity.
data with a r2 score 77% on train data and around 75% on [3] Calafiore, G.C., Portigliotti, F., Rizzo, A.: ‘A network model for an
the test data. urban bike-sharing system’, IFAC-PapersOnLine, 2017, 50, (1), pp.
15633–15638system for measuring traffic parameters,” in Proc. IEEE
Conf. Computer Vision and Pattern Recognition, Puerto Rico, June
1997, pp. 496–501.
XI. RESULTS
[4] Shaheen, S., Guzman, S., Zhang, H.: ‘Bikesharing in Europe, the
americas, and Asia: past, present, and future’, Transp. Res. Rec., J.
ALGORITHM LINEAR DESCION RANDOM Transp. Res. Board, 2010, 2143, p. 159167.
REGRESSION TREE FOREST [5] Giraud-Carrier, C., Vilalta, R., Brazdil, P.: ‘Introduction to the special
issue on metalearning’, Mach. Learn., 2004, 54, (3), pp. 187–193.
MSE 137241.308 91524.533 84111.621 [6] Turner, S., Eisele, W., Benz, R., et al.: ‘Travel Time Data Collection
Handbook’, Federal Highway Administration, Report FHWA-PL-98-
RMSE 370.460 302.530 290.020
035, 1998.
MAE 254.740 188.507 178.308 [7] Li, Y., DimitriosGunopulos, C.L., Guibas, L.: ‘Urban travel time
prediction using a small number of GPS floating cars’. Proc. of the
TRAIN R2 0.58346 0.75989 0.77380 25th ACM SIGSPATIAL Int. Conf. on Advances in Geographic
Information Systems, USA, 2017.
REST R2 0.59240 0.72818 0.75019
[8] Mridha, S., NiloyGanguly, S.B.: ‘Link travel time prediction from
ADJUSTED 0.58959 0.72551 0.74774 large scale endpoint data’. Proc. of the 25th ACM SIGSPATIAL Int.
Conf. on Advances in Geographic Information Systems, USA, 2017.
R2 [9] Miura, H.: ‘A study of travel time prediction using universal kriging’,
Top, 2010, 18, (1), pp. 257–270 .
Fig 10. Results of exploratory data analysis [10] Kwon, J., Coifman, B., Bickel, P.: ‘Day-to-day travel-time trends and
travel time prediction from loop-detector data’, Transp. Res. Rec.: J.
Transp. Res. Board, 2000.
[11] Zhang, X., Rice, J.A.: ‘Short-term travel time prediction’, Transp.
Res. C: Emergency Technology , 2003, 11, (3), pp. 187–210
[12] Brazdil, P., Soares, C., Costa, J.D.: ‘Ranking learning algorithms:
using IBL and metalearning on accuracy and time results’, Mach.
Learn., 2003, 50, pp. 251–277.
[13] Zarmehri, M.N., Soares, C.: ‘Using meta learning for prediction of
taxi trip duration using different granularity levels’. Int. Symp. on
Intelligent Data Analysis, Cham, 2015, pp. 205–216.

You might also like