You are on page 1of 5

ENHANCING E-BIKE TRIPS THROUGH PRECISE

DURATION AND BATTERY CONSUMPTION


PREDICITION USING MACHINE LEARNING

Rajasekhar Chandana Venkateswara Rao Bathina Rajini Gudipudi


Department of Electrical and Department of Electrical and Department of Electrical and
Electronics Engineering Electronics Engineering Electronics Engineering
V.R.Siddhartha Engineering College V.R.Siddhartha Engineering College V.R.Siddhartha Engineering College
Vijayawada, India Vijayawada, India Vijayawada, India
rajj.studies@gmail.com drbvrao@vrsiddhartha.ac.in rajinigudipudi1674@gmail.com

Akshith Roy Kopuri Sri Jayanth Javvaji


Department of Electrical and Department of Electrical and
Electronics Engineering Electronics Engineering
V.R.Siddhartha Engineering College V.R.Siddhartha Engineering College
Vijayawada, India Vijayawada, India
akshithroy007@gmail.com jsrijayanth@gmail.com

Abstract - In today's transportation systems, it is Efficient management of e-bike fleets and optimizing the
essential to effectively control trip time and battery user experience hinges on two critical factors: trip duration
usage, especially for electric modes such as e-bikes. The prediction and battery consumption management. The ability
foundation of this research is the enhancement of the to accurately predict how long a trip will take and how much
dataset with trip details, ambient conditions, and battery power will be consumed during that journey can
battery-specific information via strategic feature significantly enhance the usability and sustainability of E-
engineering. Using data mining techniques on a large bike sharing systems. This project, titled "Enhancing E-Bike
dataset, the goal is to properly estimate trip time and Trips Through Precise Duration and Battery Consumption
Prediction Through Machine Learning & Power BI," aims to
battery usage in a rental e-bike sharing system. Journey
tackle these essential challenges.
time and energy consumption are estimated using
machine learning models, such as Random Forest,
Gradient Boosting Machines, and Linear Regression; II. LITERATURE REVIEW
model efficacy is measured by performance metrics.
The research emphasizes the importance of trip duration
Through Predictive Mobility Insights, Power BI's
and battery management in modern transportation, especially
visualizations further improve data interpretation for
within the context of electric modes of transport. By
stakeholders, bringing in a new age of intelligent, integrating weather parameters and battery-specific data into
responsive, and environmentally friendly urban the dataset, the study aims to enhance predictive accuracy.
mobility. The machine learning models mentioned are employed to
Keywords - Power-BI, E-bikes, Data mining, Data estimate both journey time and energy usage, indicating a
visualization quantitative approach to solving the problem. The abstract
mentions the use of performance indicators to evaluate the
I. INTRODUCTION models, but specific metrics are not detailed. Furthermore,
the abstract highlights the use of Power BI for data
visualization, emphasizing its role in conveying complex
In the dynamic landscape of contemporary urban
data to stakeholders effectively.
transportation, there is a growing emphasis on sustainable
and eco-friendly mobility solutions. Electric bikes, or e-bikes, Portigliotti, F., Rizzo, A.: ‘A network model for an urban
have emerged as a promising remedy for challenges such as bikesharing system’, IFAC-PapersOnLine, 2017, 50, (1), pp.
urban congestion, air pollution, and limited parking space. E- 15633–15638 [1].
bikes provide a convenient and environmentally friendly
transportation option, appealing to both commuters and E-bikes, like other forms of active transportation, can
leisure riders. However, to achieve precise predictions, it's improve individual and community health. This theme of the
crucial to consider various factors influencing trip duration literature review considers the effects of e-bikes on both
and battery usage. Variables like traffic, terrain, weather physical and mental health and well-being.
conditions, and rider behavior play significant roles. Bourne, J., Sauchelli, S., Perry, R., Page, A., Leary, S.,
By enriching the dataset with real-time ambient England, & C., Cooper, A. (2018).Health benefits of
conditions and battery specifics, we can improve the electrically-assisted cycling: a systematic review.
accuracy of predictions. This allows for better fleet International Journal of Behavioral Nutrition and Physical
management and ensures a more satisfying user experience. Activity[2].

979-8-3503-1502-8/23/$31.00 ©2023 IEEE


III. FEATURE SELECTION IV. CLEANING AND MANIPULATING THE
DATASET
A. FEATURE DESCRIPTION:
a) Date: The date of the day, during 365 days from A. DATA PREPROCESSING:
01/12/2017 to 30/11/2018, formatting in DD/MM/YYYY, It is the process of transforming raw data into a useful,
we need to convert into date-time format. understandable format. Realworld or raw data usually has
b) Rented Bike Count: Number of rented bikes per hour inconsistent formatting, human errors, and can also be
which our dependent variable and we need to predict that. incomplete. Data pre-processing resolves such issues and
c) Hour: The hour of the day, starting from 0- 23 it's in a makes datasets more complete and efficient to perform data
digital time format. analysis.
d) Temperature (°C): Temperature of the weather in B. DATA CLEANING:
Celsius and it varies from -17°C to 39.4°C.
Cleansing is the process of cleaning datasets by
e) Humidity (%): Availability of Humidity in the air during
accounting for missing values, removing outliers, correcting
the booking and ranges from 0 to 98%.
inconsistent data points, and smoothing noisy data. In
f) Wind speed (m/s): Speed of the wind while booking and essence, the motive behind data cleaning is to offer complete
ranges from 0 to 7.4m/s. and accurate samples for machine learning models.
g) Visibility (10m): Visibility to the eyes during driving in
“m” and ranges from 27m to 2000m. C. NOISY DATA:
h) Dew point temperature (°C): Temperature At the A large amount of meaningless data is called noise. More
beginning of the day, it ranges from - 30.6°C to 27.2°C. precisely, it’s the random variance in a measured variable or
i) Solar Radiation (MJ/m2): Sun contribution or solar data having incorrect attribute values. Noise includes
radiation during ride booking which varies from 0 to 3.5 duplicate or semi-duplicates of data points, data segments of
MJ/m2. no value for a specific research process, or unwanted
j) Rainfall (mm): The amount of rainfall during bike information fields.
booking which ranges from 0 to 35mm. DETECTING MULTICOLLINEARITY BY
k) Snowfall (cm): Amount of snowing in cm during the VARIANCE INFLATION FACTOR (VIF):
booking in cm and ranges from 0 to 8.8 cm.
l) Seasons: Seasons of the year and total there are 4 distinct
seasons I.e., summer, autumn, spring and winter.
m)Holiday: If the day is holiday period or not and there
are2 types of data that is holiday and no holiday.
n) N Functioning Day: If the day is a Functioning Day or
not and it contains object data type yes and no. Variance inflation factor (VIF) is a measure of the
amount of multicollinearity in a set of multiple regression
variables. Mathematically, the VIF for a regression model
B. MISSING VALUES:
variable is equal to the ratio of the overall model variance to
One of the ways to handle missing values is to simply the variance of a model that includes only that single
remove them from our dataset. We have known that we can independent variable. This ratio is calculated for each
use the null() and not null() functions from the pandas library independent variable. Where R^2 is the coefficient of
to determine null values. Since there are no missing values in determination in linear regression. A higher R-squared value
this data set. denotes a stronger collinearity. Generally, a VIF above 5
C. DATA DUPLICATION: indicates a high multicollinearity. Here we have taken the
VIF for consideration v alue is 8 for having some important
It is very likely that your dataset contains duplicate rows. features to accord with the model which we will be using in
Removing them is essential to enhance the quality of the this dataset.
dataset. Since there is no duplicate value in these datasets.
V. MODEL BUILDING & BLOCK DIAGRAM
D. EXPLORATORY DATA ANALYSIS:
After loading the dataset, we performed this method by A. FEATURE SCALING:
comparing our target variable that is bike_count with other Scaling data is the process of increasing or decreasing the
independent variables. This process helped us figuring out magnitude according to a fixed ratio, in simpler words you
various aspects and relationships among the target and the change the size but not the shape of the data. There different
independent variables. It gave us a better idea of which three types of feature scaling:
feature behaves in which manner compared to the target
variable. ● Centring: The intercept represents the estimate of the
target when all predictors are at their mean value, means
E. FEATURE TRANSFORMATION: when x=0, the predictor value will be equal to the intercept.
Transformation of the skewed variables may also help In this method we centralize the data, then we divide by the
correct the distribution of the variables. These could be standard deviation to enforce that the standard deviation of
logarithmic, square root, or square transformations. In our the variable is one.
dataset Dependent variable i.e. bike count having a moderate
right skewed, to apply linear regression dependent features
have to follow the normal distribution. Therefore, we use
square root transformation on top of it.
● Normalization: Normalization most often refers to the VIII. BLOCK DIAGRAM EXPLANATION
process of “normalizing” a variable to be between 0 and 1.
Think of this as squishing the variable to be constrained to a A. Input Data:
specific range. This is also called min-max scaling
The dataset is sourced from Kaggle. Visualization of the
workplace setting and predictive analytics development is
presented. The Machine Learning phase includes data pre-
processing, entropy-based feature engineering, and
classification modeling, yielding accurate outcomes. Iterative
feature selection and modeling occur for various attribute
combinations, with continuous tracking of ML approach and
VI. METHODOLOGY performance.
B. Data Preprocessing
Algorithm Data pre-processing is done after acquiring a large
number of records. The dataset includes 8759 customer data
Data that will be utilized in pre-processing. For the attributes of
the provided dataset, the sub categorization parameter and
classification methods are described. The multi-class variable
Trained is used to assess the presence or absence data.
Input
Model
CUSTMOR
DATA

Predicted
output

Fig 1. Block diagram of input data analysis

VII. PREDICTION Fig 3. Virtual IP data

• This Dataset contain 8760 rows and 14 columns.


Input data Data preprocessing • Three categorical features Seasons, Holiday& Functioning
Day.
• One Date time column ‘Date’.
Split data
C. Split Data:
Train data Test data
Splitting a dataset involves dividing it into two types:
training and testing data. In this study, the split approach is
Algorithms employed for training and assessment.
D. Train Data
LR DT RF Utilizing the information of users to train a ML concept
is known as training data in ML. Analysing or processing
training dataset learning requires some human input.
Depending on the machine learning algorithms used and the
Selecting the
best result kind of issue they're supposed to solve can vary how much
participation there is from people.
E. Test Data
Prediction
An unknown data must be needed to test the machine
learning model after it has been constructed (using the
Fig 2. Block diagram of training data & test data available training data). Further, this data can be used,
which is referred to as testing data, to assess the
effectiveness and development of the training of algorithms
and to modify or improve them for better outcomes.
IX. EXPLORATORY DATA ANALYSIS D. CHECKING LINEARITY IN DATA:

A. VISUALIZING DISTRIBUTIONS
Data Collecting and Traffic Sensing Unit:
This unit is responsible for storing and managing the
collected data. It might include memory storage or USB
services to hold the data, which can be valuable for various
purposes, including analysis and archiving.

Fig 4. Graphical representation of visualizing distributions Fig 6. Graphical representation of checking linearity

B. CHECKING OUTLIER: X. ALGORITHMS

A. LINEAR REGRESSION

Fig 5. Graphical representation of checking outliner


Fig 7. Graphical representation of Linear Regression
algorithm
• We see outliers in some columns like Sunlight, Wind,
Rain and Snow but let’s not treat them because they may • We plotted the absolute values of the beta coefficients
not be outliers as snowfall, rainfall etc. themselves are which can be seen parallel to the feature importance of
tree-based algorithms.
rare event in some countries.
• Since the performance of simple linear model is not so
• We treated the outliers in the target variable by capping
good. We experimented with some complex models
with IQR limits.
B. DECISION TREE
C. MANIPULATING THE DATASET:

• Added new feature named weekend that shows whether


it’s a weekend or not. Here Saturdays and Sundays means
1 else 0.
• Added one newer feature named time shift based on time
intervals. It has three values Night, Day and Evening.
• Dropped the date column because we already extracted
some useful features from that column.
• Created dummy features from the season column named
summer, autumn, spring and winter with one hot
encoding.
Fig 8. Graphical representation of Decision Tree Algorithm
• DecisionTreeRegressor(max_depth=10,min_samples_ XII. CONCLUSION
leaf=40, min_samples_split=50, random_state=1) • Functioning day is the most influencing feature and
• Decision tree performs well better than the linear reg
temperature is at the second place for LinearRegressor.
witha test r2 score more than 70%. • Temperature is the most important feature for
DecisionTree.
C. RANDOM FOREST REGRESSOR • Functioning day is the most important feature and Winter
is the second most for Linear Regressor.
• RMSE Comparisons:
1) Liner regressorRMSE : 370.46
2) Decision Tree RegressorRMSE : 302.53
3) Random Forest MethodRMSE : 290.20
• The feature temperature is on the top list for all the
regressors.
• So It can be considered as the best model for given
problem.

REFERENCES

[1] Portigliotti, F., Rizzo, A.: ‘A network model for an urban bike sharing
Fig 9. Graphical representation of Random Forest regressor system’, IFAC-PapersOnLine, 2017, 50, (1), pp. 15633–15638.
[2] Bourne, J., Sauchelli, S., Perry, R., Page, A., Leary, S., England, & C.,
Cooper, A. (2018).Health benefits of electrically-assisted cycling: a
• RandomForestRegressor(max_depth=10,min_samples systematic review. International Journal of Behavioral Nutrition and
_leaf=40, min_samples_split=50, random_state=2) Physical Activity.
• Random forest also performs well in both test and train [3] Calafiore, G.C., Portigliotti, F., Rizzo, A.: ‘A network model for an
data with a r2 score 77% on train data and around 75% on urban bike-sharing system’, IFAC-PapersOnLine, 2017, 50, (1), pp.
15633–15638system for measuring traffic parameters,” in Proc. IEEE
the test data. Conf. Computer Vision and Pattern Recognition, Puerto Rico, June
1997, pp. 496–501.
[4] Shaheen, S., Guzman, S., Zhang, H.: ‘Bikesharing in Europe, the
XI. RESULTS americas, and Asia: past, present, and future’, Transp. Res. Rec., J.
Transp. Res. Board, 2010, 2143, p. 159167.
ALGORITHM LINEAR DESCION RANDOM [5] Giraud-Carrier, C., Vilalta, R., Brazdil, P.: ‘Introduction to the special
REGRESSION TREE FOREST issue on metalearning’, Mach. Learn., 2004, 54, (3), pp. 187–193.
[6] Turner, S., Eisele, W., Benz, R., et al.: ‘Travel Time Data Collection
MSE 137241.308 91524.533 84111.621 Handbook’, Federal Highway Administration, Report FHWA-PL-98-
RMSE 370.460 302.530 290.020 035, 1998.
[7] Li, Y., DimitriosGunopulos, C.L., Guibas, L.: ‘Urban travel time
MAE 254.740 188.507 178.308 prediction using a small number of GPS floating cars’. Proc. of the
TRAIN R2 0.58346 0.75989 0.77380 25th ACM SIGSPATIAL Int. Conf. on Advances in Geographic
Information Systems, USA, 2017.
REST R2 0.59240 0.72818 0.75019
[8] Mridha, S., NiloyGanguly, S.B.: ‘Link travel time prediction from
ADJUSTED 0.58959 0.72551 0.74774 large scale endpoint data’. Proc. of the 25th ACM SIGSPATIAL Int.
Conf. on Advances in Geographic Information Systems, USA, 2017.
R2
[9] Miura, H.: ‘A study of travel time prediction using universal kriging’,
Top, 2010, 18, (1), pp. 257–270 .
Fig 10. Results of exploratory data analysis [10] Kwon, J., Coifman, B., Bickel, P.: ‘Day-to-day travel-time trends and
travel time prediction from loop-detector data’, Transp. Res. Rec.: J.
Transp. Res. Board, 2000.
[11] Zhang, X., Rice, J.A.: ‘Short-term travel time prediction’, Transp.
Res. C: Emergency Technology , 2003, 11, (3), pp. 187–210
[12] Brazdil, P., Soares, C., Costa, J.D.: ‘Ranking learning algorithms:
using IBL and metalearning on accuracy and time results’, Mach.
Learn., 2003, 50, pp. 251–277.
[13] Zarmehri, M.N., Soares, C.: ‘Using meta learning for prediction of
taxi trip duration using different granularity levels’. Int. Symp. on
Intelligent Data Analysis, Cham, 2015, pp. 205–216.

You might also like