Professional Documents
Culture Documents
1. Introduction
This study analyzes a Modified Bike-Sharing data set. Unlike the original
data set, this “Modified” version includes nulls, zeros, and outliers, which
opens the door to a detail Exploratory Data Analysis EDA. Many of the
public studies on Bike-Sharing include basic EDA and then go straight into
Modeling. The general goal of this analysis is to perform an extensive EDA
that takes into account the Physics of the phenomena and provides
insights for modeling Bike-Sharing rentals.
After reading this post you will know about:
Exploiting the visual power of the human brain allows learning about
patterns or trends in congested plots..
The tricks to make those rows that have cells with null and non-null
values visible in a plot.
The identification of suspicious patterns and proposed steps to address
them.
The sequential steps that unveil the story behind the data during the
EDA.
The importance of considering the Physics that explains the
phenomena before performing imputations.
The preliminary assessment of the variables that influence the number
of bicycle rentals.
The Python code to present data on plots with visual contrast, which
helps to unveil patterns and understand the “puzzle” behind the data.
Note that Data Exploration is performed using Python (version 3.7.4) with
Data Science libraries such as NumPy, Pandas, SciKit-
Learn, Seaborn, category_encoders, scipy, and date time among others.
Table of contents
1….. Introduction
2….. Attribute Information
3….. Nulls, Zeros, and Outliers
4….. Correlation Structure
5….. Univariate Behavior of the Response Variable
6….. Behavior of Response Variable With Time Variables
7….. Conclusions
8….. References
2. Attribute Information
The data set contains six variables related with time, four continuous
variables, and the response variable counts of bicycles ‘cnt’. The ASCII
file and the Python code reside in my GitHub account.
Once the rows containing null values have been tagged, there should be
exactly 23 rows showing ‘temp_nan’ or ‘hum_nan’ tags. Observe the
following Pandas data-frame output.
3.2. Tagging Zeros
(1) Unveiling rows with null and non-null values. The Null-zero
replacement trick allows the presentation of rows that have null and
non-null values in the scatter matrix plot. Observe the orange dots that
correspond to the ‘temp_nan’ tag; each orange dot represents cell values
from a row in the data table. Keep in mind that these orange dots represent
either null or non-null values depending on the plot. On one hand, orange
dots represent null values in those plots where the temperature ‘temp’ is
one of the variables; these plots are surrounded by an orange rectangle in
the scatter matrix plot. Note that zeros represent null values in these plots;
this is why the orange dots follow a linear pattern at ‘temp=0’. On the other
hand, orange dots represent non-null values on those plots where
temperature ‘temp’ is not one of the variables.
Both orange and blue dots follow the same interpretation. Notice that
these dots have tags with the same suffix ‘_nan’. The next step of this
workflow will address these null values before going into modeling.
(3) Outliers. The scatter plots that have temperature ‘temp’ and ‘feel-
like’ temperature ‘atemp’ as variables, show a sequence of gray dots
forming a linear pattern. This dots should be labeled as outliers because:
1) they stand out from the majority of the dots forming a 45-degree trend,
2) they show a constant value for ‘atemp’ which is suspicious, and 3) they
all occur during one day 08–17–2012, which is also suspicious. Once
again, this is a data set that was modified, probably by hand, and these
outliers need to be addressed before modeling. The following is a snippet
of the Python code to tag these outliers as ‘atemp_outlr’:
The scatter plot on the left presents all the original data. The colors, which
represent tags, reveal three suspicious patterns (observe the suffix of each
tag): Outliers (light blue dots), zeros (magenta dots), and nulls (orange
dots). Remember, orange dots represent null values replaced by zeros;
the Null-zero trick makes the nulls visible as zeros on the plot, we can
have a visual assessment of how many they are and whether corrections to
these null values will make sense.
This is where the tags become handy; they allow filtering the suspicious
trends before data is fed into the Linear Regression model from Sci-Kit-
learn. Once the model is trained, it is used to correct the suspicious
patterns in temperature.
Note that data presented by colors and chunks (by days) allows the brain
to knowledge the opposite patterns between temperature and humidity;
It would not make sense to replace the zeros in humidity with an average
value.
The next figure confirms this, black dots have a ‘hum_zero’ tag. The
suffix ‘_zero’ indicates that in fact those are zeros in the original data set.
2) Random Forest does a good job predicting values within the range of
values where it was trained.
The data set that is used to train the Random Forest Regressor has non-
null values for humidity. Once the Random Forest Regressor is trained, it
is used to predict null values for humidity. The following is a snippet of the
Python code.
The next graph shows the results of the Random Forest Regressor.
Observe the plot for Thursday 2011/3/10; before the application of
Random Forest, this plot had a horizontal trend at ‘hum=0’. After the
application of Random Forest Tree Regressor, this plot shows a humidity
trend that: a) matches daily humidity trends reported in weather
reports and b) starts right at the end of Wednesday and finishes to meet
the beginning trend of Friday.
Therefore, in this case, Random Forest Regressor not only does a good
job addressing nulls and zeros, but also honors the underlined Physics of
humidity, temperature, and dew point.
3.4.3. Addressing zeros in wind speed
The next figure shows the behavior of wind speed with different features.
Observe the green dots; they represent zero values in humidity. These dots
form not only a detached pattern in the scatter plots but also a bimodal
behavior in the distribution plot.
The Random Forest Regressor is a good starting point to address this
suspicious pattern in humidity.
The next figure presents the results of the Random Forest Regressor.
The distribution plot shows a unimodal behavior, and the green dots no
longer form a horizontal trend at windspeed=0’.
4. Correlation Structure
One way to identify collinearity among the features is by using the
correlation matrix, which provides the Pearson correlation coefficient.
Intuitively, the bicycle counts depend on variables that are related not only
with weather conditions but also with time. Therefore, the correlation
matrix includes both weather and time variables.
The collinearity between temperature ‘temp’ and the feeling-like
temperature ‘atemp’ is evident. As a result, the feel-like temperature,
which is derived from ‘temp’, will be excluded during the modeling
process. In general, the bicycle counts show a positive correlation with
temperature and a negative correlation with humidity. The bottom row of
the scatter plot matrix shows these tendencies. Also, notice that some of
the time variables have a high correlation with the bicycle counts; this is
especially true for hours. The next sections will explore, graphically, the
relationship of bicycle counts with time variables.
5. Univariate Behavior of the Response
Variable
The distribution of the number of bicycles ‘cnt’ is right-skewed, and a
logarithmic transformation is required to correct the behavior.
Even though the logarithmic transformation does not correct the
skewness completely, the results are close to a normal distribution.
This transformation will have a positive impact on some of the Machine
Learning algorithms.
6. Behavior of Response Variable With
Time Variables
Observe the median (above 50% of the data) of the box plots in the next
figure; it shows four general characteristics of the number of bicycle
rentals.
1) rentals are increasing over the years,
2) trends initially increase and then decrease over time,
3) trends are cyclical through time, and
4) the presence of outliers. In more detail, the following statements can be
derived:
The first two box plots indicate that starting in January, rentals
gradually increase, they continue the increasing pattern through
Spring and reach a pick in the Summer. Then rentals decrease through
Fall and Winter to meet the trends of January and Spring of the next
year, a sign of cyclical behavior. Pick bike rentals occurs in Summer.
The middle-left box plot indicates the demand for bicycle rentals is
higher during working days. This phenomenon is even more
pronounced, almost 50% more, for the second year. This behavior
signals that the popularity of bike rentals is increasing, at least in the
city source of this data set.
The middle-right box plot shows people ride less on holidays. However,
this plot also reveals the increase in the popularity of bicycle rentals
throughout the years.
The bottom box plot also resembles both the increase in popularity and
cyclical behavior. This time with two cycles over a short period of time,
a day. The demand for bicycles picks between 7–9 AM and 5–6 PM.
These patterns match workers going to work and leaving work; workers
actually use bicycles as transportation media for work. Notice demand
remains relatively constant between pick hours, this could be explained
by “regular people”, tourists, and students using bicycles.
Let’s inspect the next set of plots. Observe how the brain can digest
multiple trends easily, thanks to different sorting and colors on the plot.
The inflections of the continuous trend lines depict the average number of
rented bicycles for each hour throughout the day. These trend lines
resemble the same patterns and characteristics that we just explained.
These plots give us an additional piece of information. The dash lines
represent the variation of average temperature through the hours of the
day. This dash trend helps us determine, at least partially (remember this
is a multivariate analysis, humidity, wind speed, and dew point are not in
these plots), the influence of temperature on the number of rental bikes
through the hours of the day.
The plot on the top reveals interesting trends. Fall shows the highest
average temperatures throughout the day; in contrast, Spring shows the
lowest average temperature throughout the day. These lowest average
temperatures in Spring seem to correlate with the lowest average bike
count, which also happens in Spring. We could say that people don’t bike
much in Spring; maybe people want to stay away from pollen, dust, etc.
During these periods, people don’t want to catch allergies.
There is an overlap of the trends that tell us about the average bicycle
counts for Summer, Fall, and Winter. One could think that this could be
related to temperature; however, the trends that tell us about the average
temperature for the same seasons do not overlap. In other words,
separation of the average temperature trends does not match separation
for average count trends for Summer, Fall, and Winter.
The last plot also reveals an interesting trend. The average temperature
remains constant throughout the days of the week. You may ask: how is
this possible when the previous plot shows that average temperature
changes for the same hour of the day? Well, this plot shows average
temperatures for all Mondays and all the seasons, say at the 23 hours
exactly. This average temperature happens to be relatively the same for
Tuesday, Wednesday, etc. You can run the calculations manually, say at
23 hours, for confirmation. Notice the sinuosity of the temperature trend;
even though it is an average trend, it follows the trends reported
in weather reports.
There is another interesting trend in this last plot. Imagine for a moment
that the picks of average bicycle counts are not in the last plot. Then the
average demand for bicycles is higher during the weekends, especially late
morning and early afternoon.
1.1. Conclusions
In general:
Monochromatic plots showing all the data for all the variables are
deceiving to the human brain. Humans are visual learners, and color
on the plots make a big difference. When plots show data with colors
referencing different categories, tags, or short time frames, they unveil
valuable patterns and insights during Exploratory Data Analysis.
The first line of action to address nulls, zeros, and outliers is the
identification of correlations or similarities among the available
variables. Strategies that use metrics such as mean or median to
impute values are suitable for univariate analysis and should be the last
line of action.
Random Forest Regressor is a good starting point to address nulls,
zeros, and outliers; it provides results that honor the physics of the
phenomena in this data set.
The trend that describes the average bicycle counts during the
weekends differs from that of Monday through Friday (working days).
During working days, demand for bicycle picks during rush hours
6:30–8:530 AM, 5:00–7:00 PM.
The trend that describes the average bicycle counts during Spring
differs from that of Summer, Fall, and Winter. Average bicycle demand
for Summer, Fall, and Winter remains relatively constant (observe
these trends overlap).
It is not 100% clear whether temperature alone is the cause of these
different trends in the demand of bicycles. Remember temperature is
not the only factor in weather conditions.
The trends of bicycle counts are cyclical through the day and through
the year.
8. References
The Modified Bike-Sharing data set and the Python code source of this
publication reside in my GitHub account.
Humidity or amount of water vapor in the air is usually reported as
Relative Humidity on weather reports.
Weather reports describing interaction between relative humidity,
temperature, and dew point.
Calculator of relative humidity, temperature, dew point.
Happy Learning…..
Sunil Arava