You are on page 1of 21

Applied Exploratory Data Analysis,

Bike-Sharing. The Power of


Visualization, Python.
https://towardsdatascience.com/applied-exploratory-data-analysis-the-power-of-visualization-bike-sharing-
python-c5b2645c3595

1. Introduction
This study analyzes a Modified Bike-Sharing data set. Unlike the original
data set, this “Modified” version includes nulls, zeros, and outliers, which
opens the door to a detail Exploratory Data Analysis EDA. Many of the
public studies on Bike-Sharing include basic EDA and then go straight into
Modeling. The general goal of this analysis is to perform an extensive EDA
that takes into account the Physics of the phenomena and provides
insights for modeling Bike-Sharing rentals.
After reading this post you will know about:
 Exploiting the visual power of the human brain allows learning about
patterns or trends in congested plots..
 The tricks to make those rows that have cells with null and non-null
values visible in a plot.
 The identification of suspicious patterns and proposed steps to address
them.
 The sequential steps that unveil the story behind the data during the
EDA.
 The importance of considering the Physics that explains the
phenomena before performing imputations.
 The preliminary assessment of the variables that influence the number
of bicycle rentals.
 The Python code to present data on plots with visual contrast, which
helps to unveil patterns and understand the “puzzle” behind the data.

Note that Data Exploration is performed using Python (version 3.7.4) with
Data Science libraries such as NumPy, Pandas, SciKit-
Learn, Seaborn, category_encoders, scipy, and date time among others.

Table of contents
1….. Introduction
2….. Attribute Information
3….. Nulls, Zeros, and Outliers
4….. Correlation Structure
5….. Univariate Behavior of the Response Variable
6….. Behavior of Response Variable With Time Variables
7….. Conclusions
8….. References
2. Attribute Information
The data set contains six variables related with time, four continuous
variables, and the response variable counts of bicycles ‘cnt’. The ASCII
file and the Python code reside in my GitHub account.

2.1. Feature Engineering and Encoding


With the purpose to a) facilitate data filtering, b) extract hidden
information from the data, c) analyze trends on plots, and d) model the
Bike-Sharing rentals, it is handy to derive and encode some features as
follows:
Notice the ‘season’ variable carries ordering (i.e., Summer is followed by
Fall). Encoding algorithms such as LabelEncoder() from Sci-Kit-learn will
not honor the order of seasons because it encodes in alphabetical order.
Instead, the category_encoders library allows the use of dictionary-
mapping to establish custom order.

3. Nulls, Zeros, and Outliers


Data tables have columns identifying features or variables and rows with
cells holding data for each one of the features. If a row happens to have a
cell with a null value, the entire row will cause some algorithms to cough
even though the rest of the cells in that row are non-null values. Zeros and
outliers, on the other hand, will influence the model negatively if they
represent sources of error. Therefore, the first task is to identify, analyze,
and then address nulls, zeros, and outliers. The use of tags facilitates this
process; this section of the analysis describes this process.

3.1. Start by Tagging Nulls.


The execution of df.isna().sum() identifies features with nulls in the data
set. ‘temp’ and ‘hum’ are the only two features with nulls. A column named
‘outlr_miss’ will store tags with the following values: (Note that NaN in
Python means Non-a-Number, that is the reason for the suffix ‘_nan’)

Once the rows containing null values have been tagged, there should be
exactly 23 rows showing ‘temp_nan’ or ‘hum_nan’ tags. Observe the
following Pandas data-frame output.
3.2. Tagging Zeros

The figure on the side shows the output of df.isin([0]).sum(). Ordinal


features such as hours ‘hr’ or Boolean features such as ‘holiday’ or
‘workingday’ are expected to have zeros. However, zeros in continuous
features such as ‘atemp’ with 2 zero-values, ‘hum’ with 22 zero-values, and
‘windspeed’ with 2180 zero-values are suspicious. Notice the tag has the
suffix ‘_zero’ indicating the row has zeros. The following is a snippet of the
Python code to encode nulls and zeros.

3.3. Scatter Matrix Plot or Pair Plots


Visualize Nulls, Zeros, and Outliers
Scatter matrix plots are useful in the assessment of collinearity among
features and the identification of nulls, zeros, and outliers. At first view,
matrix plots look like a bunch of small monochromatic plots, sometimes
so tiny, that they make us wonder about their purpose. But, when these
plots show data with the correct color contrast, they trigger the human
brain. Remember that the human brain is visual by nature, it processes
information way better through images and colors. Here is where tags are
useful, they become a category which colors nulls, zeros, and outliers.
At this point in the analysis, the Bike-Sharing data table includes rows
with cells that hold null values. This is an implicit problem because the
majority of plotting and machine learning ML algorithms cough in the
presence of nulls. By default, these algorithms address this issue by
excluding rows that have null values. The consequence of that default
setting is that rows that have null and non-null values become completely
invisible for the plot. The following section explains a null-zero
replacement trick that uses tags to make rows that have null and non-
Null values visible in a plot.

3.3.1. The code for plotting


The Python library seaborn facilitates the construction of variety of plots;
however, algorithms such as pairplot(..) cough when nulls are present.
This analysis proposes two tricks to overcome this issue.

 Null-zero replacement Trick. Plots receive data in a table-like


format; they have columns representing features or variables and rows
representing the actual data. When a table has rows with nulls, plotting
algorithms generate errors. To overcome this issue, users and
statistical applications automatically exclude rows that have null
values. As a result, non-null values located in those same rows become
‘invisible’ on the plot.
This analysis proposes a Null-zero replacement trick that uses
zeros to replace nulls. This transformation takes place at run time,
while data are feed to the Python plot algorithm; therefore, nulls
remain intact in the original data set.
 Custom color palette for tags. Python plot algorithms follow a
predefined color palette, and many times the contrast between colors
shadow data patterns to the human eye. The definition of a custom
color palette addresses this issue, observe the next Python code.

3.3.2. The analysis of nulls, zeros, and outliers


The focus of this section is the description of tips that help identify and
analyze nulls, zeros and outliers in plots.

(1) Unveiling rows with null and non-null values. The Null-zero
replacement trick allows the presentation of rows that have null and
non-null values in the scatter matrix plot. Observe the orange dots that
correspond to the ‘temp_nan’ tag; each orange dot represents cell values
from a row in the data table. Keep in mind that these orange dots represent
either null or non-null values depending on the plot. On one hand, orange
dots represent null values in those plots where the temperature ‘temp’ is
one of the variables; these plots are surrounded by an orange rectangle in
the scatter matrix plot. Note that zeros represent null values in these plots;
this is why the orange dots follow a linear pattern at ‘temp=0’. On the other
hand, orange dots represent non-null values on those plots where
temperature ‘temp’ is not one of the variables.
Both orange and blue dots follow the same interpretation. Notice that
these dots have tags with the same suffix ‘_nan’. The next step of this
workflow will address these null values before going into modeling.

NOTE: one improvement to the Null-zero replacement trick is to use


a transparent color for those dots that represent null values. This
requires additional coding to manipulate data and assign colors at run
time.
(2) Outliers showing true zero-values. Observe the black dots
corresponding to the ‘hum_zero’ tag; they represent humidity ‘hum’ with
a value of zero. These zero values do exist in the data table. These black
dots form a linear pattern that stands out as an outlier. The interpretation
of dots with ‘atemp-zero’ and ‘windspeed_zero’ tags is similar to the black
dots. These zero-values, which stand as outliers, need to be addressed
before performing further analysis.

(3) Outliers. The scatter plots that have temperature ‘temp’ and ‘feel-
like’ temperature ‘atemp’ as variables, show a sequence of gray dots
forming a linear pattern. This dots should be labeled as outliers because:
1) they stand out from the majority of the dots forming a 45-degree trend,
2) they show a constant value for ‘atemp’ which is suspicious, and 3) they
all occur during one day 08–17–2012, which is also suspicious. Once
again, this is a data set that was modified, probably by hand, and these
outliers need to be addressed before modeling. The following is a snippet
of the Python code to tag these outliers as ‘atemp_outlr’:

3.4. Addressing Nulls, Zeros, And Outliers


Strategies that use metrics such as mean, median, or mode to perform
imputations seem to be suitable for univariate analysis. The Bike-Sharing
data set is a multivariate data set, and the first attempt to address nulls,
zeros, and outliers should involve the exploration of correlations or
similarities among the features within the data set.

3.4.1. Addressing nulls, zeros, and outliers for


temperature
We all have experienced it, at high temperatures ‘temp’, the ‘feel-like’
temperature is higher than the actual temperature; conversely, at low
temperatures, the ‘feel-like’ temperature is lower than the actual
temperature. This linear relationship between these variables helps
address and correct nulls, zeros, and outliers for temperature.
The next figure shows a linear relationship between
temperature ‘temp’ and ‘feel-like’ temperature ‘atemp’. One variable tells
about the other. Therefore, one of these variables can be dropped from the
analysis during modeling. However, we will use this relationship to
emphasize the fact that when it comes to addressing nulls, zeros, or
outliers, looking for relationships or similarities between the variables
should be the first option in the list.

The scatter plot on the left presents all the original data. The colors, which
represent tags, reveal three suspicious patterns (observe the suffix of each
tag): Outliers (light blue dots), zeros (magenta dots), and nulls (orange
dots). Remember, orange dots represent null values replaced by zeros;
the Null-zero trick makes the nulls visible as zeros on the plot, we can
have a visual assessment of how many they are and whether corrections to
these null values will make sense.

This is where the tags become handy; they allow filtering the suspicious
trends before data is fed into the Linear Regression model from Sci-Kit-
learn. Once the model is trained, it is used to correct the suspicious
patterns in temperature.

3.4.2. Addressing Nulls and zeros for humidity


Humidity or amount of water vapor in the air is usually reported as
Relative Humidity on weather reports. Relative Humidity is related to dew
point and temperature; therefore, the first attempt that tries to address
nulls and zeros for humidity should involve the identification of
correlations or similarities with temperature and dew point.
The Bike-Sharing data set provides values for humidity and temperature,
but values for dew point are missing in this data set. The next figure
investigates possible relationships or correlations for humidity with the
available variables in the data set. Keep in mind blue dots show null values
presented as zero in these plots (these points do not exist in the data table).
The plots show the absence of a defined pattern for humidity; however,
the patterns are there. The human brain is a visual learner, and the data
needs to be presented in chunks and with color to reveal possible patterns.
Exploring the data in more detail should help to identify possible
relationships for humidity. Think about this, the temperature during the
morning is low, then it raises progressively to a pick during the afternoon,
and finally, it decreases at the end of the day. Since temperature is related
to humidity, this suggests a possible humidity relationship should exist
during the day. The next graph shows temperature and humidity day by
day.
The previous plots show an absence of a defined pattern for daily values
of humidity, temperature, and wind speed. However, the next graph
reveals a different picture; it shows a pattern when humidity and
temperature interact through the hours of the day. The third plot, from
left to right, shows humidity values of zero; these values exist in the data
set, they are not the result of the Null-zero replacement trick.

Note that data presented by colors and chunks (by days) allows the brain
to knowledge the opposite patterns between temperature and humidity;
It would not make sense to replace the zeros in humidity with an average
value.

The next figure confirms this, black dots have a ‘hum_zero’ tag. The
suffix ‘_zero’ indicates that in fact those are zeros in the original data set.

Observe that humidity and temperature follow opposite trends, which is


in line with the behavior described by weather reports. Humidity is related
not only to the temperature but also to the dew point; as a result, to
calculate one value, the other two are needed. Dew point is absent in the
Bike-Sharing data set; therefore, there is not a straightforward
relationship that can be derived from the Bike-Sharing data set to
calculate null and zero values for humidity.
Any relationship that addresses null and zero values for humidity needs to
honor the daily relationship of humidity, temperature, and dew point.
Random Forest is a great starting point to address nulls and zeros for
humidity because:

1)the relationship between humidity, temperature, and the dew point is


not linear

2) Random Forest does a good job predicting values within the range of
values where it was trained.

3) the absence of information related to the dew point makes it challenging


to derive humidity values directly.

The data set that is used to train the Random Forest Regressor has non-
null values for humidity. Once the Random Forest Regressor is trained, it
is used to predict null values for humidity. The following is a snippet of the
Python code.

The next graph shows the results of the Random Forest Regressor.
Observe the plot for Thursday 2011/3/10; before the application of
Random Forest, this plot had a horizontal trend at ‘hum=0’. After the
application of Random Forest Tree Regressor, this plot shows a humidity
trend that: a) matches daily humidity trends reported in weather
reports and b) starts right at the end of Wednesday and finishes to meet
the beginning trend of Friday.

Therefore, in this case, Random Forest Regressor not only does a good
job addressing nulls and zeros, but also honors the underlined Physics of
humidity, temperature, and dew point.
3.4.3. Addressing zeros in wind speed
The next figure shows the behavior of wind speed with different features.
Observe the green dots; they represent zero values in humidity. These dots
form not only a detached pattern in the scatter plots but also a bimodal
behavior in the distribution plot.
The Random Forest Regressor is a good starting point to address this
suspicious pattern in humidity.

The next figure presents the results of the Random Forest Regressor.

The distribution plot shows a unimodal behavior, and the green dots no
longer form a horizontal trend at windspeed=0’.
4. Correlation Structure
One way to identify collinearity among the features is by using the
correlation matrix, which provides the Pearson correlation coefficient.
Intuitively, the bicycle counts depend on variables that are related not only
with weather conditions but also with time. Therefore, the correlation
matrix includes both weather and time variables.
The collinearity between temperature ‘temp’ and the feeling-like
temperature ‘atemp’ is evident. As a result, the feel-like temperature,
which is derived from ‘temp’, will be excluded during the modeling
process. In general, the bicycle counts show a positive correlation with
temperature and a negative correlation with humidity. The bottom row of
the scatter plot matrix shows these tendencies. Also, notice that some of
the time variables have a high correlation with the bicycle counts; this is
especially true for hours. The next sections will explore, graphically, the
relationship of bicycle counts with time variables.
5. Univariate Behavior of the Response
Variable
The distribution of the number of bicycles ‘cnt’ is right-skewed, and a
logarithmic transformation is required to correct the behavior.
Even though the logarithmic transformation does not correct the
skewness completely, the results are close to a normal distribution.
This transformation will have a positive impact on some of the Machine
Learning algorithms.
6. Behavior of Response Variable With
Time Variables
Observe the median (above 50% of the data) of the box plots in the next
figure; it shows four general characteristics of the number of bicycle
rentals.
1) rentals are increasing over the years,
2) trends initially increase and then decrease over time,
3) trends are cyclical through time, and
4) the presence of outliers. In more detail, the following statements can be
derived:

 The first two box plots indicate that starting in January, rentals
gradually increase, they continue the increasing pattern through
Spring and reach a pick in the Summer. Then rentals decrease through
Fall and Winter to meet the trends of January and Spring of the next
year, a sign of cyclical behavior. Pick bike rentals occurs in Summer.
 The middle-left box plot indicates the demand for bicycle rentals is
higher during working days. This phenomenon is even more
pronounced, almost 50% more, for the second year. This behavior
signals that the popularity of bike rentals is increasing, at least in the
city source of this data set.
 The middle-right box plot shows people ride less on holidays. However,
this plot also reveals the increase in the popularity of bicycle rentals
throughout the years.
 The bottom box plot also resembles both the increase in popularity and
cyclical behavior. This time with two cycles over a short period of time,
a day. The demand for bicycles picks between 7–9 AM and 5–6 PM.
These patterns match workers going to work and leaving work; workers
actually use bicycles as transportation media for work. Notice demand
remains relatively constant between pick hours, this could be explained
by “regular people”, tourists, and students using bicycles.
Let’s inspect the next set of plots. Observe how the brain can digest
multiple trends easily, thanks to different sorting and colors on the plot.
The inflections of the continuous trend lines depict the average number of
rented bicycles for each hour throughout the day. These trend lines
resemble the same patterns and characteristics that we just explained.
These plots give us an additional piece of information. The dash lines
represent the variation of average temperature through the hours of the
day. This dash trend helps us determine, at least partially (remember this
is a multivariate analysis, humidity, wind speed, and dew point are not in
these plots), the influence of temperature on the number of rental bikes
through the hours of the day.

The plot on the top reveals interesting trends. Fall shows the highest
average temperatures throughout the day; in contrast, Spring shows the
lowest average temperature throughout the day. These lowest average
temperatures in Spring seem to correlate with the lowest average bike
count, which also happens in Spring. We could say that people don’t bike
much in Spring; maybe people want to stay away from pollen, dust, etc.
During these periods, people don’t want to catch allergies.

There is an overlap of the trends that tell us about the average bicycle
counts for Summer, Fall, and Winter. One could think that this could be
related to temperature; however, the trends that tell us about the average
temperature for the same seasons do not overlap. In other words,
separation of the average temperature trends does not match separation
for average count trends for Summer, Fall, and Winter.

(*) Therefore, the average temperature during Summer, Fall, and


Winter does not change the average demand trends of bicycles. However,
since the temperature is not the only factor controlling the weather
conditions, it seems that there is a commingle of variables factoring in a
relatively constant bicycle demand during Summer, Fall, and Winter.
(*) Also, the average bicycle demand during Summer, Fall, and Winter
is higher compared to the demand during Spring.

The last plot also reveals an interesting trend. The average temperature
remains constant throughout the days of the week. You may ask: how is
this possible when the previous plot shows that average temperature
changes for the same hour of the day? Well, this plot shows average
temperatures for all Mondays and all the seasons, say at the 23 hours
exactly. This average temperature happens to be relatively the same for
Tuesday, Wednesday, etc. You can run the calculations manually, say at
23 hours, for confirmation. Notice the sinuosity of the temperature trend;
even though it is an average trend, it follows the trends reported
in weather reports.

There is another interesting trend in this last plot. Imagine for a moment
that the picks of average bicycle counts are not in the last plot. Then the
average demand for bicycles is higher during the weekends, especially late
morning and early afternoon.
1.1. Conclusions
In general:

 Monochromatic plots showing all the data for all the variables are
deceiving to the human brain. Humans are visual learners, and color
on the plots make a big difference. When plots show data with colors
referencing different categories, tags, or short time frames, they unveil
valuable patterns and insights during Exploratory Data Analysis.
 The first line of action to address nulls, zeros, and outliers is the
identification of correlations or similarities among the available
variables. Strategies that use metrics such as mean or median to
impute values are suitable for univariate analysis and should be the last
line of action.
 Random Forest Regressor is a good starting point to address nulls,
zeros, and outliers; it provides results that honor the physics of the
phenomena in this data set.
 The trend that describes the average bicycle counts during the
weekends differs from that of Monday through Friday (working days).
During working days, demand for bicycle picks during rush hours
6:30–8:530 AM, 5:00–7:00 PM.
 The trend that describes the average bicycle counts during Spring
differs from that of Summer, Fall, and Winter. Average bicycle demand
for Summer, Fall, and Winter remains relatively constant (observe
these trends overlap).
 It is not 100% clear whether temperature alone is the cause of these
different trends in the demand of bicycles. Remember temperature is
not the only factor in weather conditions.
 The trends of bicycle counts are cyclical through the day and through
the year.
8. References
 The Modified Bike-Sharing data set and the Python code source of this
publication reside in my GitHub account.
 Humidity or amount of water vapor in the air is usually reported as
Relative Humidity on weather reports.
 Weather reports describing interaction between relative humidity,
temperature, and dew point.
 Calculator of relative humidity, temperature, dew point.

Happy Learning…..

Sunil Arava

You might also like