You are on page 1of 8

IEEE TRANSACTIONS ON EDUCATION, VOL. 1, NO. 1, NOVEMBER 2020.

Business Intelligence: Extract-Transform-


Load (ETL) and Data Analysis with Python
and MongoDB, Bixi Montréal example.
Adrián E. Caldera, Cinthya E. Pedroza, and Carlos Sandoval

Abstract—Currently, the information that companies generate in their business process is begin used to generate
value that helps in the business processes in the companies, which is known as Business Intelligence (BI). But the
process to obtain information that generates value for the company to gain a competitive advantage is not easy due to
its nature, such as volume, variety, velocity, veracity, and value. BI uses a set of techniques and tools to transform large
amounts of raw data into useful and meaningful information, allowing its interpretation and identifying its usefulness to
be applied in the company, process known as Big Data Science. Organizations that collect large amounts of
unstructured data are increasingly turning to non-relational databases, now called NoSQL databases. In this project, we
describe the Big Data Science process used for the information analysis of the company Bixi Montreal, which was based
on the extraction, transformation, and loading (ETL) of the data for the analysis of the same to obtaining of useful
information, using the Python programming language for the ETL process and analysis of the information and the
MongoDB for loading the data in JSON format. In addition, results of the process Big Data Sciene are showed and
discussed.

Index Terms—Business Intelligence, Data Analysis, Big Data, Raw Data, Extratct-Transform-Load (ETL),
Python, MongoDB.

I. INTRODUCTION

T he exponential growth of the volume of data gener-


ated by users, systems and sensors, further acceler-
business process of the company, introducing more ro-
bust products and services to the market [2].
ated by the concentration of large part of this volume
The constant growth of companies as well as the
on big distributed systems. The increasing interde-
enormous amount of information that is generated both
pendency and complexity of data accelerated by the In-
inside and outside of them, make it more difficult to im-
ternet, Web2.0, social networks, and open and stand-
prove their internal process, as well as understand what
ardized access to data sources from many different sys-
their costumers need. This is where th BI process be-
tems. NoSQL systems are distributed, non-relational
come highly relevant as a business tool for the compa-
databases designed for large-scale data storage and
nies [3].
for massively parallel data processing across a large
number of commodity servers [1]. In this paper, we describe a Big Data Science process
for analysis of some data about records of trips by the
In the last two decades, the significant evolution of
company Bixi Montreal from April 2014 to November
BI has generated new possibilities not only in the col-
2014, using the Python programming language for pre-
lection and analysis of data for its use in decision mak-
paring and cleaning the data thus as the analysis of the
ing support systems (DSS), as well as in financial, stra-
same, generating a structured file (JSON) to store in a
tegic, marketing, sales and production areas, but also
NoSQL database, in this case MongoDB was used.
provide the client with the services that they require
based on the Big Data Science process. Improving the This report is organized as follows: First, basic con-
cepts are represented in Section II, then la methodol-
————————————————
ogy employeed for this analysis is described in Section
• A. E. Caldera is with the Autonomous University of Aguascalientes, 20131,
Aguascalientes, México. E-mail: al138329@edu.uaa.mx. III and statistical results are shown in Section IV. Finally,
• C.E. Pedroza is with the Autonomous University of Aguascalientes, 20131,
these results are discussed in Section V.
Aguascalientes, México. E-mail: al92194@edu.uaa.mx.

• C. Sandoval is with Autonomous University of Aguascalientes, 20131, Aguas-


calientes, México. E-mail: al285667@edu.uaa.mx.

xxxx-xxxx/0x/$xx.00 © 200x IEEE Published by the IEEE Computer Society


2 IEEE TRANSACTIONS ON EDUCATION, VOL. 1, NO. 1, NOVEMBER 2020.

II. BASIC CONCEPTS The first BI tools (from companies such as Cognos
This section describes a review of literature about and Business Objects, which today form a large part of
basic concepts of BI and the tools used for the example the market under their new owners) initially aimed to
of Big Data Science process of companie Bixi Montreal, do queries on business data in order to dig deeper into
Python and MongoDB. it, or get results faster, than end-of-week/month/quar-
ter reports.
A. Business Intelligence (BI)
Business intelligence (BI) is the set of techniques and Today’s BI systems, therefore, carry out two main
tools for the transformation of raw data into meaning- functions: reporting and querying. In addition, two new
ful and useful information for business analysis pur- challenges triggered further evolution of BI solutions:
poses. BI technologies are capable of handling large
• The increasing importance of both reporting
amounts of unstructured data to help identify, de-
and querying meant enormous and increas-
velop, and otherwise create new strategic business op-
ing business pressure on IT to deliver report-
portunities.
ing and querying on "fresh" daily and even
The goal of BI is to allow for the easy interpretation hourly data, rather than via batch runs on
of these large volumes of data. Identifying new oppor- weekends
tunities and implementing an effective strategy based
• The advent of the Web meant new customer
on insights can provide businesses with a competitive
interfaces and Web-based data (e.g., social
market advantage and long-term stability.
media data) and required that customer-
Business Intelligence (BI) is an umbrella term that facing solutions operate 24/7, 52 weeks per
encompasses the processes, tools, and technologies to year.
turn data into information, information intoknowledge
B. Python
and plans to effectively conduct business activities. BI
Since its appearance in 1991, Python has become
encompasses the technologies of datawarehousing the
one of the most popular programming languages to-
processes in the "back end”, queries, reports, analysis,
day. Within the world of languages used for data anal-
and the tools to display information (these are the BI
ysis and computing, Python has developed a large sci-
tools) and processes on the front end. [4]
entific community active in this area. In the last 10
The purpose of business Intelligence (BI) solutions is years Python has gone from being a scientific compu-
pretty much the same as the purpose of “military intel- ting language to one of the most important language in
ligence”: to give commanders at every stage of a cam- data science, machine learning and general software
paign a good view of the battlefield and the pros and development in the industry [5].
cons of their options.
One of the advantages that Python shows compared
In the business context, this means telling the busi- to other languages used for data analysis such as R or
ness ongoing basis the state of the business (produc- SAS, is that Python is not only a suitable language for
tion, sales, profits, losses) and the factors that affect data science but also to build production systems
success (success of a sales campaign, customer satis- based on its data science productos. In the case of lan-
faction, manufacturing efficiency, planning and budg- guages like R or SAS, they create data science products
eting). which must be transferred to a language like Java, C#
or C++, to build the production system [5].
In the early 1990s, was introduced ETL (extract,
transform, load) software attached to a data ware- C. MongoDB
house, with a specialized version called EAI (enterprise The most used database until a few years ago are
application integration) to handle communication be- based on the relational model, which uses SQL as the
tween ERP packages and data transmission from the query language. However, NoSQL database solutions
ERP data stores to the data warehouse. are becoming increasingly used as the amount of infor-
mation that is generated in almost all our daily
AUTHOR ET AL.: BUSINESS INTELLIGENCE: EXTRACT-TRANSFORM-LOAD AND DATA ANALYSIS WITH PYTHON AND MONGODB. 3

activities grow. These data are generally unstructured, Once the data extraction and pre-cleaning process
complex and do not fit the relational model [6]. was completed, the data was loaded in a JSON format
and stored in a Database in Mongo. With the infor-
MongoDB is a one of the NoSQL solutions, there are
mation already stored, the analysis of the information
no database schemas or tables. MongoDB uses a “col-
was carried out.
lection” which is similar to a table and “documents”
which is similar to a rows, to store the data and schema
IV. RESULTS
information [6].
As part of the results obtained, there is a file in JSON
III. EXAMPLE BIG DATA SCIENCE PROCESS: format with the information of the stations ready to an-
METHODOLOGY EMPLOYED alyze the information. The first step was to extract the
day of the week with the most trips, generating a graph
In this paper, we describe a Big Data Science process
showing the volume of trips in each month.
for analysis of some data about records of trips by the
company Bixi Montreal from April 2014 to November
2014, using the Python programming language for pre-
paring and cleaning the data thus as the analysis of the
same, generating a structured file (JSON) to store in a
NoSQL database, in this case MongoDB was used.

Illustration 2. Trips by month

Illustration 1. Data stored in MongoDB.


The Illustration 2 shows the number of trips per
month, where it is identified that the months with
In order to carry out the process of analyzing the rec- more than 250,000 trips are in May and June, so that
ords of the Bixi Montreal company from April to No- from July to November the number of trips is less than
vember 2014, we began with the extraction of data 50,000.
from the Data set in raw format, in PyCharm a code was
developed in Python language with which was carried The next step was to identify the trips per day:
out a data pre-cleaning process. The code has the in-
structions to show the geographical location of the sta-
tions in the final file, since initially within the dataset
only the longitude and latitude of the station were
counted. The coding is integrated by dataframes used
to load the station catalog, the general file with the
dates and times of the use of bicycles at each station.

We sought to identify the month, day and hr of each


record, so some lines of code were used to distinguish
these values that will be used for data analysis. Since at
the beginning the files were independently, the inte-
gration was made with the respective name of each
station, whether at the beginning or end of each trip.
4 IEEE TRANSACTIONS ON EDUCATION, VOL. 1, NO. 1, NOVEMBER 2020.

Illustration 3. Trips by day. Illustration 5. Five Stations less start trips.

In illustration 3 shows that Thursday is when the In illustration 5, Metcalfe / Square Dorchester sta-
175,000 trips are made. Where it can also be seen that tion is the station where less than 5 trips are started.
on Tuesday there are almost 125,000 trips, this figure Another data query that was made was to know in
being the least number of trips a week. which stations more trips are started:
To have more precise information, the number of
trips per hour is consulted

Illustration 6. Five Stations more start trips.

Illustration 4. Trips by hour. As can be seen in illustration 6, at the Metro Mont-


Royal station (Rivard / du Mont-Royal) more than
In illustration 4 it can be seen that in the morning at
10,000 trips are started.
8 am is the hour where more than 80,000 trips are
made and in the afternoon at 5 pm there are over Another query was about the stations where the
100,000 trips, after this time until 11 pm it is observed least trips are completed:
a gradual decrease, until reaching approximately
21,000 trips.

Another data query that was made was to know the


5 stations where fewer trips are started:
AUTHOR ET AL.: BUSINESS INTELLIGENCE: EXTRACT-TRANSFORM-LOAD AND DATA ANALYSIS WITH PYTHON AND MONGODB. 5

Illustration 9. Is member.
Illustration 7. Five Stations less end trips.
What can be seen in Illustration 9 is that there are
In Illustration 7 you can see the 5 stations where more than 800,000 members and users who are not
fewer trips are completed, where the Metcalife / members with less than 200,000, it is a figure that al-
Square Dorchester station where less than 10 trips are lows identifying users who may be candidates for mem-
completed. This information can make it possible to an- bership and enjoy the benefits to which that the mem-
alyze the geographical area or identify why users do not bers are creditors.
arrive at this station.
To find out more statistical results, the total duration
Complementing where more trips are started, the 5 of trips in minutes, the average of the duration of the
stations where more trips are finished were searched: trips and the standard deviation of said trips, where the
duration in seconds of 1048575.0.

count 1048575
mean 806
std 635
min 61
25% 366
50% 641
75% 1075
max 7192

In the result of the data it can be observed that in


the trips there is a minimum time of 61 seconds and a
maximum of 7192 seconds, there is an average of 806
Illustration 8. Five Stations more end trips. seconds, so the standard deviation of 635 allows know-
In illustration 8 it can be seen that the same station ing how dispersed are the mean data.
where more trips begin is the same station where they
end, the Metro Mont-Royal station (Rivard / du Mont-
Royal), which allows us to observe that the station is
very visited by users.

Within the resulting report, the query was made to


find out how many users are members:
6 IEEE TRANSACTIONS ON EDUCATION, VOL. 1, NO. 1, NOVEMBER 2020.

Time series analysis comprises methods for analyz-


ing time series data to extract meaningful statistics and
other characteristics of the data. Time series forecast-
ing is the use of a model to predict future values based
on previously observed values. We decided to apply a
model to forecast the trips for the most popular sta-
tion.

We look for the most used stations for starting and


ending the bike trips. This is the station "Métro Mont-
Royal (Rivard / du Mont-Royal)". The station is located
near a supermarket and is at the center of all stations
in Montreal.

Illustration 10. Average duration per month As we can see, this station follows the pattern of the
trips per month reviewed early. Apparently, the service
In illustration 10, the average duration per month re-
bike stations have seasonality pattern, such as sales (al-
fers to the average duration in seconds of the bicycles
ways low at the beginning of the year and high at the
use per trip for every month in the dataset. In the graph end of the year). There is always an upward trend by
we can see that April’s average duration per trip is
the middle of the year and a downward by the begin-
around 725 seconds. The it starts to grow to a maxi-
ning and the ending of the year.
mum of 850 seconds in August. As we saw in previous
graphs, August was also the month with more trips. For a better visualization and understanding of this
From August to November it starts to decay to an aver- data of the most used station, we decomposed the
age duration per trip of 600 seconds. This information data into the three distinct components of a time se-
might be useful to know because it affects the availa- ries: trend, seasonality, and noise.
bility of bicycles. We could also say that the duration of
the trips is related to the maintenance necessity of the
bicycle. Using this information could be used to moni-
toring the routes of the stations because the duration
depends on the road taken. Optimizing the service by
setting up stations or tracking the bike’s locations by
the duration of the trip.

Illustration 12. Bike services station.

The graphs obtained clearly shows that the bike ser-


vice station follows an obvious seasonality and the ser-
vice always spikes at the start of each month.

Taking in advance these three parameters (season-


ality, trend, and noise) in data we applied a
model/method for time-series forecasting, known as
Illustration 11. Most used station per month.
AUTHOR ET AL.: BUSINESS INTELLIGENCE: EXTRACT-TRANSFORM-LOAD AND DATA ANALYSIS WITH PYTHON AND MONGODB. 7

ARIMA, which stands for Autoregressive Integrated


Moving Average.

To apply this model, we started a parameter selec-


tion to find the optimal set of parameters that yields
the best performance to forecast.

Illustration 13. Best performance to forecast.


Illustration 3. Compare model vs real data.
The results of our selection suggested that that SARI-
MAX(1, 1, 1)x(0, 1, 1, 12) yields the lowest AIC value of Overall, our forecast aligns with the real values well,
297.78. Therefore, we considered this to be optimal showing an upward trend starts from the beginning of
option. the year and captured the seasonality toward the end
of the year.

Then we tried to use the model to forecast Decem-


ber and see if it was following the seasonality at the end
of the year.

To acquire more insights, we plotted graphs related


to the behavior of the data.

Illustration 4. Forecast December.

Looking at the graph, we conclude that the model is


Illustration 2. Model residuals. capable of follow the trend and seasonality of the data
and it could be better to analyze a longer dataset cov-
These graphs helped us to see that the model resid- ering “cycles” (years) of the service bike stations trips.
uals are near normally distributed.

Then we compared the model vs the real data to V. CONCLUSIONS


help us understand the accuracy of the forecast. In this study, the work was based on a bicycle pick up
stations from Canada. We proceed to follow the meth-
odology of ETL to analyze this dataset to find opportu-
nities of improving quality service, and other areas.

First, in the Extract phase, we cleaned up the dataset


using Python programming language. The initial pro-
cess was to get along the station name with the corre-
sponding name and the address obtained by the
8 IEEE TRANSACTIONS ON EDUCATION, VOL. 1, NO. 1, NOVEMBER 2020.

geolocation given. Once the dataset is clean, we started REFERENCES


to generate a JSON file to store in a NoSQL Database, in [1] CLASS MATERIAL FILE "NOSQL DATABASES (P2)". PHD LUIS EDU-
this case MongoDB, where we were able to manage the ARDO BAUTISTA.

data in an easier way. After the data was stored in the [2] D. De Carvalho, R. Rocha, V. Fernandes, and S. Neves,
database, we proceed to apply the analysis to extract “Business intelligence: Future perspectives (April, 2016),”
useful information to keep up the good service. ACM Int. Conf. Proceeding Ser., vol. 20-22-July, pp. 89–92,
2016, doi: 10.1145/2948992.2949011.

We identified that the most frequent days in the sta- [3] D. Brandon, “Business intelligence *,” pp. 59–60.
tions are Fridays and Thursdays. This might be because
[4] Oltra Badenes, Raúl Francisco.Business Intelligence. Definición.
these days are used to be the ones when people decide http://hdl.handle.net/10251/84471.
to take a ride after a long weekday of work. The most
frequent hours when people pick up a bicycle is around
17 hrs., and this might be because is about the hour a [5] W. McKinney, Python for Data Analysis, vol. 71, no. 10. 2018.
regular job used finish of the day. The most frequent
[6] Z. Parker, S. Poe, and S. V. Vrbsky, “Comparing NoSQL
months in the stations are August and July, this might
MongoDB to an SQL DB,” Proc. Annu. Southeast Conf., 2013,
be because the weather conditions make prefer to use doi: 10.1145/2498328.2500047.
the bike instead. We also identified that the average
duration per bicycle ride tends to be lower right past
August. This could be useful in terms of bicycles availa-
bility.

In brief, there are several reasons why the bike pickups


have this behavior, and all seems to obey the weather
conditions and the job schedules. In any case, we
found out the frequency of bikes picked up related to
the date given. This is useful information to keep up the
service. Using this information is vital because it serves
in a baseline form to planning, for example, when to do
maintenance, when do we need more controls and
monitoring the capacity of the service at any given
time. This information could be used even for strategic
activities of expanding the service.

At last, we built a model that could be used to forecast


coming years. With the dataset provided it was hard to
build one that could give great information for the fu-
ture, because all the dataset is eight months wide. For
the actual time series analysis models, the data was
lacking “cycles” or periods of time. The main thing we
identified with the model was that the service works
with the season and is stationary. This is relevant be-
cause we could easy predict how it would be the up- GitHub project: https://gist.github.com/Car-
coming years because the trend repeats every year and losSM17/7159bb589b7fe90df2277d943317cd0f
the increase of bikes needed do not change a lot every
year. Adrian E. Caldera, biography not available at the time of publica-
tion.
ACKNOWLEDGMENT
Cinthya E. Pedroza, biography not available at the time of publica-
The authors wish to thank Dr. Luis Eduardo Bautista, for tion.
the support and knowledge transferred in the clasess. Carlos Sandoval, biography not available at the time of publication.

You might also like