Professional Documents
Culture Documents
Abstract—Currently, the information that companies generate in their business process is begin used to generate
value that helps in the business processes in the companies, which is known as Business Intelligence (BI). But the
process to obtain information that generates value for the company to gain a competitive advantage is not easy due to
its nature, such as volume, variety, velocity, veracity, and value. BI uses a set of techniques and tools to transform large
amounts of raw data into useful and meaningful information, allowing its interpretation and identifying its usefulness to
be applied in the company, process known as Big Data Science. Organizations that collect large amounts of
unstructured data are increasingly turning to non-relational databases, now called NoSQL databases. In this project, we
describe the Big Data Science process used for the information analysis of the company Bixi Montreal, which was based
on the extraction, transformation, and loading (ETL) of the data for the analysis of the same to obtaining of useful
information, using the Python programming language for the ETL process and analysis of the information and the
MongoDB for loading the data in JSON format. In addition, results of the process Big Data Sciene are showed and
discussed.
Index Terms—Business Intelligence, Data Analysis, Big Data, Raw Data, Extratct-Transform-Load (ETL),
Python, MongoDB.
I. INTRODUCTION
II. BASIC CONCEPTS The first BI tools (from companies such as Cognos
This section describes a review of literature about and Business Objects, which today form a large part of
basic concepts of BI and the tools used for the example the market under their new owners) initially aimed to
of Big Data Science process of companie Bixi Montreal, do queries on business data in order to dig deeper into
Python and MongoDB. it, or get results faster, than end-of-week/month/quar-
ter reports.
A. Business Intelligence (BI)
Business intelligence (BI) is the set of techniques and Today’s BI systems, therefore, carry out two main
tools for the transformation of raw data into meaning- functions: reporting and querying. In addition, two new
ful and useful information for business analysis pur- challenges triggered further evolution of BI solutions:
poses. BI technologies are capable of handling large
• The increasing importance of both reporting
amounts of unstructured data to help identify, de-
and querying meant enormous and increas-
velop, and otherwise create new strategic business op-
ing business pressure on IT to deliver report-
portunities.
ing and querying on "fresh" daily and even
The goal of BI is to allow for the easy interpretation hourly data, rather than via batch runs on
of these large volumes of data. Identifying new oppor- weekends
tunities and implementing an effective strategy based
• The advent of the Web meant new customer
on insights can provide businesses with a competitive
interfaces and Web-based data (e.g., social
market advantage and long-term stability.
media data) and required that customer-
Business Intelligence (BI) is an umbrella term that facing solutions operate 24/7, 52 weeks per
encompasses the processes, tools, and technologies to year.
turn data into information, information intoknowledge
B. Python
and plans to effectively conduct business activities. BI
Since its appearance in 1991, Python has become
encompasses the technologies of datawarehousing the
one of the most popular programming languages to-
processes in the "back end”, queries, reports, analysis,
day. Within the world of languages used for data anal-
and the tools to display information (these are the BI
ysis and computing, Python has developed a large sci-
tools) and processes on the front end. [4]
entific community active in this area. In the last 10
The purpose of business Intelligence (BI) solutions is years Python has gone from being a scientific compu-
pretty much the same as the purpose of “military intel- ting language to one of the most important language in
ligence”: to give commanders at every stage of a cam- data science, machine learning and general software
paign a good view of the battlefield and the pros and development in the industry [5].
cons of their options.
One of the advantages that Python shows compared
In the business context, this means telling the busi- to other languages used for data analysis such as R or
ness ongoing basis the state of the business (produc- SAS, is that Python is not only a suitable language for
tion, sales, profits, losses) and the factors that affect data science but also to build production systems
success (success of a sales campaign, customer satis- based on its data science productos. In the case of lan-
faction, manufacturing efficiency, planning and budg- guages like R or SAS, they create data science products
eting). which must be transferred to a language like Java, C#
or C++, to build the production system [5].
In the early 1990s, was introduced ETL (extract,
transform, load) software attached to a data ware- C. MongoDB
house, with a specialized version called EAI (enterprise The most used database until a few years ago are
application integration) to handle communication be- based on the relational model, which uses SQL as the
tween ERP packages and data transmission from the query language. However, NoSQL database solutions
ERP data stores to the data warehouse. are becoming increasingly used as the amount of infor-
mation that is generated in almost all our daily
AUTHOR ET AL.: BUSINESS INTELLIGENCE: EXTRACT-TRANSFORM-LOAD AND DATA ANALYSIS WITH PYTHON AND MONGODB. 3
activities grow. These data are generally unstructured, Once the data extraction and pre-cleaning process
complex and do not fit the relational model [6]. was completed, the data was loaded in a JSON format
and stored in a Database in Mongo. With the infor-
MongoDB is a one of the NoSQL solutions, there are
mation already stored, the analysis of the information
no database schemas or tables. MongoDB uses a “col-
was carried out.
lection” which is similar to a table and “documents”
which is similar to a rows, to store the data and schema
IV. RESULTS
information [6].
As part of the results obtained, there is a file in JSON
III. EXAMPLE BIG DATA SCIENCE PROCESS: format with the information of the stations ready to an-
METHODOLOGY EMPLOYED alyze the information. The first step was to extract the
day of the week with the most trips, generating a graph
In this paper, we describe a Big Data Science process
showing the volume of trips in each month.
for analysis of some data about records of trips by the
company Bixi Montreal from April 2014 to November
2014, using the Python programming language for pre-
paring and cleaning the data thus as the analysis of the
same, generating a structured file (JSON) to store in a
NoSQL database, in this case MongoDB was used.
In illustration 3 shows that Thursday is when the In illustration 5, Metcalfe / Square Dorchester sta-
175,000 trips are made. Where it can also be seen that tion is the station where less than 5 trips are started.
on Tuesday there are almost 125,000 trips, this figure Another data query that was made was to know in
being the least number of trips a week. which stations more trips are started:
To have more precise information, the number of
trips per hour is consulted
Illustration 9. Is member.
Illustration 7. Five Stations less end trips.
What can be seen in Illustration 9 is that there are
In Illustration 7 you can see the 5 stations where more than 800,000 members and users who are not
fewer trips are completed, where the Metcalife / members with less than 200,000, it is a figure that al-
Square Dorchester station where less than 10 trips are lows identifying users who may be candidates for mem-
completed. This information can make it possible to an- bership and enjoy the benefits to which that the mem-
alyze the geographical area or identify why users do not bers are creditors.
arrive at this station.
To find out more statistical results, the total duration
Complementing where more trips are started, the 5 of trips in minutes, the average of the duration of the
stations where more trips are finished were searched: trips and the standard deviation of said trips, where the
duration in seconds of 1048575.0.
count 1048575
mean 806
std 635
min 61
25% 366
50% 641
75% 1075
max 7192
Illustration 10. Average duration per month As we can see, this station follows the pattern of the
trips per month reviewed early. Apparently, the service
In illustration 10, the average duration per month re-
bike stations have seasonality pattern, such as sales (al-
fers to the average duration in seconds of the bicycles
ways low at the beginning of the year and high at the
use per trip for every month in the dataset. In the graph end of the year). There is always an upward trend by
we can see that April’s average duration per trip is
the middle of the year and a downward by the begin-
around 725 seconds. The it starts to grow to a maxi-
ning and the ending of the year.
mum of 850 seconds in August. As we saw in previous
graphs, August was also the month with more trips. For a better visualization and understanding of this
From August to November it starts to decay to an aver- data of the most used station, we decomposed the
age duration per trip of 600 seconds. This information data into the three distinct components of a time se-
might be useful to know because it affects the availa- ries: trend, seasonality, and noise.
bility of bicycles. We could also say that the duration of
the trips is related to the maintenance necessity of the
bicycle. Using this information could be used to moni-
toring the routes of the stations because the duration
depends on the road taken. Optimizing the service by
setting up stations or tracking the bike’s locations by
the duration of the trip.
data in an easier way. After the data was stored in the [2] D. De Carvalho, R. Rocha, V. Fernandes, and S. Neves,
database, we proceed to apply the analysis to extract “Business intelligence: Future perspectives (April, 2016),”
useful information to keep up the good service. ACM Int. Conf. Proceeding Ser., vol. 20-22-July, pp. 89–92,
2016, doi: 10.1145/2948992.2949011.
We identified that the most frequent days in the sta- [3] D. Brandon, “Business intelligence *,” pp. 59–60.
tions are Fridays and Thursdays. This might be because
[4] Oltra Badenes, Raúl Francisco.Business Intelligence. Definición.
these days are used to be the ones when people decide http://hdl.handle.net/10251/84471.
to take a ride after a long weekday of work. The most
frequent hours when people pick up a bicycle is around
17 hrs., and this might be because is about the hour a [5] W. McKinney, Python for Data Analysis, vol. 71, no. 10. 2018.
regular job used finish of the day. The most frequent
[6] Z. Parker, S. Poe, and S. V. Vrbsky, “Comparing NoSQL
months in the stations are August and July, this might
MongoDB to an SQL DB,” Proc. Annu. Southeast Conf., 2013,
be because the weather conditions make prefer to use doi: 10.1145/2498328.2500047.
the bike instead. We also identified that the average
duration per bicycle ride tends to be lower right past
August. This could be useful in terms of bicycles availa-
bility.