You are on page 1of 10

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/273524576

SQL and Data Analysis. Some Implications for Data Analysits and Higher
Education

Article  in  Procedia Economics and Finance · December 2015


DOI: 10.1016/S2212-5671(15)00071-4

CITATIONS READS

10 3,235

2 authors:

Marin Fotache Catalin Strimbei


Universitatea Alexandru Ioan Cuza Universitatea Alexandru Ioan Cuza
31 PUBLICATIONS   65 CITATIONS    27 PUBLICATIONS   54 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Research in Business Information Systems View project

Big Data - Apache Spark benchmarking View project

All content following this page was uploaded by Marin Fotache on 24 March 2015.

The user has requested enhancement of the downloaded file.


Available online at www.sciencedirect.com

ScienceDirect
Procedia Economics and Finance 20 (2015) 243 – 251

7th International Conference on Globalization and Higher Education in Economics and Business
Administration, GEBA 2013

SQL and data analysis. Some implications for data analysits and
higher education
Marin Fotachea*, Catalin Strimbeib
a,b
Al. I. Cuza University, B-dul Carol 1 nr.22, Iasi, 700505, Romania

Abstract

Hypes or not, Big Data, NoSQL, Analytics, Business Intelligence, Data Science require processing huge amounts of data in
various and complex ways using a vast array of statistical methods and tools. Market increasingly needs graduates with both
databases and data warehouses technological skills and also statistical competencies in order to decipher the business patterns and
trends hidden in the mountains of data. This paper presents the main coordinates of data processing today and some implications
for academic curricula. It argues that data analysis and business intelligence professionals could benefit if trained to acquire a
proper level of SQL and data warehouses knowledge.
© 2014 The Authors. Published by Elsevier B.V.
© 2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license
Selection and peer-review under responsibility of the Faculty of Economics and Business Administration, Alexandru Ioan Cuza
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
University of Iasi.
Peer-review under responsibility of the Faculty of Economics and Business Administration, Alexandru Ioan Cuza University of Iasi.
Keywords: Big Data, Statistics, Data Analysis, SQL, OLAP

1. Introduction: Data Deluge and a Consequent Hype – Big Data

In IT and business world hypes and fads emerge and disappear in a very quick pace. The top of the hypes
changes every few months. Marketing departments of almost all tech companies compete in repackaging and
rebranding old stuff suggesting coolness and desirability of their products (Buhl et.al, 2013). It seems the strategy
works, at least for some.

* Corresponding author. Tel.: +40232201430; fax: +40232217000


E-mail address: fotache@uaic.ro

2212-5671 © 2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
Peer-review under responsibility of the Faculty of Economics and Business Administration, Alexandru Ioan Cuza University of Iasi.
doi:10.1016/S2212-5671(15)00071-4
244 Marin Fotache and Catalin Strimbei / Procedia Economics and Finance 20 (2015) 243 – 251

As Stonebraker (2012) put it, Big Data is the "buzzword du jour", Stonebraker. As with other buzzwords, there is
no rigorous definition of the term. How much data is big (Jacobs, 2009; Borkar et al., 2012)? What are the
differences between Big Data, databases and data warehouses, all of them dealing with huge volumes of data
(Borkar et al.,2012) ?
E-commerce sites, sensors, cameras, mobile apps all produces huge amount of data with different periodicity.
This mountain of data must be processed and analyzed in order to detect patterns, to explain business phenomena, to
make predictions. The basic assumption of Big Data is we can learn from data (Cron et al., 2012).
According to Jacobs (2009), Big Data should be defined at any point in time as data whose size forces us to look
beyond the tried and true methods that are prevalent at that time, whereas for Cuzzocrea et al. (2011) Big Data refers
to enormous amounts of unstructured data produced by high-performance applications falling in a wide and
heterogeneous family of application scenarios: from scientific computing applications to social networks, from e-
government applications to medical information systems, and so forth.
Stonebraker (2012) identifies four flavors of Big Data:
x Big volumes of data, but small analytics
x Big analytics on big volumes of data
x Big velocity
x Big variety.
Big volumes of data, but small analytics usually means using regular SQL queries (SELECT with MIN, MAX,
SUM, COUNT, AVG, GROUP BY, HAVING functions and clauses) on large datasets. All types of SQL
(relational) databases, commercial (Oracle, IBM DB2, Microsoft SQL Server) or open-source (PostgreSQL,
MySQL) could be platforms/tools for this type of processing.
Big analytics on big volumes of data requires combining ETL (Extract-Transform-Load) tools with statistical
packager. Big analytics denote regression, data mining, machine learning and other types of more complex
processing. Data could be extracted from various data sources using SQL queries and/or ETL tools. Complex
analyses require packages such as SPSS, R, SAS, etc. and sometimes a good deal of coding.
Big velocity is the ability to absorb on the fly streams of data from stock exchange, electronic trading, mobile
social networks, web sites, etc.
Big variety is related to the heterogeneity of data sources and data formats (XLS, relational databases, CSV, flat
files, etc.) which must me imported and transformed in order to be processed/analyzed.
Managing Big Data means dealing with three types of operations: gathering data, storing data, and processing
data. Consequently, twp key ingredients for Big Data are the databases and statistical packages.

2. SQL and Statistical Packages

There is a large offer of statistical packages dedicated to data analysis and other types of complex processing.
Some of the most popular commercial products are: SPSS, SAS, Stata, S-PLUS, Minitab. They generally provide a
vast array of statistical functions and options with very friendly interfaces to the regular user (non-programmers).
But at least some of them are also notorious for their costs. Small and Medium Business, as well as a good range of
universities, cannot afford to spend sometimes thousand of dollars for a not-so-large number of licenses. Of course
prices and licensing systems differ, but from our experience the price is still the most common barrier to their usage.
Nevertheless many universities have acquired packages such as SPSS, SAS through donations, research grants,
projects with the industry, etc.
The recent years witnessed a general trend in higher education and research world towards open-source statistical
software, mainly R, Tsoukalos et al. (2013). R is gradually becoming the dominant platform for universities,
companies and researchers that could not spend too much on software especially within current financial troubles. R
has a huge community of enthusiast developers with continuously implement the most recent advancements in
statistics, data mining, machine learning, etc. without any cost for the final user.
There are two main limitations of R in relation with these papers objectives. One is R specific and concerns the
user interface. Even if some open-source extensions (such as RStudio) soften somehow the dialog, R is based on the
command prompt and scripts and also programming-prone. In other word, R is still far away from the elegance of
commercial products.
Marin Fotache and Catalin Strimbei / Procedia Economics and Finance 20 (2015) 243 – 251 245

The second limitation is inherent to all statistical packages and concerns the data source. Surveys and lab data can
be entered directly in the statistical package, but in the real-world companies’ data to be analyzed reside on a large
scope of platforms: SQL databases, web logs, sensors, mobile applications, Excel files, etc. As consequence, in most
cases some extraction-transformation-load (ETL) mechanisms are needed for gathering data in R or other package.
Usually statistical packages usually load their data to be processed using one or more from the following
solutions:
x Direct import from external data files (Excel, CSV-Comma Separated Values, text files etc.) using their menus
(where available).
x Save intermediate results from the data sources (databases, Excel, etc.) into common format files and then import
these intermediate files into the package; the most popular interchange formats are XML, CSV and JSON.
x Create data sources using ODBC (Object Data Base Connectivity) or JDBC (Object Data Base Connectivity)
drivers and then connect directly the package to ODBC/JDBC data sources. No intermediate files are needed,
data being imported directly into the package variables/tables.
In recent years, some new options are available for data imports, see also Tsoukalos et al.(2013), such as:
x Using special ETL procedures which can be customized for both the data source and the destination package
x Connecting to special APIs (Application Programming Interfaces) or web/data services which provide data sets
in formats easy to import. Google Analytics is such a service becoming more popular over the years.
x Import data from web servers log using user-defined or standard ETL procedures. This is an area where NoSQL
systems have a strong presence.
x In addition to plain import through ODBC/JDBC connections, sometimes is possible to perform database query
in a database server directly from the statistical package. For example, R users can query SQLLite databases
directly and import the results from the tables into the R workspace.

3. SQL Features for Data Analysis

Basically SQL extracts record sets from huge databases based on a relational algebra. SELECT is the core SQL,
endowed with powerful clauses for filtering records, columns/attributes, computation, grouping, etc. The immense
popularity of SQL (Michael Stonebraker once called SQL the intergalactic dataspeak language) is due mainly to its
high level syntax (no programming is necessary for most of the queries) and also to its implementation in all types
DataBase Management Systems, from desktop (Access) to open-source (MySQL, PostgreSQL) and commercial
(Oracle, IBM DB2, Microsoft SQL Servers) ones. The broad adoption was facilitated by the standardization of SQL
by ISO with ANSI and various national agencies. First SQL standard was published in 1986 (ANSI) and 1989
(ISO), and then in 1992, 1999, 2003, 2008 and 2011.
As pointed out in previous section, the result of SQL queries (SELECT commands) can be saved/stored inside
the database (mainly as table or view) but also is prone to be exported from the DBMS to various targets and
formats, i.e. another database, Excel/CSV file, text file, HTML, ODBC/JDBC data source, etc.
But SELECT commands do not merely extract and filter data from the database. Its various clauses can do
various processing tasks for all the result rows or for groups or rows (GROUP BY and HAVING clauses).
Starting with the first standard (1986/1989), all SQL dialects have implemented the basic statistical functions
called (statistical) aggregate functions with self-descriptive names: SUM, COUNT, AVG, MIN, MAX.
Since 1999 one of the most important target of SQL standards has been data analysis, mainly through OLAP (On
Line Analytical Processing) features (sometimes also called window functions). There are some OLAP differences
among dialects. The richest DBMSs for data analysis are Oracle and DB2 whereas open source systems are less
generous. But some basic OLAP operations, such as ranking, are common. For example, current version of
PostgreSQL (9.3) implements, among other functions, RANK, DENSE_RANK, PERCENT_RANK, CUME_DIST,
LEAD, LAG, NTILE, NTH_VALUE, etc. For some advanced SQL OLAP features see next section.
Less known and used in SQL are the statistical functions for common statistical procedures. Again commercial
database servers (Oracle, DB2, SQL Server) are endowed with the best statistical features. But also open-source
servers provide useful functions such as STDEV_POP (standard deviation for a population), STDEV_SAMP
(standard deviation for a sample) CORR (correlation), COVAR_POP (population covariance), COVAR_SAMP
246 Marin Fotache and Catalin Strimbei / Procedia Economics and Finance 20 (2015) 243 – 251

(sample covariance), REGR_INTERCEPT (y intercept of the least-squares-fit linear equation determined by the
(x,y) pairs), REGR_SLOPE (slope of the least-squares-fit linear equation), PostgreSQL (2013).
As a main representative of commercial database servers Oracle is endowed with a vast array of statistical
features. Most of them are included in the Oracle SQL core dialect, but some other extensions are available, such as
Oracle Data Mining and Oracle R Enterprise. According to Oracle database documentation, Oracle (2013), the main
statistical options available are:
x Descriptive statistics
x Hypothesis testing
x Correlations analysis (parametric and nonparametric)
x Ranking functions
x Cross Tabulations with Chi-square statistics
x Linear regression
x ANOVA
x Test Distribution fit
x Window Aggregate functions
x Statistical Aggregates
x LAG/LEAD functions
x Reporting aggregate functions
The following simple query illustrates one-way ANOVA test performed in Oracle.
SELECT emp_gender,
STATS_ONE_WAY_ANOVA(years_in_company, salary, 'F_RATIO') f_ratio,
STATS_ONE_WAY_ANOVA(years_in_company, salary, 'SIG') p_value
FROM employees emp INNER JOIN salaries sal ON emp.emp_id = sal.emp_id
GROUP BY emp_gender
ORDER BY emp_gender;
The computed f_ratio and p_value show the significance of the difference in mean salaries within each value of
number of years spent in the company and differences in mean salaries between years. A p-value less than 0.05
would indicate that, for both men and women, the difference in the amount of salaries across years in company is
significant.
It is expected that future SQL standards and implementations will have more statistical features. From SQL:1999
standard, when OLAP functions were introduced, SQL has entered corporate data analysis, a market dominated by
data warehouse systems (see next section). It is worth pointing out that for some database servers (i.e. PostgreSQL)
there is an open-source library for scalable in-database analytics called MADlib, Hellerstein et al. (2012).

4. Data Analysis in the Corporrate World: Data Warehouse, Data Mining and All That Jazz

4.1. Data Mining, Knowledge Discovery and OLAP

Data Mining and Knowledge Discovery are broadly considered to form the academic and technological domain
of turning raw data into valuable information useful for business intelligence decision process. In fact, Knowledge
Discovery (KD) is considered as the methodological process and the Data Mining (DM) as a range of tools and
techniques to achieve the goals of such informational processing (Peng et al.,2008). The KD process is heavily
dependent on the database support and consequently is often referred as KDD - Knowledge Discovery in Database
(Fayyad et al.,1996). It covers activities like planning and setting analysis goals, setting the data collections to be
processed, data pre-processing through cleaning and preparation techniques, data transforming to simplify and to
adapt data structures to analytical data models, data mining with techniques as searching interpretative patterns to
represent analytical data and, finally, interpretation/evaluation plus visualization of the interpretative data patterns
extracted. Therefore the DM domain covers the most significant data analysis techniques to extract the new
predictive information using techniques such as classification (learning functions), regression, clustering or
summarization (Peng et al.,2008).
The Online Analytical Processing tools heavily use data analytical techniques, inspired by statistical operations,
to build their query trees. They are usually related with data integration and data refinement activities applied mostly
Marin Fotache and Catalin Strimbei / Procedia Economics and Finance 20 (2015) 243 – 251 247

within the third stage of the KDD process. Although there is no clear delimitation between OLAP and DM specific
tools, the latest ones use OLAP queries to set their own processing inputs. That is why the SQL-ROLAP is
considered a very important component of the DM-KDD architecture, especially in the context of multidimensional
databases built on relational extensions.

4.2. Data Warehouses and OLAP

Prior to feeding the DM processing activities, the analytical data need to be modeled (multidimensional), stored
(data warehousing) and properly queried (using OLAP operators). Therefore, the data warehouses (DW) are very
often related with the multidimensional databases regarded as subject-oriented, integrated, time-variant and
nonvolatile collections of data intended to decision support systems (Reddy et al., 2010). The original sources of
analytical data are most frequent OLTP systems, mainly relational databases. Also, structured or semi-structured
external data sources could feed DW. The integration of these kinds of virtually heterogeneous data sources could be
one of the most critical factors in the building of data warehouse. That is why an entire range of extract-transform-
load tools (ETL) are part of DW architecture. These tools are responsible for extracting, cleaning, integration,
transforming and loading data into DW and also they address the problem of data refreshing within DW context
(Chaudhuri et al., 1997). Therefore ETL tools ensure the entering gate of data into the multidimensional database of
data warehouses. By contrast, the OLAP tools are positioned somehow as the outer gate of DW systems: ensuring
the data access for the external reporting tools or for the data mining tools. The queries performed to
multidimensional databases of DW are processed by the OLAP engines.
There were developed four main OLAP data models and query engines:
x MOLAP (Multidimensional OLAP) tools support some analytical operators applied on multidimensional data
structures grafted on concepts like dimensions and metrics (Thomsen , 2003);
x ROLAP (Relational OLAP) tools support the analytical functions using dedicated operators implemented as
relational extensions on multidimensional databases designed with star-schema or snowflake-schema
methodologies (Celko, 2006);
x HOLAP (Hierarchical OLAP) tools solve some performance problems of ROLAP systems by using specialized
storage and indexing techniques (Celko, 2006);
x XML-OLAP tools (i.e. XCube) dedicated to XML Data Warehouses where dimensional facts are stored as XML
documents (Hümmer et al.,2003; Park et al., 2005).

4.3. OLAP and SQL

The SQL analytical capabilities are closely related with OLAP tools, or, more specifically, with the ROLAP
extensions. All the other tools try to formalize and to implement different flavors of multidimensional query
languages just resembling with SQL syntax, such as MDX (Multidimensional Expression Language) of Microsoft,
but these (sub)languages are not standardized as the conventional SQL.
From the SQL perspective, the main approach on querying multidimensional databases consists in operating
directly on the specific tables produced by the relational-based multidimensional design (star or snowflake
methodologies). Therefore, the SQL syntax was extended with an entire new range of operators, statistical functions
and expressions such as ( Chaudhuri et al., 1997):
x statistical inspired aggregate functions such as new medians;
x multiple “group by” operators such as ROLLUP and CUBE;
x sophisticated comparisons on aggregates, etc.
OLAP analytical processing is a core feature of multidimensional databases that propose a new data
organizational model meant to surpass the traditional relational model. In order to capitalize the SQL language in the
multidimensional domain, an extended relational model for multidimensional purposes is required. Several such
models have been proposed, such as Wang et al. (1996), Vassiliadis et al. (1998) and Lujan-Mora et al. (2006).
These models deal mainly with the mathematical foundation of relational model, but Pedersen et al. (2003) proposed
a fairly formalized analytical system that aim straight to the SQL language, named SQL-M, obviously M coming
from multidimensional. While the SQL approach tries to add to the regular SELECT phrases some multidimensional
248 Marin Fotache and Catalin Strimbei / Procedia Economics and Finance 20 (2015) 243 – 251

GROUP BY specialized sub-clauses like CUBE, ROLLUP, Wang et al. (2000) proposed object-relational SQL
extensions defining and implementing statistical processing functions which apply to aggregates. These extensions
are based on a sub-language named AXL (Aggregate Extension Language). Consequently, the multidimensional
databases could be designed and implemented into Object-Relational DBMSs so that SQL extensions like DCUBE
could be fully leveraged.

4.4. Some Actual Challenges for SQL ROLAP

Given the overwhelming diversity of heterogeneous web-based data sources (Big Data) it seems that OLAP
engines of DBMSs are limited in handling the inherent complexity of analytical data processing. This pressure has
stimulated significant efforts to enhance SQL systems so that they could be capable to analyze larger and larger
relational tables, processing the following statistical models, and also keeping them into relational model context.
One such solution is based on user defined functions developed and added to the core of object-relational systems
(Ordonez et al., 2011).
A significant competitor for the relational systems seems to come from the new kind of data systems as in-
memory column databases (Plattner, 2009). These systems are considered to favor the storage of OLAP specific
multidimensional cubes more effectively than the row-based systems as traditional RDBMS. OLAP tools, hosted by
the relational systems and exposed through SQL extensions, could be vulnerable when compete with these new
kinds of Big Data tools because of the closed nature of the many traditional RDBMS. Therefore, the data integration
issues remain critical, challenging the distributed relational systems, especially within a heterogeneous data source
environment. In this context, the declarative and very flexible syntax of SQL language is struggling to exceed the
boundaries of the closed SQL-relational systems, aiming to reach the magnitude of web-related data storage and
processing areas (Konopnicki and Shmueli, 1998). Even NoSQL systems, like column-based databases or map-
reduce systems, manifest a very strong disposition to adopt … SQL. Therefore, even though Big Data will conquer
the data analytical spot of OLAP, maybe SQL, as the reference data query language, will survive through this kind
of systems…
Another approach to keep SQL-OLAP still relevant in the (O)RDBMS context would be to “open” relational
database systems for accommodating various data sources (structured and semi-structured) integrated by Big Data
tools. We have discussed such architecture (Strimbei, 2012), and we have tried to argue the possibility of extending
SQL-DBMS systems in order to access external data sources and to reveal them as web data services.

5. Databases and Yet Another Hype: NoSQL


The database world was usually seen in the latest decades as a peaceful place. The core theories of the field
emerged in 1970s, the 1980s and 1990s witnessed the consolidation of theories and various strands of research
targeted database implementation issues – indexing, optimization, concurrency, transactions, etc.
All the revolutions failed to materialize one by one. Object-Oriented Data Model (Atkinson et al., 1989),
Intelligent Databases (Zaniolo, 1992), XML Databases (Bonifati and Cuzzocrea, 2007) – to name a few – promised
to change the complacent database community but did not reach critical mass on the application development
market.
Perhaps the main problem of SQL (relational) databases in Big Data environments has been the inability to scale
properly (Jacobs, 2009). We would rather modify this statement, because many SQL/relational DBMSs have had for
many years excellent solutions for distributions and scaling but these usually require exquisite hardware and large
amounts of money for the license. So “the inability of SQL databases to scale properly”, means actually “the cost of
implementing fully scalable SQL databases is prohibitive”.
In 2010 and 2011, the NoSQL tsunami came almost from nowhere. Actually from nowhere is a misnomer. Some
internet giants (i.e. Yahoo, Google, Amazon, Facebook, LinkedIn) found that, for a wide range of applications, SQL
databases based on transactions (ACID – Atomicity, Consistency, Isolation, Durability) were too fastidious and too
expensive to scale, and also some data gathered in their applications were not so critical (messages, comments,
reviews, etc.) to require ACID. So new DBMSs have been proposed which are based on data models radically
different than the relational model.
Marin Fotache and Catalin Strimbei / Procedia Economics and Finance 20 (2015) 243 – 251 249

NoSQL is a direct consequence of Big Data trying to answer the question: which are the most appropriate
systems for storing and querying/analyzing huge amounts of data gathered in various format? For Cattell (2010)
there are six key technical features of NoSQL systems:
x Horizontal scalability of simple operations over many servers
x Replication and distribution of data over many nodes
x Simple Call Level Interfaces or protocols
x A weaker (than ACID) concurrency model
x Efficient use of distributed indexes and RAM for data storage
x Dynamic insertion of new attributes into the data records.
But the explosion of NoSQL systems is due mainly to the economic consequences of the above features (Cogean
et al., 2013):
x NoSQL datastores can handle immense volumes of data.
x Data can be gathered “on-the-fly” (from web servers’ logs, sensors, mobile devices) very close to the moments it
is produced.
x Openness: almost all NoSQL datastores are open-source and prone for add-ins wrote in whatever language
(Python, Perl, Java, etc.).
x Cost: most of NoSQL systems are free of charge.
x NoSQL can be installed on modest or average equipment.
x Easier and quicker development and deployment.
x Large and enthusiastic users and developers communities.
x Data openness: comparing with proprietary database products, NoSQL datastores deal easier with import and
export of data in a large variety of formats.
If traditional infrastructure for Big Data used to be parallel relational DBMS, the basic NoSQL ingredients for
large-scale data analysis are Hadoop and Map-Reduce. Marketed as radical departures from the relational
counterparts, Hadoop and Map-Reduce present features available many decades ago in high-end relational systems
(Pavlo et al., 2009). Moreover, performance comparisons between Hadoop/Map-Reduce and parallel/SQL systems
still show, in many regards, better results for the later (Pavlo et al., 2009).
NoSQL systems have also a number of drawbacks. Comparing to the relational model, there is a plethora of
NoSQL data models: key-value stores, document, column, and graph databases (Cattell, 2010). They share some
commonalities but generally they differ significantly. NoSQL technologies are still immature even they are growing
fast. They lack the established research and practice communities of the relational technologies. Query mechanism
is very heterogeneous and low level requiring programming skills. Sometimes map-reduce solutions is traumatic for
beginners and completely opaque for the debugging (see the case of MongoDB). NoSQL systems suffer from to
much noise. As with other buzzwords, many NoSQL adepts do not lack blindness and hysteria and that could
damage NoSQL reputation on long term.
As a recent trend one can notice that more and more NoSQL datastores provide query languages (MongoDB
Aggregation Framework, Cassandra C-SQL, Hive, Pig) with troubling similarities to…SQL. We use troubling
because they are proudly called NoSQL! So even with the NoSQL advent, SQL is not (yet) dead.

6. Conclusions: SQL and Data Analysis Teaching and Practice

With all the buzzwords and fads, there is a tremendous interest in storing, exchanging and processing huge
volumes of heterogeneous data. No matter the denomination - Analytics, Data Mining, Business Analysis, Business
Intelligence, Data Science, etc. – more and more companies hire professionals with a mixture of skills and
competencies which combines databases and statistics.
This is good news for Information Systems and Statistics students and graduates but also require adapting the
curricula for these types of undergraduate and graduate programmes. Namely, Information Systems programmes
need more courses related to data analysis and also Statistics programmes need courses targeted to provide students
core competencies for dealing with databases and data warehouses, especially on extracting and processing data.
Real world data and case studies are essentials for a proper understanding of business phenomena.
250 Marin Fotache and Catalin Strimbei / Procedia Economics and Finance 20 (2015) 243 – 251

This paper argues that, for business graduates aspiring to careers in Data Analysis and Data Science, SQL could
prove an invaluable core competency in order to extract and pre-process vast amounts of data. Moreover, despite the
intricate relation of SQL with relational databases (we actually prefer to use the term SQL databases), more and
more NoSQL datastores implement SQL-like query languages. This makes many Data Analysis/Data Science tasks
accessible to non-programmers.
On a general level, the basic assumption of Big Data, statistics, Data Analysis is that we can learn from the past.
This also true for other business domains (such as Accounting). But also we must be careful. Even if the only exact
data we may know is all about the past, we should never forget that, in all domains of human activity, the future has
not been just a linear (or logarithmic) consequence of the past and the present. In other words, hanging on the past is
not the best way to cope with the future.

References

Buhl, H.U., Röglinger, M., Moser, F. (2013), Big Data: A Fashionable Topic with(out) Sustainable Relevance for Research and Practice?,
Business & Information Systems Engineering, 2, 2013, pp.65-69
Stonebraker, M. (2012), What Does 'Big Data' Mean?, Communications of the ACM (BLOG@CACM), September 21, 2012,
http://cacm.acm.org/blogs/blog-cacm/155468-what-does-big-data-mean/fulltext
Jacobs, A. (2009), The Pathologies of Big Data, Communications of the ACM, 52(8), pp. 36-44
Borkar, V.R., Carey, M.J., Li, C. (2012), Big Data Platforms: What’s Next?, XRDS, 19 (1), pp.44-49
Borkar, V.R., Carey, M.J., Li, C. (2012), Inside "Big Data Management": Ogres, Onions, or Parfaits?, Proc. of the 15th International Conference
on Extending Database Technology (EDBT '12), ACM, New York, USA, pp.3-14
Cron, A., Nguyen, H.L., Parameswaran, A. (2012), Big Data, XRDS, 19(1), pp.7-8
Cuzzocrea, A., Song, I.Y., Davis, K.C. (2011), Analytics over Large-Scale Multidimensional Data: The Big Data Revolution!, Proc. of the ACM
14th international workshop on Data Warehousing and OLAP (DOLAP '11), ACM, New York, USA, pp.101-104
Tsoukalos, M. (2013), Using the R Advanced Statistical Package, Linux Journal, Iss.232 (August 2013), pp.60-73
PostgreSQL 9.3.0 Documentation. Chapter 9: Functions and Operators, 9.20: Aggregate Functions,
http://www.postgresql.org/docs/current/static/functions-aggregate.html
Oracle Statistical Functions, Available (on September 1st 2013) at
http://www.oracle.com/technetwork/middleware/bi-foundation/index-092760.html
Hellerstein, J.M., Ré, C., Schoppmann, F., Wang, D.Z., Fratkin, E., Gorajek, A., Ng, K.S., Welton, C., Feng, X., Li, K., Arun Kumar, A. (2012),
The MADlib analytics library: or MAD skills, the SQL, Proc. of the VLDB Endowment, 5 (12), pp.700-1711.
Peng, Yi, Kou, Gang, Shi, Young, Zhengxin, Chen (2008), A descriptive framework for the field of Data Mining and Knowledge Discovery,
International Journal of Information Technology & Decision Making, Vol. 7, No. 4, 2008
Fayyad, Usama, Piatetsky-Shapiro, Gregory, Smyth, Padhraic (1996), Knowledge Discovery and Data Mining: Towards a Unifying Framework,
In proceeding of: Proceedings of the 2nd International Conference on Knowledge Discovery and Data, 1996
Reddy, G. Satyanarayana, Srinivasu, Rallabandi, Rao, M., (2010), Data Warehousing, Data Mining, OLAP and OLTP Technologies are essential
elements to support decision-making process in industries, International Journal on Computer Science and Engineering, 2(9), 2010
Chaudhuri, Surajit, Dayal, Umeshwar (1997), An Overview of DataWarehousing and OLAP Technology, ACM SIGMOD Record, Volume 26(1),
March 1997, pp.65-74
Thomsen, E. (2003), OLAP Solutions Building Multidimension Information Systems, Second Edition, Wiley, New York, USA
Celko, J. (2006), Analytics and OLAP in SQL, Morgan Kaufmann
Hümmer, Wolfgang, Bauer, Andreas, Harde, Gunnar (2003), XCube – XML For Data Warehouses, Proceeding DOLAP '03 Proceedings of the
6th ACM international workshop on Data warehousing and OLAP pg. 33-40, ACM, New York, USA
Park, B.K., Han, H., Song, I.-Y. (2005), XML-OLAP: A Multidimensional Analysis Framework for XML Warehouses, DaWaK 2005, LNCS
3589, Springer-Verlag Berlin, pp. 32–42
Li, C., Wang, S. (1996). A Data Model for Supporting On-Line Analytical Processing, Proc. of the 5th International Conference on Information
and Knowledge Management, pp. 81-88
Vassiliadis, P. (1998). Modeling Multidimensional Databases, Cubes and Cube Operations, in M. Rafanelli and M. Jarke (Eds.), Proc.of the 10th
International Conference on Scientific and Statistical Database Management, pp.53-62
Lujan-Mora, S., Trujillo, J., Song, I.-Y. (2006), A UML a profile for multidimensional modeling in data warehouses. Data Knowl. Eng., 59(3),
pp.725–769
Pedersen, D., Riis, K., Pedersen, T.B. (2002), A Powerful and SQL-Compatible Data Model and Query Language For OLAP, Proc. of the 13th
Australasian Database Conference (ADC2002), Melbourne, Australia
Wang, H., Zaniolo, C. (2000), Using SQL to Build New Aggregates and Extenders for Object Relational Systems, Proc. of the 26th VLDB
Conference, Cairo, Egypt
Ordonez, C., Gracia-Alvarado, C. (2011), A Data Mining System Based on SQL Queries and UDFs for Relational Databases, CIKM’11,
Glasgow, Scotland, UK.
Marin Fotache and Catalin Strimbei / Procedia Economics and Finance 20 (2015) 243 – 251 251

Plattner, H. (2009) A Common Database Approach for OLTP and OLAP Using an In-Memory Column Database, SIGMOD’09, Providence,
Rhode Island, USA
Konopnicki, D., Shmueli, O. (1998), Information Gathering in the World Wide Web: The W3QL Query Language and the W3QS System, ACM
Transactions on Database Systems, 23(4), pp.369–410.
Strîmbei,, C. (2012), OLAP Services on Cloud Architecture, IBIMA Journal of Software & Systems Development, 1
Atkinson, M., Bancilhon, F., DeWitt, D., Dittrich, K., Maier, D., Zdonik, S. (1989), The Object-Oriented Database System Manifesto, Proc. of
the First International Conference on Deductive and Object-Oriented Databases, Kyoto, Japan, pp.223-240
Zaniolo, C. (1992), Intelligent Databases: Old Challenges and New Opportunities, Journal of Intelligent Information Systems, 1, pp.271-292
Bonifati, A., Cuzzocrea, A. (2007), Synopsis Data Structures for XML Databases: Models, Issues, and Research Perspectives, Proc. of the 18th
International Conference on Database and Expert Systems Applications (DEXA '07). Washington, DC, USA, pp.20-24
Cattell, R. (2010), Scalable SQL and NoSQL Data Stores, ACM SIGMOD Record, 39(4), pp. 12-27
Cogean, D.I., Fotache, M., Greavu-Serban, V. (2013), NoSQL in Higher Education. A Case Study, Proc. of the IE 2013 International Conference,
ASE Bucuresti
Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M. (2009), A Comparison of Approaches to Large-Scale
Data Analysis, Proc. of ACM SIGMOD Conference (SIGMOD’09), Providence, Rhode Island, USA, pp.165-178

View publication stats

You might also like