You are on page 1of 19

Visual Informatics 2 (2018) 235–253

Contents lists available at ScienceDirect

Visual Informatics
journal homepage: www.elsevier.com/locate/visinf

A comprehensive review of tools for exploratory analysis of tabular


industrial datasets

Aindrila Ghosh a , , Mona Nashaat a , James Miller a , Shaikh Quader b , Chad Marston c
a
Department of Electrical and Computer Engineering, University of Alberta, 116 Street NW, Edmonton, T6G 1H9, Canada
b
Machine Learning Research, IBM Canada, Toronto, Canada
c
Information Technology and Analytics, IBM U.S., Boston, United States

article info a b s t r a c t

Article history: Exploratory data analysis plays a major role in obtaining insights from data. Over the last two decades,
Received 27 October 2018 researchers have proposed several visual data exploration tools that can assist with each step of the
Received in revised form 15 December 2018 analysis process. Nevertheless, in recent years, data analysis requirements have changed significantly.
Accepted 22 December 2018
With constantly increasing size and types of data to be analyzed, scalability and analysis duration are now
Available online 26 December 2018
among the primary concerns of researchers. Moreover, in order to minimize the analysis cost, businesses
Keywords: are in need of data analysis tools that can be used with limited analytical knowledge. To address these
Exploratory data analysis challenges, traditional data exploration tools have evolved within the last few years. In this paper, with an
Industrial tabular data in-depth analysis of an industrial tabular dataset, we identify a set of additional exploratory requirements
Interactive visualization for large datasets. Later, we present a comprehensive survey of the recent advancements in the emerging
Systematic literature review
field of exploratory data analysis. We investigate 50 academic and non-academic visual data exploration
Research opportunities
tools with respect to their utility in the six fundamental steps of the exploratory data analysis process. We
also examine the extent to which these modern data exploration tools fulfill the additional requirements
for analyzing large datasets. Finally, we identify and present a set of research opportunities in the field of
visual exploratory data analysis.
© 2018 Zhejiang University and Zhejiang University Press. Published by Elsevier B.V. This is an open access
article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

1. Introduction 2017; Endert et al., 2017; Slater et al., 2017; Liu et al., 2017b)
assist with domain specific data analysis (for example, analysis
In today’s digital world, insights obtained from Exploratory of genome-sequence data (Pabinger et al., 2014), meteorological
Data Analysis (EDA) are used in strategic business decision making. data (Rautenhaus et al., 2017), results of predictive analysis (Liu
EDA (Tufféry, 2011) is a fundamental procedure that makes use et al., 2017b) etc.), some other tools (Godfrey et al., 2016; Idreos
of statistical techniques and graphical representations in order to et al., 2015; Khan and Khan, 2011) focus on general purpose
obtain insights from data (Cui et al., 2018). EDA not only assists exploratory browsing of tabular data. In either case, since the
with the identification of hidden patterns and correlations among beginning of visual interactive data analysis (Godfrey et al., 2016)
attributes in data, but also helps with the formulation and vali- almost all visual EDA tools perform a few common analytics tasks.
dation of hypotheses from the data. Over the last few decades, In their work, Heer and Shneiderman (2012) as well as Amar et al.
interactive visualization strategies have become an integral part (2005) have identified these basic data exploration tasks as sort,
of data exploration and analysis techniques (Godfrey et al., 2016). filter, aggregate, correlate, group, and derive attributes.
With a picture being worth a thousand words, academics have Nevertheless, in recent years, the requirements for exploratory
proposed several tools and techniques (Yalçin et al., 2018; El-Hindi data analysis have changed significantly. With ever growing size
et al., 2016; Kraska, 2018; Zhao et al., 2013; Yu and Silva, 2017;
and types of data to be analyzed, scalability and analysis dura-
Gratzl et al., 2014) to visualize complex relationships among data
tion (Godfrey et al., 2016; El-Hindi et al., 2016) of the EDA tools
attributes using simple diagrams and charts. Whilst some of these
are now among the primary concerns of researchers. Moreover,
visual data analysis tools (Pabinger et al., 2014; Rautenhaus et al.,
with data being used to train predictive models (Liu et al., 2017b)
for making strategic business decisions, analysts are in need of
∗ Corresponding author.
data exploration tools that can help to accurately analyze complex
E-mail addresses: aindrila@ualberta.ca (A. Ghosh), nashaata@ualberta.ca
(M. Nashaat), jimm@ualberta.ca (J. Miller), shaikhq@ca.ibm.com (S. Quader),
multivariate relationships (Tufféry, 2011; Chan, 2006) in datasets,
cmarston@us.ibm.com (C. Marston). with limited available analytical expertise. To address the above
Peer review under responsibility of Zhejiang University and Zhejiang mentioned challenges, EDA tools are constantly evolving (Idreos
University Press. et al., 2015; High, 2012). In the last few years, many advancements

https://doi.org/10.1016/j.visinf.2018.12.004
2468-502X/© 2018 Zhejiang University and Zhejiang University Press. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
236 A. Ghosh, M. Nashaat, J. Miller et al. / Visual Informatics 2 (2018) 235–253

have taken place with the design of data visualization tools (El- EDA, whilst Section 5 presents the related work. In Section 6, we
Hindi et al., 2016; Liu et al., 2013; Battle et al., 2016; Wang et al., discuss the limitations of this work, while in Section 7 we conclude
2017) in order to address different challenges (Srinivasan et al., the paper.
2018; Zgraggen et al., 2014; Demiralp et al., 2017; Lee et al., 2013)
of analyzing large datasets (Cui et al., 2018; Javed and Elmqvist, 2. Research scope and methodology
2013; Satyanarayan and Heer, 2014a; Lee et al., 2013; Wongsupha-
sawat et al., 2017). However, the trade-off between the depth To precisely define the scope of our research, in this section, at
and the breadth of analysis supported by the modern exploratory first we present our primary research questions for this work. Next,
data visualization tools still remains a challenge (Godfrey et al., we outline a specific set of inclusion and exclusion criteria of the
2016). As, on the one hand, despite of covering the breadth of visual data analytics tools included in our study. Finally, we discuss
basic exploration tasks (Amar et al., 2005), general purpose data the detailed steps that were followed to analyze the industrial
exploration tools (Gratzl et al., 2014; Mei et al., 2018) often do dataset and to perform the state-of-the-art survey of EDA tools.
not fulfill the in-depth analysis requirements of their users. On the
other hand, tools (Liu et al., 2013) that focus on highly scalable and 2.1. Research scope
in-depth multivariate analysis, often lack in interpretability and
require significant knowledge of the problem domain. In this section, we outline the boundaries of our study in terms
To identify the current state of research in the emerging field of investigation time-frame, purpose, and popularity of the ana-
of EDA, at first we examine a real-world dataset with 3.4 million lyzed EDA tools. Therefore, we enlist our research questions and
records (cf. Section 3.1) obtained from our industrial partner IBM. discuss them as follows.
From this investigation, we identify a set of additional exploratory RQ1: What are the additional exploratory requirements for EDA
requirements for addressing different challenges of analyzing such tools to investigate large industrial datasets?
enormous business data. Later, we investigate 50 visual inter-
active EDA tools (cf. Section 3.2) for their ability to assist with RQ2: What research activities have taken place in last five years
the traditional EDA process steps, along with their fulfillment of in the domain of visual EDA tools for general purpose explo-
the identified additional exploratory requirements for large scale ration of tabular data?
EDA. Among the 50 analyzed tools, 43 are proposed by academic
researchers and the remaining 7 are commercial tools used in RQ3: What are the most popular commercial EDA tools in indus-
industry. Since, performing a complete survey of each and every try?
existing EDA tool would be too large to cover in a single paper, RQ4: To what extent do modern EDA tools assist with the steps
we carefully define precise selection criteria (cf. Section 2.1) for of the EDA process and fulfill the additional exploratory
the selected tools. For example, whilst for academic tools we only requirements of analyzing large datasets (i.e., answers of
look at the ones that were presented within the last five years RQ1)?
and help with general purpose exploration of tabular data, for
commercial tools we follow the guidelines of Gartner Inc. and RQ5: What are the gaps and future directions for the current state
select the business intelligence platforms that received the Gartner of research on visual EDA tools?
Customer Choice Awards (Gartner, 0000) in the year 2017. During
our evaluation of the selected tools, we identify some gaps and Based on the known challenges (Wang et al., 2015; Najafabadi
research opportunities in the emerging field of visual EDA. et al., 2015; Kaisler et al., 2013) of analyzing large datasets, re-
Although there has been much research (Pabinger et al., 2014; searchers (Wang et al., 2015; Najafabadi et al., 2015) have proposed
Rautenhaus et al., 2017; Liu et al., 2017b; Bikakis and Sellis, 2016; a set of additional requirements for analyzing such data. How-
Dunn et al., 2016; Wang et al., 2015; Behrisch et al., 2018) that aims ever, work that addresses all possible challenges of large datasets,
at surveying the state-of-the-art in visual data analytics, in most is sparse. Hence, with RQ1, we investigate a real-world tabular
dataset and identify different challenging aspects of this dataset.
cases the research only focuses on domain specific data exploration
Later, based on existing literature (High, 2012; Liu et al., 2013;
tools (Pabinger et al., 2014; Rautenhaus et al., 2017; Endert et al.,
Wang et al., 2015; Biju and Mathew, 2017; Johnstone and Titter-
2017). Moreover, as per our knowledge, in contrast to its closest
ington, 2009) that relates the identified aspects to specific data
competitors (Godfrey et al., 2016; Idreos et al., 2015; Khan and
analysis requirements, we identify four additional exploratory re-
Khan, 2011; Diamond and Mattia, 2018; Biju and Mathew, 2017;
quirements for analyzing large industrial datasets.
Behrisch et al., 2018), this novel work also considers tools that were
Fig. 1 illustrates our decisions for RQ2 and RQ3 in detail. As
proposed in the last one year. On the other hand, our study presents
shown in the figure, for RQ2, limiting the analysis time-frame
a list of 50 EDA tools that were analyzed for the first time from the
for academic EDA tools to five years was one of our very first
perspective of the steps followed in EDA process (Tufféry, 2011;
decisions in this research. The reason behind this decision was:
Demiralp et al., 2017). The primary contributions of this research
technology trend analysis for five years is a common industrial
are as follows:
practice.1 Moreover, we fixed our focus on investigation of tab-
• This novel work presents the current state of research on ular data stored in relational databases, because as discussed by
visual EDA tools for exploring tabular data by investigating researchers (Godfrey et al., 2016), most business data is stored
50 tools for their utility in the EDA process steps (cf. Sec- in relational databases despite of being initially recorded as plain
tion 3.2.1). text, XML, or graphs. We also narrowed our focus on tools used
• The research also evaluates the selected tools for their abili- for general purpose exploration of tabular data. The reason being,
ties to fulfill different additional exploratory requirements of due to the existence of large number of EDA tools in every research
large industrial datasets (cf. Section 3.2.2). field (such as, time-series, geospatial, genomic data etc.), it would
• The work identifies open research opportunities for the do- not be feasible to cover all these fields in one paper. Additionally,
main of visual EDA tools (cf. Section 4). with a focus on investigating tools that can be used by both novice
and expert users, we chose to exclude data analytics libraries,
The rest of the paper is organized as follows: in Section 2, we
present the scope and methodology of this research. In Section 3, 1 https://www.gartner.com/en/newsroom/press-releases/2017-02-08-
we describe our survey results for the selected EDA tools. Section 4, gartner-says-within-five-years-organizations-will-be-valued-on-their-
identifies research opportunities and gaps in the field of visual information-portfolios.
A. Ghosh, M. Nashaat, J. Miller et al. / Visual Informatics 2 (2018) 235–253 237

Fig. 1. Flow-diagram of selection criteria of the state-of-the-art EDA tools for this study.

frameworks, and packages that require programming skills from • Domain specific visual exploratory analysis tools (i.e., tools
end-users. that only work with data from a specific source).
In today’s data-centric world, almost all businesses make use of • Frameworks, packages, or libraries for performing visual EDA
general purpose commercial Business Intelligence (BI) and analyt- tasks.
ics tools2 for performing EDA tasks to gather insights from data. As
shown in Fig. 1, with respect to RQ3, we selected the seven most Like every other process, EDA consists of a set of steps (cf.
popular tools (Gartner, 0000) that were awarded by Gartner Inc. Section 3.2). With RQ4 we aimed at investigating the utility of
in 2017. Our primary purpose of investigating commercial tools the selected tools at these different steps. Also, we intended to
was to identify the similarities and differences between the current investigate the extent to which the tools fulfill the additional
state of academic research and industrial practice. exploratory requirements (i.e., answers of RQ1) for analyzing large
To summarize, tools fulfilling the following criteria were datasets. At the end of our study, we aimed at seeking answers for
included in our study (cf. Fig. 1): RQ5 and identifying gaps and research opportunities in the current
state of research on visual EDA.
• Presented within the last five years (criterion applicable only
for academic tools). 2.2. Research methodology
• Focused on analyzing tabular data stored in relational
databases. In this section, we discuss different steps of our research
• Focused on general purpose exploratory analysis of data. methodology in detail. This section is primarily divided into three
• Most popular and widely used (criterion applicable only for subsections. The first subsection presents our analysis methodol-
commercial tools). ogy for the real-world dataset, whilst in the next two subsections,
we discuss the detailed processes of collection and analysis of the
On the other hand, tools were excluded from the study based on selected EDA tools for this research.
the following criteria:
2.2.1. Background analysis of an industrial dataset
2 https://www.gartner.com/reviews/market/analytics-business-intelligence- The industrial dataset analyzed in this work is comprised of 3.4
platforms. million records with 27 attributes and contains product license
238 A. Ghosh, M. Nashaat, J. Miller et al. / Visual Informatics 2 (2018) 235–253

renewal information from IBM. It is important to note that we 2.2.3. Data analysis for systematic literature review
only had access to a completely anonymized version of the dataset. While reviewing the chosen EDA tools, following the guidelines
The dataset was provided to us in Comma Separated Values (CSV) of Kitchenham et al. (2009), at first both the researchers thoroughly
format, and was created by joining five different DB2 tables from an read the articles for each tool, later for the tools (Yalçin et al., 2018;
IBM data server. These tables contained information such as sales Kraska, 2018; Yu and Silva, 2017; Gratzl et al., 2014; Wongsupha-
figures, product details, customer interaction details, and types of sawat et al., 2017; Furmanova et al., 2017; Niederer et al., 2018)
product licenses. The tabular dataset was investigated by a group that provide open source access to their implementations, the two
of two researchers (the first and second authors) using Microsoft researchers independently executed the source code of these tools.
Excel. At this time, we performed different data manipulation tasks Among the tools that were executed, whilst some (Yu and Silva,
such as: plotting the value distributions of attributes and finding 2017; Liu et al., 2013; Wongsuphasawat et al., 2017; Furmanova
correlations among attributes. We also generated a pivot table et al., 2017; Niederer et al., 2018; Vartak et al., 2015) allowed
from the data that enabled us to sort and filter the attribute values, their source code to be downloaded to our local systems, some
so that we could compare the maximum, minimum, mean, and other tools (Yalçin et al., 2018; Kraska, 2018) only presented a live
standard deviations (Tufféry, 2011) of each attribute. During these executable version that requires users to upload their datasets on
tasks, we identified a set of challenging aspects (cf. Section 3.1) of the tool’s server. Due to the strict Data Access Policy requirements
the analyzed dataset. Once, these challenges were identified, the from IBM, we applied our analyzed industrial dataset only to those
academic tools (Yu and Silva, 2017; Liu et al., 2013; Wongsupha-
two researchers looked into the literature (Liu et al., 2013; Wang
sawat et al., 2017; Furmanova et al., 2017; Niederer et al., 2018;
et al., 2017, 2015; Biju and Mathew, 2017; Kaisler et al., 2013;
Vartak et al., 2015) that allowed us to download their source code.
Johnstone and Titterington, 2009) that addresses one or more of
For the tools (Yalçin et al., 2018; Kraska, 2018; Gratzl et al., 2014)
these identified challenges of big-data exploration (Wang et al.,
that did not enable us to download any code, we executed the
2015; Najafabadi et al., 2015; Kaisler et al., 2013). Based on these
tools using the sample datasets on the tools’ websites. In case of
literary evidences, we formed a set (cf. Section 3.1) of additional
the tools (El-Hindi et al., 2016; Zgraggen et al., 2014; Mei et al.,
exploratory requirements for large scale EDA tools.
2018) that did not share any source code information, the two
researchers thoroughly reviewed the main articles of the tools. For
2.2.2. Data collection for systematic literature review commercial tools however, we could download all seven of the
In order to address RQ2, we carried out a manual search of tools (Gartner, 0000) and applied them on our industrial dataset. At
conference proceedings and journals that are known to publish the end of the analysis, both researchers discussed their findings to
novel ideas on data visualization techniques. The article sources derive a final evaluation for each tool. In case of disagreements, the
were chosen not only based on their impact factors in the EDA third researcher helped to resolve the conflicts. Finally, the group
community, but also because they have been popularly chosen of three researchers collaboratively derived a summary table (cf.
by researchers (Godfrey et al., 2016; Idreos et al., 2015; Wang Table 1) with the evaluation of the identified EDA tools. From this
et al., 2015) for performing similar studies. As the next step, the analysis, the researchers identified some gaps and open research
last five years’ archive for each of the identified journals and directions (cf. Section 4) in the emerging field of visual EDA.
conferences were scrutinized by the two researchers. As shown in
Fig. 1, during this task, the researchers collected each and every 3. Results
article from the identified journals into a pool of 233 articles that
were relevant to EDA. Later the collected articles were filtered by In this section, we discuss the primary results of our research
the researchers based on the inclusion and exclusion criteria (cf. in detail. We begin with a brief description of our findings from
Fig. 1) defined for the tools. During this step, 190 articles were analyzing the industrial dataset, followed by a detailed discussion
excluded from our study. In cases of conflicts between the two on the results of our systematic literature review of the chosen EDA
researchers regarding an article’s eligibility to be included in the tools.
study, a third researcher was brought in to resolve the disagree-
ments. In parallel, for addressing RQ3, the two researchers started 3.1. Elicitation of additional exploratory requirements for large indus-
investigating on the most popular commercial exploratory data trial datasets
analysis tools in industry (cf. Fig. 1). Later, following the evaluation
of Gartner Inc. Gartner (0000) the researchers selected 7 commer- This section presents our analysis results of the industrial
cial EDA tools. For this study, we considered both the winners dataset obtained from IBM. In this section, at first we highlight
and the honorable mentions of the customer choice awards. Once the challenging aspects of the dataset, then we present a list of
the selected EDA tools were finalized, a quality assessment was additional exploratory requirements for large scale EDA tools.
performed by a team of three researchers (the first three authors) i. High dimensionality: The dimensionality of a tabular
involved in this work, where the fulfillment of the inclusion and dataset usually refers to the number of independent vari-
exclusion criteria for each of the selected tools was validated. ables or attributes in the data. Our dataset from IBM had 27
During the quality assessment session, the team also confirmed attributes. High-dimensionality of large business datasets
if the systematic review has covered all relevant EDA tools from is a known challenge among researchers (Johnstone and
the selected journals and conference proceedings that it should. Titterington, 2009; Liu et al., 2017a). As, firstly, the compu-
Once the tools to be analyzed were finally chosen, the following tational workload for analyzing a dataset increases as the
information was extracted regarding the tools: number of dimensions grows (Fan and Li, 2006). Secondly,
as a result of dimensional redundancy (Fan and Li, 2006),
• The source journal or conference proceedings of the tool and some attributes in a high-dimensional dataset might not be
its year of publication. as useful as others. For example, in our industrial dataset,
• The research questions addressed by the tool and its primary there were three attributes representing the country code
focus. of customers from three different view-points. In these
• The EDA steps supported by the tool along with its additional situations, strong correlations can be noticed (Dunn et al.,
features. 2016) among the redundant dimensions that can be difficult
A. Ghosh, M. Nashaat, J. Miller et al. / Visual Informatics 2 (2018) 235–253 239

Table 1
Summary of investigated exploratory data analysis tools. Note: ‘✓’ represents the tool ‘supports’ the operation. For commercial tools: GA, SA, BA, and HM represent Gold
Award, Silver Award, Bronze Award, and Honorary Mentions respectively in customer choice awards by Gartner (Gartner, 0000).
Type of Serial Name of tool Ordered Traditional steps of EDA process Additional requirements
tool No. by
Distin- Univari- Bivariate Multi- Detect Detect Feature Scalabil- Inter- Reduced User
guish ate analysis variate missing outliers engineer- ity pretabil- analytical engage-
attributes analysis analysis value ing ity expertise ment
1. DataScope (Iyer et al., 2018 ✓ ✓ ✓ ✓ ✓
2017)
2. DataSite (Cui et al., 2018 ✓ ✓ ✓ ✓
2018)
3. Duet (Law et al., 2018) 2018 ✓ ✓ ✓ ✓ ✓ ✓
4. FastMatch (Macke 2018 ✓ ✓
et al., 2018)
5. InfoNice (Wang et al., 2018 ✓ ✓ ✓ ✓
2018)
6. Keshif (Yalçin et al., 2018 ✓ ✓ ✓ ✓ ✓ ✓
2018)
7. NorthStar (Kraska, 2018 ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
2018)
8. Podium (Wall et al., 2018 ✓ ✓ ✓ ✓ ✓ ✓ ✓
2018)
9. RCLens (Lin et al., 2018 ✓ ✓ ✓ ✓ ✓ ✓ ✓
2018)
10. Taco (Niederer et al., 2018 ✓ ✓
2018)
11. Taggle (Furmanova 2018 ✓ ✓ ✓ ✓ ✓
Academic

et al., 2017)
12. VisComposer (Mei 2018 ✓ ✓ ✓ ✓ ✓
et al., 2018)
13. Voder (Srinivasan 2018 ✓ ✓ ✓ ✓ ✓ ✓ ✓
et al., 2018)
14. Zenvisage (Siddiqui 2018 ✓ ✓ ✓ ✓ ✓ ✓
et al., 2016)
15. Analyza (Dhamdhere 2017 ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
et al., 2017)
16. ChartAccent (Ren 2017 ✓ ✓ ✓ ✓
et al., 2017)
17. ForeSight (Demiralp 2017 ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
et al., 2017)
18. GaussianCubes (Wang 2017 ✓ ✓ ✓ ✓ ✓ ✓ ✓
et al., 2017)
19. HindSight (Feng et al., 2017 ✓ ✓ ✓ ✓ ✓
2017)
20. MyBrush (Koytek 2017 ✓ ✓ ✓ ✓ ✓
et al., 2018)
21. VisFlow (Yu and Silva, 2017 ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
2017)
22. Voyager 2 2017 ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
(Wongsuphasawat
et al., 2017)
23. AggreSet (Yalçin et al., 2016 ✓ ✓ ✓ ✓
2016)
24. DimScanner (Xia et al., 2016 ✓ ✓ ✓ ✓ ✓ ✓
2016)
25. ForeCache (Battle 2016 ✓ ✓ ✓ ✓ ✓ ✓ ✓
et al., 2016)
26. VisTrees (El-Hindi 2016 ✓ ✓ ✓ ✓
et al., 2016)
27. SeeDB (Vartak et al., 2015 ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
2015)
28. Sketch (Budiu et al., 2015 ✓ ✓ ✓ ✓
2015)
29. Bertifier (Perin et al., 2014 ✓ ✓ ✓ ✓ ✓
2014)
30. Domino (Gratzl et al., 2014 ✓ ✓ ✓ ✓ ✓ ✓ ✓
2014)
31. Ellipsis (Satyanarayan 2014 ✓ ✓ ✓ ✓
and Heer, 2014a)
32. iVisDesigner (Ren 2014 ✓ ✓ ✓ ✓
et al., 2014)
33. Lyra (Satyanarayan 2014 ✓ ✓ ✓ ✓
and Heer, 2014b)
34. PanoramicData 2014 ✓ ✓ ✓ ✓ ✓
(Zgraggen et al., 2014)
(continued on next page)
240 A. Ghosh, M. Nashaat, J. Miller et al. / Visual Informatics 2 (2018) 235–253

Table 1 (continued).
Type of Serial Name of tool Ordered Traditional steps of EDA process Additional requirements
tool No. by
Distin- Univari- Bivariate Multi- Detect Detect Feature Scalabil- Inter- Reduced User
guish ate analysis variate missing outliers engineer- ity pretabil- analytical engage-
attributes analysis analysis value ing ity expertise ment
35. Prog-Insights (Stolper 2014 ✓ ✓ ✓ ✓ ✓
et al., 2014)
36. UpSet (Lex et al., 2014) 2014 ✓ ✓ ✓
37. ExPlates (Javed and 2013 ✓ ✓ ✓ ✓ ✓ ✓
Elmqvist, 2013)
38. imMens (Liu et al., 2013 ✓ ✓ ✓ ✓ ✓
2013)
39. LineUp (Gratzl et al., 2013 ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
2013)
40. PivotSlice (Zhao et al., 2013 ✓ ✓ ✓ ✓ ✓
2013)
41. SketchStory (Lee et al., 2013 ✓ ✓ ✓ ✓ ✓
2013)
42. VisDeck (Perry et al., 2013 ✓ ✓ ✓ ✓
2013)
43. VisReduce (Im et al., 2013 ✓ ✓ ✓ ✓ ✓
2013)
44. Alteryx (Sallam et al., GA ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
2014)
Commercial

45. Tableau (Software, SA ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓


0000)
46. Domo (D. Inc, 0000) BA ✓ ✓ ✓ ✓ ✓ ✓
47. Watson Analytics HM ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
(Kelly, 2015)
48. MS Power BI (M. HM ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Corporation, 0000)
49. QlikView (Qlik, 0000) HM ✓ ✓ ✓ ✓ ✓
50. Sisence (Sisense, 0000) HM ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓

to visualize. Finally, high-dimensional datasets cause ‘‘geo- iii. Missing or aberrant values and outliers: The data points
metrical insanity’’ (Johnstone and Titterington, 2009) when with missing values show the incompleteness of the data.
visually exploring the data. For example, as the dimension As we discovered in the dataset, many data points had
changes only from 2D to 3D, the data that could initially missing values for attributes that did not have a NOT NULL
be represented by a 1-dimensional line now becomes a 2- constraint in the original database tables. On the other hand,
dimensional surface. Hence, when the dimension increases the records with outliers or aberrant values show incon-
from 3D to 4D and further, it gets extremely challenging to sistency in the data. In our dataset, some aberrant values
visualize such dimensionality in the data. (such as, 9999 in place of date values) represented some
ii. Categorical attributes: The second primary aspect of an undocumented codes for missing data. The outliers in the
industrial tabular dataset is the large number of categorical dataset were either results of human errors in data input or
attributes in the data (precisely, in our dataset 19 among indicated calculation errors when deriving attribute values.
the 27 attributes were categorical). Research (Tufféry, 2011) In either case, given the enormous size of the dataset, the
shows that analysis of categorical features in a dataset can outliers were among our main challenges for exploring this
be a primary challenge due to reasons such as: dataset.
iv. Data sanity: In the industrial dataset, we noticed that the
a. Performing statistical analysis on categorical attributes dataset being created by merging different tables not only
is more challenging than the numeric attributes, as had some columns with ambiguous names but also had
some of the measures of centrality (such as, mean and columns with inconsistent values. For example, whilst some
median) and dispersion (such as, variance) apply only ambiguous column names represented abbreviations of long
to numerical data. Also in case the categories are not sentences (e.g., FYCA standing for: First Year of Contract
relative, sorting them according to an ascending or Agreement), some other column names represented orga-
descending order can be a challenging task. Hence, it nization specific terminology with internal meaning. On the
becomes difficult for data analysts to perform any nor- other hand, the inconsistency in values for some columns is
mality tests (Tufféry, 2011) on the categorical features. resulted from different tables storing the values for the same
b. Analysis of categorical features with too many cate- attribute in different formats. We noticed these inconsisten-
gories can result in performance challenges (Johnstone cies in, attributes containing date information and financial
and Titterington, 2009) for any data analysis tool. Also details. The data sanity problems made us realize that a
often for these features, there are some categories that significant amount of expertise is required to understand the
are more dominant; such that, whilst the dominant values of each attribute in the data.
categories account for the majority of the data points, v. Multivariate relationships: Attributes in business datasets
the remaining categories represent extremely small contain complex multivariate relationships that are not eas-
portion of the data in comparison to the dominant ily visible in tabular data. Whilst, in some cases the values
categories. In such situations, it gets immensely chal- of an attribute depend on two or more other attributes, in
lenging to perform univariate analysis (Tufféry, 2011) some other cases combined exploration of several attributes
of the categorical features. can provide more meaningful insights than exploration of
A. Ghosh, M. Nashaat, J. Miller et al. / Visual Informatics 2 (2018) 235–253 241

an individual attribute. For example, in our dataset, the 3.2. Survey of exploratory data analysis tools
attribute containing the information on the next purchase
date depended on the attributes: previous purchase date, In this section, we present the results of our systematic lit-
product type, and business value of customers. On the other erature review that answers our research question RQ4 (cf. Sec-
hand, combined exploration of customer industry, type of tion 2.1). We begin this section with the evaluation results of the
purchased products, and product pricing information gave chosen EDA tools for their ability to assist with the EDA process
(cf. Fig. 3). Later, we discuss our findings on the tools’ fulfillment
us insights on the pricing requirements of customers in
of the additional exploratory requirements for large scale EDA (cf.
different industries. So, it can often get challenging to iden-
Section 3.1).
tify the attributes that are related to each other without
appropriate domain knowledge and training. 3.2.1. Support for traditional EDA process steps
vi. Anonymity: Another aspect of a real-world industrial According to Tufféry (2011) and Demiralp et al. (2017), as
dataset is anonymity that can cause challenges during the shown in Fig. 3, the EDA process usually follows six distinct steps
data analysis process. In large multinational organizations, namely: (i) Distinguish Attributes, (ii) Univariate Data Analysis,
much data is classified business information that is only (iii) Detect Interactions Among Attributes, (iv) Detect Missing &
shared with specific teams and individuals. In such cases, Aberrant Values, (v) Detect Outliers, and (vi) Feature Engineering.
even the data analysts do not get access to the entire infor- As depicted in Fig. 3, the analysis begins with identification of
mation about the dataset. For example, in case of our dataset, attributes in a dataset that gives a clear understanding of the
attributes such as product pricing or customer firmographic data to be analyzed. Next, in order to understand individual at-
information were anonymized that lead us to some misin- tributes and their relationships with each other, univariate, bivari-
terpretation of the data. ate, and multivariate analyses are performed. Later, cleaning and
vii. Large scale of data points: One of the primary aspect of data preparation tasks are carried out, where missing, aberrant val-
ues and outliers (Srinivasan et al., 2018) are detected and imputed.
real-business data is the large scale of data points in the
The process ends with feature engineering, where features are
datasets. In our case, the dataset with 3.4 million rows and
transformed or combined to generate new features. We summarize
27 columns resulting into 91.8 million data values took
our analysis results in Table 1.
hours to be extracted from the database into CSV. Hence, we
think it will take longer time for any EDA tool to visualize i. Distinguish attributes: Exploratory data analysis begins
such amount of data. with identification of the attributes in a dataset. This is an
essential step at the beginning of the EDA process that not
From our analysis, we believe that in order to efficiently an- only helps with the ‘‘Cold-start’’ (Cui et al., 2018; High, 2012)
alyze any industrial dataset, in addition to supporting the EDA problem of data analysis, but it also assists users to formu-
process steps (Tufféry, 2011), EDA tools need to address the above- late clear analysis goals. According to researchers (Godfrey
mentioned challenges. Research (Wang et al., 2015; Najafabadi et al., 2016), datasets commonly have numerical (or quan-
et al., 2015; Kaisler et al., 2013) shows that, each of these identified titative) or categorical (or qualitative) attributes (Tufféry,
challenges of big-data analytics can be associated to specific ex- 2011). However, not all statistical analysis techniques can
ploratory requirements of modern EDA tools. Following the exist- be applied to all the attributes in a dataset (Xia et al., 2016).
ing research results (Chan, 2006; Liu et al., 2013; Wang et al., 2017, Hence, it is important for data analysts to clearly distinguish
2015; Biju and Mathew, 2017; Johnstone and Titterington, 2009; and understand the meaning of each attribute in a dataset
Im et al., 2013), we identify four additional exploratory require- prior to analyzing the data.
ments of large scale EDA tools namely: (i) scalability, (ii) reduced Most existing commercial data visualization tools such as
analytical expertise, (iii) user engagement, and (iv) interpretability. Microsoft Power BI (M. Corporation, 0000) and IBM Watson
Analytics (High, 2012), show the entire dataset in a tabular
Fig. 2 summarizes the relations between the identified challenges
format and allow users to see and modify the data in terms
and the additional exploratory requirements.
of attribute names, attribute values, and datatypes. Among
As shown in Fig. 2, researchers such as Najafabadi et al. (2015), academic EDA tools, while some tools such as Keshif (Yalçin
Wang et al. (2015), and Chan (2006) have associated the aspects of et al., 2018), Explates (Javed and Elmqvist, 2013), North-
high-dimensionality, and large scale of data points to the scalability Star (Kraska, 2018), DimScanner (Xia et al., 2016), and An-
requirements of EDA tools. The reason being, both the aspects alyza (Dhamdhere et al., 2017) present a list of attribute
refer to the size and complexity of a dataset (Wang et al., 2017; names to the user, tools such as Podium (Wall et al., 2018),
Biju and Mathew, 2017; Im et al., 2013), and hence signify the ForeSight (Demiralp et al., 2017) and Bertifier (Perin et al.,
necessity for scalability in EDA tools. Hence, we consider scala- 2014) present a portion of data in tabular format at the
bility as our first additional exploratory requirement. Moreover, beginning of the analysis process. On the other hand, tools
according to Tufféry (2011), the analysis of categorical attributes such as Voyager 2 (Wongsuphasawat et al., 2017) (cf. Fig. 2),
and multivariate relationships among attributes can require signif- Taggle (Furmanova et al., 2017), Zenvisage (Siddiqui et al.,
icant analytical expertise. Hence, based on existing research (Wang 2016), and LineUp (Gratzl et al., 2013) provide visual
et al., 2015; Najafabadi et al., 2015) reduced analytical expertise is overviews of all attributes immediately at the beginning of
the analysis. In most cases (Wongsuphasawat et al., 2017;
chosen as our next requirement for EDA tools. On the other hand,
Wall et al., 2018; Furmanova et al., 2017; Dhamdhere et al.,
researchers (Demiralp et al., 2017; Wang et al., 2015) have also
2017), an initial summary uses a variety of interactive his-
shown that, whilst the results of multivariate analysis can be chal-
tograms to present an overview of each attribute. For exam-
lenging to interpret, the presence of poor data sanity, anonymity, ple, Fig. 4 shows a snapshot of the tool Voyager 2 (Wong-
missing values, and outliers require additional support for inter- suphasawat et al., 2017), where the parts (a) and (c) are rel-
pretability from EDA tools. Finally, according to researchers (Wang evant to distinguishing attributes. In the figure, the section
et al., 2015; Najafabadi et al., 2015; Kaisler et al., 2013), in order to marked by (a) gives an example of distinguishing attributes
rectify the sanity issues of large datasets, EDA tools need to enable at the beginning of the analysis process. Whereas, the sec-
user engagement in the form of user feedback. tion (c) depicts the visual summaries of each attribute.
242 A. Ghosh, M. Nashaat, J. Miller et al. / Visual Informatics 2 (2018) 235–253

Fig. 2. Elicitation of additional exploratory requirements for large scale EDA tools.

Fig. 3. The fundamental steps of the exploratory data analysis process.

Fig. 4. Dashboard of the tool Voyager 2 (Wongsuphasawat et al., 2017). The figure shows, (a) the names of attributes grouped into categories such as quantitative, categorical
and temporal. (b) a panel that can assist with bivariate and multivariate analysis by allowing users to choose filters and embellishments. (c) a panel that shows univariate
summaries of all attributes.

Moreover, some existing EDA tools (Perry et al., 2013) pro- data source. Among the commercial EDA tools, Domo (D.
vide more detailed summaries of attributes. For example, Inc, 0000) provides a brief summary of datatypes and groups
while the tool Domino (Gratzl et al., 2014) summarizes attributes based on their types. On the other hand, some
attribute information such as datatypes, number of records, data exploration tools such as, Taco (Niederer et al., 2018;
and dimensions, Taggle (Furmanova et al., 2017) provides a Hourieh et al., 2016) and Domino (Gratzl et al., 2014) do not
short description of the dataset with an HTML link to the describe any attributes in the dataset at all, and begin with
A. Ghosh, M. Nashaat, J. Miller et al. / Visual Informatics 2 (2018) 235–253 243

complex data exploration tasks (e.g., join, merge, aggregate a. Bivariate statistics: In modern EDA tools, interactive
etc.) right after the data is loaded. filtering and aggregation of attributes are the most
ii. Univariate data analysis: Once the attributes in a dataset common ways (Yalçin et al., 2018; El-Hindi et al.,
are identified, it is necessary to perform univariate analy- 2016; Yu and Silva, 2017; Srinivasan et al., 2018;
sis (Tufféry, 2011) in order to get a deeper understanding of Software, 0000; Kelly, 2015; Mokalis and Davis, 2018)
each attribute. Univariate analysis, also allows the determi- of performing a combined analysis of two attributes.
nation of attribute combinations for subsequent analysis. It Typically, the vast majority of all the exploratory data
helps with detection of details such as: centrality (i.e., mean, visualization tools perform bivariate data analysis.
median, and mode) and dispersion (i.e., range, variance, In some tools (e.g., Voder Srinivasan et al., 2018,
standard deviation, skewness, and kurtosis) of attributes in Taggle Furmanova et al., 2017, Domino Gratzl et al.,
the data. While the centrality measures help us determine an 2014, MyBrush Koytek et al., 2018, DataScope Iyer
approximate average for the attribute values, the dispersion et al., 2017, ForeSight Demiralp et al., 2017), filtered
measures help us identify the spread of the value between and aggregated attribute values are usually obtained
by interactive brush-and-link (Yu and Silva, 2017;
its lowest and highest bounds. Univariate analysis is also
Mei et al., 2018) operations and are presented us-
used to identify missing values or outliers in a dataset and
ing highlighted and interactive histograms (Yu and
to discretize continuous variables (Tufféry, 2011; Kamat and
Silva, 2017; Macke et al., 2018; Niederer et al., 2018;
Nandi, 2014).
Furmanova et al., 2017). These histograms use differ-
Most recent advancements in data exploration tools facil-
ent colors and/or textures to represent correlations
itate univariate data analysis. Typically, in both academic
among attributes. Moreover, tools like Keshif (Yalçin
and commercial tools (Yalçin et al., 2018; Demiralp et al.,
et al., 2018) allow users to lock histograms of spe-
2017), interactive histograms and box-plots are used to de- cific variables and compare them to other variables.
pict value distributions of the variables. For example, as Fig. 5 shows a snapshot of both univariate and bivari-
shown in the part (c) of Fig. 4, the tool Voyager 2 uses ate analysis using Keshif. Whereas, part (a) in Fig. 5
interactive histograms to depict the value distributions of presents individual attributes in groups based on their
the attributes. Additionally, commercial EDA tools such as datatype, the upper half of part (b) depicts bivariate
IBM Watson Analytics (Kelly, 2015; Anderson, 2012), Mi- relationships among attributes using overlapped and
crosoft Power BI (M. Corporation, 0000), QlikView (Qlik, locked histograms, and part (c) shows univariate anal-
0000), Alteryx (Sallam et al., 2014), and academic tools such ysis with filter operation.
as Voyager 2 (Wongsuphasawat et al., 2017), DataSite (Cui Some tools perform different variations of the brush-
et al., 2018), Northstar (Kraska, 2018), and ForeSight (Demi- and-link operations in order to correlate attributes.
ralp et al., 2017), let users choose from a set of optional visual For example, VisTrees (El-Hindi et al., 2016) requires
representations (such as, heat-maps, pie-charts, line-graphs users to explicitly link two attributes prior to per-
etc.) and in some cases visual embellishments (Koytek et al., forming the brush and filter operations. VisFlow (Yu
2018) (such as, color, texture etc.) to better analyze each and Silva, 2017) makes users select two attributes
attribute. For example, as depicted in the part (b) of Fig. 4, and pass them through a binder component before
Voyager 2 (Wongsuphasawat et al., 2017) allows users to the brush and filter operations can be performed.
select details such as shape, size, color etc. for the visu- NorthStar (Kraska, 2018) links the two attributes and
alizations. In most cases, modern visualization tools such creates a scatter plot that shows the correlations
as Keshif (Yalçin et al., 2018), NorthStar (Kraska, 2018), among the two attributes. The tool MyBrush (Koytek
Voder (Srinivasan et al., 2018), Tableau (Software, 0000) et al., 2018) on the other hand, focuses entirely on
allow end-users to interactively brush (Yalçin et al., 2018; brushing and linking attributes. It provides a uni-
Zgraggen et al., 2014), hover (Zgraggen et al., 2014; Ren fied interface for interactively configuring different
et al., 2014), and zoom (Gratzl et al., 2014; Liu et al., 2013; components of the brush-and-link operation namely:
Satyanarayan and Heer, 2014b) on the visualizations. While, source, link, and target. Some tools such as Panoram-
aggregation (Furmanova et al., 2017) of feature values is one icData (Zgraggen et al., 2014), Tableau
(Software, 0000), iVisDesigner (Ren et al., 2014),
of the most common ways (Tufféry, 2011) of performing
Voder (Srinivasan et al., 2018), DataSite (Cui et al.,
univariate analysis, most EDA tools also support sorting and
2018), IBM Watson Analytics (High, 2012; Kelly, 2015)
filtering (Law et al., 2018; Niederer et al., 2018; Budiu et al.,
allow users to compose different visuals (such as,
2015) of attribute values.
scatter-plots and pie-charts) other than just
iii. Detect interactions among attributes: After the univariate
histograms to perform bivariate data analysis. Tools
analysis of each attribute, the next step is to understand the
such as Tableau (Software, 0000), IBM Watson Analyt-
relationships among different attributes in the dataset. This
ics (Anderson, 2012), Alteryx (Sallam et al., 2014) also
not only helps to identify incompatibilities among attribute enable users to perform join operations on multiple
values, but it also enables analysts to generate optimal fea- related tables in the same database.
ture combinations (Kraska, 2018; D. Inc, 0000) for subse- Some EDA tools (Yu and Silva, 2017; Liu et al., 2013;
quent analysis. Analysis of attribute relationships can be Furmanova et al., 2017; Isaacs et al., 2014) analyze
performed in two different ways: bivariate and multivariate horizontal subsets of data. These subsets are often
statistics (Tufféry, 2011). Whereas, bivariate statistics only created either based on user-driven selections (Fur-
analyses the association of a chosen pair of attributes, the manova et al., 2017), or algorithmic analysis (Liu et al.,
intersection of more than two variables are analyzed using 2013). Horizontal data subsets are used in many dif-
multivariate statistics. As per Tufféry (2011), bivariate anal- ferent visualization tools to achieve different goals.
ysis needs to be performed prior to multivariate analysis. For example, whereas FastMatch (Macke et al., 2018)
This way, once the users have a clear idea of the compati- uses subset sampling to analyze the histograms of all
bility of an attribute pair, they can combine more attributes attributes in a dataset, and finds the top-k similar his-
with them, for further analysis. tograms among them. Taggle (Furmanova et al., 2017)
244 A. Ghosh, M. Nashaat, J. Miller et al. / Visual Informatics 2 (2018) 235–253

allows end-users to create hierarchical aggregation Aberrant values are erroneous values which occur as a result
of data subsets in order to create nested attributes. of incorrect user inputs or calculation errors, whilst missing
Domino (Gratzl et al., 2014) (cf. Fig. 4), on the other values occur in a dataset during data extraction and/or data
hand, describes data subsets as blocks and depicts collection. Detection of such values in a dataset usually
relationships (e.g., strong or weak) among the blocks. happens right after multivariate analysis, when the user has
Duet (Law et al., 2018) makes use of data subsets a clear idea about the value ranges of the attributes and
to perform pairwise comparison among tabular data. their compatibilities. In case of a large dataset, the search for
Fig. 6, depicts an example of bivariate analysis us- missing and aberrant values begins when any abnormality
ing the tool Domino (Gratzl et al., 2014), where the is noticed in the univariate, bivariate, or multivariate visu-
relationships between data subsets are presented by alizations. For example, Fig. 7 shows the dashboard of the
parallel coordinates and scatter plots. tool ForeSight (Demiralp et al., 2017), where the missing
b. Multivariate statistics: Once a pair of relevant at- values in data can be located in the part (d) of the dashboard,
tributes in a dataset are analyzed, the next step is and the part (c) can show aberrant values in data. Once
to perform a deeper investigation, where more at- data-points with aberrant or missing values are detected,
tributes are added with the analyzed pair for a com- usually the first action of the data analysts is to remove
bined exploration. Research (Tufféry, 2011) shows these data points (Tufféry, 2011). However, removing data-
that analysis of the correlation among more than points can have its own consequences. Firstly, there can be a
two attributes is a complex and time-consuming task, large number of data-points for which at least one attribute
which can only be achieved by factor analysis tech- value is missing. Secondly, the dataset might have special
niques such as clustering (Perin et al., 2014) and significance for the data-points with missing values. Hence,
dimensionality reduction (Cui et al., 2018). As a result, removal of observations with missing and/or aberrant val-
in order to avoid the complexity of visualizing the ues can add further bias into the analysis. According to
results of factor analysis, most of the modern EDA Tufféry (2011), alternatives to deletion of records with miss-
tools depict relationships among multiple attributes ing values are: to perform value imputations, or to include
using group and filter operations. For example, in the data with missing values in the analysis with a known
cases of tools such as PanoramicData (Zgraggen et al., margin of error. Imputations of missing values can either be
2014), Keshif (Yalçin et al., 2018), iVizDesigner (Ren user driven (Tufféry, 2011) or automatically performed with
et al., 2014), bivariate histograms and scatter plots the help of predictive models (Liu et al., 2017b).
are filtered using one or more attributes to show the Although, some tools such as Keshif (Yalçin et al., 2018)
relationships among all these features. An example of and AggreSet (Yalçin et al., 2016) allow the user to tem-
multivariate analysis using 2-dimensional histograms porarily remove some attributes from analysis, except for
is presented in Fig. 5 (i.e., the lower half of the part a few tools such as IBM Watson Analytics (High, 2012),
(b)), where three histograms of different colors are GaussianCubes (Wang et al., 2017), and MyBrush (Koytek
used to compare the values of three different at- et al., 2018) most of the analyzed tools do not allow users to
tributes. detect or modify aberrant values in the dataset. Tools such as
Nevertheless, despite of the complexity of multivari- Podium (Wall et al., 2018), ForeSight (Demiralp et al., 2017)
ate statistics, some of the analyzed EDA tools im- (cf. Fig. 7), and Bertifier (Perin et al., 2014) allow users to
plement factor analysis tasks. For example, Pivot- visualize missing values in the data in tabular format, how-
Slice (Zhao et al., 2013) uses multi-dimensional query ever, these tools require users to manually scroll through the
mechanisms to generate faceted exploratory visual- entire table in order to identify the missing values. Despite
izations; and VisTrees (El-Hindi et al., 2016) allows scalability challenges, these tools allow users to perform
users to create multi-dimensional indexes in order user driven imputations on the missing values, none among
to combine feature subsets with each other. More- our analyzed the tools perform any automatic imputation of
over, tools such as GaussianCube (Wang et al., 2017), missing or faulty data.
imMens (Liu et al., 2013), Podium (Wall et al., 2018) v. Detect outliers: The detection of outliers usually happens
and LineUp (Gratzl et al., 2013) enhanced scalabil- during or after the univariate, bivariate, or multivariate
ity of EDA process with the use of dimensionality analysis. An outlier is an observation that deviates further
reduction (Cui et al., 2018) techniques. For exam- away (Tufféry, 2011) from other observations in the dataset.
ple, imMens generates data cubes (Liu et al., 2013) Like aberrant values, outliers can also add bias to the anal-
from the binned aggregation of data that is further ysis leading to misinterpretation of attribute properties.
transformed into multi-variate data tiles; whereas According to researchers (Zuur et al., 2010), outliers in a
GaussianCube (Wang et al., 2017) improves on im- dataset can be primarily of three types namely: univariate,
Mens by precomputing the best multivariate Gaus- bivariate, and multivariate outliers. Therefore, usually af-
sian distribution among attributes. On the other hand, ter multivariate analysis and detection of aberrant values,
LineUp (Gratzl et al., 2013) and Podium (Wall et al., users focus on the detection of outliers. While univariate
2018) make use of multi-attribute rankings based on outliers can be detected by calculation of the Inter-Quartile
attribute combinations. Whereas, using multi- Range (IQR) (Smith, 2018) of individual variables, to detect
attribute ranking LineUp (Gratzl et al., 2013) allows bivariate and multivariate outliers, analysts need to inspect
end-users to alter attribute combinations or column correlations among different attributes. For example, bivari-
rankings to compare the differences in the relation- ate outliers can be detected using combining two attributes
ships, Podium (Wall et al., 2018) assists users to deter- and calculating their correlation coefficient (Demiralp et al.,
mine the importance of each attribute in the dataset 2017), whereas multivariate outliers can be detected using
factor analysis (Tufféry, 2011). The complexity of visualiza-
in terms of decision making.
tion of outliers in a dataset also depends on the type of
iv. Detect aberrant & missing values: Aberrant and missing outlier. Whilst, univariate and bivariate outliers can be easily
values may result in biased analysis of data (Tufféry, 2011). depicted using box-plots, interactive histograms and scatter
A. Ghosh, M. Nashaat, J. Miller et al. / Visual Informatics 2 (2018) 235–253 245

plots (Kraska, 2018), it is often challenging for visual EDA incremental visual representations of the data and provide
tools to depict multivariate outliers. incremental updates to notify the user of the wait time.
As per our analysis, some of the modern EDA tools such Alternatively, tools such as ForeSight (Demiralp et al., 2017)
Inflow (Yu and Silva, 2017), ForeSight (Demiralp et al., 2017) and ProgressiveInsights (Stolper et al., 2014) provide ap-
(cf. Fig. 7), Podium (Wall et al., 2018), RCLens (Lin et al., proximate visualizations with a known boundary of error.
2018), DimScanner (Xia et al., 2016), HindSight (Feng et al., Other tools make use of creating subsets from the data
2017), and IBM Watson Analytics (Kelly, 2015), allow their in order to achieve scalability. For example, tools such as
users to detect univariate and bivariate outliers in a dataset. Taggle (Furmanova et al., 2017), Domino (Gratzl et al., 2014),
Fig. 7 shows an example of the tool ForeSight (Demiralp GaussianCubes (Wang et al., 2017), FastMatch (Macke et al.,
et al., 2017), where the detections of univariate and bi- 2018) and imMens (Liu et al., 2013) make use of horizontal
variate outliers are depicted in parts (b) and (c) of the fig- data subsets for this purpose. In case of commercial tools,
ure respectively. Just like the missing and aberrant values, almost all the EDA tools (Sallam et al., 2014; Software, 0000;
once the outliers in a dataset are detected, they can be Sisense, 0000) analyzed during this work, support highly
rectified by either removing the observations, performing scalable analytics.
automatic or user-driven imputations, or transformation of With respect to the scalability of the tools in each step of the
variables (Gratzl et al., 2014; Furmanova et al., 2017). EDA process; as shown in Table 2, only a few tools (Kraska,
vi. Feature engineering: Finally, after obtaining detailed in- 2018; Wang et al., 2017; Demiralp et al., 2017; Dhamdhere
sights about the dataset, as the last step of the EDA process et al., 2017) focus on distinguishing attributes. For example,
feature engineering is carried out. Feature engineering is a tools such as ForeSight (Demiralp et al., 2017) and Microsoft
core step of exploratory data visualization (Tufféry, 2011) Power BI (M. Corporation, 0000) present attribute names in a
that is performed by almost all EDA tools (Cui et al., 2018; tabular form, and the tools such as NorthStar (Kraska, 2018),
Yalçin et al., 2018; El-Hindi et al., 2016; Kraska, 2018). It Tableau (Software, 0000), and Domino (Gratzl et al., 2014)
is primarily divided into two parts: variable creation and group attributes into categories. Scalability in univariate
transformation. The creation of derived variables often hap- and bivariate analysis is supported by most EDA tools that
pens to ease the data analysis process. Derived variables not allow large scale analysis. For this purpose, the tools such as
only summarize linear relationships among many attributes, ProgressiveInsights (Stolper et al., 2014), NorthStar (Kraska,
but they also help to simplify the understanding of complex 2018), and VisReduce (Im et al., 2013) constantly refine
attributes in the dataset. Variable transformations convert partially loaded univariate and bivariate analysis charts of
complex non-linear relationships into linear relationships; attributes. Moreover, to provide support for scalable mul-
and standardize values to obtain a better understanding. tivariate analysis, tools such as imMens (Liu et al., 2013),
Normalization (Tufféry, 2011) is a type of variable trans- and GaussianCubes (Wang et al., 2017) precompute multi-
formation that helps to convert skewed distributions into variate data tiles. On the other hand, scalable identification
more symmetric distributions. Among the tools we ana- of missing, aberrant values, and outliers is supported by
lyzed, FastMatch (Macke et al., 2018) identifies similarities some EDA tools (Kraska, 2018; Gratzl et al., 2014; Wang
between different distributions by comparing the relative et al., 2017; Gratzl et al., 2013). Whilst in most cases (Kraska,
values. Most visualization tools, such as Keshif (Yalçin et al., 2018; Srinivasan et al., 2018), the outliers are presented
2018), NorthStar (Kraska, 2018), Voyager 2 (Wongsupha- using graphical representations such as box-plots or scatter
sawat et al., 2017) use binning or categorization strategies plots, the missing values are presented either in tabular
on split up continuous variables into categories to gather form (Kraska, 2018) or using visual encodings (Furmanova
more insight from them. et al., 2017). Finally, the scalability of feature engineer-
ing (Tufféry, 2011) depends on the scalability of univariate
3.2.2. Support for additional exploratory requirements and bivariate analysis in the analyzed EDA tools.
In this section, we present our evaluation of each of the ana- ii. Reduced analytical expertise: In order to help non-expert
lyzed tools with respect to their fulfillment of the four additional users to explore data, researchers (Cui et al., 2018; Demiralp
exploratory requirements (cf. Section 3.1). For each requirement, et al., 2017; Wongsuphasawat et al., 2017) have proposed
we also discuss the different ways the analyzed tools have ad- proactive visual recommender systems that can ease the
dressed this requirement in the steps of the EDA process. We learning curve for novice users. During this study, we no-
summarize the results of our analysis in Table 2 and discuss them ticed three different types of recommendations: (i) recom-
as follows: mendation of charts (Vartak et al., 2015), (ii) recommenda-
tion of actions (Cui et al., 2018), and (iii) recommendation of
i. Scalability: Scalability of exploratory visualization tools pri- questions (Anderson, 2012). Among these, recommendation
marily has two aspects: firstly, loading the entire dataset of charts is the most common and is offered by many tools
into the main memory, secondly, processing the data and such as SeeDB (Vartak et al., 2015), Voyager 2 (Wongsupha-
producing visual representations of the attribute relation- sawat et al., 2017), VizDeck (Perry et al., 2013), Tableau (Soft-
ships (i.e., the response time of the tool). In the case of aca- ware, 0000), Analyza (Dhamdhere et al., 2017), Alteryx (Sal-
demic tools, researchers have attempted to address both of lam et al., 2014), Microsoft Power BI (M. Corporation, 0000),
these aspects. For example, in order to address the challenge among others. Recommendation of action is less common;
of a large set of raw data that does not fit into main memory, however, it is offered by tools such as DataSite (Cui et al.,
tools like ForeCache (Battle et al., 2016) use a client–server 2018) and ForeSight (Demiralp et al., 2017) that suggest
architecture, where a middleware layer fetches portions of users with subsequent steps of analysis. Recommendation
data ahead in time based on the analysis history of the of possible questions that can be asked from data is offered
user. On the other hand, EDA tools make use of several by Voder (Srinivasan et al., 2018) (cf. Fig. 8) and IBM Watson
different techniques to assist with the response time for Analytics (Anderson, 2012) that performs natural-language-
processing very large datasets. For example, tools such as processing for the task. Apart from proactive recommen-
ProgressiveInsights (Stolper et al., 2014), NorthStar (Kraska, dations, tools such Zenvisage (Siddiqui et al., 2016) auto-
2018), and VisReduce (Im et al., 2013) progressively create matically search for user specified patterns in data, the tool
246 A. Ghosh, M. Nashaat, J. Miller et al. / Visual Informatics 2 (2018) 235–253

Fig. 5. Dashboard of the Tool Keshif (Yalçin et al., 2018). In the figure:(a) Keshif enlists the attributes in the dataset in groups such as categorical, quantitative, time-series data.
(b) For bivariate and multivariate analysis Keshif allows users to lock histograms of up to three attributes. (c) Attribute relationships are also shown on visual representations
that allow users to switch to different visuals and/or filter the data.

Fig. 6. The tool Domino (cf. Gratzl et al. Fig. 1 Gratzl et al., 2014) showing the relationships between data subsets using parallel coordinates and scatter plots.

SketchStory (Lee et al., 2013) (cf. Fig. 9(b)) identifies specific values in the dataset. With the help of a live keyword search,
partial sketches drawn on the user interface using a digital some tools (Zhao et al., 2013; Srinivasan et al., 2018) allow
pen, and automatically completes the graphical representa- the user to impute missing and aberrant data. For assistance
tion. Moreover, tools such as Lyra (Satyanarayan and Heer, with feature engineering, some tools (High, 2012; Demiralp
2014b) and iVisDesigner (Ren et al., 2014) facilitate users to et al., 2017) proactively recommend feature combinations
explore data without any programming knowledge. and derivations of new features.
To reduce the required analytical expertise in each step of iii. User engagement: In recent years, visual EDA tools are used
the EDA process, as shown in Table 2, tools such as Voy- in different domains to make informed decisions from data.
ager2 (Wongsuphasawat et al., 2017) and ForeSight (Demi- Hence, in order to enhance the users’ trust on the visual
ralp et al., 2017) proactively provide visual summaries to representations provided by these EDA tools, researchers
distinguish attributes; whereas, the tool Analyza (Dhamd- have proposed several mechanisms to engage end-users. For
here et al., 2017) guides users through the data discovery example, tools such as NorthStar (Kraska, 2018), Panoram-
(i.e., distinguish attributes and univariate analysis) and the icData (Zgraggen et al., 2014), SketchStory (Lee et al., 2013),
detection of relations between attributes (i.e., bivariate and and ExPlates (Javed and Elmqvist, 2013) use interactive
multivariate analysis). Moreover, the proactive chart rec- pen and touch features of the graphical user interface to
ommendations by some academic (Wongsuphasawat et al., engage users. Other tools such as LineUp (Gratzl et al.,
2017; Vartak et al., 2015; Dhamdhere et al., 2017; Perry 2013), Voder (Srinivasan et al., 2018), Duet (Law et al.,
et al., 2013) and commercial tools (Software, 0000; M. Cor- 2018), RCLens (Lin et al., 2018), ForeSight (Demiralp et al.,
poration, 0000; Sallam et al., 2014) also help with univari- 2017), InfoNice (Wang et al., 2018) (cf. Fig. 9(a)) allow users
ate and bivariate analysis. Nevertheless, we noticed a lack to provide feedback on the visual representations, embel-
of proactive guidance for multivariate analysis among the lishments, and proactive recommendations. Additionally,
EDA tools. For the identification of outliers, tools such as tools such as ExPlates (Javed and Elmqvist, 2013), Voy-
ForeSight (Demiralp et al., 2017), RCLens (Lin et al., 2018), ager (Wongsuphasawat et al., 2017), ForeCache (Battle et al.,
Voyager2 (Wongsuphasawat et al., 2017), and SeeDB (Var- 2016), and HindSight (Feng et al., 2017) allow users to see
tak et al., 2015) proactively highlight apparently abnormal a history of the performed analysis tasks, so that not only
A. Ghosh, M. Nashaat, J. Miller et al. / Visual Informatics 2 (2018) 235–253 247

Fig. 7. Dashboard of ForeSight (cf. Demiralp et al.– Fig. 1 Demiralp et al., 2017). In the figure: (a) shows univariate attribute distributions, (b) shows outliers in the data, (c)
linear correlations among attributes, (d) tabular access to underlying data, (e) bookmarks of data exploration, (f) related insights.

Fig. 8. Explore view of the interface of the tool Voder (cf. Srinivasan et al.- Fig. 4 Srinivasan et al., 2018). In the figure: (A) shows specification of visualization, (B) shows
active visualization, (C) automatically generated data facts, (D) starred data facts about the current visualization, (E) System generated visuals for other data facts that can
be explored, (F) Query panel for data facts, (G) possible visualizations for the chosen attributes.

undo operations can be permitted but also the new EDA distinguishing attributes and performing univariate, bivari-
results can be compared with previously obtained results. ate, and multivariate analysis. This feature lets users com-
Finally, tools like Voder (Srinivasan et al., 2018) (cf. Fig. 8) bine two or more attributes together simply by drawing
and PivotSlice (Zhao et al., 2013) allow users to execute live a line between them. On the other hand, the interactive
search operations on the data that produce transformed or
feedback allowed by Duet (Law et al., 2018), RCLens (Lin
derived results.
et al., 2018), ForeSight (Demiralp et al., 2017) engages users
In these EDA tools, as shown in Table 2, user engagement
with the EDA process usually starts from the very beginning in the detection and imputation of outliers and missing
of the analysis. For example, the drag and drop feature in data. For engagement with feature engineering (High, 2012),
NorthStar (Kraska, 2018), PanoramicData (Zgraggen et al., showing historical interactions from users assists with more
2014), and SketchStory (Lee et al., 2013) engages users in informed decision making.
248 A. Ghosh, M. Nashaat, J. Miller et al. / Visual Informatics 2 (2018) 235–253

Table 2
Summary of EDA tools addressing additional exploratory requirements in the EDA process steps. Note: In the table, whilst the columns represent the additional exploratory
requirements for industrial datasets, the rows represent individual steps of the EDA process. Each cell in the following table presents references to the analyzed tools that
fulfill the corresponding analysis requirement in the associated process step.
EDA steps Scalability Reduced analytical expertise User engagement Interpretability
Distinguish Yalçin et al. (2018), Kraska Kraska (2018), Demiralp et al. Zgraggen et al. (2014), Demiralp Demiralp et al. (2017), Vartak
attributes (2018), Wang et al. (2017), (2017), Wongsuphasawat et al. et al. (2017), Javed and Elmqvist et al. (2015), Law et al. (2018),
Demiralp et al. (2017), Vartak (2017), Vartak et al. (2015), (2013), Wongsuphasawat et al. Kelly (2015)
et al. (2015), Xia et al. (2016), Dhamdhere et al. (2017), (2017), Mei et al. (2018),
Dhamdhere et al. (2017), Iyer Siddiqui et al. (2016), Lin et al. Dhamdhere et al. (2017), Wall
et al. (2017), Perin et al. (2014), (2018), Sallam et al. (2014), et al. (2018), Siddiqui et al.
Sallam et al. (2014), Software Software (0000), D. Inc (0000), (2016), Sallam et al. (2014),
(0000), Kelly (2015), M. Kelly (2015), M. Corporation Software (0000), D. Inc (0000),
Corporation (0000), Sisense (0000), Sisense (0000) Kelly (2015), M. Corporation
(0000) (0000), Qlik (0000), Sisense
(0000)
Univariate Yalçin et al. (2018), El-Hindi Cui et al. (2018), Battle et al. Yu and Silva (2017), Battle et al. Cui et al. (2018), Yu and Silva
analysis et al. (2016), Kraska (2018), Yu (2016), Srinivasan et al. (2018), (2016), Srinivasan et al. (2018), (2017), Demiralp et al. (2017),
and Silva (2017), Liu et al. Demiralp et al. (2017), Lee et al. Zgraggen et al. (2014), Demiralp Law et al. (2018), Ren et al.
(2013), Battle et al. (2016), (2013), Wongsuphasawat et al. et al. (2017), Lee et al. (2013), (2017), Vartak et al. (2015),
Wang et al. (2017), Srinivasan (2017), Lin et al. (2018), Siddiqui Javed and Elmqvist (2013), Kelly (2015)
et al. (2018), Demiralp et al. et al. (2016), Dhamdhere et al. Satyanarayan and Heer (2014a),
(2017), Niederer et al. (2018), (2017), Vartak et al. (2015), Wongsuphasawat et al. (2017),
Vartak et al. (2015), Xia et al. Perry et al. (2013), Sallam et al. Mei et al. (2018), Wang et al.
(2016), Dhamdhere et al. (2017), (2014), Software (0000), D. Inc (2018), Wall et al. (2018),
Iyer et al. (2017), Macke et al. (0000), Kelly (2015), M. Furmanova et al. (2017),
(2018), Feng et al. (2017), Yalçin Corporation (0000), Sisense Siddiqui et al. (2016),
et al. (2016), Budiu et al. (2015), (0000) Dhamdhere et al. (2017), Feng
Perin et al. (2014), Stolper et al. et al. (2017), Yalçin et al. (2016),
(2014), Lex et al. (2014), Gratzl Vartak et al. (2015), Gratzl et al.
et al. (2013), Sallam et al. (2014), (2013), Perry et al. (2013), Im
Software (0000), Kelly (2015), et al. (2013), Sallam et al. (2014),
M. Corporation (0000), Sisense Software (0000), D. Inc (0000),
(0000) Kelly (2015), M. Corporation
(0000), Qlik (0000), Sisense
(0000)
Bivariate Yalçin et al. (2018), El-Hindi Cui et al. (2018), Battle et al. Yu and Silva (2017), Battle et al. Cui et al. (2018), Yu and Silva
analysis et al. (2016), Kraska (2018), Yu (2016), Srinivasan et al. (2018), (2016), Srinivasan et al. (2018), (2017), Demiralp et al. (2017),
and Silva (2017), Liu et al. Demiralp et al. (2017), Lee et al. Zgraggen et al. (2014), Demiralp Law et al. (2018), Ren et al.
(2013), Battle et al. (2016), (2013), Wongsuphasawat et al. et al. (2017), Lee et al. (2013), (2017), Vartak et al. (2015),
Wang et al. (2017), Srinivasan (2017), Lin et al. (2018), Siddiqui Javed and Elmqvist (2013), Kelly (2015)
et al. (2018), Demiralp et al. et al. (2016), Dhamdhere et al. Satyanarayan and Heer (2014a),
(2017), Iyer et al. (2017), (2017), Vartak et al. (2015), Wongsuphasawat et al. (2017),
Dhamdhere et al. (2017), Feng Perry et al. (2013), Sallam et al. Mei et al. (2018), Wang et al.
et al. (2017), Yalçin et al. (2016), (2014), Software (0000), D. Inc (2018), Wall et al. (2018),
Xia et al. (2016), Vartak et al. (0000), Kelly (2015), M. Furmanova et al. (2017),
(2015), Budiu et al. (2015), Perin Corporation (0000), Sisense Siddiqui et al. (2016),
et al. (2014), Stolper et al. (0000) Dhamdhere et al. (2017), Feng
(2014), Lex et al. (2014), Gratzl et al. (2017), Yalçin et al. (2016),
et al. (2013), Sallam et al. (2014), Vartak et al. (2015), Lex et al.
Software (0000), Kelly (2015), (2014), Perry et al. (2013), Im
M. Corporation (0000), Sisense et al. (2013), Sallam et al. (2014),
(0000) Software (0000), D. Inc (0000),
Kelly (2015), M. Corporation
(0000), Qlik (0000), Sisense
(0000)
Multivariate Yalçin et al. (2018), Kraska Kraska (2018), Battle et al. Yu and Silva (2017), Battle et al. Yu and Silva (2017), Vartak et al.
analysis (2018), Yu and Silva (2017), Liu (2016), Wongsuphasawat et al. (2016), Javed and Elmqvist (2015), Kelly (2015)
et al. (2013), Battle et al. (2016), (2017), Lin et al. (2018), (2013), Wongsuphasawat et al.
Wang et al. (2017), Demiralp Dhamdhere et al. (2017), Vartak (2017), Wall et al. (2018),
et al. (2017), Dhamdhere et al. et al. (2015), Sallam et al. (2014), Dhamdhere et al. (2017), Vartak
(2017), Xia et al. (2016), Vartak Software (0000), D. Inc (0000), et al. (2015), Gratzl et al. (2013),
et al. (2015), Stolper et al. Kelly (2015), M. Corporation Sallam et al. (2014), Software
(2014), Gratzl et al. (2013), (0000), Sisense (0000) (0000), D. Inc (0000), Kelly
Sallam et al. (2014), Software (2015), M. Corporation (0000),
(0000), Kelly (2015), M. Sisense (0000)
Corporation (0000),Sisense
(0000)
Detect Kraska (2018), Wang et al. Kraska (2018), Sallam et al. Furmanova et al. (2017), Gratzl Law et al. (2018), Kelly (2015)
missing (2017), Gratzl et al. (2013), (2014), Software (0000), Kelly et al. (2013), Sallam et al. (2014),
values Sallam et al. (2014), Software (2015), M. Corporation (0000) Software (0000), Kelly (2015),
(0000), Kelly (2015), M. M. Corporation (0000)
Corporation (0000)
(continued on next page)
A. Ghosh, M. Nashaat, J. Miller et al. / Visual Informatics 2 (2018) 235–253 249

Table 2 (continued).
EDA steps Scalability Reduced analytical expertise User engagement Interpretability
Detect Kraska (2018), Yu and Silva Kraska (2018), Srinivasan et al. Yu and Silva (2017), Srinivasan Yu and Silva (2017), Demiralp
outlier (2017), Liu et al. (2013), (2018), Demiralp et al. (2017), et al. (2018), Demiralp et al. et al. (2017), Vartak et al. (2015),
values Srinivasan et al. (2018), Wongsuphasawat et al. (2017), (2017), Wongsuphasawat et al. Kelly (2015)
Demiralp et al. (2017), Vartak Lin et al. (2018), Vartak et al. (2017), Wall et al. (2018), Vartak
et al. (2015), Perin et al. (2014), (2015), Sallam et al. (2014), et al. (2015), Gratzl et al. (2013),
Gratzl et al. (2013), Sallam et al. Software (0000), Kelly (2015), Sallam et al. (2014), Software
(2014), Software (0000), Kelly M. Corporation (0000) (0000), Kelly (2015), M.
(2015), M. Corporation (0000) Corporation (0000)
Feature Yalçin et al. (2018), El-Hindi Kraska (2018), Battle et al. Yu and Silva (2017), Battle et al. Yu and Silva (2017), Demiralp
engineering et al. (2016), Kraska (2018), Yu (2016), Srinivasan et al. (2018), (2016), Srinivasan et al. (2018), et al. (2017), Ren et al. (2017),
and Silva (2017), Battle et al. Demiralp et al. (2017), Lee et al. Zgraggen et al. (2014), Demiralp Vartak et al. (2015), Kelly (2015)
(2016), Wang et al. (2017), (2013), Wongsuphasawat et al. et al. (2017), Lee et al. (2013),
Srinivasan et al. (2018), (2017), Lin et al. (2018), Siddiqui Javed and Elmqvist (2013),
Demiralp et al. (2017), Iyer et al. et al. (2016), Dhamdhere et al. Satyanarayan and Heer (2014a),
(2017), Dhamdhere et al. (2017), (2017), Vartak et al. (2015), Wongsuphasawat et al. (2017),
Feng et al. (2017), Xia et al. Sallam et al. (2014), Software Mei et al. (2018), Wang et al.
(2016), Vartak et al. (2015), (0000), Kelly (2015), M. (2018), Wall et al. (2018),
Budiu et al. (2015), Stolper et al. Corporation (0000) Furmanova et al. (2017),
(2014), Gratzl et al. (2013), Im Siddiqui et al. (2016),
et al. (2013), Sallam et al. (2014), Dhamdhere et al. (2017), Feng
Software (0000), Kelly (2015), et al. (2017), Vartak et al. (2015),
M. Corporation (0000) Gratzl et al. (2013), Im et al.
(2013), Sallam et al. (2014),
Software (0000), Kelly (2015),
M. Corporation (0000), Qlik
(0000), Sisense (0000)

iv. Interpretability: Due to the large volume of data being other hand, for accurate multivariate analytics, factor anal-
analyzed, visualizations showing inter-relations among at- ysis (e.g., PCA (Bro and Smilde, 2014)) techniques are used.
tributes can be difficult to interpret. In order to assist with However, the visualization of the results for these statistical
this challenge, recent visual EDA tools attempt to help users tests can be complicated (Dunn et al., 2016) for non-expert
with interpretations of the generated visualizations. For ex- users. Currently, most of our analyzed visual EDA tools only
ample, tools such as Voder (Srinivasan et al., 2018), Data- perform brush-link and filter operations to show correla-
Site (Cui et al., 2018), ExPlates (Javed and Elmqvist, 2013), El- tions among attributes. Although some tools (Wang et al.,
lipsis (Satyanarayan and Heer, 2014a), and ChartAccent (Ren 2017) do perform dimensionality reduction of attributes, the
et al., 2017) present users with natural language annotations reduced dimensions are not depicted in a comprehensive
alongside the visualizations. These annotations discuss de- way (Isaacs et al., 2014). Hence, there is a need for visual EDA
tails such as the distribution, value range, and most common tools to perform more complex statistical analysis (e.g., per-
values of an attribute. However, as shown in Table 2, for the forming factor analysis for multivariate attributes instead of
tools that we analyzed, comprehensive annotations are only brush-link and filter) and to provide more comprehensive
offered for univariate (Law et al., 2018), bivariate (Demiralp visualizations of the results. Additionally, during our analy-
et al., 2017), and multivariate (Vartak et al., 2015) analysis. sis we also noticed that although some of the investigated
Whereas, the other steps of the EDA process such as distin- EDA tools (Srinivasan et al., 2018) allow users to visualize
guishing attributes, and identification of missing values and univariate and bivariate outliers, identification and visual-
outliers are rarely addressed by the interpretable EDA tools. ization of multivariate outliers are still not performed by
any tool. Moreover, the tools that detect outliers in data do
4. Research opportunities not support any automated imputation of these values. It is
important for researchers to consider automated strategies
The results of our analysis show that based on changes in data
for outlier imputation in visual EDA tools.
analysis requirements (Godfrey et al., 2016), modern EDA tools
ii. Advanced discretization of continuous variables: Almost
have included support for some additional features (e.g., scalabil-
all the tools that were investigated during this work per-
ity, interpretability etc.). However, we have identified some poten-
form discretization (Tufféry, 2011) of continuous variables.
tial research opportunities that can enhance the abilities of visual
Discretization is a process where continuous variables are
EDA tools. We believe, in order to make informed decisions from
split into bins or categories based on ranges in their values.
data, deeper statistical analysis is required to understand the com-
Research (Kamat and Nandi, 2014) shows that the task of
plex relationships among its attributes. Our analysis shows, the
discretization can add error in data analysis as the selection
trade-off between the breadth and depth of supported operations
of optimal bin-value ranges for continuous variables is of-
in the visual EDA tools still remains open. Whereas, most EDA tools
designed for a generic target audience do not perform complex ten challenging. Our analysis shows although most of the
statistical analysis of data, tools that support such operations are recent visual EDA tools discretize continuous variables into
either domain specific or are challenging to interpret. Hence, we histograms, none of our analyzed tools consider any error
identify and list a set of potential research opportunities in the or confidence (Kamat and Nandi, 2014) of the discretization
domain of exploratory data analysis as follows. process. Hence, more research is required to consider mini-
mizing the discretization error in order to perform a more
i. Detailed analysis and visualization of bivariate & mul- accurate analysis. Moreover, some values of a continuous
tivariate statistics: In statistical analysis, the strength of variable might have higher importance than some other
a bivariate relationship between two attributes is usually values of the same variable. There is a need for EDA tools
obtained using correlation coefficients (Smith, 2018). On the to accommodate this fact. Although some tools (Demiralp
250 A. Ghosh, M. Nashaat, J. Miller et al. / Visual Informatics 2 (2018) 235–253

Fig. 9. User-engagement initiatives by modern EDA tools.

et al., 2017; Wall et al., 2018) support weighing attributes 5. Limitations and future work
values based on their importance, there is a need for further
research in this direction. In this section, we enlist a set of limitations of this research,
iii. Proactive guidance for multivariate relationships: As dis- which provides opportunities for future work. First of all, in this pa-
cussed in Section 3.1, in high-dimensional industrial per, we perform a comprehensive review of visual EDA tools based
datasets there can be complex multivariate relationships on a selection of 43 academic and 7 commercial tools used for
general purpose data analysis. Although, we precisely define and
among attributes that would require much domain exper-
justify our selection criteria (cf. Section 2.1), many existing visual
tise to understand. Moreover, in case of datasets with large
EDA tools were excluded. In order to avoid any biases in the selec-
number of attributes, it can be immensely challenging to just
tion criteria, we performed data source triangulation (Shull et al.,
identify the features that are related together and influence 2007), where the selected tools were chosen from both academia
each other. Despite addressing majority of the modern day and industry. Moreover, the academic tools were selected from
exploratory data analysis requirements, a significant gap has multiple reputed journals and conferences. Nevertheless, as the
been noticed among EDA tools with respect to proactive analysis of each and every existing EDA tools is beyond the capacity
grouping and depiction of related attributes in the data. of any individual research article, we had to limit the scope of this
Although some tools such as Microsoft Power BI (M. Cor- research. Future work needs to focus on extending our study and
poration, 0000) visualize the relationships among different include more tools in the analysis.
data-sources using entity-relationship diagrams, none of Moreover, apart from the utility for the analyzed tools in each
the analyzed EDA tools perform any proactive grouping step of the EDA process, we also evaluated the tools for the extent
among the related attributes (apart from grouping them to which they meet the list of additional exploratory requirements
with respect to their datatypes (Srinivasan et al., 2018)). (cf. Section 3.1) for analyzing large industrial datasets. To elicit
iv. Scalability vs. data visualization: Scalability of visual EDA these additional requirements, we mapped the identified challeng-
ing aspects (cf. Section 3.1) of our analyzed industrial dataset to
systems is a known challenge (Godfrey et al., 2016). In or-
the known big-data analysis requirements (Chan, 2006; Liu et al.,
der to deal with this challenge, many of our analyzed EDA
2013; Wang et al., 2017, 2015; Biju and Mathew, 2017; Johnstone
tools (Kraska, 2018; Gratzl et al., 2014; Battle et al., 2016;
and Titterington, 2009; Im et al., 2013). In order to add more ex-
Furmanova et al., 2017) have suggested several scalability ploratory requirements in the evaluation of EDA tools, future work
measures that can visualize billions of records within an ac- could perform a cross-sectional study (Shull et al., 2007) across
ceptable time limit. Nevertheless, the concern of scalable vi- industry and academia to identify more requirements for large
sual analysis is twofold: firstly, despite of several existing vi- scale EDA. Finally, researcher bias (Shull et al., 2007) is a known
sualization approaches, the reduced dimensions of a dataset challenge (Behrisch et al., 2018) in systematic literature reviews.
are difficult to interpret. Secondly, the number of data points To avoid any kind of researcher biases, in this study a group of
to display is often much larger than the number of pixels two researchers independently performed all the data analysis
available in one screen (Godfrey et al., 2016). Researchers tasks. In case of conflicts among these two researchers, a third
have proposed the use of data reduction techniques such researcher stepped in to alleviate the disagreements. Nevertheless,
as filtering, aggregation, sampling, and clustering in order in future investigator triangulation (Shull et al., 2007) could be
to address the challenges. However, whilst data reduction performed, where researchers from both industry and academia
techniques can solve visual scalability challenges, they can could collaboratively explore the utilities of different EDA tools, to
generalize the decisions made during this research even further.
induce additional error in the analysis process. Moreover,
outputs of data reduction tasks such as binned aggregation
6. Related work
of data (Liu et al., 2013), or data split into data cubes (Wang
et al., 2017) are difficult to visually interpret. Hence, there Identification of the state-of-the-art in exploratory data visual-
is a need for researchers to investigate more comprehensive ization is a well-researched area (Khan and Khan, 2011; Diamond
visual techniques for data reduction. and Mattia, 2018; Roberts, 2007; Behrisch et al., 2018). However,
A. Ghosh, M. Nashaat, J. Miller et al. / Visual Informatics 2 (2018) 235–253 251

a common challenge with such research is that with every new visual recommendations; the absence of any feedback process can
advancement in the research community, the work gets outdated cause users to lose their confidence on the suggestions provided by
quickly. Visual analysis of data is a large umbrella that spreads over the tools. Overall, we think there are many research opportunities
several different perspectives and applications of data analysis. in this emerging field that can be looked into for enhancing the
Numerous surveys exist that focus on identification of visualization performance and user experience of visual EDA tools.
libraries (Bikakis and Sellis, 2016), packages (Wang et al., 2015),
and tools (Dunn et al., 2016) for different purposes. For example, References
whereas, surveys (Keim, 2002) on visual data mining tools are
commonly published within research community, surveys (Chan, Amar, R., Eagan, J., Stasko, J., 2005. Low-level components of analytic activity in
information visualization. In: IEEE Symposium on Information Visualization
2006) also exist that focus on presenting multivariate data visu-
(INFOVIS). IEEE, pp. 111–117. http://dx.doi.org/10.1109/INFVIS.2005.1532136.
alization techniques. Moreover, many surveys (Wang et al., 2015; Anderson, F., 2012. Getting Started Tutorial for IBM Watson Analytics. IBM Corpo-
Biju and Mathew, 2017) have been performed on tools and tech- ration.
niques used to analyze big data. However, most of these surveys Battle, L., Chang, R., Stonebraker, M., 2016. Dynamic prefetching of data tiles for in-
focus on specific aspects of big data analysis, such as indexing teractive visualization. In: Proceedings of the 2016 International Conference on
Management of Data. ACM, pp. 1363–1375. http://dx.doi.org/10.1145/2882903.
techniques for big data, or visualization of high-dimensional (Dunn 2882919.
et al., 2016; Liu et al., 2017a) data. Among these, whilst some of the Behrisch, M., Streeb, D., Stoffel, F., Seebacher, D., Matejek, B., Weber, S.H., Mit-
surveys (Behrisch et al., 2018) focus on the recent advancements telstaedt, S., ster, H.P., Keim, D., 2018. Commercial visual analytics systems-
only in commercial data analysis tools, other surveys (Keim, 2002) advances in the big data analytics field. IEEE Trans. Vis. Comput. Graphics http:
//dx.doi.org/10.1109/TVCG.2018.2859973.
look into visualization recommender systems. However, in most
Biju, S.M., Mathew, A., 2017. Comparative Analysis of Selected big Bata Analytics
cases, the state-of-the-art surveys for visualization tools focus on Tools. University of Wollongong.
applications of the visualization. For example, surveys exist that Bikakis, N., Sellis, T., 2016. Exploration and visualization in the web of big linked
present visualization of biological data (Pabinger et al., 2014), vi- data: A survey of the state of the art, arXivpreprint arXiv:1601.08059.
sual sentiment analysis tools (Kucher et al., 2018), or visualization Bro, R., Smilde, A.K., 2014. Principal component analysis. Anal. Methods 6 (9), 2812–
2831. http://dx.doi.org/10.1016/0169-7439(87)80084-9.
of meteorological data (Rautenhaus et al., 2017). In recent years, Budiu, M., Isaacs, R., Murray, D., Plotkin, G., Barham, P., Al-Kiswany, S., Boshmaf, Y.,
researchers have been focusing on combinations of visualization Luo, Q., Andoni, A., 2015. Interacting with large distributed datasets using
techniques and machine learning models (Endert et al., 2017) to Sketch. In: Eurographics Symposium on Parallel Graphics and Visualization.
enhance interpretability of the machine learning process (Endert University of Wisconsin-Madison. http://dx.doi.org/10/f3tvr9.
Chan, W.W.-Y., 2006. A survey on multivariate data visualization. Dep. Comput. Sci.
et al., 2017). Surveys presented by Liu et al. (2017b) and Endert
Eng. Hong Kong Univ. Sci. Technol. 8 (6), 1–29.
et al. (2017) focus on techniques that are used to integrate machine Cui, Z., Badam, S.K., Yalćin, A., Elmqvist, N., 2018. DataSite: Proactive Visual Data Ex-
learning and visual analytics together. Several surveys (Slater et al., ploration with Computation of Insightbased Recommendations, arXiv preprint
2017) have been performed by researchers that classify visualiza- arXiv:1802.08621.
D. Inc. Domo, URL https://www.domo.com/, Accessed: 2018-10-20.
tion tools based on their utilities with respect to domain specific
Demiralp, C., Haas, P.J., Parthasarathy, S., Pedapati, T., 2017. Foresight: Rapid Data
data analysis steps (Roberts, 2007). However, unlike existing sur- Exploration Through Guideposts, arXivpreprint arXiv:1709.10513.
veys on visualization tools, this work focuses on 50 visual EDA tools Dhamdhere, K., McCurley, K.S., Nahmias, R., Sundararajan, M., Yan, Q., 2017. An-
that are used for exploration of tabular data, and were developed alyza: Exploring data with conversation. In: Proceedings of the 22nd Inter-
within the last 5 years. Our novel analysis examines the existing national Conference on Intelligent User Interfaces. ACM, pp. 493–504. http:
//dx.doi.org/10.1145/3025171.3025227.
tools for their abilities to assist with each steps of exploratory data
Diamond, M., Mattia, A., 2018. Data Visualization: An Exploratory Study into the
visualization of large industrial datasets. Software Tools Used by Businesses. J. Instr. Pedagogies 18.
Dunn, Jr., W., Burgun, A., Krebs, M.-O., Rance, B., 2016. Exploring and visualizing
7. Conclusions multidimensional data in translational research platforms. Brief. Bioinform. 18
(6), 1044–1056. http://dx.doi.org/10.1093/bib/bbw080.
El-Hindi, M., Zhao, Z., Binnig, C., Kraska, T., 2016. VisTrees: fast indexes for interac-
In this research, we identify the primary focus areas of vi- tive data exploration. In: Proceedings of the Workshop on Human-In-the-Loop
sually exploring industrial tabular datasets by analyzing a real- Data Analytics. San Francisco, CA, USA, pp. 5–11. http://dx.doi.org/10.1145/
world dataset of 3.4 million records. Later, we present a system- 2939502.2939507.
atic literature review of 50 state-of-the-art visual data analytics Endert, A., Ribarsky, W., Turkay, C., Wong, B.W., Nabney, I., Blanco, I.D., Rossi, F.,
2017. The state of the art in integrating machine learning into visual analytics.
tools and their utility in six distinct steps of the Exploratory Data In: Computer Graphics Forum (36). Wiley Online Library, pp. 458–486. http:
Analysis (EDA) process. We also investigate the extent to which //dx.doi.org/10.1111/cgf.13092.
these modern visual EDA tools address scalability, interpretability, Fan, J., Li, R., 2006. Statistical challenges with high dimensionality: Feature selection
and analytical expertise challenges of analyzing large datasets. Our in knowledge discovery, arXiv preprintmath/0602133.
Feng, M., Deng, C., Peck, E.M., Harrison, L., 2017. HindSight: Encouraging exploration
analysis shows, most modern EDA tools assist with the funda-
through direct encoding of personal interaction history. IEEE Trans. Vis. Comput.
mental steps of the EDA process, whilst only some tools consider Graphics 23 (1), 351–360. http://dx.doi.org/10.1109/TVCG.2016.2599058.
addressing the challenges of big-data analytics. Among the ana- Furmanova, K., Gratzl, S., Stitz, H., Zichner, T., Jaresova, M., Ennemoser, M., Lex, A.,
lyzed tools however, the trade-off between breadth of supported Streit, M., 2017. Taggle: Scalable visualization of tabular data through aggrega-
features and in-depth analysis of data is still remaining. Even tion, arXiv preprint arXiv:1712.05944.
Gartner. Inc. Gartner Customer Choice Awards - Analytics and Business In-
the most advanced tools in both academia and industry do not telligence Platform, URL: https://www.gartner.com/reviews/customer-choice-
depict complex multivariate relationships among attributes. The awards/analytics-business-intelligence-platforms//, Accessed:2018-10-25.
reason behind this is, most tabular data analysis tools are primarily Godfrey, P., Gryz, J., Lasek, P., 2016. Interactive visualization of large data sets. IEEE
designed for a generic audience who might need more training to Trans. Knowl. Data Eng. 28 (8), 2142–2157. http://dx.doi.org/10.1109/TKDE.
2016.2557324.
perform complex statistical analysis with the data. Moreover, some
Gratzl, S., Gehlenborg, N., Lex, A., Pfister, H., Streit, M., 2014. Domino: Extract-
academic EDA tools that perform factor analysis or use complex ing, comparing, and manipulating subsets across multiple tabular datasets.
diagrams to show relationships between multiple attributes, often IEEE Trans. Vis. Comput. Graphics (1), http://dx.doi.org/10.1109/TVCG.2014.
suffer from interpretability and scalability issues. Incorporation 2346260, 1-1.
of domain expertise is another challenge in most modern EDA Gratzl, S., Lex, A., Gehlenborg, N., ster, H.P., Streit, M., 2013. Lineup: Visual analysis of
multi-attribute rankings. IEEE Trans. Vis. Comput. Graphics 19 (12), 2277–2286.
tools. As in most cases for both commercial and academic tools, http://dx.doi.org/10.1109/TVCG.2013.173.
the user gets to take only the viewer’s role in the data analysis Heer, J., Shneiderman, B., 2012. Interactive dynamics for visual analysis. Queue 10
process. Especially for the EDA tools that proactively generate (2), 30. http://dx.doi.org/10.1145/2133416.2146416.
252 A. Ghosh, M. Nashaat, J. Miller et al. / Visual Informatics 2 (2018) 235–253

High, R., 2012. The era of Cognitive Systems: An Inside look at IBM Watson and how Najafabadi, M.M., Villanustre, F., Khoshgoftaar, T.M., Seliya, N., Wald, R.,
it Works. IBM Corporation, Redbooks. Muharemagic, E., 2015. Deep learning applications and challenges in big data
Hourieh, R., Stitz, H., Gehlenborg, N., Streit, M., 2016. TaCo: comparative visualiza- analytics. J. Big Data 2 (1), 1. http://dx.doi.org/10.1186/s40537-014-0007-7.
tion of large tabular data. In: Proceedings of the Eurographics IEEE EuroVis, (1). Niederer, C., Stitz, H., Hourieh, R., Grassinger, F., Aigner, W., Streit, M., 2018. Taco:
p. 1. Visualizing changes in tables over time. IEEE Trans. Vis. Comput. Graphics (1),
Idreos, S., Papaemmanouil, O., Chaudhuri, S., 2015. Overview of data exploration http://dx.doi.org/10.1109/TVCG.2017.2745298, 1-1.
techniques. In: Proceedings of the 2015 ACM SIGMOD International Confer- Pabinger, S., Dander, A., Fischer, M., Snajder, R., Sperk, M., Efremova, M., Krabich-
ence on Management of Data. ACM, pp. 277–281. http://dx.doi.org/10.1145/ ler, B., Speicher, M.R., Zschocke, J., Trajanoski, Z., 2014. A survey of tools for
2723372.2731084. variant analysis of next-generation genome sequencing data. Brief. Bioinform.
Im, J.-F., Villegas, F.G., McGuffin, M.J., 2013. VisReduce: Fast and responsive in- 15 (2), 256–278. http://dx.doi.org/10.1093/bib/bbs086.
cremental information visualization of large datasets. In: IEEE International Perin, C., Dragicevic, P., Fekete, J.-D., 2014. Revisiting bertin matrices: New inter-
Conference on Big Data. IEEE, 101109/BigData.20136691710, pp. 25–32. actions for crafting tabular visualizations. IEEE Trans. Vis. Comput. Graphics 20
Isaacs, E., Damico, K., Ahern, S., Bart, E., Singhal, M., 2014. Footprints: A visual search (12), 2082–2091. http://dx.doi.org/10.1109/TVCG.2014.2346279.
tool that supports discovery and coverage tracking. IEEE Trans. Vis. Comput. Perry, D.B., Howe, B., Key, A.M., Aragon, C., 2013. VizDeck: Streamlining exploratory
Graphics 20 (12), 1793–1802. http://dx.doi.org/10.1109/TVCG.2014.2346743. visual analytics of scientific data, http://hdl.h{and}le.net/2142/36044.
Iyer, G., DuttaDuwarah, S., Sharma, A., 2017. DataScope: Interactive visual ex- Qlik, Qlik View, URL https://www.qlik.com/us/, Accessed: 2018-10-20.
ploratory dashboards for large multidimensional data. In: IEEE Workshop on Rautenhaus, M., Böttinger, M., Siemen, S., Hoffman, R., Kirby, R.M., Mirzargar, M.,
Visual Analytics in Healthcare (VAHC). IEEE, 101109/VAHC20178387496, pp. Röber, N., Westermann, R., 2017. Visualization in meteorology-a survey of
17–23. techniques and tools for data analysis tasks. IEEE Trans. Vis. Comput. Graphics
Javed, W., Elmqvist, N., 2013. ExPlates: spatializing interactive analysis to scaffold early-view, http://dx.doi.org/10.1109/TVCG.2017.2779501.
visual exploration. In: Computer Graphics Forum, 32. Wiley Online Library, pp. Ren, D., Brehmer, M., Lee, B., Hollerer, T., Choe, E.K., et al., 2017. ChartAccent:
441–450. http://dx.doi.org/10.1111/cgf.12131. Annotation for data-driven storytelling. In: 2017 IEEE Pacific Visualization
Johnstone, I.M., Titterington, D.M., 2009. Statistical challenges of high-dimensional Symposium (PacificVis). IEEE, pp. 230–239, http://doi.ieeecomputersociety.org/
data. Philos. Trans. R. Soc. 367, 4237–4253. http://dx.doi.org/10.1098/rsta.2009. 10.1109/PACIFICVIS.2017.8031599.
0159. Ren, D., Hollerer, T., Yuan, X., 2014. iVisDesigner: Expressive interactive design of
Kaisler, S., Armour, F., Espinosa, J.A., Money, W., 2013. Big data: Issues and chal- information visualizations. IEEE Trans. Vis. Comput. Graphics (1), http://dx.doi.
lenges moving forward. In: 46th Hawaii International Conference on System org/10.1109/TVCG.2014.2346291, 1-1.
Sciences (HICSS). pp. 995–1004. http://dx.doi.org/10.1109/HICSS.2013.645. Roberts, J.C., 2007. State of the art: Coordinated & multiple views in exploratory
Kamat, N., Nandi, A., 2014. InfiniViz: Interactive Visual Exploration using Progres- visualization. In: Fifth International Conference on Coordinated and Multiple
sive Bin Refinement, arXiv preprint arXiv:1710.01854. Views in Exploratory Visualization CMV’07. IEEE, 101109/CMV200720, pp. 61–
Keim, D.A., 2002. Information visualization and visual data mining. IEEE Trans. Vis. 71.
Comput. Graphics (1), 1–8, http://doi.ieeecomputersociety.org/10.1109/2945. Sallam, R.L., Tapadinhas, J., Parenteau, J., Yuen, D., Hostmann, B., 2014. Magic
981847. quadrant for business intelligence and analytics platforms, Gartner RAS core
Kelly, J.E., 2015. Computing, cognition and the future of knowing, Whitepaper. IBM research notes. Gartner, Stamford, CT.
Res. 2. Satyanarayan, A., Heer, J., 2014a. Authoring narrative visualizations with ellipsis.
Khan, M., Khan, S.S., 2011. Data and information visualization methods, and inter- In: Computer Graphics Forum, 33. Wiley Online Library, pp. 361–370. http:
active mechanisms: A survey. J. Comput. Appl. 34 (1), 1–14. //dx.doi.org/10.1111/cgf.12392.
Kitchenham, B., Brereton, O.P., Budgen, D., Turner, M., Bailey, J., Linkman, S., 2009. Satyanarayan, A., Heer, J., 2014b. Lyra: An interactive visualization design environ-
Systematic literature reviews in software engineering a systematic literature ment. In: Computer Graphics Forum, 33. Wiley Online Library, pp. 351–360.
review. Inf. Softw. Technol. 51 (1), 7–15. http://dx.doi.org/10.1016/j.infsof.2008. http://dx.doi.org/10.1111/cgf.12391.
09.009. Shull, F., Singer, J., Sjøberg, D.I., 2007. Guide to Advanced Empirical Software Engi-
Koytek, P., Perin, C., Vermeulen, J., André, E., Carpendale, S., 2018. Mybrush: Brush- neering. Springer.
ing and linking with personal agency. IEEE Trans. Vis. Comput. Graphics 24 (1), Siddiqui, T., Kim, A., Lee, J., Karahalios, K., Parameswaran, A., 2016. Effortless data ex-
605–615. http://dx.doi.org/10.1109/TVCG.2017.2743859. ploration with zenvisage: an expressive and interactive visual analytics system.
Kraska, T., 2018. Northstar: An interactive data science system. Proc. VLDB Endow- Proc. VLDB Endowment 10 (4), 457–468. http://dx.doi.org/10.14778/3025111.
ment 11 (12), 2150–2164. http://dx.doi.org/10.14778/3229863.3240493. 3025126.
Kucher, K., Paradis, C., Kerren, A., 2018. The state of the art in sentiment visual- Sisense, Sisense, URL https://www.sisense.com/product/, Accessed: 2018-10-20.
ization. In: Computer Graphics Forum, 37. Wiley Online Library, pp. 71–96. Slater, S., Joksimović, S., Kovanovic, V., Baker, R.S., Gasevic, D., 2017. Tools for
http://dx.doi.org/10.1111/cgf.13217. educational data mining: A review. J. Educ. Behav. Statist. 42 (1), 85–106. http:
Law, P., Basole, R.C., Wu, Y., 2018. Duet: Helping Data Analysis Novices Conduct Pair- //dx.doi.org/10.3102/1076998616666808.
wise Comparisons by Minimal Specification. IEEE Trans. Vis. Comput. Graphics Smith, M.J.D., 2018. Statistical Analysis Handbook, Edinburgh. The Winchelsea
http://dx.doi.org/10.1109/TVCG.2018.2864526, 1-1, early access. Press, Drumlin Security Ltd.
Lee, B., Kazi, R.H., Smith, G., 2013. SketchStory: Telling more engaging stories with Software, T., Tableau, URL https://www.tableau.com/products, Accessed: 2018-10-
data through freeform sketching. IEEE Trans. Vis. Comput. Graphics 19 (12), 20.
2416–2425. http://dx.doi.org/10.1109/TVCG.2013.191. Srinivasan, A., Drucker, S.M., Endert, A., Stasko, J., 2018. Augmenting visualizations
Lex, A., Gehlenborg, N., Strobelt, H., Vuillemot, R., ster, H.P., 2014. UpSet: visualiza- with interactive data facts to facilitate interpretation and communication. IEEE
tion of intersecting sets. IEEE Trans. Vis. Comput. Graphics 20 (12), 1983–1992. Trans. Vis. Comput. Graphics http://dx.doi.org/10.1109/TVCG.2018.2865145, 1-
http://dx.doi.org/10.1109/TVCG.2014.2346248. 1, early access.
Lin, H., Gao, S., Gotz, D., Du, F., He, J., Cao, N., 2018. Rclens: Interactive rare category Stolper, C.D., Perer, A., Gotz, D., 2014. Progressive visual analytics: User-driven
exploration and identification. IEEE Trans. Vis. Comput. Graphics 24 (7), 2223– visual exploration of in-progress analytics. IEEE Trans. Vis. Comput. Graphics
2237. http://dx.doi.org/10.1109/TVCG.2017.2711030. 20 (12), 1653–1662. http://dx.doi.org/10.1109/TVCG.2014.2346574.
Liu, Z., Jiang, B., Heer, J., 2013. imMens: real-time visual querying of big data. In: Tufféry, S., 2011. Data Mining and Statistics for Decision Making, Vol. 2. Wiley
Computer Graphics Forum, 32. Wiley Online Library, 101111/cgf.12129, pp. Chichester.
421–430. Vartak, M., Rahman, S., Madden, S., Parameswaran, A., Polyzotis, N., 2015. SeeDB:
Liu, S., Maljovec, D., Wang, B., Bremer, P.-T., Pascucci, V., 2017a. Visual- efficient data-driven visualization recommendations to support visual ana-
izing high-dimensional data: Advances in the past decade. IEEE Trans. lytics. Proc. VLDB Endowment 8 (13), 2182–2193. http://dx.doi.org/10.14778/
Vis. Comput. Graphics (3), 1249–1268, http://doi.ieeecomputersociety.org/ 2831360.2831371.
10.1109/TVCG.2016.2640960. Wall, E., Das, S., Chawla, R., Kalidindi, B., Brown, E.T., Endert, A., 2018. Podium:
Liu, S., Wang, X., Liu, M., Zhu, J., 2017b. Towards better analysis of machine learning Ranking data using mixed-initiative visual analytics. IEEE Trans. Vis. Comput.
models: A visual analytics perspective. Visual Inform. 1 (1), 48–56. http://dx. Graphics 24 (1), 288–297. http://dx.doi.org/10.1109/TVCG.2017.2745078.
doi.org/10.1016/j.visinf.2017.01.006Get. Wang, Z., Ferreira, N., Wei, Y., Bhaskar, A.S., Scheidegger, C., 2017. Gaussian cubes:
M. Corporation, Microsoft Power BI, URL https://powerbi.microsoft.com//, Ac- Real-time modeling for visual exploration of large multidimensional datasets.
cessed: 2018-10-20. IEEE Trans. Vis. Comput. Graphics 23 (1), 681–690. http://dx.doi.org/10.1109/
Macke, S., Zhang, Y., Huang, S., Parameswaran, A., 2018. Adaptive sampling for TVCG.2016.2598694.
rapidly matching histograms. Proc. VLDB Endowment 11 (10), 1262–1275. http: Wang, L., Wang, G., Alexander, C.A., 2015. Big data and visualization: methods,
//dx.doi.org/10.14778/3231751.3231753. challenges and technology progress. Digital Technol. 1 (1), 33–38. http://dx.doi.
Mei, H., Chen, W., Ma, Y., Guan, H., Hu, W., 2018. VisComposer: A Visual Pro- org/10.12691/dt-1-1-7.
grammable Composition Environment for Information Visualization. Visual Wang, Y., Zhang, H., Huang, H., Chen, X., Yin, Q., Hou, Z., Zhang, D., Luo, Q., Qu, H.,
Inform. 2 (1), 71–81. http://dx.doi.org/10.1016/j.visinf.2018.04.008. 2018. InfoNice: easy creation of information graphics. In: Proceedings of the
Mokalis, A.L., Davis, J.J., 2018. Google Analytics Demystified. CreateSpace Indepen- CHI Conference on Human Factors in Computing Systems. ACM, p. 335. http:
dent Publishing Platform. //dx.doi.org/10.1145/3173574.2018.3173909.
A. Ghosh, M. Nashaat, J. Miller et al. / Visual Informatics 2 (2018) 235–253 253

Wongsuphasawat, K., Qu, Z., Moritz, D., Chang, R., Ouk, F., Anand, A., Mackinlay, J., Yu, B., Silva, C.T., 2017. VisFlow-Web-based visualization framework for tabular
Howe, B., Heer, J., 2017. Voyager 2: Augmenting visual analysis with partial view data with a subset flow model. IEEE Trans. Vis. Comput. Graphics 23 (1), 251–
specifications. In: Proceedings of the 2017 CHI Conference on Human Factors in 260. http://dx.doi.org/10.1109/TVCG.2016.2598497.
Computing Systems. ACM, pp. 2648–2659. http://dx.doi.org/10.1145/3025453. Zgraggen, E., Zeleznik, R., Drucker, S.M., 2014. PanoramicData: Data analysis
3025768. through pen & touch. IEEE Trans. Vis. Comput. Graphics 20 (12), 2112–2121.
Xia, J., Chen, W., Hou, Y., Hu, W., Huang, X., Ebertk, D.S., 2016. DimScanner: A http://dx.doi.org/10.1109/TVCG.2014.2346293.
relation-based visual exploration approach towards data dimension inspection. Zhao, J., Collins, C., Chevalier, F., Balakrishnan, R., 2013. Interactive exploration
In: IEEE Conference on Visual Analytics Science and Technology (VAST). IEEE, pp. of implicit and explicit relations in faceted datasets. IEEE Trans. Vis. Comput.
81–90. http://dx.doi.org/10.1109/VAST.2016.7883514. Graphics 19 (12), 2080–2089. http://dx.doi.org/10.1109/TVCG.2013.167.
Yalçin, M.A., Elmqvist, N., Bederson, B.B., 2016. AggreSet: Rich and scalable set ex- Zuur, A.F., Ieno, E.N., Elphick, C.S., 2010. A protocol for data exploration to avoid
ploration using visualizations of element aggregations. IEEE Trans. Vis. Comput. common statistical problems. Methods Ecology Evol. 1 (1), 3–14. http://dx.doi.
Graphics 22 (1), 688–697. http://dx.doi.org/10.1109/TVCG.2015.2467051. org/10.1111/j.2041-210X.2009.00001.x.
Yalçin, M.A., Elmqvist, N., Bederson, B.B., 2018. Keshif: Rapid and expressive tabular
data exploration for novices. IEEE Trans. Vis. Comput. Graphics 24 (8), 2339–
2352. http://dx.doi.org/10.1109/TVCG.2017.2723393.

You might also like