Professional Documents
Culture Documents
Weihs & Ickstadt 2018
Weihs & Ickstadt 2018
https://doi.org/10.1007/s41060-018-0102-5
REGULAR PAPER
Received: 20 March 2017 / Accepted: 25 January 2018 / Published online: 16 February 2018
© The Author(s) 2018. This article is an open access publication
Abstract
In this paper, we substantiate our premise that statistics is one of the most important disciplines to provide tools and methods
to find structure in and to give deeper insight into data, and the most important discipline to analyze and quantify uncertainty.
We give an overview over different proposed structures of Data Science and address the impact of statistics on such steps as
data acquisition and enrichment, data exploration, data analysis and modeling, validation and representation and reporting.
Also, we indicate fallacies when neglecting statistical reasoning.
Keywords Structures of data science · Impact of statistics on data science · Fallacies in data science
123
190 International Journal of Data Science and Analytics (2018) 6:189–194
insight into data, and the most important discipline to analyze Data Analysis and Modeling and Model Validation and Selec-
and quantify uncertainty. tion.
This paper aims at addressing the major impact of statistics Table 1 compares different definitions of steps in Data
on the most important steps in Data Science. Science. The relationship of terms is indicated by horizontal
blocks. The missing step Data Acquisition and Enrichment
in CRISP-DM indicates that that scheme deals with obser-
vational data only. Moreover, in our proposal, the steps
2 Steps in data science Data Storage and Access and Optimization of Algorithms
are added to CRISP-DM, where statistics is less involved.
One of forerunners of Data Science from a structural per- The list of steps for Data Science may even be enlarged,
spective is the famous CRISP-DM (Cross Industry Standard see, e.g., Cao in [12], Figure 6, cp. also Table 1, middle
Process for Data Mining) which is organized in six main column, for the following recent list: Domain-specific Data
steps: Business Understanding, Data Understanding, Data Applications and Problems, Data Storage and Management,
Preparation, Modeling, Evaluation, and Deployment [10], Data Quality Enhancement, Data Modeling and Represen-
see Table 1, left column. Ideas like CRISP-DM are now fun- tation, Deep Analytics, Learning and Discovery, Simulation
damental for applied statistics. and Experiment Design, High-performance Processing and
In our view, the main steps in Data Science have been Analytics, Networking, Communication, Data-to-Decision
inspired by CRISP-DM and have evolved, leading to, e.g., and Actions.
our definition of Data Science as a sequence of the following In principle, Cao’s and our proposal cover the same main
steps: Data Acquisition and Enrichment, Data Storage steps. However, in parts, Cao’s formulation is more detailed;
and Access, Data Exploration, Data Analysis and Mod- e.g., our step Data Analysis and Modeling corresponds to
eling, Optimization of Algorithms, Model Validation Data Modeling and Representation, Deep Analytics, Learn-
and Selection, Representation and Reporting of Results, and ing and Discovery. Also, the vocabularies differ slightly,
Business Deployment of Results. Note that topics in depending on whether the respective background is com-
small capitals indicate steps where statistics is less involved, puter science or statistics. In that respect note that Experiment
cp. Table 1, right column. Design in Cao’s definition means the design of the simulation
Usually, these steps are not just conducted once but are experiments.
iterated in a cyclic loop. In addition, it is common to alter- In what follows, we will highlight the role of statistics
nate between two or more steps. This holds especially for discussing all the steps, where it is heavily involved, in
the steps Data Acquisition and Enrichment, Data Explo- Sects. 2.1–2.6. These coincide with all steps in our proposal
ration, and Statistical Data Analysis, as well as for Statistical in Table 1 except steps in small capitals. The corresponding
Table 1 Steps in Data Science: comparison of CRISP-DM (Cross Industry Standard Process for Data Mining), Cao’s definition and our proposal
CRISP-DM Cao’s definition Our proposal
Business Understanding Domain-specific Data, Data Acquisition and Enrichment (cp. Sect. 2.1)
Applications and Problems
Data Understanding, Data Quality Enhancement Data Exploration (cp. Sect. 2.2)
Data Preparation
Modeling Data Modeling and Representation, Data Analysis and Modeling (cp. Sects. 2.3, 2.4)
Deep Analytics, Learning and Discovery
Evaluation Simulation and Experiment Design Model Validation and Selection (cp. Sect. 2.5)
Deployment Networking, Communication Representation and Reporting of Results (cp. Sect. 2.6)
123
International Journal of Data Science and Analytics (2018) 6:189–194 191
entries Data Storage and Access and Optimization ing Bayesian statistics. Distributions also enable us to choose
of Algorithms are mainly covered by informatics and adequate subsequent analytic models and methods.
computer science, whereas Business Deployment of
Results is covered by Business Management. 2.3 Statistical data analysis
2.1 Data acquisition and enrichment Finding structure in data and making predictions are the most
important steps in Data Science. Here, in particular, statistical
Design of experiments (DOE) is essential for a systematic methods are essential since they are able to handle many
generation of data when the effect of noisy factors has to be different analytical tasks. Important examples of statistical
identified. Controlled experiments are fundamental for robust data analysis methods are the following.
process engineering to produce reliable products despite
variation in the process variables. On the one hand, even con- a) Hypothesis testing is one of the pillars of statistical anal-
trollable factors contain a certain amount of uncontrollable ysis. Questions arising in data driven problems can often
variation that affects the response. On the other hand, some be translated to hypotheses. Also, hypotheses are the
factors, like environmental factors, cannot be controlled at natural links between underlying theory and statistics.
all. Nevertheless, at least the effect of such noisy influencing Since statistical hypotheses are related to statistical tests,
factors should be controlled by, e.g., DOE. questions and theory can be tested for the available data.
DOE can be utilized, e.g., Multiple usage of the same data in different tests often
leads to the necessity to correct significance levels. In
– to systematically generate new data (data acquisition) applied statistics, correct multiple testing is one of the
[33], most important problems, e.g., in pharmaceutical studies
– for systematically reducing data bases [41], and [15]. Ignoring such techniques would lead to many more
– for tuning (i.e., optimizing) parameters of algorithms significant results than justified.
[1], i.e., for improving the data analysis methods (see b) Classification methods are basic for finding and predict-
Sect. 2.3) themselves. ing subpopulations from data. In the so-called unsuper-
vised case, such subpopulations are to be found from a
Simulations [7] may also be used to generate new data. A data set without a-priori knowledge of any cases of such
tool for the enrichment of data bases to fill data gaps is the subpopulations. This is often called clustering.
imputation of missing data [31]. In the so-called supervised case, classification rules
Such statistical methods for data generation and enrich- should be found from a labeled data set for the predic-
ment need to be part of the backbone of Data Science. The tion of unknown labels when only influential factors are
exclusive use of observational data without any noise control available.
distinctly diminishes the quality of data analysis results and Nowadays, there is a plethora of methods for the unsu-
may even lead to wrong result interpretation. The hope for pervised [22] as well for the supervised case [2].
“The End of Theory: The Data Deluge Makes the Scientific In the age of Big Data, a new look at the classical meth-
Method Obsolete” [4] appears to be wrong due to noise in ods appears to be necessary, though, since most of the
the data. time the calculation effort of complex analysis methods
Thus, experimental design is crucial for the reliability, grows stronger than linear with the number of observa-
validity, and replicability of our results. tions n or the number of features p. In the case of Big
Data, i.e., if n or p is large, this leads to too high calcula-
2.2 Data exploration tion times and to numerical problems. This results both,
in the comeback of simpler optimization algorithms with
Exploratory statistics is essential for data preprocessing to low time-complexity [9] and in re-examining the tradi-
learn about the contents of a data base. Exploration and tional methods in statistics and machine learning for Big
visualization of observed data was, in a way, initiated by Data [46].
John Tukey [43]. Since that time, the most laborious part of c) Regression methods are the main tool to find global
data analysis, namely data understanding and transformation, and local relationships between features when the tar-
became an important part in statistical science. get variable is measured. Depending on the distributional
Data exploration or data mining is fundamental for the assumption for the underlying data, different approaches
proper usage of analytical methods in Data Science. The most may be applied. Under the normality assumption, linear
important contribution of statistics is the notion of distribu- regression is the most common method, while gener-
tion. It allows us to represent variability in the data as well alized linear regression is usually employed for other
as (a-priori) knowledge of parameters, the concept underly- distributions from the exponential family [18]. More
123
192 International Journal of Data Science and Analytics (2018) 6:189–194
advanced methods comprise functional regression for a bridge between the applied sciences and Data Sci-
functional data [38], quantile regression [25], and regres- ence.
sion based on loss functions other than squared error loss (c) Local models and globalization Typically, statistical
like, e.g., Lasso regression [11,21]. models are only valid in sub-regions of the domain of
In the context of Big Data, the challenges are similar the involved variables. Then, local models can be used
to those for classification methods given large numbers [8]. The analysis of structural breaks can be basic to iden-
of observations n (e.g., in data streams) and / or large tify the regions for local modeling in time series [5]. Also,
numbers of features p. For the reduction of n, data the analysis of concept drifts can be used to investigate
reduction techniques like compressed sensing, random model changes over time [30].
projection methods [20] or sampling-based procedures In time series, there are often hierarchies of more and
[28] enable faster computations. For decreasing the num- more global structures. For example, in music, a basic
ber p to the most influential features, variable selection local structure is given by the notes and more and more
or shrinkage approaches like the Lasso [21] can be global ones by bars, motifs, phrases, parts etc. In order to
employed, keeping the interpretability of the features. find global properties of a time series, properties of the
(Sparse) principal component analysis [21] may also be local models can be combined to more global character-
used. istics [47].
d) Time series analysis aims at understanding and predict- Mixture models can also be used for the generalization
ing temporal structure [42]. Time series are very common of local to global models [19,23]. Model combination
in studies of observational data, and prediction is the most is essential for the characterization of real relationships
important challenge for such data. Typical application since standard mathematical models are often much too
areas are the behavioral sciences and economics as well simple to be valid for heterogeneous data or bigger
as the natural sciences and engineering. As an example, regions of interest.
let us have a look at signal analysis, e.g., speech or music
data analysis. Here, statistical methods comprise the anal-
ysis of models in the time and frequency domains. The
2.5 Model validation and model selection
main aim is the prediction of future values of the time
series itself or of its properties. For example, the vibrato
In cases where more than one model is proposed for, e.g.,
of an audio time series might be modeled in order to
prediction, statistical tests for comparing models are help-
realistically predict the tone in the future [24] and the
ful to structure the models, e.g., concerning their predictive
fundamental frequency of a musical tone might be pre-
power [45].
dicted by rules learned from elapsed time periods [29].
Predictive power is typically assessed by means of so-
In econometrics, multiple time series and their co-
called resampling methods where the distribution of power
integration are often analyzed [27]. In technical appli-
characteristics is studied by artificially varying the subpop-
cations, process control is a common aim of time series
ulation used to learn the model. Characteristics of such
analysis [34].
distributions can be used for model selection [7].
Perturbation experiments offer another possibility to
2.4 Statistical modeling
evaluate the performance of models. In this way, the stability
of the different models against noise is assessed [32,44].
(a) Complex interactions between factors can be modeled
Meta-analysis as well as model averaging are methods to
by graphs or networks. Here, an interaction between
evaluate combined models [13,14].
two factors is modeled by a connection in the graph
Model selection became more and more important in the
or network [26,35]. The graphs can be undirected as,
last years since the number of classification and regression
e.g., in Gaussian graphical models, or directed as, e.g.,
models proposed in the literature increased with higher and
in Bayesian networks.
higher speed.
The main goal in network analysis is deriving the network
structure. Sometimes, it is necessary to separate (unmix)
subpopulation specific network topologies [49]. 2.6 Representation and reporting
(b) Stochastic differential and difference equations can
represent models from the natural and engineering sci- Visualization to interpret found structures and storing of
ences [3,39]. The finding of approximate statistical mod- models in an easy-to-update form are very important tasks
els solving such equations can lead to valuable insights in statistical analyses to communicate the results and safe-
for, e.g., the statistical control of such processes, e.g., in guard data analysis deployment. Deployment is decisive for
mechanical engineering [48]. Such methods can build obtaining interpretable results in Data Science. It is the last
123
International Journal of Data Science and Analytics (2018) 6:189–194 193
step in CRISP-DM [10] and underlying the data-to-decision to a more suitable distribution as well as resampling methods
and action step in Cao [12]. for predictive power are important.
Besides visualization and adequate model storing, for
statistics, the main task is reporting of uncertainties and
review [6].
4 Conclusion
123
194 International Journal of Data Science and Analytics (2018) 6:189–194
9. Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for 32. Molinelli, E.J., Korkut, A., Wang, W.Q., Miller, M.L., Gauthier,
large-scale machine learning. arXiv preprint arXiv:1606.04838 N.P., Jing, X., Kaushik, P., He, Q., Mills, G., Solit, D.B., Pratilas,
(2016) C.A., Weigt, M., Braunstein, A., Pagnani, A., Zecchina, R., Sander,
10. Brown, M.S.: Data Mining for Dummies. Wiley, London (2014) C.: Perturbation Biology: Inferring Signaling Networks in Cellular
11. Bühlmann, P., Van De Geer, S.: Statistics for High-Dimensional Systems. arXiv preprint arXiv:1308.5193 (2013)
Data: Methods, Theory and Applications. Springer, Berlin (2011) 33. Montgomery, D.C.: Design and Analysis of Experiments, 8th edn.
12. Cao, L.: Data science: a comprehensive overview. ACM Comput. Wiley, London (2013)
Surv. (2017). https://doi.org/10.1145/3076253 34. Oakland, J.: Statistical Process Control. Routledge, London (2007)
13. Claeskens, G., Hjort, N.L.: Model Selection and Model Averaging. 35. Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks
Cambridge University Press, Cambridge (2008) of Plausible Inference. Morgan Kaufmann, Los Altos (1988)
14. Cooper, H., Hedges, L.V., Valentine, J.C.: The Handbook of 36. Piateski, G., Frawley, W.: Knowledge Discovery in Databases. MIT
Research Synthesis and Meta-analysis. Russell Sage Foundation, Press, Cambridge (1991)
New York City (2009) 37. Press, G.: A Very Short History of Data Science. https://www.
15. Dmitrienko, A., Tamhane, A.C., Bretz, F.: Multiple Testing Prob- forbescom/sites/gilpress/2013/05/28/a-very-short-history-of-
lems in Pharmaceutical Statistics. Chapman and Hall/CRC, Lon- data-science/#5c515ed055cf (2013). [last visit: March 19, 2017]
don (2009) 38. Ramsay, J., Silverman, B.W.: Functional Data Analysis. Springer,
16. Donoho, D.: 50 Years of Data Science. http://courses.csail.mit.edu/ Berlin (2005)
18.337/2015/docs/50YearsDataScience.pdf (2015) 39. Särkkä, S.: Applied Stochastic Differential Equations. https://
17. Dyk, D.V., Fuentes, M., Jordan, M.I., Newton, M., Ray, B.K., users.aalto.fi/~ssarkka/course_s2012/pdf/sde_course_booklet_
Lang, D.T., Wickham, H.: ASA Statement on the Role of Statistics 2012.pdf (2012). [last visit: March 6, 2017]
in Data Science. http://magazine.amstat.org/blog/2015/10/01/asa- 40. Schäfer, M., Radon, Y., Klein, T., Herrmann, S., Schwender, H.,
statement-on-the-role-of-statistics-in-data-science/ (2015) Verveer, P.J., Ickstadt, K.: A Bayesian mixture model to quantify
18. Fahrmeir, L., Kneib, T., Lang, S., Marx, B.: Regression: Models, parameters of spatial clustering. Comput. Stat. Data Anal. 92, 163–
Methods and Applications. Springer, Berlin (2013) 176 (2015). https://doi.org/10.1016/j.csda.2015.07.004
19. Frühwirth-Schnatter, S.: Finite Mixture and Markov Switching 41. Schiffner, J., Weihs, C.: D-optimal plans for variable selection in
Models. Springer, Berlin (2006) data bases. Technical Report, 14/09, SFB 475 (2009)
20. Geppert, L., Ickstadt, K., Munteanu, A., Quedenfeld, J., Sohler, C.: 42. Shumway, R.H., Stoffer, D.S.: Time Series Analysis and Its Appli-
Random projections for Bayesian regression. Stat. Comput. 27(1), cations: With R Examples. Springer, Berlin (2010)
79–101 (2017). https://doi.org/10.1007/s11222-015-9608-z 43. Tukey, J.W.: Exploratory Data Analysis. Pearson, London (1977)
21. Hastie, T., Tibshirani, R., Wainwright, M.: Statistical Learning with 44. Vatcheva, I., de Jong, H., Mars, N.: Selection of perturbation exper-
Sparsity: The Lasso and Generalizations. CRC Press, Boca Raton iments for model discrimination. In: Horn, W. (ed.) Proceedings of
(2015) the 14th European Conference on Artificial Intelligence, ECAI-
22. Hennig, C., Meila, M., Murtagh, F., Rocci, R.: Handbook of Cluster 2000, IOS Press, pp 191–195 (2000)
Analysis. Chapman & Hall, London (2015) 45. Vatolkin, I., Weihs, C.: Evaluation, chap 13. In: Weihs, C.,
23. Klein, H.U., Schäfer, M., Porse, B.T., Hasemann, M.S., Ickstadt, Jannach, D., Vatolkin, I., Rudolph, G. (eds.) Music Data Analysis—
K., Dugas, M.: Integrative analysis of histone chip-seq and tran- Foundations and Applications, pp. 329–363. CRC Press, Boca
scription data using Bayesian mixture models. Bioinformatics Raton (2017)
30(8), 1154–1162 (2014) 46. Weihs, C.: Big data classification — aspects on many features.
24. Knoche, S., Ebeling, M.: The musical signal: physically and psy- In: Michaelis, S., Piatkowski, N., Stolpe, M. (eds.) Solving Large
chologically, chap 2. In: Weihs, C., Jannach, D., Vatolkin, I., Scale Learning Tasks: Challenges and Algorithms, Springer Lec-
Rudolph, G. (eds.) Music Data Analysis—Foundations and Appli- ture Notes in Artificial Intelligence, vol. 9580, pp. 139–147 (2016)
cations, pp. 15–68. CRC Press, Boca Raton (2017) 47. Weihs, C., Ligges, U.: From local to global analysis of music time
25. Koenker, R.: Quantile Regression. Econometric Society Mono- series. In: Morik, K., Siebes, A., Boulicault, J.F. (eds.) Detecting
graphs, vol. 38 (2010) Local Patterns, Springer Lecture Notes in Artificial Intelligence,
26. Koller, D., Friedman, N.: Probabilistic Graphical Models: Princi- vol. 3539, pp. 233–245 (2005)
ples and Techniques. MIT press, Cambridge (2009) 48. Weihs, C., Messaoud, A., Raabe, N.: Control charts based on mod-
27. Lütkepohl, H.: New Introduction to Multiple Time Series Analysis. els derived from differential equations. Qual. Reliab. Eng. Int.
Springer, Berlin (2010) 26(8), 807–816 (2010)
28. Ma, P., Mahoney, M.W., Yu, B.: A statistical perspective on 49. Wieczorek, J., Malik-Sheriff, R.S., Fermin, Y., Grecco, H.E.,
algorithmic leveraging. In: Proceedings of the 31th International Zamir, E., Ickstadt, K.: Uncovering distinct protein-network
Conference on Machine Learning, ICML 2014, Beijing, China, 21– topologies in heterogeneous cell populations. BMC Syst. Biol. 9(1),
26 June 2014, pp 91–99. http://jmlr.org/proceedings/papers/v32/ 24 (2015)
ma14.html (2014) 50. Wu, J.: Statistics = data science? http://www2.isye.gatech.edu/
29. Martin, R., Nagathil, A.: Digital filters and spectral analysis, chap 4. ~jeffwu/presentations/datascience.pdf (1997)
In: Weihs, C., Jannach, D., Vatolkin, I., Rudolph, G. (eds.) Music
Data Analysis—Foundations and Applications, pp. 111–143. CRC
Press, Boca Raton (2017)
30. Mejri, D., Limam, M., Weihs, C.: A new dynamic weighted major-
ity control chart for data streams. Soft Comput. 22(2), 511–522.
https://doi.org/10.1007/s00500-016-2351-3
31. Molenberghs, G., Fitzmaurice, G., Kenward, M.G., Tsiatis, A., Ver-
beke, G.: Handbook of Missing Data Methodology. CRC Press,
Boca Raton (2014)
123