Knowledge Discovery in Big Data From Astronomy and Earth Observation Astrogeoinformatics 1St Edition Petr Skoda Editor Full Chapter

Knowledge Discovery in Big Data from
Astronomy and Earth Observation:

Astrogeoinformatics 1st Edition Petr
Skoda (Editor)
Visit to download the full and correct content document:
https://ebookmass.com/product/knowledge-discovery-in-big-data-from-astronomy-and
-earth-observation-astrogeoinformatics-1st-edition-petr-skoda-editor/
Knowledge
Discovery in Big
Data from
Astronomy and
Earth Observation
Knowledge
Discovery in Big
Data from
Astronomy and
Earth Observation
AstroGeoInformatics
Edited by
Petr Škoda
Stellar Department
Astronomical Institute of the Czech Academy of Sciences
Ondřejov, Czech Republic
Fathalrahman Adam
Earth Observation Center
German Remote Sensing Data Center
DLR German Aerospace Center
Wessling, Germany
Elsevier
3251 Riverport Lane
St. Louis, Missouri 63043
Knowledge Discovery in Big Data from Astronomy and Earth Observation ISBN: 978-0-12-819154-5
Copyright © 2020 Elsevier Inc. All rights reserved.
MATLAB® is a trademark of The MathWorks, Inc. and is used with permission.

The MathWorks does not warrant the accuracy of the text or exercises in this book.
This book’s use or discussion of MATLAB® software or related products does not constitute endorsement or sponsorship by The Math-
Works of a particular pedagogical approach or particular use of the MATLAB® software.
No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photo-
copying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how
to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the
Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions.
This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted
herein).
Notices
Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information,
methods, compounds or experiments described herein. Because of rapid advances in the medical sciences, in particular, independent
veriﬁcation of diagnoses and drug dosages should be made. To the fullest extent of the law, no responsibility is assumed by Elsevier,
authors, editors or contributors for any injury and/or damage to persons or property as a matter of products liability, negligence or
otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein.
Publisher: Candice Janco

Acquisitions Editor: Marisa LaFleur
Editorial Project Manager: Andrea Dulberger
Production Project Manager: Sreejith Viswanathan
Designer: Alan Studholme
Contents
L I S T O F C O N T R I B U T O R S, vii 6 Surveys, Catalogues,

A WO R D FR O M TH E B I G - S K Y- E A RTH Databases/Archives, and
C H A I R, xi State-of-the-Art Methods for
P R E F A C E, xiii Geoscience Data Processing, 103
A C K N O W L E D G M E N T S, xvii Lachezar Filchev, Lyubka Pashova,
Vasil Kolev, Stuart Frye
PART I
D ATA 7 High-Performance Techniques for
Big Data Processing, 137
1 Methodologies for Knowledge Philipp Neumann, Julian Kunkel
Discovery Processes in Context of
AstroGeoInformatics, 1
8 Query Processing and Access
Peter Butka, Peter Bednár,
Methods for Big Astro and Geo
Juliana Ivančáková
Databases, 159
Karine Zeitouni, Mariem Brahem,
2 Historical Background of Big Data Laurent Yeh, Atanas Hristov
in Astro and Geo Context, 21
Christian Muller 9 Real-Time Stream Processing
in Astronomy, 173
PART I I Veljko Vujčić, Darko Jevremović
I N F O R M ATI O N
PART I I I
3 AstroGeoInformatics: From KNOWLEDGE
Data Acquisition to Further
Application, 31 10 Time Series, 183
Bianca Schoen-Phelan Ashish Mahabal
4 Synergy in Astronomy and 11 Advanced Time Series Analysis

Geosciences, 39 of Generally Irregularly Spaced
Mikhail Minin, Angelo Pio Rossi Signals: Beyond the Oversimplified
Methods, 191
Ivan L. Andronov
5 Surveys, Catalogues, Databases,
and Archives of Astronomical
Data, 57 12 Learning in Big Data: Introduction
Irina Vavilova, Ludmila Pakuliak, Iurii Babyk, to Machine Learning, 225
Andrii Elyiv, Daria Dobrycheva, Olga Melnyk Khadija El Bouchefry, Rafael S. de Souza
v
vi CONTENTS
13 Deep Learning – an Opportunity 20 International Database of

and a Challenge for Geo- and Neutron Monitor Measurements:
Astrophysics, 251 Development and Applications, 371
Christian Reimers, Christian Requena-Mesa D. Sapundjiev, T. Verhulst, S. Stankov
14 Astro- and Geoinformatics – 21 Monitoring the Earth Ionosphere by

Visually Guided Classification of Listening to GPS Satellites, 385
Time Series Data, 267 Liubov Yankiv-Vitkovska, Stepan Savchuk
Roman Kern, Tarek Al-Ubaidi, Vedran Sabol,
Sarah Krebs, Maxim Khodachenko, 22 Exploitation of Big Real-Time
Manuel Scherf GNSS Databases for Weather
Prediction, 405
15 When Evolutionary Nataliya Kablak, Stepan Savchuk
Computing Meets Astro- and
Geoinformatics, 283 23 Application of Databases Collected
Zaineb Chelly Dagdia, Miroslav Mirchev in Ionospheric Observations by
VLF/LF Radio Signals, 419
PART I V Aleksandra Nina
WISDOM
24 Influence on Life Applications
16 Multiwavelength Extragalactic of a Federated Astro-Geo
Surveys: Examples of Data Database, 435
Mining, 307 Christian Muller
Irina Vavilova, Daria Dobrycheva,
Maksym Vasylenko, Andrii Elyiv, Olga Melnyk
I N D E X, 445
17 Applications of Big Data in

Astronomy and Geosciences:
Algorithms for Photographic
Images Processing and Error
Elimination, 325
Ludmila Pakuliak, Vitaly Andruk
18 Big Astronomical Datasets and

Discovery of New Celestial Bodies
in the Solar System in Automated
Mode by the CoLiTec Software, 331
Sergii Khlamov, Vadym Savanevych
19 Big Data for the Magnetic Field

Variations in Solar-Terrestrial
Physics and Their Wavelet
Analysis, 347
Bozhidar Srebrov, Ognyan Kounchev,
Georgi Simeonov
List of Contributors
Tarek Al-Ubaidi, MSc Andrii Elyiv, Dr

DCCS – IT Business Solutions, Graz, Austria Main Astronomical Observatory of the National
Academy of Sciences of Ukraine, Kyiv, Ukraine
Ivan L. Andronov, DSc, Prof
Department of Mathematics, Physics and Astronomy, Lachezar Filchev, Assoc Prof, PhD
Odessa National Maritime University, Odessa, Ukraine Space Research and Technology Institute, Bulgarian
Academy of Sciences, Sofia, Bulgaria
Vitaly Andruk
Main Astronomical Observatory of the National Stuart Frye, MSc
Academy of Sciences of Ukraine, Kyiv, Ukraine National Aeronautics and Space Administration,
Washington, DC, United States
Iurii Babyk, Dr
Atanas Hristov, PhD
Main Astronomical Observatory of the National
Academy of Sciences of Ukraine, Kyiv, Ukraine University of Information Science and Technology
“St. Paul the Apostle”, Ohrid, North Macedonia
Peter Bednár, PhD
Juliana Ivančáková, MSc
Department of Cybernetics and Artificial Intelligence,
Department of Cybernetics and Artificial Intelligence,
Technical University of Košice, Košice, Slovakia
Technical University of Košice, Košice, Slovakia
Mariem Brahem, PhD
Darko Jevremović, Dr
DAVID Lab., University of Versailles
Astronomical Observatory Belgrade, Belgrade, Serbia
Saint-Quentin-en-Yvelines, Université Paris-Saclay,
Versailles, France Nataliya Kablak, DSc, Prof
Uzhhorod National University, Uzhhorod, Ukraine
Peter Butka, PhD
Department of Cybernetics and Artificial Intelligence, Roman Kern, PhD
Technical University of Košice, Košice, Slovakia Institute of Interactive Systems and Data Science,
Technical University of Graz, Graz, Austria
Zaineb Chelly Dagdia, Dr
Université de Lorraine, CNRS, Inria, LORIA, Sergii Khlamov, PhD
F-54000 Nancy, France Institute of Astronomy, V. N. Karazin Kharkiv National
LARODEC, Institut Supérieur de Gestion de Tunis, University, Kharkiv, Ukraine
Tunis, Tunisia Main Astronomical Observatory of the NAS of Ukraine,
Kyiv, Ukraine
Rafael S. de Souza, PhD
Maxim Khodachenko, PhD
Department of Physics & Astronomy, University of North
Carolina at Chapel Hill, Chapel Hill, NC, United States Space Research Institute, Austrian Academy of
Sciences, Graz, Austria
Daria Dobrycheva, Dr Skobeltsyn Institute of Nuclear Physics, Moscow State
Main Astronomical Observatory of the National University, Moscow, Russia
Vasil Kolev, MSc
Khadija El Bouchefry, PhD Institute of Information and Communication
South African Radio Astronomy Observatory, Technologies, Bulgarian Academy of Sciences, Sofia,
Rosebank, JHB, South Africa Bulgaria
vii
viii LIST OF CONTRIBUTORS
Ognyan Kounchev, Prof, Dr Christian Requena-Mesa, MSc

Institute of Mathematics and Informatics, Bulgarian Computer Vision Group, Faculty of Mathematics and
Academy of Sciences, Sofia, Bulgaria Computer Science, Friedrich Schiller University Jena,
Jena, Germany
Sarah Krebs, MSc
Climate Informatics Group, Institute of Data Science,
Know-Center, Graz, Austria
German Aerospace Center, Jena, Germany
Julian Kunkel, Dr Department Biogeochemical Integration, Max Planck
University of Reading, Reading, United Kingdom Institute for Biogeochemistry, Jena, Germany
Ashish Mahabal, PhD Angelo Pio Rossi, Dr, PhD
California Institute of Technology, Pasadena, CA, Jacobs University Bremen, Bremen, Germany
United States
Vedran Sabol, PhD
Olga Melnyk, Dr Know-Center, Graz, Austria
D. Sapundjiev, PhD
Royal Meteorological Institute (RMI), Brussels, Belgium
Mikhail Minin, MSc
Vadym Savanevych, DSc
Jacobs University Bremen, Bremen, Germany
Main Astronomical Observatory of the NAS of Ukraine,
Miroslav Mirchev, Dr Kyiv, Ukraine
Faculty of Computer Science and Engineering,
Stepan Savchuk, DSc, Prof
Ss. Cyril and Methodius University in Skopje, Skopje,
North Macedonia Lviv Polytechnic National University, Lviv, Ukraine
Christian Muller, Dr Manuel Scherf, MSc

Royal Belgian Institute for Space Aeronomy, Space Research Institute, Austrian Academy of
Belgian Users Support and Operation Centre, Sciences, Graz, Austria
Brussels, Belgium Bianca Schoen-Phelan, PhD
Philipp Neumann, Prof, Dr Technical University Dublin, School of Computer
Helmut-Schmidt-Universität Hamburg, Hamburg, Science, Dublin, Ireland
Germany Georgi Simeonov, Assistant
Aleksandra Nina, PhD Institute of Mathematics and Informatics, Bulgarian
Institute of Physics Belgrade, University of Belgrade, Academy of Sciences, Sofia, Bulgaria
Belgrade, Serbia Bozhidar Srebrov, Assoc Prof, Dr
Ludmila Pakuliak, Dr Institute of Mathematics and Informatics, Bulgarian
Main Astronomical Observatory of the National Academy of Sciences, Sofia, Bulgaria
Academy of Sciences of Ukraine, Kyiv, Ukraine S. Stankov, PhD
Lyubka Pashova, Assoc Prof, PhD Royal Meteorological Institute (RMI), Brussels, Belgium
National Institute of Geophysics, Geodesy and Maksym Vasylenko, MSc
Geography, Bulgarian Academy of Sciences, Sofia,
Bulgaria
Christian Reimers, MSc
Irina Vavilova, Dr
Computer Vision Group, Faculty of Mathematics and
Computer Science, Friedrich Schiller University Jena,
Jena, Germany
Climate Informatics Group, Institute of Data Science, T. Verhulst, PhD
German Aerospace Center, Jena, Germany Royal Meteorological Institute (RMI), Brussels, Belgium
LIST OF CONTRIBUTORS ix
Veljko Vujčić Karine Zeitouni, Prof, PhD

Astronomical Observatory Belgrade, Belgrade, Serbia DAVID Lab., University of Versailles
Liubov Yankiv-Vitkovska, Assoc Prof, Dr
Versailles, France
Lviv Polytechnic National University, Lviv, Ukraine
Laurent Yeh, Assoc Prof, PhD
DAVID Lab., University of Versailles
Versailles, France
A Word from the BIG-SKY-EARTH Chair
In the summer of 2013, a small group of astronomers ers. The book was a challenge for authors to write, since
and Earth observation experts gathered around an in- the readers will surely come with a large range of back-
triguing idea – why not assemble a network of experts ground knowledge, and texts had to be balanced be-
from those two disciplines that have so much in com- tween the disciplines. But everyone should find at least
mon, but essentially do not communicate much as pro- some parts of the book intriguing and interesting, hope-
fessionals? An opportunity opened for this idea to be fully as much as we found our BIG-SKY-EARTH network
realized as the COST funding scheme opened a call inspiring and motivating.
for transdisciplinary networks. A proposal was put to-
gether and the Big Data Era in Sky and Earth Observa- Dejan Vinković
tion (BIG-SKY-EARTH) project was born.1 Funded by Chair of the BIG-SKY-EARTH network
COST, the project officially started in January 2015 and
ended in January 2019. In the end, the BIG-SKY-EARTH
network included 28 COST countries and hundreds of
researchers from all around the world. The project or-
ganized four training schools, two workshops and two
This book is based upon work from COST Action TD1403
conferences, exchange of numerous scientists between
- BIG-SKY-EARTH, supported by the European Cooperation
research groups, many working group meetings, and
in Science and Technology (COST).
formal and informal gatherings. The lively exchange of
COST2 is a funding agency for research and innovation
ideas and a mixture of people of various expertise fueled
networks. Our actions help connect research initiatives across
a plethora of collaborations between participants. It has
Europe and enable scientists to grow their ideas by sharing
been a mixture of experts from academia and the private
them with their peers. This boosts their research, career, and
sector, with special attention given to supporting young
innovation.
researchers still in the early stages of their careers.
Toward the end of the COST project, a suggestion
emerged to put together a book that would be an in-
teresting reading for astronomers curious about remote
sensing and vice versa. Essentially, the book would be
about AstroGeoInformatics – an amalgam of computer
science, astronomy, and Earth observation. The sugges-
tion was enthusiastically supported by lots of partici-
pants and now, finally, you have it in front of you –
the first book written for such transdisciplinary read-
1 https://bigskyearth.eu. 2 www.cost.eu.
xi
Preface
WHAT’S IN THIS BOOK? MOTIVATION AND SCOPE

This book has several parts reflecting various stages of Fundamental science at the beginning of our civiliza-
Big Data processing and machine learning following the tion was naturally treated as a multidisciplinary task.
Data–Information–Knowledge–Wisdom (DIKW) pyra- Most important ancient Greek scientists and philoso-
mid explained below. There are several main parts re- phers focused on mathematics, physics, astronomy, ge-
flecting the particular stages of the pyramid. ometry, biology, mineralogy, meteorology, and medi-
cine, as well as on social and political matters. The
Part I is an introductory section about the origin of
discovery of nature per se was primarily driven by cu-
Big Data and its history. It shows that Big Data is not an
riosity and not by the intention of exploiting it. Par-
entirely new concept, as humans have tended for a long ticularly famous are many philosophical discussions in
time to collect ever-growing amounts of data, and sum- the Pythagorean school or the Platonic Academy. The
marizes the general principles and recent developments principles of “Pansophism” as an educational goal were
of knowledge discovery. proclaimed by Comenius and the didactic principles he
Part II discusses the different stages of data acquisi- had introduced are probably the top achievement in
tion, preprocessing, and interpretation and data repos- teaching that seem to continue almost unchanged until
itories. We want to point out what basic steps have to these days. Our goal is to maintain some of his princi-
be made to reach end-to-end systems allowing an au- ples in this book, as well. That is why we put emphasis
tomated extraction of results based on instrument data on the understandability and on the joy of gaining new
and auxiliary databases. knowledge together with using (some) practical exam-
Part III addresses the primary goal of this book – ples. We also accentuate (in agreement with Comenius)
the role of color pictures that try to associate the read
the extraction of new wisdom about the Universe (in-
text with something familiar to the reader.
cluding our Earth) from Big Data. It presents various
The problem with current science is mainly in break-
approaches of data analysis, such as genetic program- ing with all the principles given above. The contem-
ming and machine learning techniques, including the porary scientific communities are very narrow-focused,
commercially overhyped deep learning. having their own terminology which presents a high en-
Part IV addresses the specific readership that we want try barrier for the intruders into their “sacred land.” An
to reach. We think of a wide range of experts ranging enthusiastic researcher searching similarities between
from students of different disciplines to practitioners his research and a completely different field of science
as well as theory-oriented researchers. Here, our aim is is quickly redirected into the “proper” corridors by a
to acquaint these readers with the wide variety of tasks number of ways (e.g., by funding agencies, referees of
that can be solved by modern approaches of Big Data scientific journals, or leaders of their home institution).
processing and give them some examples of an interdis- The result is that interdisciplinarity, although strong-
ciplinary approach using smart untraditional methods ly proclaimed by science policymakers, is very difficult
to practice in reality. Most researchers hardly get suffi-
and data sources. The advanced combination of appar-
cient funds to visit essential conferences and workshops
ently nonrelevant data may even help reveal unexpected
in their own narrow field, and here they usually meet
correlations and relationships having an impact on our the same colleagues they already know quite well, and
everyday life. So, for instance, the real-time monitoring everybody knows what topics (and even which objects)
of GPS signals helps in predicting the weather, and lis- will be presented. The big symposia are more promis-
tening to very long-wave transmitters can be crucial in ing to promote interdisciplinarity at least within the
predictions of natural disasters such as hurricanes or boundaries of one branch of science (e.g., astronomy,
earthquakes. space physics, and geophysics). However, the multitude
xiii
xiv PREFACE
of parallel sessions and the shortage of available time A typical astronomical example are old astronomical
enforces the attendees again to visit only sessions about plates with spectra of bright hot stars (secured at the
their own and closely related fields. So it seems that beginning of the 20th century) to measure ozone con-
many scientists remain locked in their small, highly spe- centrations in the Earth’s stratosphere at that time.
cialized communities. Fortunately, on the other hand, In the future, we expect that more complex scien-
there are also quite large meetings not based on lim- tific problems will lead to common “astro-/geo-” ap-
ited research subjects but particular methodologies and proaches to be applied in synergy. There are already
technology. several cases where astronomical instrumentation was
So, for instance, in astronomy, there is a tradi- used for analyzing terrestrial phenomena. For example,
tional Astronomical Data Analysis Software and Sys- the Earth’s aurora X-ray radiation was observed in 2015
tems (ADASS), conference where the common uniting by ESA Integral mission, ordinarily busy with observing
subject is modern computer technology in fields such as black holes, during calibration of the diffuse cosmic X-
satellite and telescope control systems, advanced math- ray background. Another example is the world’s largest
ematical algorithms, high-performance computing and radio telescope array LOFAR, which, when monitoring
databases, software development methodology, or the the three-dimensional propagation of a lightning 18 km
analysis of Big Data. The same holds for Earth obser- above the Netherlands with microsecond time resolu-
vation, where the International Geoscience and Remote tion, discovered interesting needle-shaped plasma struc-
Sensing Symposium (IGARSS) conferences are targeted tures, probably responsible for multiple strikes of the
to participants from numerous geoscientific and remote same lightning within seconds.
sensing disciplines. The astronomical observations also contribute to
Another example of a broad community from all meteorology and vice versa. In astronomical spec-
fields of astronomy are interoperability meetings of
troscopy, almost every spectrum is contaminated by
the International Virtual Observatory Alliance, where
atmospheric water vapor absorption lines, which pro-
the main goal is to define data formats, models, and
vides a direct way to measure the line-of-sight water
protocols allowing the global interoperability of all as-
content in the atmosphere. This water contamination
tronomical databases and archives. Their “terrestrial”
must be removed from the recorded spectra by compli-
counterpart are lots of standards and conventions pub-
cated methods, although in many cases, it is used to
lished by the Open Geospatial Consortium (OGC),
correct the wavelength calibration of the spectrograph
national and international space agencies (e.g., NASA,
as a reference source with negligible radial velocity in
ESA, and DLR), and even industrial consortia that are
comparison to stellar ones.
highly specialized in image processing, spectroscopy,
precision farming, database technologies, computer Current astronomy is making many new discoveries
networks, etc. thanks to the possibility of aggregating the observations
Nowadays, we can see that more and more quantita- in the whole electromagnetic spectrum (with recent ex-
tive physical and chemical parameters can be retrieved tensions to astroparticle physics – cosmic rays, neutri-
from advanced instruments and their processing chains; nos, and also gravitational wave astronomy) and also
a typical criterion is the attainable signal-to-noise level by cross-matching of gigantic multi-petabyte scaled sky
of an instrument. Here we can see considerable im- surveys. The new astronomical instruments LSST, SKA,
provements when we compare traditional instruments and EUCLID foresee to produce tens of petabytes of raw
with innovative devices. However, we must also con- data every year, and their future archives are expected to
sider the specific requirements of astro- and geoscience be operating at the edge of contemporary IT technol-
applications such as the long-term data availability and ogy, with the embedded state-of-the-art machine learn-
traceability of the results (data curation, reproducibil- ing functionality.
ity), proven scientific data quality with quantitative er- As for geosciences, we have to consider additional
ror levels, and appropriate tools to validate them. The application tasks with specific constraints, such as real-
importance of archives of historical records, namely, time traffic monitoring and rapid disaster analysis (e.g.,
photographic plates, is rising in accordance with the im- of flooding, lava flows, storm warnings, and volcanic
plementation of FAIR principles of data management plume dynamics) or the handling of large data vol-
(striving to make data findable, accessible, interopera- umes that we encounter in weather predictions or crop
ble, and reusable). Digitization of legacy sources brings yield calculations. In particular, some applications call
about a considerable amount of Big Data volumes but for highly reliable and carefully calibrated input data,
also new, unexpected opportunities of their analysis. for instance, for climate research and risk assessments,
PREFACE xv
while other forms put more emphasis on keeping up statistics and astroinformatics of both the International
with the data volume or access to cloud computing. Astronomical Union and the American Astronomical
The geosciences benefit a lot from satellite observa- Society, and also the recently established International
tions and remote sensing, where the continuous flow of AstroInformatics Association. In contrast, Geoinformat-
new hyperspectral data, motion and displacement mea- ics for Earth observation suffers from the misunder-
surements (based on speed and phase measurements standing of being confounded with geomatics and ge-
derived from synthetic-aperture radars or optical instru- ographical information systems; however, the impor-
ments) have to be combined with accurate “ground tance of machine learning and Big Data processing is
truth” reference data obtained by geophysics, hydro- already well understood and it is applied routinely.
logy, and terrain investigations. This presents challeng- Despite the lack of a proper taxonomy for these
ing opportunities for agriculture, forestry, water resource newly established fields, there are attempts to transfer
planning, and ore mining as well as new approaches some methods that were successfully applied in one
to infrastructure development and urban planning. The scientific field into a completely different one. Within
importance of aggregated Earth observation databases this context, a key methodology for transferring digi-
and so-called Virtual Earths is continuously increasing tized knowledge is transfer learning, allowing us to train
in the case of natural disasters for rescue and humani- a machine learning system in one domain and then
tarian aid purposes. apply the fully trained system within a completely dif-
The current Fourth Paradigm of science, character- ferent application area of data analytics. However, this
ized by Big Data, is data-driven research and exploratory approach calls for an appropriate design of the required
analysis exploiting innovative machine learning princi- databases. Currently, the first results could already be
ples to obtain new knowledge, e.g., about a vast tsunami demonstrated for image content classification in remote
threatening a coastal area, thus offering a qualitatively sensing applications.
new way of scientific methodologies, e.g., by simulation There are already world-renowned institutions where
runs and comparisons with existing databases. The Big such research is being conducted. For instance, the
Data phenomenon is common for every scientific dis- Center for Data-Driven Discovery at Caltech and some
cipline and presents a serious obstacle to conducting groups in NASA successfully applied methods devel-
efficient research by established methods. Therefore, a oped for astronomical analysis in medicine (e.g., pre-
new multidisciplinary approach is needed, which in- diction of autism in EEG records or the identification of
tegrates the particular discipline knowledge with ad- cancer metastases in histologic samples.).
vanced statistics, high-performance computing and data Another field where the methodological similarities
analytics, new computer software design principles, ma- with astronomy are even more striking is the complete
chine learning and other fields belonging rather to com- range of geosciences and remote sensing disciplines. Im-
puter science, artificial intelligence, and data science. age analysis using data recorded by CCD detectors, mul-
This change also requires a new kind of a career path tispectral analysis, hyperdata cubes, time series mea-
which is not yet fully established despite first exper- surements, data streams, various coordinate systems,
imental courses at renowned universities, where the deep learning, classification techniques, and federaliza-
graduates are “data scientists.” tion of resources – all these fields are applied in similar
The same term is used in business for the “sexiest ways both in astronomy and geosciences. So it seems
job of the 21st century,” and is understood rather as to be very useful for one community to learn from
a commercial data analysis task – using statistics and the other one, and all of them should acquire a lot of
simple machine learning for the analysis of data ware- practical skills from computer science and data science.
houses. The real scientific data scientist must, however, We are convinced that both communities (“astro” and
have a considerably broader knowledge of his scientific “geo”) will benefit from understanding the more com-
branch, in addition to excellent knowledge of statistics prehensive interdisciplinary view. An interesting topic
and machine learning. could be advanced algorithms to identify (and remove)
The first successful integration happened in bioin- the varying atmospheric background of satellite images.
formatics, which recently became an officially accepted The idea of this book originated during very produc-
branch of science. In astronomy, the Astroinformatics tive meetings of the transdisciplinary European COST
originated around the year 2010 and is still being treated Action TD1403 called BIG-SKY-EARTH. It would not
by the majority of astronomers like a kind of sorcery, be possible without the great enthusiasm of many peo-
despite the public outreach effort of the International ple who devoted a considerable amount of time to the
Astrostatistics Association, the working groups in astro- preparation of the state-of-the-art reviews of typical top-
xvi PREFACE
ics that, according to their feelings, will be important for As we are aware that there are limits in understand-
future data scientists. ing the terminologies of other fields, we asked all au-
The main goal of the book, which is a kind of ex- thors to avoid complex mathematics as well as deep
periment trying to show the potential synergy between details and the common jargon of each discipline, and
astronomy and geosciences, is to give some first-time try to treat their contribution more on an educational or
overview of the technologies and methodologies to- outreach level (also by using more illustrations) while
gether with references to the practical usage of many im- still maintaining a rigorous structure with proper refer-
portant fields related to knowledge discovery in astro- ences for every important statement. Thus, we can avoid
and geo-Big Data. Its purpose is to give a very general detailed discussions about very specific problems such
background and some ideas with numerous references as the avoidance of overfitting during image classifica-
that could be helpful in the design of fully operational tion; however, we have to be aware of typical decision
data analysis systems – and of an experimental data sci- making problems and imminent limitations: Shall we
ence book. believe in earthquake prediction based on deep learn-
The book tries to follow the pyramid called Data ing and evacuate a full region?
(from data acquisition to distributed databases), We hope that a wide scope of readers will find this
Information (data processing), Knowledge (machine book interesting, and that it will serve them as a starter
learning), and Wisdom (applied knowledge for appli- for an interdisciplinary way of individual thinking. This
cations) (DIKW). We intend to present a global picture should be an important characteristic of this book. The
seen from history, data gathering, data processing and future of humankind is dependent on a close collabora-
knowledge extraction to inferences of new wisdom. This tion between many scientific disciplines in synergy. Big
has to be understood in conjunction with the pros and Data is one example of global problems that must be
cons of every processing step and its concatenations (in- overcome by changes of paradigms of how the research
cluding user interfaces). was done in each discipline so far.
In the last part of the book, we address a funda- Data scientists will be indispensable leaders of these
mental question, namely, what kind of new knowledge changes. We hope that our book will help educate new
we could get if the data from both astronomical and graduates in this emerging field of science.
geo-research will be properly processed in close synergy,
and what impact this will have on human health, envi- Petr Škoda
ronmental problems, economic prosperity, etc. In the Fathalrahman Adam
following chapters, we present somewhat preliminary Editors
topics of a newly emerging scientific field which we try
to define as AstroGeoInformatics. With great help from Gottfried Schwarz, DLR.
Acknowledgments
We want to thank all the people who participated in interdisciplinary collaboration that tries to present the
the preparation of this book. The book is not just an- benefits of synergetic empowerment in natural sciences.
other collection of papers like typical conference pro- We are also indebted to Elsevier’s representatives,
ceedings, but the result of long-term planning, inspiring who were in direct contact with other authors and us,
discussions with experts in many fields of natural sci- editors, during the whole process of book preparation.
ences, asking personally more than fifty people we knew It was Marisa LaFleur who followed the book project
from various conferences, making open calls in different from its start and managed to arrange things in a com-
discussion groups and communities, and exchanging a fortable but determined way. Lots of people, including a
large number of e-mails with the world’s leaders in re- part of the referees during the initial review phase of the
lated fields. So in addition to the authors who finally book’s table of contents, did not believe in our vision of
mixing astronomy, geosciences, and computer science
wrote some chapters, we also thank those who estab-
in all chapters. They would have preferred the more clas-
lished contacts with potential authors or contributed by
sic approach – to split the book into two parts – one for
providing links to interesting articles, as well as those
astronomers and another second for geo-experts. De-
who have enthusiastically promised writing the chap- spite their skepticism, the Elsevier people trusted the
ter, but later were not able to do this due to more urgent strength of our vision and helped us finish the work suc-
matters. cessfully.
We are also grateful to the European Union for its So we thank Ashwathi Aravindakshan, our copy-
funding of the COST Action TD1403 BIG-SKY-EARTH, rights coordinator, who was keeping an eye on the
which succeeded in attracting such a nice group of ex- proper copyright status of each figure, as well as the
perts that benefited from the interdisciplinary nature tables and examples which were already published else-
of this action. This COST action has also shown the where, and Subramaniam Jaganathan, the contract co-
need of personal contacts in preparing exciting research ordinator, who helped us arrange the initial paperwork
ideas. The fruitful COST meetings helped amalgamate for signing the contract.
the initially heterogeneous group of researchers from Finally, we would like to express our deep gratitude
different fields and countries into a real task force ca- to Andrea Dulberger, the editorial project manager, as
pable of accepting each other’s visions, methodologies, well as Sreejith Viswanathan, the project manager, for
and technologies. This book is the result of such a new leading the project to a successful end.
xvii
PART I DATA
CHAPTER 1
Methodologies for Knowledge

Discovery Processes in Context of
AstroGeoInformatics
PETER BUTKA, PHD • PETER BEDNÁR, PHD • JULIANA IVANČÁKOVÁ, MSC
1.1 INTRODUCTION For a better understanding of KDPs, we can shortly

Whenever someone wants to apply data mining tech- describe how basic terms about data, information, or
niques to specific problem or data, it is useful to see knowledge are defined. We have to say that there are
anything done in a broader and more organized way. many attempts to explain them more precisely. One ex-
Therefore, successful data science projects usually fol- ample is the DIKW pyramid (Rowley, 2007). This model
low some methodology which can provide data scientist represents and characterizes information-based levels
with basic guidelines on how to challenge the problem (according to the area of information engineering) in
and how to work with data, algorithms, or models. This the chain of grading informativeness known as Data–
methodology is then a structured way to describe the Information–Knowledge–Wisdom (see Fig. 1.1). Simi-
knowledge discovery process. Without a flexible struc- lar models often apply such chains, even if some parts
ture of steps data science projects can be unsuccessful, are removed or combined. For example, very often such
or at least it will be hard to achieve a result that can be a model is simplified to Data–Information–Knowledge
easily applied and shared. A better understanding of at or even Data–Knowledge, but semantics is usually the
least an overview of the process is quite beneficial both same or similar as in the case of the DIKW pyramid.
to the data scientist and to anyone who needs to discuss Moreover, there are many models which describe not
results or steps of the process (such as data engineers, only its objects but also processes for their transitions,
customers, or managers). Moreover, in some domains, e.g., Bloom’s taxonomy (Anderson and Krathwohl,
including those working with data from astronomy and 2001), decision process models (Bouyssou et al., 2010),
geophysics, steps used in preprocessing and analysis of or knowledge management – SECI models (Nonaka et
data are crucial to understanding provided products. al., 2000). The description of methodology usually de-
From the 1990s, research in this area started to define fines what we understand under data, information, and
its terms more precisely, with the definition of knowl- knowledge level.
edge discovery (or knowledge discovery in databases While methodologies started from a more general
[KDD]) (Fayyad et al., 1996) as a synonym for knowl- view, logically more and more attempts were trans-
edge discovery process (KDP). It included data mining formed into a more structured way. Also, many of them
as one of the steps in the knowledge acquisition effort. became more tool-specific. When we try to look at the
KDD (or KDP) and data mining are even today often evolution of the KDP, the main further steps after the
seen as equal terms, but data mining is a subpart (step) creation of more general methodologies are two ba-
of the whole process dedicated to the application of al- sic concepts. First, in order to have more precise and
gorithms able to extract patterns from data. Moreover, formalized processes, many of them were transformed
KDD also becomes the first description of KDP as a into standardized process-based definitions with the au-
formalized methodology. During the next years, new ef- tomation of their steps. Such effort is logically achieved
forts bring more attempts which lead to other method- more easily by the application in specific domains (such
ologies and their applications. We will describe more as industry, medicine, science), with clear standards for
details on selected cases later. exchanging documents and often with the support of
Knowledge Discovery in Big Data from Astronomy and Earth Observation. https://doi.org/10.1016/B978-0-12-819154-5.00010-2 1
2 PART I Data
FIG. 1.1 DIKW pyramid – understanding the difference between Data, Information, Knowledge, and Wisdom.
specific tools used for the automation of processes. Sec- data into some useful artifacts. Hence, the real benefit
ond, when we have several standardized processes in is in our ability to extract such useful artifacts, which
different domains, it is often not easy to apply methods can be in the form of reports, policies, decisions, or rec-
from one area directly in another one. One of the solu- ommended actions. Before we provide more details on
tions is to support better cross-domain understanding processes that transform raw data into these artifacts, we
of steps using some shared terminology. This solution can start with the basic notion of data, information, or
leads to the creation of formalized semantic models like knowledge.
ontologies that are helpful in better understanding of As we already mentioned in the previous section,
terminology between domains. Moreover, another step there are different definitions with a different scope,
towards a new view of methodologies and sharing of i.e., from the DIKW pyramid with a more granular
information about them was proposed based on the on- view, to simpler definitions when there are only two
tologies of KDPs, like OntoDM (Panov et al., 2013). levels of data–knowledge relations. For our purposes
Therefore, if we summarize, generalized methodolo- we would stay with simpler versions of DIKW, where
gies are basic concepts related to KDPs. More specific we define Data–Information–Knowledge relations in
versions of them provide standards and automation in this way, adapted from broader Beckman definitions
specific domains, and on the other hand, cross-domain (Beckman, 1997):
models share domain-specific knowledge between dif- • Data – facts, numbers, pictures, recorded sound, or
another raw source usually describing real-world ob-
ferent domains. This basic overview also describes the
jects and their relations;
structure of sections for this chapter. In the next sec-
• Information – data with added interpretation and
tion, we provide some details on data–information–
meaning, i.e., formatted, filtered, and summarized
knowledge definitions and KDPs. In the following sec-
data;
tion, we describe existing more general methodologies.
• Knowledge – information with actions and applica-
In Section 1.4 we provide a look at methodologies in
tions, i.e., ideas, rules, and procedures, which lead to
a more precise way, through standardization and au- decisions and actions.
tomation efforts, as well as attempts to share knowledge While there are also extended versions of such rela-
in cross-domain view. In the following section, the as- tions, this basic view is quite sufficient with all method-
tro/geo context is discussed, mainly focusing on their ologies for KDPs. It is because raw data gathering (Data
specifics and shared aspects, and the possible transfer of part), their processing and manipulation (Informa-
knowledge. tion part), and creation of models that are suitable
for support of decisions and further actions (Knowl-
edge part) are all necessary aspects of standard data
1.2 KNOWLEDGE DISCOVERY PROCESSES analytical tasks. Hence, transformations in this Data–
Currently, we can store and access large amounts of Information–Knowledge chain represent a very gen-
data. One of the main problems is to transform raw eral understanding of the KDP or a simple version of
CHAPTER 1 Methodologies for Knowledge Discovery Processes in Context of AstroGeoInformatics 3
methodology. We have the input dataset (raw sources – of methods applied here is large, including methods
Data part), which is transferred using several steps (of- from data mining, machine learning (whenever out-
ten including data manipulation to get more interpreted put models are also applicable as actions), operation
and meaningful data – Information part) to knowledge research, optimization, computational modeling, or
(models containing rules or patterns – Knowledge part). expert (knowledge-based) systems.
For example, data from a customer survey are in the A nice feature of business analytics is that every
raw form of Yes/No answers, values on an ordinal scale, option can be applied separately, or we can combine
or numbers. If we put these data about customers in them in the chain as a step-by-step process. In this case,
the context of questions, combine them in infographics, we can see descriptive analytics mainly responsible for
and analyze their relations with each other, we trans- transformation between Data and Information. With
form raw data into information. In practice, we mine the addition of predictive analytics, we can enhance the
some rules on how these customers and their subgroups process of transformation to get Knowledge of our sys-
usually react in specific cases discussed in the survey. We tem. Our extracted knowledge is then applicable and
can try to understand their behavior (what they prefer, actionable simply as is, or we can extend it and make
buy), predict their future reactions (if they will be in- it part of the decision making process using methods
terested in a new product) in similar cases, and provide from the area of prescriptive analytics. Hence, we can
actionable knowledge in the form of a recommendation see Data–Information–Knowledge in a narrow view as
to the responsible actor (apply these rules to get higher part of predictive analytics in, let us say, traditional un-
income). derstanding (with KDPs as KDD), or we can see it in
The presented view of Data–Information–Knowl- broader scope with all analytics involved in transforma-
edge relations is also comparable to the view of business tion.
analytics. In this case, we have three options in analytics Now we can show differences in this example. Imag-
according to our expectations (Evans, 2015): ine that a company has several hotels with casinos,
• Descriptive analytics – uses data aggregation and de- and they want to analyze customers and optimize their
scriptive data mining techniques to see what hap- profit. Within descriptive analytics they use data ware-
pened in the system (business), so the question housing techniques to make reports about hotel pack-
“What has happened?” is answered. The main idea is ing in time, activities in the casino and its incomes,
to use descriptive analytics if we want to understand and infographics of profit according to different aspects.
at an aggregate level what is going on, summarize These methods will help them to understand what is
such information, and describe different aspects of happening in their casinos and hotels. Within predic-
the system in that way (to understand present and tive analytics, they can create a predictive model that
historical data). The methods here lead us to ex- forecasts hotel and casino packing in the future, or they
ploration analysis, visualizations, periodic or ad hoc can use data about customers and segment them into
reporting, trend analysis, data warehousing, and cre- groups according to their behavior in casinos. The re-
ation of dashboards. sult is a better understanding of what will happen in
• Predictive analytics – basically tasks from this part the future, what will be occupancy of the hotel in differ-
examine the future of the system. They answer the ent months, and what is the expected behavior of cus-
question “What could happen according to historical tomers when they come to the casino. Moreover, within
data?” We can see this as a predictor of states accord- prescriptive analytics, they can identify which decision-
ing to all historical information. It is an estimation based input setup (and how) to optimize their profit.
of the normal development of the characteristics of It means that according to the prediction of hotel oc-
our system. This part of analytical tasks is closest to cupancy they can change prices accordingly, set up the
the traditional view of KDPs. The methods here are allocation of rooms, or provide benefits to some seg-
the same as in the case of any KDP methodology, sta- ments of customers. For example, if someone is playing
tistical analysis, and data mining methods. a lot, we can provide him/her with some benefits to sup-
• Prescriptive analytics – here are all attempts when port his/her return like a better apartment for a lower
we select some model about the system and try to price or free food.
optimize its possible outcomes. It means that we As we already mentioned, people often exchange the
analyze what we have to do if we want to get the KDP with data mining, which is only one step. More-
best efficiency for some output model values. The over, for knowledge discovery, some other names were
name came from the word prescribe, so it is pre- also used in literature, like knowledge extraction, infor-
scription or advice for actions to be done. The set mation harvesting, information discovery, data pattern
4 PART I Data
processing, or even data archeology. The mostly used source) software tools for particular steps or one unified
synonym for KDPs is then obviously KDD, which is log- analytical platform of tools.
ical due to the beginnings of KDP with the processing of Before we move to a description of selected method-
structured data stored in standard databases. The basic ologies in the next section, we summarize the motiva-
properties are even nowadays the same as or similar to tion for the use of standardized KDP models (method-
KDD basics from the 1990s. Therefore we can summa- ologies) (Kurgan and Musilek, 2006):
rize them accordingly (Fayyad et al., 1996): • The output product (knowledge) must be useful for
• The main objective of KDP is to seek new knowledge the user, and ad hoc solutions more often failed in
in the selected application domain. yielding valid, novel, useful, and understandable re-
• Data are a set of facts. The pattern is the expression in sults.
some suitable language (part of the outcome model, • Understanding of the process itself is important. Hu-
e.g., rule written in some rule-based language) about mans often lack a perception of large amounts of
a subset of facts. untapped and potentially valuable data. A process
• KDP is a nontrivial process of identifying valid, model that is well structured and logical will help
novel, potentially useful, and ultimately understand- to avoid these issues.
able patterns in data. A process is simply a multistep • An often underestimated factor is providing support
approach of transformations from data to patterns. for management problems (this also includes cases
The pattern (knowledge) mentioned before is: of a larger project in the science area, which needs
• valid – pattern should be true on new data with efficient management). Whenever KDP projects in-
some certainty, volve large teams, requiring careful planning and
• novel – we did not know about this pattern be- scheduling, a management specialist in such projects
fore, is often unfamiliar with terms from the data min-
• useful – pattern should lead to actions (the pat- ing area – KDP methodology can then be helpful in
tern is actionable), managing the whole project.
• comprehensible – the process should produce • Standardization of KDP provides a unified view of
patterns that lead to a better understanding of the current process description and allows an appropri-
underlying data for human (or machine). ate selection and usage of technology to solve cur-
KDP is easily generalized also to sources of data rent problems in practice, mostly on an industrial
which are not in databases or not in structured form, level.
which induces methodology aspects of a similar type
also to the area of text mining, Big Data analysis, or
data streams processing. Knowledge discovery involves 1.3 METHODOLOGIES FOR KNOWLEDGE
the entire process, including storage and access of data, DISCOVERY PROCESSES
application of efficient and scalable data processing al- In this section, we provide more details on selected
gorithms to analyze large datasets, interpretation and methodologies. From the 1990s, several of them were
visualization of outcome results, and support of the developed, starting basically from academic research,
human–machine or human–computer interaction, as but they very quickly moved on to an industry level.
well as support for learning and analyzing the domain. As we already mentioned, the first more structured way
The KDP model, which is then called methodology, was proposed as KDD in Fayyad et al. (1996). Their ap-
consists of a set of processing steps followed by the proach was later modified and improved by both the
data analyst or scientist to run a knowledge discovery research and the industry community. The processes
project. The KDP methodology usually describes proce- always share a multistep sequential way in processing
dures for each step of such a project. The model helps input data, where each step starts after accessing the re-
organizations (represented by the data analyst) to un- sult of the successful completion of the previous step as
derstand the process and create a project roadmap. The its input. Also, it is common that activities within steps
main advantage is reduced costs for any ad hoc analy- cover understanding of the task, data, preprocessing or
sis, time savings, better understanding, and acceptation preparation of data, analysis, evaluation, understand-
of the advice coming from the results of the analysis. ing of results, and their application. All methodolo-
While there are still data analysts who apply ad hoc gies also emphasize their iterative nature by introducing
steps to their projects, most of them apply some com- feedback loops throughout the process. Moreover, they
mon framework with the help of (commercial or open are often processed with a strong influence of human
data scientists and therefore acknowledge its interactiv- to the result of polls from years 2007 and 2014, more
ity. The main differences between the methodologies than 42% of data analysts (most of all votes) are using
are in the number and scope of steps, the characteris- CRISP-DM methodology in their analytics, data mining,
tics of their inputs and outputs, and the usage of various or data science projects, and the usage of the methodol-
formats. ogy seems to be stable.
Several studies compared existing methodologies,
their advantages and disadvantages, the scope of their 1.3.1 First Attempt to Generalize Steps –
application, the relation to software tools and stan- Research-Based Methodology
dards, and any other aspects. Probably the most ex- Within the starting field of knowledge discovery in the
tensive comparisons of methodologies can be found in 1990s, researchers defined the multistep process, which
Kurgan and Musilek (2006) and Mariscal et al. (2010). guides users of data mining tools in their knowledge
Other papers also bring ideas and advice, including discovery effort. The main idea was to provide a se-
their applicability in different domains; see, for exam- quence of steps that would help to go through the KDP
ple, Cios et al. (2007), Ponce (2009), Rogalewicz and in an arbitrary domain. As mentioned before, in Fayyad
Sika (2016). et al. (1996) the authors developed a model known as
Before we describe details of some selected method- KDD process.
ologies, we provide some information on two aspects, In general, KDD provides a nine-step process, mainly
i.e., the evolution of methodologies and practical usage considered as a research-based methodology. It involves
of them by data analysts. both the evaluation and interpretation of the patterns
According to the history of methodologies, in (possibly knowledge) and the selection of preprocess-
Mariscal et al. (2010) one can find quite a thorough de- ing, sampling, and projections of the data before the
scription of such evolution. As we already mentioned, data mining step. While some of these nine steps fo-
the first attempts were fulfilled by Fayyad’s KDD pro- cus on decisions or analysis, other steps are data tran-
cess between the years 1993–1996, which we will also sitions within the data–information–knowledge chain.
describe in the next subsection. This approach inspired As mentioned before, KDD is a “nontrivial process of
several other methodologies, which came in the years identifying valid, novel, potentially useful, and ulti-
after the KDD process, like SEMMA (SAS Institute Inc., mately understandable patterns in data” (Fayyad et al.,
2017), Human-Centered (Brachman and Anand, 1996), 1996). The KDD process description also provides an
or approaches described in Cabena et al. (1998) and outline of its steps, which is available in Fig. 1.2.
Anand and Buchner (1998). On the other hand, also The model of the KDD process consists of the follow-
some other ideas evolved into methodologies including ing steps (input of each step is output from the previous
the 5As or Six Sigma. Of course, some issues were iden- one), in an iterative (analysts apply feedback loops if
tified during those years and an answer to them was in necessary) and interactive way:
the development of CRISP-DM standard methodology, 1. Developing and understanding the application do-
which we will also describe in one of the following sub- main, learning relevant prior knowledge, identifying
sections. CRISP-DM became the leading methodology of the goals of the end-user (input: problem to be
and quite a reasonable solution for a start in any data solved/our goal, output: understanding of the prob-
mining project, including new projects with Big Data lem/domain/goal).
and data streams processing. Any new methodology or 2. Creation of a target dataset – selection (querying) of
some standardized description of processes usually fol- the dataset, identification of subset variables (data
lows a similar approach to one defined by CRISP-DM attributes), and the creation of data samples for the
(some of them are available in the review papers men- KDP (output: target data/dataset).
tioned before). 3. Data cleaning and preprocessing – dealing with out-
The influential role of CRISP-DM is evident by liers and noise removal, handling the missing data,
the polls evaluated on KDnuggets,1 a well-known collecting data on time sequences, and identifying
and widely accepted community-based web site re- known changes to data (output: preprocessed data).
lated to knowledge discovery and data mining. Gregory 4. Data reduction and projection – finding useful fea-
Piatetsky-Shapiro, one of the authors of the KDD pro- tures that represent the data (according to goal), in-
cess methodology, showed in his article2 that according cluding dimension reductions and transformations
1 https://www.kdnuggets.com/. (output: transformed data).
2 https://www.kdnuggets.com/2014/10/crisp-dm-top-methodology- 5. Selection of data mining task – the decision on
analytics-data-mining-data-science-projects.html. which methods to apply for classification, cluster-
6 PART I Data
FIG. 1.2 The KDD process.
ing, regression, or another task (output: selected 1.3.2 Industry-Based Standard – the
method[s]). Success of CRISP-DM
6. Selection of data mining algorithm(s) – select meth- Shortly after the KDD process definition, the indus-
od for pattern search, deciding on appropriate mod- try produced methodologies more suitable for their
els and their parameters, and matching methods needs. One of them is CRISP-DM (CRoss-Industry Stan-
with the goal of the process (output: selected algo- dard Process for Data Mining) (Chapman et al., 2000),
rithms). which became the standard for many years and is still
7. Data mining – searching for patterns of interest in widely used in both the industry and the research area.
specific form like classification rules, decision trees, CRISP-DM was originally developed by a project con-
regression models, trends, clusters, and associations sortium under the ESPRIT EU funding initiative in
(output: patterns). 1997. The project involved several large companies,
8. Interpretation of mined patterns – understanding which cooperated in its design: SPSS, Teradata, Daimler
and visualizations of patterns based on the extracted AG, NCR Corporation, and OHRA. Thanks to the differ-
models (output: interpreted patterns). ent knowledge of companies, the consortium was able
9. Consolidation of discovered knowledge – use of dis- to cover all aspects, like IT technologies, case studies,
data sources, and business understanding.
covered patterns into a system analyzed by the KDD
CRISP-DM is an open standard and is available for
process, documenting and reporting knowledge to
anyone to follow. Some of the software tools (like SPSS
end-users, and checking and resolving conflicts if
Modeler/SPSS Clementine) have CRISP-DM directly in-
needed (output: knowledge, actions/decisions based
corporated. As we already mentioned, CRISP-DM is the
on the results).
most widely used KDP methodology. While it still has
The authors of this model declared its iterative fash- some drawbacks, it became a part of the most success-
ion, but they gave no specific details. The KDD process ful story in the data mining industry. The central fact
is a simple methodology and quite a natural model for behind this success is that CRISP-DM is industry-based
the discussion of KDPs. There are two significant draw- and neutral according to tools and application. One of
backs of this model. First, lower levels are too abstract the drawbacks of this model is that it does not perform
and not explicit and formalized. This lack of detail was project management activities. One major factor behind
changed in later methodologies using more formalized the success of CRISP-DM is that it is an industry tool,
step descriptions (in some cases using standards, au- and it is application-neutral (Mariscal et al., 2010).
tomation of processes, or specific tools or platforms). The CRISP-DM model (see Fig. 1.3) consists of the
The second drawback is its lack of business aspects de- following six steps, which are then described in more
scription, which is logical due to the research-based idea details and can be iteratively applied, including feed-
at the start of its development. back in some places (where necessary):
FIG. 1.3 Methodology CRISP-DM.
1. Business understanding – focuses on the under- tion of data, software tools, technical deploy-
standing of objectives and requirements from a busi- ment).
ness perspective, and also converts them into the 2. Data understanding – initial collection of data, un-
technical definition and prepares the first version of derstanding the data quality issues, exploration anal-
the project plan to achieve the objectives. Therefore, ysis, detection of interesting data subsets. If under-
substeps here are: standing shows a need to reconsider business under-
a. determination of business objectives – here it standing substeps, we can move back to the previous
is important to define what we expect as busi- step. Hence, the substeps of data understanding are:
ness goals (costs, profits, better support of cus- a. collection of initial data – the creation of the
tomers, and higher quality of the data prod- first versions of the dataset or its parts,
uct), b. description of data – understanding the mean-
b. assessment of the situation – understanding the ing of attributes in data, summary of the initial
actual situation within the objectives, defining dataset(s), extraction of basic characteristics,
the criteria of success for business goals, c. exploration of data – visualizations, descrip-
c. determination of technical (data mining) tions of relations between attributes, correla-
goals – business goals should be transformed tions, simple statistical analysis on attributes,
into technical goals, i.e., what data mining exploration of the dataset,
models we need to achieve business goals, what d. verification of data quality – analysis of missing
the technical details of these models are, how values, anomalies, or other issues in data.
we will measure it, 3. Data preparation – after finishing the first steps, the
d. generation of a project plan – the analyst cre- most important step is the preparation of data for
ates the first version of the plan, where details data mining (modeling), i.e., the preparation of the
on next steps are available. The analysts should final dataset for modeling using data manipulation
address different issues, from business aspects methods which can be applied. We can divide them
(how to discuss and transform data mining re- into:
sults, deployment issues from a management a. selection of data – a selection of tables, records,
point of view) to technical aspects (how to and attributes, according to goal needs and re-
achieve data, data formats, security, anonymiza- duction of dimensionality,
8 PART I Data
b. integration of data – identification of the same plication of KDP (lifecycle applications). This step
entities within more tables, aggregations from consists of:
more tables, redundancy checks, and processing a. plan deployment – the deployment strategy
of and detection of conflicts in data, is provided, including the necessary steps and
c. cleansing of data – processing of missing values how to perform them,
(remove records or imputation of values), pro- b. plan monitoring and maintenance – strategy
cessing of anomalies, removing inconsistencies, for the monitoring and maintenance of deploy-
d. construction (transformation) of data – the cre- ment,
ation of new attributes, aggregations of values, c. generation of the final report – preparation of
transformation of values, normalizations of val- the final report and final presentation (if ex-
ues, and discretization of attributes, pected),
e. formatting of data – preparation of data as in- d. review of the process substeps – summary of
put to the algorithm/software tool for the mod- experience from the project, unexpected prob-
eling step. lems, misleading approaches, interesting solu-
4. Modeling – various modeling techniques are ap- tions, and externalization of best practices.
plied, and usually more types of algorithms are used, CRISP-DM is relatively easy to understand and has
with different setup parameters (often with some good vocabulary and documentation. Thanks to its gen-
metaapproach for optimization of parameters). Be- eralized nature, this methodology is a very successful
cause methods have different formats of inputs and and extensively used model. In practice, many advanced
other needs, the previous step of data preparation analytic platforms are based on this methodology, even
could be repeated in a small feedback loop. In gen- if they do not call it the same way.
eral, this step consists of: In order to help in understanding the process, we
a. selection of modeling technique(s) – choose can provide a simple example. One of the possible ap-
the method(s) for modeling and examining plications of the CRISP-DM methodology is to provide
their assumptions, tools in support of clinical diagnosis in medicine. For
b. generation of test design – plan for training, example, our goal is to improve breast cancer diagnos-
testing, and evaluating the models, tics using data about patients. In terms of CRISP-DM
c. creation of models – running the selected meth- methodology we can describe the KDP in the following
ods, way:
d. assessment of generated models – analysis of 1. Business understanding – from a business perspec-
models and their qualities, revision of param- tive, our business objective goal is to improve the
eters, and rebuild. effectiveness of breast cancer diagnostics. Here we
5. Evaluation – with some high-quality models (ac- can provide some expectation in numbers related
cording to the data analysis goal), such models are to diagnostics effectiveness and costs of additional
evaluated from a business perspective. The analyst re- medical tests, in order to set up business goals –
views the process of model construction (to find infor example, if our diagnosis using some basic setup
sufficiently covered business issues) and also decides will be more effective, it reduces the costs by 20%.
on the next usage of data mining results. Therefore, Then data mining goals are defined. In terms of data
we have: mining, it is a classification task with the binary tar-
a. evaluation of the results – assessment of results get attribute, which will be tested using a confusion
and identification of approved models, matrix, and according to business goals we want to
b. process review – summarize the process, iden- achieve at least 95% accuracy of the classifier to ful-
tify activities which need another iteration, fill the business goal. According to the project plan,
c. determination of the next step – a list of further we know that data are available in CSV format, and
actions is provided, including their advantages data and models are processed in R using RStudio,
and disadvantages, with the Rshiny web application (on available server
d. decision – describe the decision as to how to infrastructure) providing the interface for doctors in
proceed. their diagnostic process.
6. Deployment – discovered knowledge is organized 2. Data understanding – in this example, let us say we
and presented in the form of reports or some com- have data collected from the Wisconsin Diagnosis
plex deployment is done. Also, this can be a step that Breast Cancer (WDBC) database. We need to under-
finishes one of the cycles if we have an iterative ap- stand the data themselves, and what are their at-
tributes and what is their meaning. In this case, we 1.3.3 Proprietary Methodologies – Usage of
have 569 records with 32 attributes, which mostly Specific Tools
describe original images with/without breast cancer. While the research or open standard methodologies are
The first attribute is ID and the second attribute is tar- more general and tool-free, some of the leaders in the
get class (binary – the result of diagnosis). The other area of data analysis also provide to their customers pro-
30 real-valued attributes describe different aspects of prietary solutions, usually based on the usage of their
cells in the image (shape, texture, radius). We also software tools.
find no missing values, and we do not need any pro- One of such examples is the SEMMA methodol-
cedure to clean or transform data. We also explore ogy from the SAS Institute, which provided a process
data, visualize them, and describe relations between description on how to follow its data mining tools.
attributes and correlations, in order to have enough SEMMA is a list of steps that guide users in the im-
information for the next steps. plementation of a data mining project. While SEMMA
3. Data preparation – any integration, cleaning, and provides still quite a general overview of KDP, authors
transformation issues are solved here. In our ex- claim that it is a most logical organization of their tools
ample, there are no missing values other issues in to cover core data mining tasks (known as SAS Enter-
WDBC. There is only one data table, we will select prise Miner). The main difference of SEMMA with the
traditional KDD overview is that the first steps of appli-
all records, and we will not remove/add an attribute.
cation domain understanding (or business understand-
The data format is CSV, suitable for input in RStu-
ing in CRISP-DM) are skipped. SEMMA also does not
dio for the modeling step. We can also select subsets
include the knowledge application step, so the business
of data according to expected modeling and evalua-
aspect is out of scope for this methodology (Azevedo
tion, in this case, let us say a simple hold-out method
and Santos, 2008). Both these steps are in the knowl-
with different ratios for the size of training and test
edge discovery community considered as crucial for the
samples (80:20, 70:30, 60:40). success of projects. Moreover, applying this methodol-
4. Modeling – data mining models are created. In our ogy outside SAS software tools is not easy. The phases
case, we want classification models (algorithms), of SEMMA and related tasks are the following:
i.e., C4.5, Random Forests, neural networks, k-NN, 1. Sample – the first step is data sampling – a selection
SVM, and naive Bayes. We create models for differ- of the dataset and data partitioning for modeling;
ent hold-out selections and parameters of algorithms the dataset should be large enough to contain rep-
to achieve the best models. Then we evaluate mod- resentative information and content, but still small
els according to test subsets and select the best of enough to be processed efficiently.
them for further deployment, i.e., the SVM-based 2. Explore – understanding the data, performing ex-
model with more than 97% accuracy with 70:30 ploration analysis, examining relations between the
hold-out. variables, and checking anomalies, all using simple
5. Evaluation – the best models are analyzed from a statistics and mostly visualizations.
business point of view, i.e., whether we can achieve 3. Modify – methods to select, create, and transform
the business goal using such a model and its suffi- variables (attributes) in preparation for data model-
ciency for application in the deployment phase. We ing.
decide on how to proceed with the best model, and 4. Model – the application of data mining techniques
what the advantages and disadvantages are. For ex- on the prepared variables, the creation of models
ample, in this case, the application of the selected with (possibly) the desired outcome.
model can support doctors and remove one intru- 5. Assess – the evaluation of the modeling results, and
sive and expensive test out of diagnostics, in some of analysis of reliability and usefulness of the created
the new cases. models.
6. Deployment – a web-based application (based on IBM Analytics Services have designed a new method-
Rshiny) is created and deployed on the server, which ology for data mining/predictive analytics named An-
contains an extracted model (SVM classifier) and a alytics Solutions Unified Method for Data Mining/Pre-
user interface for the doctor in order to input results dictive Analytics (also known as ASUM-DM),3 which is
a refined and extended CRISP-DM. While strong points
of image characteristics from new patients (records)
and provide him/her with a diagnosis of such new 3 https://developer.ibm.com/predictiveanalytics/2015/10/16/have-you-
samples. seen-asum-dm/.
10 PART I Data
of CRISP-DM are on the analytical part, due to its open is Architecture-centric Agile Big data Analytics (AABA)
standard nature CRISP-DM does not cover the infras- (Chen et al., 2016), which addresses technical and orga-
tructure or operations side of implementing data min- nizational challenges of Big Data with the application
ing projects, i.e., it has only few project management of agile delivery. It integrates Big Data system Design
activities, and has no templates or guidelines for such (BDD) and Architecture-centric Agile Analytics (AAA)
tasks. with the architecture-supported DevOps model for ef-
The primary goal of ASUM-DM creation was to solve fective value discovery and continuous delivery of value.
the disadvantages mentioned above. It means that this The authors validated the method based on case studies
methodology retained CRISP-DM and augmented some from different domains and summarized several recom-
of the substeps with missing activities, tasks, guidelines, mendations for Big Data analytics:
and templates. Therefore, ASUM-DM is an extension or
• Data analysts should be involved already in the busi-
refinement of CRISP-DM, mainly in the more detailed
ness analysis phase.
formalization of steps and application of (IBM-based)
• There should be continuous architecture support.
analytics tools. ASUM-DM is available in two versions –
• Agile steps are important and helpful due to fast
an internal IBM version and an external version. The in-
technology and requirements changes in this area.
ternal version is a full-scale version with attached assets,
and the external version is a scaled-down version with- • Whenever possible, it is better to follow the reference
out attached assets. Some of these ASUM-DM assets or a architecture to make development and evolution of
modified version are available through a service engage- data processing much easier.
ment with IBM Analytics Services. Like SEMMA, it is a • Feedback loops need to be open and should include
proprietary-based methodology, but more detailed and both technical and business aspects.
with a broad scope of covered steps within the analyti- As we already mentioned, processing of data and
cal project. their lifecycle is quite an important aspect in this area.
At the end of this section, we also mention that KDPs Moreover, the setup of processing architecture and tech-
can be easily extended using agile methods, initially nology stack is probably of the same importance in
developed for software development. The main appli- the Big Data context. One approach for solving such
cation of agile-based aspects is logically in larger teams issues is related to the Big Data Integrator (BDI)Plat-
in the industrial area. Many approaches are adapted form (Ermilov et al., 2017), developed within the Big
explicitly for some company and are therefore propri- Data Europe H2020 flagship project, which provides
etary. Generally, KDP is iterative, and the inclusion of distribution of Big Data components as one platform
more agile aspects is quite natural (Nascimento and with easy installation and setup. While there are several
de Oliveira, 2012). The AgileKDD method fulfills the other similar distributions, authors of this platform also
OpenUP lifecycle, which implements Agile Manifesto. provided to potential users a methodology for devel-
The project consists of sprints with fixed deadlines (usu- oping Big Data stack applications and several use cases
ally a few weeks). Each sprint must deliver incremental from different domains. One of their inspirations was to
value. Another example of an agile process description use the CRISP-DM structure and terminology and apply
is also ASUM-DM from IBM, which combines project
them to a Big Data context, like in Grady (2016), where
management and agility principles.
the author extends CRISP-DM to process scientific Big
1.3.4 Methodologies in Big Data Context Data. In the scope of the BDI Platform, authors pro-
posed a BDI Stack Lifecycle methodology, which sup-
Traditional methodologies are usually applied also in
Big Data projects. The problem here is that none of the ports the creation, deployment, and maintenance of the
traditional standards support the description of the ex- complex Big Data applications. The BDI Stack Lifecycle
ecution environment or workflow lifecycle aspects. In consists of the following steps (they developed docu-
the case of Big Data projects, it is an important issue mentation and tools for each of the steps):
due to the complex cluster of distributed services im- 1. Development – templates for technological frame-
plemented using the various technologies (distributed works, most common programming languages, dif-
databases, frameworks for distributed processing, mes- ferent IDEs applied, distribution formalized for the
sage queues, data provenance tools, coordination, and needs of users (data processing task).
synchronization tools). An interesting paper discussing 2. Packaging – dockerization and publishing of the de-
these aspects is Ponsard et al. (2017). One of the men- veloped or existing components, including best prac-
tioned methodologies related to Big Data in this paper tices that can help the user to decide.
3. Composition – assembly of a BDI stack, integration The primary standards designed for the process mod-
of several components to address the defined data eling are flowcharting techniques which represent the
processing task. process using a graph diagram. Nodes of the diagram
4. Enhancement – an extension of BDI stack with en- correspond to the performed process activities, and
hancement tools (daemons, logging) that provides edges represent control flow. This flowchart represents
monitoring. the execution ordering of the activities or data flow,
5. Deployment – instantiation of a BDI stack on physi- i.e., how the data objects pass from one operation to
cal or virtual servers. another one. Examples of the standards based on the
6. Monitoring – observing the status of a running BDI graphical notation include the Business Process Model
stack, repetition of BDI components, and architec- and Notation (BPMN4 ) or the Unified Modeling Lan-
ture development when need. guage (UML5 ) notation. BPMN models consist of sim-
ple diagrams constructed from a limited set of graph-
ical elements with the flow objects (graph nodes) and
1.4 METHODOLOGIES IN ACTION connecting objects (graph edges). The flow objects rep-
resent activities and gateways which determine forking
In practice, when it is necessary to apply the method-
and merging of connection paths, depending on the
ology, specific views and needs are expected for users
conditions expressed. We can group flow objects using
(data analyst). The general data analysis methodologies
the swim lanes representing, for example, the organi-
are not very formalized, i.e., their direct application for
zation units or different roles in the process. A part of
machine-readable sharing or automation of data analy-
the process model for data processing can be additional
sis processes is not easy. We must look at ways how an-
annotations representing the data objects generated or
alysts brought methodologies in action within a more received by the activities. Activities can be atomic tasks,
precise context. This section will look at such aspects, or they can consist of further decomposed subprocesses.
especially on the automation of KDP and understand- BPMN is a graphical modeling notation, but version
ing of their steps through shared ontologies. 2.0 also specifies the basic execution semantics, and the
workflow engines can directly execute BPMN diagram
1.4.1 Standardization and Automation of modeling in order to automatize the processes. Addi-
Processes – Process Models tionally, BPMN models can be directly mapped to work-
The primary goal of process modeling is to represent flow execution languages, such as Web Services Business
the process in such a way that it can be analyzed, im- Process Execution Language (WS-BPEL6 ). The main dis-
proved, or automatized. In the scope of data analysis, advantage of the BPMN is the lack of direct support
the data analytical work is organized itself as the pro- for knowledge creation processes and support for deci-
cess consisting of the various steps, such as process and sion rules, and some ambiguity in the sharing of BPMN
data understanding, data preprocessing, and modeling. models.
Process modeling is also crucial for the data provenance, In comparison to BPMN, UML is a general purpose
where it is necessary to capture how the data were trans- modeling language which provides many types of dia-
formed using the sequence of operations represented as grams from two categories: types representing the struc-
the data flow process model. Additionally, the analyzed tural information and types representing the general
domain can be process-oriented, as is the case for ex- types of behavior, including types representing differ-
ample in the process industries, i.e., process models can ent aspects of interactions. The behavior types can be
be an essential part of the domain knowledge shared directly used for the process modeling using the activity
by the domain experts and data scientist. Depending diagrams or in some cases sequence diagrams. A UML
on the complexity of the model, the process model- activity diagram generally describes step-by-step opera-
ing is typically performed by the process analysts, who tional activities of the components in a modeled system
provide expertise in the modeling discipline in cooper- using a similar flowcharting technique like BPMN dia-
ation together with the domain experts. In the case of grams. The activity diagram consists of nodes represent-
data analysis, data scientists typically perform the pro- ing the activities or decision gateway with support for
cess analysis. Models based on some formalism, like in choice, iteration, and concurrency. For data flow mod-
the form of the sequence diagrams, can be designed di- eling, diagrams can be additionally annotated with the
rectly by the domain experts. For the alternative, the 4 http://www.bpmn.org.
process model can be derived directly from the observed 5 http://www.uml.org.
process events using the process mining tools. 6 http://docs.oasis-open.org/wsbpel/2.0/wsbpel-v2.0.html.
12 PART I Data
FIG. 1.4 Example of a PMML file (from DMG PMML examples).
references to the structural entities. The main diagram tive models such as Predictive Model Markup Language
for structural modeling is the class diagram, which al- (PMML7 ) or Portable Format for Analytics (PFA8 ). The
lows modeling classes and instances of entities together core of the PMML standard is a structural description of
with the data types of their properties and interdepen- the input and output data and parameters of the mod-
dencies. The main advantage of the UML is its general els, but the format also allows to specify a sequence
applicability to the different aspects which can be mod- of data processing transformation which allow for the
eled, including the structural aspects not included in the mapping of user data into a more desirable form to be
BPMN. used by the mining model. This sequence together with
The flowcharting techniques were also directly in- the data dictionary (structural specification of the in-
corporated into the tools for data analytics. Tools such put and output data) can be used to represent data flow
as IBM SPSS Modeler or RapidMiner provide a visual and data provenance. An example of a structure of one
interface which allows users to leverage statistical and PMML file representing a regression task is shown in
data mining algorithms without programming. The Fig. 1.4.
data processing and modeling process is represented In comparison to PMML, the PFA standard is a
as the graph chart with the nodes representing data more generic functional execution model which pro-
vides control structures, such as conditionals and loops
sources, transformation operations, machine learning
(like a typical programming language). In PFA, data
algorithms, or build models applied to the data. The
processing operations are represented as the function
data flowchart (or data stream) is stored using the
proprietary format, but most of the modeling tools 7 http://dmg.org/pmml/v4-3/GeneralStructure.html.
also support standards for exchanging of the predic- 8 http://dmg.org/pfa.
with inputs and outputs. The standard provides the vo- the upper-level ontology and specific entities defined
cabulary of common operations for the basic data types in the domain ontologies. Upper-level and mid-level
and the language to specify user-defined functions. The ontologies are designed to be able to provide a mech-
data analysis process is then the composition of the anism for mapping subjects across domains. Mid-
functions. Although this approach is very flexible, it level ontologies usually provide a more specific ex-
lacks comprehensiveness of the graphical models. pression of abstract entities found in upper-level on-
tologies.
1.4.2 Understanding Each Other – Semantic • Domain ontologies – Domain ontologies specify en-
Models tities relevant to the domain and represent the most
During the data analysis process, domain experts have specific knowledge from the perspective of one do-
to share the domain knowledge with the data scientists main.
in order to understand business or research problem, The upper-level ontologies are especially important
identify goals of the data analysis tasks, identify relevant for the integration and sharing of the knowledge across
data, and understand relations between them. Data ana- multiple domains and provide a framework through
lysts also exchange knowledge in the opposite direction, which different systems can use a common base. The
i.e., they communicate with the domain experts during entities in upper-level ontologies are basic and univer-
the interpretation and the validation of the data analy- sal and are usually limited to meta, generic, abstract,
sis results. In order to capture the exchanged knowledge, and philosophical concepts. The following list describes
data analysts use various knowledge representations for some commonly used generic upper ontologies:
the externalization process. • Suggested Upper Merged Ontology (SUMO) –
Currently, the most elaborated knowledge represen- SUMO (Niles and Pease, 2001) was created by merg-
tation techniques are ontologies known from the Se- ing several public ontologies into one coherent struc-
mantic web area. Semantic web technologies cover the ture. Ontology is used for search and applied re-
whole stack for the representation of both knowledge search, linguistics, and logic. The SUMO core con-
about structure representing classes and instances of the tains approximately 1000 classes and 4000 axioms. It
entities and their relationship or procedural knowledge consists of SUMO core ontology, mid-level ontology,
in the form of the inference or production rules. The and a set of domain ontologies such as communica-
structural formalisms are based on the application of tion, economics, finance, and physical elements.
logic and were standardized as the ontology languages • CYC ontology – CYC (Lenat, 1995) provides a com-
such as Ontology Web Language (OWL9 ), developed by mon ontology, an upper-level language for defining
the World Wide Web Consortium (W3C). Its predeces- and creating arguments using ontology. CYC ontol-
sor is an RDF scheme developed as a standard language ogy is used in the field of natural language process-
for representing ontology. The highest priority during ing, word comprehensibility, answers to questions,
the design phase of OWL was to achieve better exten- and others.
sibility, modifiability, and interoperability. OWL is now • Descriptive Ontology for Linguistic and Cognitive
striving to achieve a good compromise between scala- Engineering (DOLCE) – A very significant top ontol-
bility and expressive power. ogy is DOLCE (Gangemi et al., 2002), which focuses
Semantic models based on the ontologies can be on capturing the ontological categories needed to
generally divided depending on the scope and level of represent the natural language and human reason.
specificity of the knowledge into the following three lev- Established upper-level categories are considered as
els:
cognitive artifacts that depend on human perception,
• Upper-level ontologies – Upper ontologies describe
cultural impulses, and social conventions. Categories
the most common entities, contain only general
include abstract quality, abstract area, physical ob-
specifications, and are used as a basis for specializa-
ject, the quantity of matter, physical quality, physical
tions. Typical entries in top ontology are, e.g., “en-
area, and process. DOLCE ontology applications in-
tity,” “object,” and “situation,” which include more
clude searching for multilingual information, web
specific concepts. Boundaries expressed by the top
systems and services, and e-learning.
levels of ontologies consist of general world knowl-
• Basic Formal Ontology (BFO) – The BFO focuses
edge that is not acquired by language.
on the role of providing a true upper-level ontol-
• Middle-level ontologies – Mid-level ontologies serve
ogy that can be used to support domain ontologies
as a bridge between the general entities defined in
developed, for example, for scientific research such
9 https://www.w3.org/OWL. as biomedicine (Smith et al., 2007). The BFO rec-
14 PART I Data
ognizes the basic distinction between the following tion of the proposed study, and documentation of the
two types of entities: the essential entities that per- results achieved. The OBI ontology uses rigid logic and
sist over time while preserving their identity and the semantics because it uses higher levels of ontology rela-
procedural entities that represent the entities that tionships to define higher levels and a set of relation-
are becoming and developing over time. The char- ships. OBI defines processes and contexts (materials,
acteristic feature of process entities is that they are tasks, tools, functions, properties) relevant to biomed-
expanded both in space and in time (Grenon and ical areas.
Smith, 2004). In comparison to OBI, EXPO (Soldatova and King,
• General Formal Ontology (GFO) – GFO (Herre et al., 2006) is a more generic ontology and is not specific
2006) is a basic ontology integrating objects and pro- to the biological domain. The EXPO ontology includes
cesses. GFO has a three-layer ontological architecture general knowledge of scientific experimental design,
consisting of an abstract top level, an abstract core methodology, and representation of results. The in-
level, and a basic level. This ontology involves ob- vestigator, method, outcome, and conclusion are the
jects as well as processes integrated into one coherent main results with which EXPO defines the types of two
framework. GFO ontology is designed to support in- main investigations, i.e., computational investigations
teroperability based on ontological mapping and re- and physical investigations. Ontology uses a subset of
duction principles. GFO is designed for applications, SUMO as the highest classes and minimizes the set of
especially in the medical, biological, and biomedical relationships to ensure compliance with existing for-
fields, but also in the field of economics and sociol- malisms. The EXPO ontology is a valuable resource for
ogy. a description of the experiments from various research
• Yet Another More Advanced Top-level Ontology
areas. The authors used EXPO ontology to describe high
(YAMATO) – The YAMATO ontology (Mizoguchi,
energy and phylogenetic experiments.
2010) was developed mainly to address the deficien-
The EXPO ontology was further extended to the
cies of other upper-level ontologies, such as DOLCE,
LABORS ontology, which defines research units such as
BFO, GFO, SUMO, and CYC. It concerns the solu-
investigation, study, testing, and repetition. These are
tions of qualities and quantities dealing with the
needed to describe the complex multilayer examina-
representation and content of things and the differ-
tions performed by the robot in a fully automatic way.
entiation of the processes of the process. The current
LABORS is used to create experimental robot survey de-
version of YAMATO has been widely used in several
scriptions that result in the formalization of more than
projects, such as the development of medical ontol-
ogy. 10,000 research units in a tree structure that is 10 levels
Regarding the mid-level ontologies, in recent years, deep. Formalization describes how the robot has con-
there is an increased need for formalized representa- tributed to the discovery of new science-related knowl-
tions of the data analytics processes and formal repre- edge through the process (Soldatova and King, 2006).
sentation of outcomes of research in general. Several Semantic technologies were also applied directly to
formalisms for describing scientific investigations and formalize knowledge about the data analytics processes
outcomes of research are available, but most of them are and KDD. The initial goal of this effort was to build an
specific for the particular domain (e.g., biomedicine). intelligent data mining assistant that combines plan-
Examples of such formalisms include Ontology of ning and metalearning for automatic design of data
Biomedical Investigations (OBI), or ontology of exper- mining workflows. The assistant relies on the formal-
iments (EXPO). These ontologies specify useful con- ized ontologies of data mining operators which spec-
cepts, which describe general processes producing out- ify constraints, required inputs, and provided outputs
put data given some input data and formalize outputs for various operations in data preprocessing, modeling,
and results of the data analytics investigations. and validation phases. Examples of the ontologies for
The goal of OBI ontology (Schober et al., 2007) is data mining/data analytics are the Data Mining OPti-
to provide a standard for the representation of biologi- mization Ontology (DMOP) (Hilario et al., 2011) and
cal and biomedical examinations. OBI is entirely in line the Ontology of Data Mining (OntoDM) (Panov et al.,
with existing formalizations in biomedical areas. On- 2013). The main concepts describe the following enti-
tology promotes consistent annotation of biomedical ties:
research regardless of the specific field of study. OBI de- • Datasets, consisting of data records of the specified
fines the investigation as a multipart process, including type, which can be primitive (nominal, Boolean, nu-
the design of a general design study, the implementa- meric) or structured (set, sequence, tree, graph).
FIG. 1.5 Example of an EXPO ontology instance from the domain of high energy physics.
• Data mining tasks, which include predictive mod- discovery and traceability/reproducibility of the scien-
eling, pattern discovery, clustering, and probability tific experiments, EXPO also allows the reasoning (logi-
distribution estimation. cal inference) about the consistency and validity of the
• Generalization, the output of a data mining algo- conclusions stated in the articles or automatic genera-
rithm, which can be: predictive modeling, pattern tion of the new hypothesis for the further research.
discovery, clustering, and probability distribution es- The following example from the use of EXPO ontol-
timation. ogy, with the main properties of the structured record
• Data mining algorithms, which solve a data min- and their values, is illustrated in Fig. 1.5. The figure
ing task, produce generalizations from a dataset, and describes the fragment of the EXPO structured record
include components of algorithms such as distance (ontology instance) created by the annotation of the
functions, kernel functions, or refinement operators. scientific paper from the domain of high energy physics
describing the new estimate of the mass of the top quark
1.4.2.1 Example – EXPO (Mtop ) authored by the “D0 Collaboration” (approxi-
Scientific experiments and their results are usually de- mately 350 scientists). The experiment was unusual as
scribed in papers published in the scientific journals. In no new observational data were generated. Instead, it
these papers, the main aspects needed for the precise presented the results of applying a new statistical analy-
interpretation and reproducibility of the experiments sis method to existing data. No explicit hypothesis was
are presented in the natural language free text ambigu- put forward in the paper. However, the structured record
ously or implicitly. Therefore, it is difficult to search for includes the formalized description of the paper’s im-
the relevant experiments automatically, interpret their plicit experimental hypothesis, i.e., given the same ob-
results, or capture their reproducibility. The EXPO on- served data, the use of the new statistical method will
tology enables one to describe experiments in a more produce a more accurate estimate of Mtop than the orig-
explicit and unambiguous structured way. Besides the inal method.
16 PART I Data
The record consists of three main parts. The first specified algorithm parameter settings. All these aspects
part is the experiment classification according to the have to be covered in the description to achieve trace-
type (ComputationalExperiment: Simulation) and do- ability and reproducibility of the data mining process.
main. The classification terms are specified as the en- Fig. 1.6 presents the process aspect of a data mining al-
tries from the controlled vocabularies of existing clas- gorithm in more detail.
sification schemes or external ontologies (e.g., Dewey Each process has defined input and output entities
Decimal Classification [DDC]). This part of the EXPO which are linked to the process via has-specified-input
record allows efficient retrieval of the experiments rel- and has-specified-output relations. An input to an ap-
evant for the specified domain or problem. The next plication of a data mining algorithm is a dataset and
property describes the research hypothesis of the ex- parameter values, and as output, we get a generaliza-
periment in the natural language and (optionally) in tion, i.e., a data mining model such as a decision tree.
the artificial (logic) language which allows structural A dataset has as parts data items that are characterized
matching of the hypotheses during the retrieval/com- with a data type which can be primitive (i.e., nomi-
parison of experiments and automatic validation or nal values, numeric values, Boolean values) or com-
generation of new potential hypotheses for further re- bined into rows/ tuples. Besides the description of the
search. The third part describes the procedure or ap- building of data mining models, the OntoDM also sup-
plied method of the experiment, i.e., in this case, the ports the description of applying the model to the new
applied method was a statistical factored model. EXPO dataset (e.g., prediction using a decision tree). It al-
records explicitly describe its inputs, target variable, and lows us to describe very complex experimental processes
assumptions, which are necessary for the traceability built by the composition of multiple models and steps
and reproducibility of the experiment process. It also mixing the data mining method with other scientific
describes the conclusion of the experiment in the nat- methods, such as simulations.
ural language, but it is possible to use also an artificial
language for structural matching. 1.4.3 Knowledge Discovery Processes in
Astro/Geo Context
1.4.2.2 Example – OntoDM In research and practice of both domains related to sky
EXPO ontology provides concepts for the description and Earth observations, data analysis usually follows
of the scientific experiments on the upper level and similar steps as previously defined methodologies, even
describes hypotheses, assumptions, and results. It also if their application is in a more ad hoc way and termi-
defines elements for descriptions of the experimen- nology differs between particular cases. Usage of a stan-
tal methods for processing of the measured/simulated dard methodology (like CRISP-DM) is quite rare. Such
data, statistical testing of defined hypotheses, or pre- ad hoc implementation could bring problems when an-
diction of the expected results. The OntoDM ontology alysts do not recognize some of the issues usually ad-
extends the description of the experimental method for dressed by methodology. On the other hand, in recent
the application of data mining methods. years it looks like more experienced data scientists are
When applying the data mining method on the data, extending project teams (especially for large projects)
it is necessary to describe input datasets, data mining and their processes evolve closer to the standard appli-
tasks (i.e., if we are dealing with predictive or descriptive cation of KDP.
data mining), and the operation applied on data during While there are of course standard applications of
the preprocessing and methods used for the modeling data mining techniques, from data to extracted knowl-
(i.e., applied algorithms and their parameters settings). edge as a scientific result or engineering output for busi-
The description of data mining algorithms in OntoDM nesses, both domains share one specific type of projects.
covers three different aspects. The first aspect is the data In both astronomy and remote sensing, one of the spec-
mining algorithm specification (e.g., the C4.5 algorithm ified outputs of some data processing process can be a
for decision tree induction), which describes declarative data product. This concept is similar to the data ware-
elements of an algorithm, e.g., it specifies that analysts housing area (Ponniah, 2010), where the company pro-
can use the algorithm for solving a predictive modeling vides data from its infrastructure through data mart –
data mining task. The second aspect is the implementa- client-based structured access to a subpart of the avail-
tion of an algorithm in some tool or software library able data storage, often with rich query and search in-
(e.g., WekaJ48 algorithm implementation). The third terface.
aspect is the process aspect, which describes how to ap- Creation and maintenance of data product can be
ply a particular data mining algorithm to a dataset with a complex process and represents KDP itself, even if it
FIG. 1.6 Example of OntoDM instantiation.
needs additional analysis to provide new knowledge. AstroGeoInformatics, it will be even more important to
Moreover, classic KDP is usually extended with lifecycle apply methods for querying data products, set up partic-
management aspects (see Section 1.3.4 for an example). ular processes, and share their understanding well. This
In both astronomy and remote sensing, new projects means that both process modeling and ontology-related
(new large telescopes and satellites with high-resolution aspects might be helpful in the combined effort of data
data) bring data-intensive research to a new level, and scientists from both areas. We will address some of them
even data product projects include almost all steps of in the following subsections.
KDP. For example, raw data of images are often pre-
processed using calibrations, denoising, and transfor- 1.4.3.1 Process Modeling Aspects
mations, and often some basic patterns are recognized From the process point of view, astronomy and remote
(in the modeling step using some machine learning) to sensing are going in a very similar way. The most cru-
provide better and more reliable data products for ana- cial aspect in process modeling is to support workflow-
lysts, who will use the product in data mining for new based approaches to Big Data produced by the instru-
knowledge. ments and their lifecycle management. In the 2020s, it
Therefore we have three different KDP instantiations will be the most critical data analytical process for both
in both domains, i.e., first, full KDP project from raw areas and it will affect any AstroGeoInformatics data
data to extracted knowledge (scientific result or business mining project.
output); second, data product projects, which reduce Regarding the process modeling aspects, there are
traditional KDP at the end of the process, i.e., the final three main lines in both domains:
result is the reliable data source for other analysts; third, • Some types of projects are supported by more indi-
slightly reduced KDP at the start, where some level of vidual software packages, suitable for querying data
data product is used, i.e., some of the preprocessing products, visualization, and modeling, thus provid-
steps are not needed, but usually querying the data is ing simple (often ad hoc or user-centric) process-
still an important step. It seems that the first case, which based analysis of particular data. This approach
splits into the second and third case, will be even more brings two types of tools:
rare with more data coming in the next years in both • general purpose packages – usually some
areas. In the projects combining data and methods for toolkit(s) with a connection to different data
18 PART I Data
products from different projects or data product development and land topography, weather forecast,
sites; a good example is an idea of the virtual ob- and security monitoring. In astronomy, the main focus
servatory, extensively applied in astronomy (see is on scientific results and the domain is quite narrow.
the CDS set of tools10 ), but also in the remote This leads to a more diverse view of tools and stan-
sensing area (see TELEIOS11 ); many of the pack- dards within remote sensing than in astronomy; also
ages support not only virtual observatory query- business-oriented applications force more business pro-
ing functions, but also analysis and visualization; cess modeling tools in remote sensing. On the other
• project-specific packages – especially with large hand, thanks to advancements in Big Data processing
projects, much effort is made to provide users architectures on the industry level, both areas will soon
not only with the data product but also with nec- share similar software and distributed systems as an in-
essary tools which help data scientists with the dustry. This effort will help in the development of any
analysis of the project data, e.g., Sentinel Tool- AstroGeoInformatics projects.
boxes12 in remote sensing.
• Instead of the previous case, which provides a more 1.4.3.2 Ontology-Related Aspects
classical KDP approach (without some standards of The usage of the ontologies for KDP in astronomy or re-
process modeling), an effort in data processing with mote sensing is sporadic, especially if we think about
grid/cloud services and high-performance comput- the complete process and the description of its steps
ing has been made. The most prominent examples as provided in ontologies like OntoDM. It is different
here were related to the execution of workflows, from some other domains, like bioinformatics, where
known under the term “scientific workflow systems” ontologies are also used to understand KDP steps in the
(e.g., DiscoveryNet, Apache Airavata, and Apache analysis itself. In both areas of astronomy and remote
Taverna). Here analysts were able to create (often sensing, ontologies are usually used in the following
through a graphical designer), maintain, and share
cases:
their workflows for processing of data. Again we have
• Light-weight taxonomies for classification of ob-
here both:
jects – the most frequent application of ontolo-
• general purpose systems – tools that are not con-
gies for a better understanding of objects in par-
nected to some specific domain(s); here we can
ticular domains. In astronomy, it means taxonomy
also put many of the classical business workflow
about astronomical objects, data of their observa-
systems;
tion and their properties. It is quite straightforward
• domain- or project-specific systems – again, some
and includes efforts like ontology for virtual obser-
scientific workflow systems more specific to some
domain(s) or project. vatory objects, or space object ontology, or others.
• As an evolution of scientific workflow systems, the In remote sensing, it is more diverse, with domain-
third line can be seen in more advanced platforms, specific semantic models related to the different ap-
which instead of only workflows also support ana- plication domains, e.g., an ontology for geoscience,
lysts with more features related to data science and agriculture, climate monitoring, and many others.
KDP. It means that more traditional scientific work- A more specific domain usually provides some se-
flow systems evolved into e-Science platforms, which mantic standard on how to share data, models,
assist scientists to see the evolution of data, their pro- methods, and application data for its purposes. The
cesses, intermediate results, and also final results and light-weight taxonomies bring specific data formats
potential publication of results. We can see this evo- (e.g., GeoJSON13 in remote sensing for geographic
lution as the advancement of both previous cases, or features), as well as controlled vocabulary for an-
simply as their enhanced combination. notations in order to achieve better understanding,
While previous aspects are shared, there are also composition, and execution of queries on real data.
some differences. Remote sensing provides more do- • Semantic models for processing architectures – some
mains of applications and also has a large number ontology-based standards can be used in the process-
of business-oriented applications, e.g., from science- ing of workflows, especially for the composition of
related topics to applied research in agriculture, urban services. It is not specific for these two areas, but it
generally follows the application of dynamic work-
10 Centre de Données astronomiques de Strasbourg – http://cds.u-
strasbg.fr/.
flow models in Big Data and data stream processing.
11 http://www.earthobservatory.eu/.
12 https://sentinel.esa.int/web/sentinel/toolboxes. 13 http://geojson.org/.
Especially in the remote sensing area, ontologies Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy,
now became interesting due to the effort to better un- R. (Eds.), Advances in Knowledge Discovery and Data Min-
derstand hyperspectral imaging and the combination of ing. AAAI Press, Menlo Park, CA, USA, pp. 37–58.
different (and often dynamic) data products in applica- Cabena, P., Hadjinian, P., Stadler, R., Verhees, J., Zanasi, A.,
1998. Discovering Data Mining: From Concept to Im-
tions. It is also crucial due to large initiatives towards
plementation. Prentice-Hall, Inc., Upper Saddle River, NJ,
the open data; in geodata one of the most prominent USA.
initiatives is the Open Geospatial Consortium,14 which Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T.,
is responsible for the creation, maintenance, and provi- Shearer, C., Wirth, R., 2000. Crisp-dm 1.0 step-by-step data
sion of standards, including some semantic models for mining guide. Technical report. The CRISP-DM Consor-
annotations. This effort for standardization and sharing tium.
of knowledge between application domains can be ben- Chen, H.-M., Kazman, R., Haziyev, S., 2016. Agile big data an-
eficial for any attempt to create an AstroGeoInformat- alytics development: an architecture-centric approach. In:
ics project. Moreover, any attempt to reuse approaches Proceedings of the 2016 49th Hawaii International Confer-
ence on System Sciences (HICSS). HICSS ’16. IEEE Com-
which tries to share more KDP-like information (e.g.,
puter Society, Washington, DC, USA, pp. 5378–5387.
ontologies like EXPO, LABORS, OntoDM) can be bene- Cios, K.J., Swiniarski, R.W., Pedrycz, W., Kurgan, L.A., 2007. The
ficial for analysts and researchers from both domains in Knowledge Discovery Process. Springer US, Boston, MA,
their work on such projects (because even terminology pp. 9–24.
in KDP steps may differ). Ermilov, I., Ngonga Ngomo, A.-C., Versteden, A., Jabeen, H.,
In conclusion, we can say that KDP methodologies Sejdiu, G., Argyriou, G., Selmi, L., Jakobitsch, J., Lehmann,
are a concept which can help especially in the prepa- J., 2017. Managing lifecycle of big data applications. In:
ration of cross-domain models and projects. Such am- Rozewski, P., Lange, C. (Eds.), Knowledge Engineering and
bition is also the case of AstroGeoInformatics projects Semantic Web. Springer International Publishing, Cham,
pp. 263–276.
proposed in this book. In the next chapters, most of the
Evans, J.R., 2015. Business Analytics. Pearson.
methodology steps will be addressed both from an as- Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., 1996. From data
tronomy and a remote sensing point of view; many of mining to knowledge discovery in databases. AI Maga-
them will bring shared ideas, methods, and tools, but zine 17, 37–54.
also describe some differences and specifics. Moreover, Gangemi, A., Guarino, N., Masolo, C., Oltramari, A., Schnei-
some of the examples of AstroGeoInformatics ideas and der, L., 2002. Sweetening ontologies with DOLCE. In:
projects will be shown at the end of the book on se- Knowledge Engineering and Knowledge Management: On-
lected use cases. In the future, such synergy can be sup- tologies and the Semantic Web. Springer Berlin Heidelberg,
ported by the methodologies for KDPs with the ontolo- Berlin, Heidelberg, pp. 166–181.
Grady, N.W., 2016. Kdd meets big data. In: 2016 IEEE Interna-
gies used for better understanding of particular steps,
tional Conference on Big Data. Big Data, pp. 1603–1608.
data, methods, and models. Grenon, P., Smith, B., 2004. Snap and span: towards dynamic
spatial ontology. Spatial Cognition and Computation 4 (1),
69–103.
REFERENCES Herre, H., et al., 2006. General Formal Ontology (GFO):
Anand, S.S., Buchner, A.G., 1998. Decision Support Using Data A Foundational Ontology Integrating Objects and Pro-
Mining. Financial Times Management, London, UK. cesses. Part I: Basic Principles. OntoMed, Leipzig.
Anderson, L.W., Krathwohl, D.R. (Eds.), 2001. A Taxonomy for Hilario, M., Nguyen, P., Do, H., Woznica, A., Kalousis, A.,
Learning, Teaching, and Assessing. A Revision of Bloom’s 2011. Ontology-Based Meta-Mining of Knowledge Discov-
Taxonomy of Educational Objectives, 2 edn. Allyn & Bacon, ery Workflows. Springer Berlin Heidelberg, Berlin, Heidel-
New York. berg, pp. 273–315.
Azevedo, A., Santos, M.F., 2008. KDD, semma and CRISP-DM: Kurgan, L.A., Musilek, P., 2006. A survey of knowledge discov-
a parallel overview. In: Abraham, A. (Ed.), IADIS European ery and data mining process models. Knowledge Engineer-
Conf. Data Mining. IADIS, pp. 182–185. ing Review 21 (1), 1–24.
Beckman, T., 1997, International Association of Science and Lenat, D.B., 1995. CYC: a large-scale investment in knowl-
Technology for Development. A Methodology for Knowl- edge infrastructure. Communications of the ACM 38 (11),
edge Management. IASTED. 33–38.
Bouyssou, D., Dubois, D., Prade, H., Pirlot, M., 2010. Decision Mariscal, G., Marban, O., Fernandez, C., 2010. A survey of
Making Process: Concepts and Methods. ISTE, Wiley. data mining and knowledge discovery process models and
Brachman, R.J., Anand, T., 1996. The process of knowledge methodologies. Knowledge Engineering Review 25 (2),
discovery in databases: a human-centered approach. In: 137–166.
Mizoguchi, R., 2010. YAMATO: Yet Another More Advanced
14 http://www.opengeospatial.org/. Top-level Ontology.
20 PART I Data
Nascimento, G.S., de Oliveira, A.A., 2012. An agile knowl- Rogalewicz, M., Sika, R., 2016. Methodologies of knowledge
edge discovery in databases software process. In: Data discovery from data and data mining methods in mechan-
and Knowledge Engineering. Springer Berlin Heidelberg, ical engineering. Management and Production Engineering
pp. 56–64. Review 7 (4), 97–108.
Niles, I., Pease, A., 2001. Towards a standard upper ontology. Rowley, J., 2007. The wisdom hierarchy: representations of the
In: Proceedings of the International Conference on Formal DIKW hierarchy. Journal of Information Science 33 (2),
Ontology in Information Systems - Volume 2001. FOIS ’01. 163–180.
ACM, New York, NY, USA, pp. 2–9. SAS Institute Inc., 2017. SAS Enterprise Miner 14.3: Reference
Nonaka, I., Toyama, R., Konno, N., 2000. SECI, Ba and leader- Help. SAS Institute Inc., Cary, NC, USA.
ship: a unified model of dynamic knowledge creation. Long Schober, D., Kusnierczyk, W., Lewis, S., Lomax, J., 2007. To-
Range Planning 33, 5–34. wards naming conventions for use in controlled vocabulary
Panov, P., Soldatova, L.N., Dzeroski, S., 2013. OntoDM-KDD: and ontology engineering. In: Proceedings of BioOntolo-
ontology for representing the knowledge discovery pro- gies SIG ISMB. Oxford University Press, pp. 2–9.
cess. In: Discovery Science - 16th International Confer- Smith, B., Ashburner, M., Rosse, C., Bard, J., Bug, W., Ceusters,
ence. Proceedings. DS 2013, Singapore, October 6–9, 2013, W., Goldberg, L.J., Eilbeck, K., Ireland, A., Mungall, C.J.,
pp. 126–140. Leontis, N., Rocca-Serra, P., Ruttenberg, A., Sansone, S.-A.,
Ponce, Julio (Ed.), 2009. Data Mining and Knowledge Discov- Scheuermann, R.H., Shah, N., Whetzel, P.L., Lewis, S., 2007.
ery in Real Life Applications. IntechOpen. The OBO Foundry: coordinated evolution of ontologies to
Ponniah, P., 2010. Data Warehousing Fundamentals for IT Pro- support biomedical data integration. Nature Biotechnol-
fessionals. Wiley, New Jersey, USA. ogy 25 (11), 1251–1255.
Ponsard, C., Touzani, M., Majchrowski, A., 2017. Combining Soldatova, L.N., King, R.D., 2006. An ontology of scientific ex-
process guidance and industrial feedback for successfully periments. Journal of the Royal Society Interface 3 (11),
deploying big data projects. Open Journal of Big Data 3 (1), 795–803.
26–41.
CHAPTER 2
Historical Background of Big Data in

Astro and Geo Context
CHRISTIAN MULLER, DR
2.1 HISTORY OF BIG DATA AND with the flooding of the Nile (Nickiforov and Petrova,
ASTRONOMY 2012). In this respect, the corpus of Egyptian observa-
2.1.1 Big Data Before Printing and the tions led to the Egyptian calendar, which was adopted
Computer Age by Julius Caesar when he became the ruler of Egypt and
Astronomy began in prehistoric times by the observa- which, with a minor revision in the late 16th century,
tion of the apparent motions of the sun, moon, stars, became our current calendar.
and planets in the Earth’s sky. Big Data was not easy to Big Data cannot exist without data preservation.
define before the computer age. According to Zhang and The first steps were the compilation of star catalogues,
Zhao (2015), Big Data is defined by ten characteristics, which began in Hellenistic times when the infrastruc-
each beginning with a V: volume, variety, velocity, ve- ture of the Alexandria museum and library were avail-
racity, validity, value, variability, venue, vocabulary, and able (Thurmond, 2003). Star catalogues not only give
vagueness. This list comes from previous studies of Big the name of stars but also their position. Hipparcos
Data in other domains, such as marketing and quality combined previous Babylonian observations, Greek
control. They consider that the four terms volume, vari- geometrical knowledge, and his own observations in
ety, velocity, and value apply to astronomy. Rhodes, and he was the first to correct his data for pre-
Velocity did not exist before the computer age, as cession, but as his manuscripts are lost, he is essentially
the acquisition of observations and their recording and known by his distant successor, Claudius Ptolemy from
publishing were entirely manual until the printing and Alexandria, whose catalogue, the Almagest, has been
the electric telegraph. preserved (Grasshoff, 1990). The number of stars of the
However, volume was already present in the early de- original manuscript is not absolutely clear; Thurmond
scriptions of the night sky: “as the stars of heaven” is a indicates a number of 1028. See Fig. 2.1.
biblical equivalent of infinite, meaning “which cannot The precision of the observations was sometime un-
be counted.” equal as different observation sites had been used and
Variety corresponds to the different observed ob- refraction was not corrected for. The Almagest became
jects; the ancients had only optical observations, but the main reference until the end of the Middle Ages,
the conditions of these observations differed between when several versions made their way to the Arab world
observer sites. Their eye sight was also probably better and Arab astronomers both added new observations
trained. An ethnological study (Griaule and Dieterlen, and adapted the book to their own epochs. Finally, the
1950) revealed that the cosmogony of the Dogon peo- Almagest came back to the Western world by the Latin
ple in present-day Mali indicated two companions of translation from an Arabic version of Gerard of Cre-
Sirius, four satellites of Jupiter, a ring around Jupiter, mona in 1175. None of the Arabic versions increased
and knowledge of Venus phases. This kind of oral tradi- the number of stars; some, due to the observation lat-
tion will probably be difficult to verify in other ethnic itude, even mention less stars than Ptolemy. The first
groups, as more and more literacy is spreading over new catalogue to appear was endeavored by Ulugh Beg
the entire world and oral transmission is disappear- in Samarkand in the 15th century using only original
ing. observations from an especially designed large obser-
Value corresponds to the progress in general knowl- vatory, correcting the errors made by Ptolemy in con-
edge associated with the observations and to their prac- verting the Hipparcos observations. This time, only 996
tical application in navigation or the calendar. For ex- stars were observed. This catalogue was fully translated
ample, the heliacal rise of Sirius had an extreme impor- in Europe only in 1665, but it was known in the Arab,
tance for the Egyptian calendar related to its coincidence Persian, and Ottoman worlds.
Knowledge Discovery in Big Data from Astronomy and Earth Observation. https://doi.org/10.1016/B978-0-12-819154-5.00011-4 21
22 PART I Data
FIG. 2.1 Almagest zodiac in the first complete printing by Petri Liechtenstein (1515), United States Library
of Congress. Printing had two advantages: the multiplication of copies and thus a better dissemination, and
protection against copyist’s errors or “improvements.”
2.1.2 The Printing and Technological accurate mechanical clocks, and the development of
Renaissance Revolution astrology. All these necessitated better star catalogues
The 16th century was marked by three important evo- and planetary ephemerides. At the same time, the print-
lutions: the generalization of open sea navigation using ing technology allowed the diffusion of the astronom-
astronomical positioning techniques, the appearance of ical writings and was followed by a real explosion of
CHAPTER 2 Historical Background of Big Data in Astro and Geo Context 23
the number of publications (Houzeau and Lancaster, were first given by Adam Smith (1778) at the end of the
1887). Printing secured two important elements of Big 18th century. Astrologers would rely on a feeling based
Data: preservation of controlled copies and availability on their knowledge which they could not quantify for
to a larger number of users. everything outside their analysis of the sky. Astrology
Astronavigation was already used in the 15th century became suspected of being linked to superstition dur-
by the Portuguese, Arab, and Chinese navigators, but ing the English Reformation, but luckily, astronomy be-
proved to be very risky during the first intercontinen- came a respected science in Great Britain. For example,
tal explorations. It is in this context that the Ottoman the founder of the London stock exchange, Thomas Gre-
sultan Murad III ordered the construction of a large ob- sham, established Gresham College in the late 16th cen-
servatory in Constantinople superior to the Ulugh Beg tury for the education of the young bankers and traders
observatory and equipped with mechanical clocks. The with the following professorships: astronomy, divinity,
chief astronomer, Taqi ad-Din, wanted to correct the geometry, law, music, physics, and rhetoric. “The astron-
previous catalogues and ephemerides to promote im- omy reader is to read in his solemn lectures, first the
provement in cartography (Ayduz, 2004). He improved principles of the sphere, and the theory of the planets,
and designed new instruments much superior to pre- and the use of the astrolabe and the staff, and other
vious versions. Unfortunately, the observatory was de- common instruments for the capacity of mariners.” This
stroyed in 1570 due to a religious decree condemning program did not make any mention of astrology and its
astrology. use as a predictive tool in commodity trading.
Almost simultaneously, Tycho Brahe equipped a Robert Hooke, who was professor at Gresham col-
huge observatory in Denmark with instruments and not lege, insisted on the use of telescopic observations in
only used up to 100 assistants, but also spent for 30 order to increase the number of stars and their posi-
years about 1% of the total budget of Denmark (Couper tional accuracy, but this important progress was only
et al., 2007), Tycho Brahe was the first to take refrac- initiated by John Flamsteed, the first Astronomer Royal
tion into account and to analyze observational errors. who exceeded the precision of Tycho Brahe’s observa-
His huge accomplishments were transferred to Prague tions and published a catalogue of 2866 stars in 1712
where he became the astronomer of emperor Rodolph (Thurmond, 2003). At that time, a marine chronometer
II and was assisted by Johannes Kepler, who succeeded accurate by one minute in six hours existed and an able
him. Kepler demonstrated the existence of the helio- seaman was for the first time able to determine an ac-
centric system and determined the parameters of the curate position by using the sextant without any other
planetary elliptical orbits using Tycho’s data. The quan- information. Better marine chronometers were progres-
tity of data measured and reduced by Tycho Brahe and sively developed (Landes, 1983), but due to their high
their accuracy were an order of magnitude greater than price, their generalization had to wait until the 19th cen-
what existed before, representing maybe the first in- tury. Flamsteed got a commission to build the Green-
stance of Big Data improving science. wich observatory in close connection with the British
Astrology was the main application of this scientific Admiralty; the accurate chronometers designed by John
project and the tables produced by Kepler. The Rudol- Harrison for this observatory were essential to the ex-
phine Tables are still used by present-day astrologers, ploration of the Southern hemisphere oceans by Cap-
who usually do not have the means to adapt the epoch. tain Cook and his followers.
Astrology at the time was the equivalent of present- Later, in the 18th and 19th century saw the astro-
day business intelligence and was commonly used for nomical observations being extended to the Southern
any kind of forecasts. Galileo taught medical students hemisphere. At the end of the 19th century, the pho-
the art of establishing the horoscopes of their patients. tographic technique allowed to win again an order of
Galileo was in this respect accused in a first inquisition magnitude in the number of stars. At the beginning of
trial of fatalism, which is the catholic sin of believing the 20th century, about 500,000 stars had been iden-
that the future can be certainly known to human in- tified and several catalogues were under development.
telligence (Westman, 2011). At the same time, Lloyd’s The last catalogue before the space age was the Smithso-
of London were determining marine insurance fees by nian Astrophysical Observatory catalogue in 1965, with
the expected technique of inspecting the ships and crew 258,997 stars listed with 15 description elements for
records, but the last judgment was left to astrologers each. The SAO catalogue used electronic data treatment
(Thomas, 2003). Astrology cannot be considered a pre- since the middle of the 1950s and is the first to fully
cursor of Big Data and their role in business intelligence, meet the definition of Big Data given in the first para-
as large-scale statistical treatments of economic data graph.
24 PART I Data
the structuring of these early sources. Aristotle was also

the successor of the Greek natural philosophers who at-
tempted to relate the observations to their causes so that
they could explain them and even attempt forecasts.
Aristotle was the first to describe the hydrologic cycle.
His knowledge of prevailing winds as a function of sea-
son proved to be essential to the conquest of Greece by
the Macedonian army, the Greek islands being unable
to send troops to their allies in the continental cities in
time due to contrary winds. The meteorology of Aristo-
tle covered a wider context than now because it included
everything in the terrestrial sphere up to the orbit of the
moon and thus would have included geology and what
is now called space weather (Frisinger, 1972).
Unfortunately, Aristotle’s efforts were not continued
for long. His successor Theophrastus compiled signs
which in combination could lead to a weather forecast.
These progresses did not prevent most of the population
to attribute weather to divine intervention and when
Christianity and Islam took over, the pagan gods were
replaced by demons. No systematic records of weather
were kept, and present climate historians have to resort
to agricultural records or indications in chronicles. Dur-
ing the Renaissance, the revival of Hippocratic medicine
led physicians to consider the relation between the en-
vironment and health and record meteorological data
again; similarly the logbooks of the ships at sea be-
came more systematic, leading in the 18th century to
the first large set of meteorological data which began to
follow a standardization process, as exemplified by the
FIG. 2.2 Frontispiece of the Rudolphine Tables: Tabulae Societas Meteorologica Palatina (Meteorological Society
Rudolphinae, quibus Astronomicae scientiae, temporum of Mannheim) (Kington, 1974) which started in 1780,
longinquitate collapsae Restauratio continetur by Johannes and established a network of 39 weather observation
Kepler (1571–1630) (Jonas Saur, Ulm, 1627). stations; 14 in Germany, and the rest in other coun-
tries, including four in the United States, all equipped
It is now succeeded by the new efforts based on space with comparable and calibrated instruments: barome-
age techniques and the massive use of large databases ters, thermometers, hygrometers, and some with a wind
which constitute the basis of the BigSkyEarth COST ac- vane. During the 19th century, more meteorological ob-
tion. See Fig. 2.2. servatories were established in Europe, North America,
and in the British Empire. The progress of telegraphic
communications led to consider the establishment of a
2.2 BIG DATA AND METEOROLOGY: A LONG synoptic database of identical meteorological parame-
HISTORY ters measured at different observatories.
2.2.1 Early Meteorology
The study and comparison of large amounts of observa- 2.2.2 Birth of International Synoptic
tions constituted the early base of meteorology. The rep- Meteorology
etition of phenomena proved very early to be less regu- The breakthrough occurred with Leverrier in 1854. Lev-
lar than astronomical events, and even extreme events errier was a French astronomer who reached celebrity by
were the unpredictable action of the gods. The Baby- predicting the position of Neptune from perturbations
lonians and Egyptians compiled a lot of observations of the Uranus orbit. Gelle at the Berlin observatory was
without relating them. A big step forward was the clas- then able to observe the planet at the predicted position.
sifications and typologies assembled by Aristotle and Following a disastrous storm in the Black Sea during
the Anglo-French siege of Sebastopol, the French gov- who began by adopting common definitions of the me-
ernment commissioned Leverrier to determine if with teorological parameters. See Fig. 2.3.
an extensive network of stations, the storm could have
been predicted. After analysis, he determined that the 2.2.3 Next Step: Extension of Data
storm had originated in the Atlantic several days be- Collection to the Entire Globe
fore the disaster and that a synoptic network would The distribution of stations of this first network was
have allowed to follow it and to make a raw forecast heavily biased to Western Europe and the Eastern
of its arrival in the Black Sea (Lequeux, 2013). Unfortu- United States. It was clear at the beginning that a real
nately, Leverrier could never assemble the legions of la- network should extend to the entire world, includ-
borers necessary to study the long-term physical causes ing the Southern hemisphere. As a permanent exten-
of weather and climate. His efforts were however the sion was beyond the means of the early International
first steps to the creation of an international synoptic Meteorological Organization, periodic campaigns for
network in parallel to the geomagnetic network already the study of polar regions were proposed by several
developed by Gauss, Sabine, and Quetelet (Kamide and countries, combining exploration and maritime obser-
Chian, 2007). The extension of the geophysical network vations. The first one, in 1883, was concentrated on
to meteorology was rapid due to the establishment of the Arctic ocean and a few sub-Antarctic stations. The
meteorological services in most observatories and the observations took place between 1881 and 1884 and
development of the electric telegraph. These founding demonstrated the feasibility of a network extension.
fathers made an unpreceded effort to internationalize The success of the first campaign led to the second
the effort, and most notably, the Dutch meteorologist International Polar Year in 1932–1933. This campaign
Buys-Ballot, founder of the Royal Dutch Meteorological was initiated and led by the International Meteorolog-
Institute, published the empirically discovered relation ical Organization and extended to geomagnetism and
between cyclones, anticyclones, and wind direction, in- ionospheric studies; more countries participated, and
troducing fluid physics to meteorology and the basis of the program included simultaneous observations at low
future forecasting models (WMO, 1973). latitudes. This campaign should have included a net-
These early networks hardly fit the definition of Big work of Antarctic stations, but the financial crisis of
Data: the telegraphic systems of the different coun- the time limited the funding means of the participat-
tries were not standard, the archiving of the data was ing countries. The collection and use of Big Data was
not uniform, and a lot of parameters were station- or already envisaged by the establishment of World Data
operator-dependant. The exchange of processed data Centers centralizing data by themes.
as hourly averages was not evident. However, around The Second World War extended to the entire North-
1865, the generalization of the Morse telegraphic pro- ern hemisphere and parts of the Southern Pacific. Me-
tocols together with the application of the newly dis- teorological forecasts were essential, and the allies de-
covered Maxwell equations improved the reliability of cided on a very wide synoptic network. This effort was
the telegraph, and regular exchange of data between led by the UK Met Office, which exfiltrated qualified me-
stations became the norm. International meteorologi- teorologists from Norway and several other occupied
cal conferences regularly met, beginning in 1853 at the European countries. The Germans took a more theo-
initiative of the United States Naval Observatory, the retical approach, demanding less stations. The Anglo-
first one presided by the director of the Brussels Ob- American meteorological forecasts, with a better time
servatory, Adolphe Quetelet. Even though fewer than resolution, were essential in planning successful am-
15 countries were represented, no explicit resolutions phibious operations at the end of the war, as well as air
came from this first meeting because any recommen- force support. After the war, the extension of this net-
dation would have led to modifications of the practice work to the Southern hemisphere led to the 1947 US
of the signatories; the wording was very general, e.g., Navy Highjump operation, combining the exploration
“that some uniformity in the logs for recording ma- of Antarctica and the establishment of stations. This ex-
rine meteorological observations is desirable.” Anyway, pedition led to numerous accidents, which confirmed
a process was started, which led in 1873 to the foun- that military claim and occupation of Antarctica were
dation of the International Meteorological Organiza- beyond the means of any nation. Most of these acci-
tion at a Vienna international conference led by Buys- dents were related to errors in the positioning of ships
Ballot (WMO, 1973). This new organization proved to and aircrafts related to the proximity of the South Pole
be strong enough to standardize practices in the entire and weather conditions. The staff of this huge expedi-
world. It established a permanent scientific committee tion included the ionospheric scientist Lloyd Berkner,
26 PART I Data
FIG. 2.3 Early synoptic map of Swedish origin (https://en.wikipedia.org/wiki/Timeline_of_meteorology#/

media/File:Synoptic_chart_1874.png). Sea level pressure is indicated, as well as an indication of surface
winds, demonstrating the success of the International Meteorological Organization at its foundation in 1873.
Until the early 1970s, isobar lines were drafted by hand to fit the results of the stations; it was only in the
last quarter of the 20th century that they were automatically plotted integrating data from other origins as
airplanes and satellites.
who after designing the radio communications of the scientifically active countries (National Academy Press,
1929 Byrd Antarctica expedition multiplied the execu- 1992). See Fig. 2.5.
tive roles in scientific unions while continuing research. The Second World War had seen an increase in
He later played an important role in coordinating elec- the number of weather ships, as these supported also
tronic operations for the US Navy in the Second World transatlantic air traffic. This network was officialized,
and these stations are shown in Fig. 2.4. Unfortunately,
War. His positions as a rear admiral, a presidential ad-
their high cost led to their progressive retirement af-
viser, and the president of the International Council of
ter IGY when their function was taken over by instru-
Scientific Unions (ICSU) helped him to initiate in 1950 mented merchant ships and commercial airliners. Also,
the International Geophysical Year (IGY) project and to beginning in 1960, experimental satellites were devoted
take the first steps of the Antarctic treaty. The purpose of to meteorological observations until evolving into the
IGY was to extend the observations to the entire globe present network of civilian operational meteorological
with the cooperation of the Soviet Union and all other satellites operated by both EUMETSAT and NOAA.
FIG. 2.4 The 12 Arctic stations of the 1883 International Polar Year, NOAA, https://www.pmel.noaa.gov/
arctic-zone/ipy-1/index.htm.
FIG. 2.5 Photograph of one of the first preparatory meetings of the IGY at the US Naval Air Weapons
Station at China Lake (California) in 1950. The scientists present around Lloyd Berkner and Sydney
Chapman on this image represent three quarters of the world authorities on ionosphere and upper
atmosphere at the time. A similar group today would include much more than 10,000 participants
(Pr. Nicolet private archive).
28 PART I Data
FIG. 2.6 Extension of the network of WMO stations from a European network in the middle of the 19th
century to the current network. The stations are color-coded to indicate the first year in which they provided
12 months of data (Hashemi, 2009).
EUMETSAT is a consortium of meteorological or- models are run in parallel and in which the final analy-
ganizations regrouping most of Western and Central sis uses statistical techniques (WMO, 2012).
Europe, including Turkey. It operates both its own net-
work of geostationary METEOSAT satellites and METOP
in polar orbit. Since the 2010s, it collaborates with REFERENCES
COPERNICUS Sentinel satellites managed by ESA for Ayduz, S., 2004. Science and Related Institutions in the Ot-
the European Union. The data are used for forecasts by toman Empire During the Classical Period. Foundation for
the European Centre for Medium Range Weather Fore- Science, Technology and Civilisation, London.
casts (ECMWF) to produce forecast maps for the en- Couper, H., Henbest, N., Clarke, A.C., 2007. The History of As-
tronomy. Firefly Books, Richmond Hill, Ontario.
tire world. In 2019 these have a 20 km resolution, and
Frisinger, H.H., 1972. Aristotle and his meteorology. Bulletin
should reach the 5 km resolution during the 2020s. See of the American Meteorological Society 53, 634–638.
Fig. 2.6. https://doi.org/10.1175/1520-0477(1972)053<0634:
The total amount of data coming from all these AAH>2.0.CO;2.
sources is difficult to estimate as the definition of Grasshoff, G., 1990. The History of Ptolemy’s Star Catalogue.
data covers all aspects of the raw and processed data. Springer Verlag.
Currently, COPERNICUS, which is not yet in com- Griaule, M., Dieterlen, G., 1950. Un Systéme soudanais de Sir-
ius. Journal de la Société des Africanistes 20 (2), 273–294.
plete operation, generates about 10 petabytes per year;
Hashemi, K., 2009. http://homeclimateanalysis.blogspot.be/
ECMWF claimed in 2017 to have archived more than 2009/12/station-distribution.html. climate blog. Brandeis
130 petabytes of meteorological data, beginning essen- University.
tially in the 1980s, when EUMETSAT and NOAA data Houzeau, J.C., Lancaster, A., 1887. Bibliographie Générale de
flows started their exponential increase.1 L’Astronomie, Hayez, Bruxelles.
Big Data have clearly become a part of the obser- Kamide, Y., Chian, A., 2007. Handbook of the Solar-Terrestrial
vational database. More and more, Big Data enter the Environment. Springer Science & Business Media.
world of forecasts by techniques as assimilation, where Kington, J.A., 1974. The Societas Meteorologica Palatina:
an eighteenth-century meteorological society. Weather 29,
the model is tuned to minimize the gaps between obser-
416–426. https://doi.org/10.1002/j.1477-8696.1974.
vations and the forecast and the ensemble techniques tb04330.x.
in which a large number of instances of one or several Landes, David S., 1983. Revolution in Time. Belknap Press
of Harvard University Press, Cambridge, Massachusetts.
1 http://copernicus.eu/news/what-can-you-do-130-petabytes-data. ISBN 0-674-76800-0.
Lequeux, J., 2013. Le Verrier and meteorology. In: Le Verrier— Westman, R.S., 2011. The Copernican Question, Prognostica-
Magnificent and Detestable Astronomer. In: Astrophysics tion, Skepticism and Celestial Order. University of Califor-
and Space Science Library, vol. 397. Springer, New York, NY. nia Press, Berkeley.
National Academy Press, 1992. Biographical Memoirs, V.61. WMO, 1973. One hundred years of international co-operation
ISBN 978-0-309-04746-3. in meteorology (1873-1973): a historical review. https://
Nickiforov, M.G., Petrova, A.A., 2012. Heliacal rising of Sir- library.wmo.int/opac/doc_num.php?explnum_id=4121.
ius and flooding of the Nile. Bulgarian Astronomical Jour- World Meteorological Organization.
nal 18 (3), 53. WMO, 2012. Guidelines on Ensemble Prediction Systems
Smith, A., 1778. An Inquiry Into the Nature and Causes of the and Forecasting. http://www.wmo.int/pages/prog/www/
Wealth of Nations. W. Strahan and T. Cadell, London. Documents/1091_en.pdf. Publication 1091. World Meteo-
Thomas, K., 2003. Religion and the Decline of Magic: Stud- rological Organization.
ies in Popular Beliefs in Sixteenth and Seventeenth-Century Zhang, Y., Zhao, Y., 2015. Astronomy in the big data era. Data
England. Penguin=History. Science Journal 14 (11), 1–9. https://doi.org/10.5334/dsj-
Thurmond, R., 2003. A history of star catalogues. http:// 2015-011.
rickthurmond.com/HistoryOfStarCatalogs.pdf.
Another random document with
no related content on Scribd:
Pumping by Compressed Air. Although, generally speaking, the
raising of water by compressed air is not an economical method, it is
frequently adopted in mining and tunnelling where the use of steam
or electricity is objectionable. In these cases, cost of operation is a
minor factor, and it may be interesting to give a few particulars of this
form of pneumatic conveying.
The simplest form of compressed air pump consists of a closed
chamber or tank immersed in the water, to be raised or fixed at such
a level that the water will flow into the tank. An air pipe is connected
to the top of the chamber, and the rising main is carried inside the
tank to the bottom. On opening the air valve, pressure is exerted on
the surface of the water in the tank, and the water is expelled
through the lift pipe or rising main. On closing the air valve, water
again fills up the tank, and the process is repeated.
A decided improvement on this pump is the return air pump, which
consists of two closed chambers connected through valves with the
rising main. The compressed air pipe passes through a two-way
valve, either into one tank or the other, this valve being positively
operated. The method of working is similar to that of the single acting
pump, considering each chamber separately, but one tank is filling
while the other is being emptied.
The air expelled from the filling tank, instead of being discharged
to atmosphere, and part of its expansive power lost, is carried back
through the pipe, which would be the air intake pipe when
discharging, through a port in the two-way valve, and into the
compressor intake pipe. The air leaving the filling tank is naturally
above atmospheric pressure, and assists the piston on entering the
compressor, thus reducing the power absorbed in driving the latter.
Air-lift Pumping. The air-lift pump is a common means of
conveying by pneumatic means and should not be confused with the
above methods of raising water by compressed air.
In the air-lift method of pumping air under pressure is admitted at
the foot of a pipe already submerged in the well. The air does not
merely bubble through the water, as might be supposed, but passes
up the pipe as a mixture of air and water. The introduction of the air
into the rising column of water makes the latter as a whole less
dense than the water around the tube, and therefore we have a
difference in head between the internal and external columns of
water which will carry the internal column considerably higher than
the external column.
As the lifting force depends upon the “head” of water outside the
rising main, it follows that the maximum height to which the water
can be raised depends upon the depth to which the air pipe and
rising main are submerged below the standing level of the water in
the bore-hole. In other words, the greater the lift, the greater the
depth to which the air pipe must be carried before releasing the air
into the rising main.
Experience shows that the water pipe should be submerged 18
ins. for every 1 ft. lift above the water level in the bore-hole, and
allowance must be made for the “depression” of the water level in
the bore-hole, which will probably take place when pumping is in
progress. This depression will vary according to the water bearing
capacity of the strata, in which the hole has been bored, hence it is
necessary to go carefully into the conditions before boring the hole. If
available, data should be studied concerning the standing water
level, and the pumping depression in other bore-holes in the
immediate neighbourhood. Also tests should be made before the
boring machinery is removed because, although the initial depth of
bore-hole may be satisfactory on the basis of standing level
calculations, it may be found that when pumping the depression is so
great that the bore-hole has to be carried to a greater depth.
The air is supplied at a pressure suitable for the conditions, and
can be carried down a separate tube and connected to the rising
main at the correct depth (Fig. 33), or, as is often done, one pipe
may be lowered and the rising main supported centrally inside the
casing tube, the annular space between the two being used as the
air pipe (see Fig. 34).
The amount of free air required is from 0·6 to 1·0 cu. ft. per gallon
of water raised per min., provided that all the details have been
studied carefully and the design of the plant worked out accordingly.
If the air pipe is too small the air will bubble slowly through the
water, while if it is too large it will blow out with great force, spraying
and losing the water: the ratio between the cross-sectional area of
the air and water pipes is about 1½ to 4.
Advantages of air-lift pumping are that a greater amount of water
can be obtained from a hole of given size than by ordinary pumping;
and that one compressing plant can deal with several wells instead
of needing a separate pump to each well.
Fig. 33.—Air Pipe Outside

Riser.
Fig. 34.—Air between
Casing and Riser.
Air-lift Pumping
The disadvantages are, that the mechanical efficiency is low; that

a considerable amount of air is entrained in the water, and aerated
water is very unsuitable for boiler feed purposes; and that means
must be provided to allow air to escape by passing the discharge
from the pump over a weir or similar contrivance. It is necessary to
have some reliable form of oil trap between the compressor and the
well to prevent contamination of the water by oil carried over by the
air from the cylinders of the compressor; this is difficult, because the
oil is not only “atomized” but is actually vaporized while in the
compressor cylinders and as a gas it is difficult to reclaim. The air
must be kept as low in temperature as possible, and it is usually
passed through a cooler before being delivered down the well.
At times, air-lifts are installed for conveying other liquids to a
height, and when these can be treated at a high temperature it is
advisable, as the efficiency is then much improved. Even under
these conditions it is advisable to cool the air to the lowest feasible
temperature, before using it as a lifting medium.
When starting up, the column of water in the rising main has to be
moved as a solid column, and consequently a higher pressure of air
is required at starting than when the column has been set in motion,
as the water and air then pass up in alternative “pellets.”
In chemical works and allied industries this pneumatic method is
frequently used for pumping acids, and other corrosive liquids from
one place to another. Compressed air is a very handy medium for
this class of work as ordinary mechanical methods are ruled out, due
to the impossibility of introducing corrosive liquids into the pumps
and syphons unless great expense is incurred by the use of acid-
proof materials.
The air-lift is also very advantageous for pumping water which
contains a large amount of sand or similar gritty material which
would cut and score the walls of an ordinary piston pump. Air-lift
pumping is frequently used, therefore, on new bore-holes until the
sand, etc., has been eliminated, after which the final pump can be
installed without fear of damage.
The question of submergence will frequently make it impossible to
use air-lift without boring many feet deeper than would otherwise be
necessary, but when the water bearing strata is low this form of
pumping is frequently very convenient.
Miscellaneous Applications of Pneumatic Conveying. Several
other interesting applications of pneumatic conveying may be
enumerated but, being somewhat outside the primary scope of this
book, they will not be discussed in detail. The main object of the
author is to raise interest in the handling of solid materials in a
manner practically unknown to the general reader.
The housing problem has developed the pneumatic handling of
cement in a liquid form, and houses are now being built of reinforced
cement in the following manner. An expanded steel frame is
supported between concrete or brick piers, and on wood sheeting
where necessary, and liquid cement is blown on to the metal in the
form of a liquid spray: the first coat dries quickly and leaves a certain
amount of cement covering the framework. Then follows another
coat, and again another and so on, until the whole of the framework
has been covered to an appreciable thickness. The result is a thin
wall or slab of cement reinforced with the steel and of great
combined strength. Slightly domed roofs constructed in this manner
have proved very strong and durable.
The Aerograph is an instrument working on the same principle for
the application of paint, and it is used a great deal in the art world, in
the manufacture of Christmas cards, in panel painting, and in interior
decoration generally. Excellent “tones” and shades are obtained by
the simple method of varying the thickness of the colour or the
number of coats applied. It is usual to convey the surplus colour and
fumes away from the operator by means of a stream of air through a
special hood placed at the back of the work, thus maintaining clean
pure air for the operator.
A similar machine of more crude design is used for whitewashing
walls of stables, cattle pens, etc. All these plants comprise an air
compressor, either power or hand operated, from which the air is led
to a special injector which draws up through a second pipe a certain
amount of the material to be sprayed. The paint or other material is
then atomized and impelled with considerable force on to the surface
to be covered.
The sand blast is another application of pneumatic conveying in
which the medium conveyed is sand, which has well-known cutting
and erosive effects when it impinges on a surface at high velocity.
This plant is used for decorating glassware, obscuring sheet glass,
and also for cleaning stone buildings by the actual removal of the
face of the previously discoloured stone.
The pneumatic conveyance of energy is exemplified by rock drills,
riveting machines, coal-cutters and innumerable other portable tools.
Energy is expended in compressing air which is transmitted through
pipes and made to yield its stored energy by driving the air motors of
the tools or other apparatus in question.
Conclusion. Enough has been said to show that pneumatic
conveying has made great progress, and that the possibilities of this
method of dealing with the moving of solid materials are much
greater than has been generally recognized.
Almost anything that will enter a pipe up to about 9 ins. diameter
can be conveyed in this way, either by “blowing” or “suction” or by
the “induction” method.
Weight and size is an advantage rather than otherwise, and bricks
can be dealt with more successfully than flour. The writer’s
experience, in the results of actual working with pneumatic
conveying, indicates that no problem should be considered too
difficult to be tackled by this method, and that even the most unlikely
materials can be conveyed successfully by pneumatic means.
BIBLIOGRAPHY
Readers wishing to amplify their knowledge of pneumatic conveying

may find useful the following references—
“Pneumatic Dispatch,” by H. R. Kemp, M.I.C.E.,
M.I.E.E., M.R.M. Paper before the Inst. of Post Office
Engineers. October, 1909.
“Power Plant for Pneumatic Tubes in the Post Office,”
by A. B. Eason, M.A., A.M.I.E.E. Paper before the Inst. of
Post Office Engineers. 18th October, 1913.
“Portable Plant.” Editorial article in Cassier’s
Engineering Monthly, June, 1919.
“Pneumatic Handling Machinery.” Engineering and
Industrial Management, 5th June, 1919.
“History of Conveying,” by G. F. Zimmer, A.M.I.C.E.
Engineering and Industrial Management, July to Sept.,
1920.
“Boots as Power Users,” by E. G. Phillips, M.I.E.E.,
A.M.I.Mech.E., describing the coal handling plant used by
Messrs. Boots Pure Drug Co., Ltd., Nottingham. Power
User, March, 1920.
“Pneumatic Handling Installation for Calcium Sulphide,”
by G. F. Zimmer, A.M.I.C.E. Chemical Age, 10th April,
1920.
“Pneumatic Conveying of Granular Substances,
including Chemicals,” by G. S. Layton. Paper before the
Society of Chemical Industry, Third Conference
(Birmingham), 23rd April, 1920.
“Pneumatic Conveying of Coal and Similar Substances,”
by J. H. King, M.I.Mech.E. Paper before the Society of
Chemical Industry, Third Conference (Birmingham), 23rd
April, 1920.
Instructive catalogues on this and allied subjects are issued by the
following firms (amongst others): Messrs. Ashwell & Nesbit, Ltd.,
Leicester; R. Boby, Ltd., Bury St. Edmunds; H. J. King & Co.,
Nailsworth, Gloucester; The Lamson Store Pneumatic Co., Ltd.,
London; and The Sturtevant Engineering Co., Ltd., London.
INDEX
Advantages of system, 5
Aerograph, 101
Air compressors, 67, 70, 71
—— filters, 10, 21-27
—— induction, 63
—— lift, advantages, 97
—— ——, air required, 97
—— —— “depression,” 96
—— —— pumping, 95-100
—— —— submergence, 96, 100
—— receivers, 69
—— reheating, 68, 69
—— velocity, 36
“Aquadag,” 20
Ash handling, 58, 74-77
Bag filters, 22-24

Bends and elbows, 34, 35, 67
Bibliography, 105
“Blowing” system, 4
Breaking of materials, 34, 35, 67
Buffer boxes, 78
Capacity of pipe lines, 36

Cement handling plants, 101
Cleaning with air blast, 94
Coal-handling plants, 54-58, 60, 61
Comparative costs, 53
Conveying above atmospheric pressure, 6
—— below atmospheric pressure, 6
—— above and below atmospheric pressure, 6, 64-66
Cyclone separators, 21
Design, factors influencing, 8

Despatch tubes, 80, 81
Dischargers design, 10, 28-32, 56, 57
—— difficulties, 9
—— valves, 28, 31, 32, 56, 58
Exhausters, 10, 14
Factors influencing design, 8

Flexibility, 5
Flexible suction pipes, 36
Floating plants, 3, 46, 51, 59-61
Flue cleaning, 59
Foot power pumps, 86
Fundamental principles, 3
Grain-handling, 45, 47
Heavy commercial systems, 7

High pressure systems, 6, 39
Historical, 1, 2
“Induction” system, 4, 6, 62-66

“Intermittent” tube system, 81, 83
Junctions in pipe lines, 33
King’s exhauster, 10-14

—— three-way valve, 37
“Kinking” to be avoided, 92
Large pipe systems, 7

Lime washing, 101
Low pressure systems, 6
Lubrication, 20
Materials, breaking of, 34, 35, 67

Mollers’ air filter, 24
Nash hydro-turbine, 18, 19

Nozzles (suction), 10, 40-43
Oil contamination, 11
Pipe lines, 10, 33, 36, 39

—— ——, capacity of, 36
Pneumatic tube carriers, 81, 82
—— ——, “continuous,” 81, 83
—— —— foot power, 86
—— ——, “intermittent,” 81, 83
—— ——, power required, 84, 85
—— —— pressure system, 81, 85
—— —— vacuum system, 84
Portable quay-side plant, 5
—— railway plant, 49-51
—— vacuum cleaners, 92
Power required, 44
Pressure systems, 4, 6
Pumping by compressed air, 94, 95
Quayside plants, 51
Reheating compressed air, 68, 69

Rotary blowers, 14, 15
Sand-blasting, 102
Stationary plants, vacuum, 92
Steam consumption, 72-74
—— jet conveyors, 72, 74, 76
—— jets, 77
—— jets, economy of, 72, 73
Sturtevant blowers, 16, 17
“Suction” nozzles, 10, 40-43
—— systems, 4-7
Systems, advantages of, 5
Telescopic pipes, 38
Turbo-blowers, 11
Vacuum cleaners, 89
—— ——, tests, 92
—— required, 3
Valves in pipe line, 37
Velocity of air in pipes, 36
Water pumping, 95-100

Waterside plants, 45, 47, 48, 59
Wear of pipes and bends, 34, 35
Wet air filters, 25-27
Printed by Sir Isaac Pitman & Sons, Ltd., Bath, England

Transcriber’s Notes
pg 59 Changed: await the convenience of the wagons

to: await the convenience of the waggons
pg 69 Changed: practice to instal an efficient separator
to: practice to install an efficient separator
*** END OF THE PROJECT GUTENBERG EBOOK PNEUMATIC
CONVEYING ***
Updated editions will replace the previous one—the old editions will
be renamed.
Creating the works from print editions not protected by U.S.

copyright law means that no one owns a United States copyright in
these works, so the Foundation (and you!) can copy and distribute it
in the United States without permission and without paying copyright
royalties. Special rules, set forth in the General Terms of Use part of
this license, apply to copying and distributing Project Gutenberg™
electronic works to protect the PROJECT GUTENBERG™ concept
and trademark. Project Gutenberg is a registered trademark, and
may not be used if you charge for an eBook, except by following the
terms of the trademark license, including paying royalties for use of
the Project Gutenberg trademark. If you do not charge anything for
copies of this eBook, complying with the trademark license is very
easy. You may use this eBook for nearly any purpose such as
creation of derivative works, reports, performances and research.
Project Gutenberg eBooks may be modified and printed and given
away—you may do practically ANYTHING in the United States with
eBooks not protected by U.S. copyright law. Redistribution is subject
to the trademark license, especially commercial redistribution.
START: FULL LICENSE

THE FULL PROJECT GUTENBERG LICENSE
PLEASE READ THIS BEFORE YOU DISTRIBUTE OR USE THIS WORK
To protect the Project Gutenberg™ mission of promoting the free

distribution of electronic works, by using or distributing this work (or
any other work associated in any way with the phrase “Project
Gutenberg”), you agree to comply with all the terms of the Full
Project Gutenberg™ License available with this file or online at
www.gutenberg.org/license.
Section 1. General Terms of Use and

Redistributing Project Gutenberg™
electronic works
1.A. By reading or using any part of this Project Gutenberg™
electronic work, you indicate that you have read, understand, agree
to and accept all the terms of this license and intellectual property
(trademark/copyright) agreement. If you do not agree to abide by all
the terms of this agreement, you must cease using and return or
destroy all copies of Project Gutenberg™ electronic works in your
possession. If you paid a fee for obtaining a copy of or access to a
Project Gutenberg™ electronic work and you do not agree to be
bound by the terms of this agreement, you may obtain a refund from
the person or entity to whom you paid the fee as set forth in
paragraph 1.E.8.
1.B. “Project Gutenberg” is a registered trademark. It may only be

used on or associated in any way with an electronic work by people
who agree to be bound by the terms of this agreement. There are a
few things that you can do with most Project Gutenberg™ electronic
works even without complying with the full terms of this agreement.
See paragraph 1.C below. There are a lot of things you can do with
Project Gutenberg™ electronic works if you follow the terms of this
agreement and help preserve free future access to Project
Gutenberg™ electronic works. See paragraph 1.E below.
1.C. The Project Gutenberg Literary Archive Foundation (“the
Foundation” or PGLAF), owns a compilation copyright in the
collection of Project Gutenberg™ electronic works. Nearly all the
individual works in the collection are in the public domain in the
United States. If an individual work is unprotected by copyright law in
the United States and you are located in the United States, we do
not claim a right to prevent you from copying, distributing,
performing, displaying or creating derivative works based on the
work as long as all references to Project Gutenberg are removed. Of
course, we hope that you will support the Project Gutenberg™
mission of promoting free access to electronic works by freely
sharing Project Gutenberg™ works in compliance with the terms of
this agreement for keeping the Project Gutenberg™ name
associated with the work. You can easily comply with the terms of
this agreement by keeping this work in the same format with its
attached full Project Gutenberg™ License when you share it without
charge with others.
1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside the
United States, check the laws of your country in addition to the terms
of this agreement before downloading, copying, displaying,
performing, distributing or creating derivative works based on this
work or any other Project Gutenberg™ work. The Foundation makes
no representations concerning the copyright status of any work in
any country other than the United States.
1.E. Unless you have removed all references to Project Gutenberg:
1.E.1. The following sentence, with active links to, or other

immediate access to, the full Project Gutenberg™ License must
appear prominently whenever any copy of a Project Gutenberg™
work (any work on which the phrase “Project Gutenberg” appears, or
with which the phrase “Project Gutenberg” is associated) is
accessed, displayed, performed, viewed, copied or distributed:
This eBook is for the use of anyone anywhere in the United
States and most other parts of the world at no cost and with
almost no restrictions whatsoever. You may copy it, give it away
or re-use it under the terms of the Project Gutenberg License
included with this eBook or online at www.gutenberg.org. If you
are not located in the United States, you will have to check the
laws of the country where you are located before using this
eBook.
1.E.2. If an individual Project Gutenberg™ electronic work is derived

from texts not protected by U.S. copyright law (does not contain a
notice indicating that it is posted with permission of the copyright
holder), the work can be copied and distributed to anyone in the
United States without paying any fees or charges. If you are
redistributing or providing access to a work with the phrase “Project
Gutenberg” associated with or appearing on the work, you must
comply either with the requirements of paragraphs 1.E.1 through
1.E.7 or obtain permission for the use of the work and the Project
Gutenberg™ trademark as set forth in paragraphs 1.E.8 or 1.E.9.
1.E.3. If an individual Project Gutenberg™ electronic work is posted

with the permission of the copyright holder, your use and distribution
must comply with both paragraphs 1.E.1 through 1.E.7 and any
additional terms imposed by the copyright holder. Additional terms
will be linked to the Project Gutenberg™ License for all works posted
with the permission of the copyright holder found at the beginning of
this work.
1.E.4. Do not unlink or detach or remove the full Project

Gutenberg™ License terms from this work, or any files containing a
part of this work or any other work associated with Project
Gutenberg™.
1.E.5. Do not copy, display, perform, distribute or redistribute this

electronic work, or any part of this electronic work, without
prominently displaying the sentence set forth in paragraph 1.E.1 with
active links or immediate access to the full terms of the Project
Gutenberg™ License.

Knowledge Discovery in Big Data From Astronomy and Earth Observation Astrogeoinformatics 1St Edition Petr Skoda Editor Full Chapter

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Knowledge Discovery in Big Data From Astronomy and Earth Observation Astrogeoinformatics 1St Edition Petr Skoda Editor Full Chapter

Uploaded by

Copyright:

Available Formats

Knowledge Discovery in Big Data from

Astronomy and Earth Observation:

MATLAB® is a trademark of The MathWorks, Inc. and is used with permission.

Publisher: Candice Janco

L I S T O F C O N T R I B U T O R S, vii 6 Surveys, Catalogues,

4 Synergy in Astronomy and 11 Advanced Time Series Analysis

13 Deep Learning – an Opportunity 20 International Database of

14 Astro- and Geoinformatics – 21 Monitoring the Earth Ionosphere by

17 Applications of Big Data in

18 Big Astronomical Datasets and

19 Big Data for the Magnetic Field

Tarek Al-Ubaidi, MSc Andrii Elyiv, Dr

Ognyan Kounchev, Prof, Dr Christian Requena-Mesa, MSc

Christian Muller, Dr Manuel Scherf, MSc

Veljko Vujčić Karine Zeitouni, Prof, PhD

WHAT’S IN THIS BOOK? MOTIVATION AND SCOPE

Methodologies for Knowledge

1.1 INTRODUCTION For a better understanding of KDPs, we can shortly

FIG. 1.2 The KDD process.

FIG. 1.3 Methodology CRISP-DM.

FIG. 1.4 Example of a PMML file (from DMG PMML examples).

FIG. 1.6 Example of OntoDM instantiation.

Historical Background of Big Data in

the structuring of these early sources. Aristotle was also

FIG. 2.3 Early synoptic map of Swedish origin (https://en.wikipedia.org/wiki/Timeline_of_meteorology#/

Fig. 33.—Air Pipe Outside

The disadvantages are, that the mechanical efficiency is low; that

Readers wishing to amplify their knowledge of pneumatic conveying

Bag filters, 22-24

Capacity of pipe lines, 36

Design, factors influencing, 8

Factors influencing design, 8

Heavy commercial systems, 7

“Induction” system, 4, 6, 62-66

Junctions in pipe lines, 33

King’s exhauster, 10-14

Large pipe systems, 7

Materials, breaking of, 34, 35, 67

Nash hydro-turbine, 18, 19

Pipe lines, 10, 33, 36, 39

Reheating compressed air, 68, 69

Water pumping, 95-100

Printed by Sir Isaac Pitman & Sons, Ltd., Bath, England

pg 59 Changed: await the convenience of the wagons

Creating the works from print editions not protected by U.S.

START: FULL LICENSE

To protect the Project Gutenberg™ mission of promoting the free

Section 1. General Terms of Use and

1.B. “Project Gutenberg” is a registered trademark. It may only be

1.E. Unless you have removed all references to Project Gutenberg:

1.E.1. The following sentence, with active links to, or other

1.E.2. If an individual Project Gutenberg™ electronic work is derived

1.E.3. If an individual Project Gutenberg™ electronic work is posted

1.E.4. Do not unlink or detach or remove the full Project

1.E.5. Do not copy, display, perform, distribute or redistribute this

You might also like