You are on page 1of 40

Data Analytics and Management in Data

Intensive Domains Leonid Kalinichenko


Visit to download the full and correct content document:
https://textbookfull.com/product/data-analytics-and-management-in-data-intensive-do
mains-leonid-kalinichenko/
More products digital (pdf, epub, mobi) instant
download maybe you interests ...

Data Analytics for Engineering and Construction Project


Risk Management Ivan Damnjanovic

https://textbookfull.com/product/data-analytics-for-engineering-
and-construction-project-risk-management-ivan-damnjanovic/

Data Analytics in Digital Humanities Hai-Jew

https://textbookfull.com/product/data-analytics-in-digital-
humanities-hai-jew/

Big Data Analytics for Intelligent Healthcare


Management 1st Edition Nilanjan Dey

https://textbookfull.com/product/big-data-analytics-for-
intelligent-healthcare-management-1st-edition-nilanjan-dey/

Guide to Data Analytics Aicpa

https://textbookfull.com/product/guide-to-data-analytics-aicpa/
Data Management, Analytics and Innovation: Proceedings
of ICDMAI 2020, Volume 1 Neha Sharma

https://textbookfull.com/product/data-management-analytics-and-
innovation-proceedings-of-icdmai-2020-volume-1-neha-sharma/

Big Data Analytics in Cybersecurity First Edition Deng

https://textbookfull.com/product/big-data-analytics-in-
cybersecurity-first-edition-deng/

Data Management Analytics and Innovation Proceedings of


ICDMAI 2018 Volume 2 Valentina Emilia Balas

https://textbookfull.com/product/data-management-analytics-and-
innovation-proceedings-of-icdmai-2018-volume-2-valentina-emilia-
balas/

Understanding Azure Data Factory: Operationalizing Big


Data and Advanced Analytics Solutions Sudhir Rawat

https://textbookfull.com/product/understanding-azure-data-
factory-operationalizing-big-data-and-advanced-analytics-
solutions-sudhir-rawat/

Data Pipelines Pocket Reference: Moving and Processing


Data for Analytics 1st Edition James Densmore

https://textbookfull.com/product/data-pipelines-pocket-reference-
moving-and-processing-data-for-analytics-1st-edition-james-
densmore/
Leonid Kalinichenko · Yannis Manolopoulos
Oleg Malkov · Nikolay Skvortsov
Sergey Stupnikov · Vladimir Sukhomlin (Eds.)

Communications in Computer and Information Science 822

Data Analytics
and Management
in Data Intensive Domains
XIX International Conference, DAMDID/RCDL 2017
Moscow, Russia, October 10–13, 2017
Revised Selected Papers

123
Communications
in Computer and Information Science 822
Commenced Publication in 2007
Founding and Former Series Editors:
Phoebe Chen, Alfredo Cuzzocrea, Xiaoyong Du, Orhun Kara, Ting Liu,
Dominik Ślęzak, and Xiaokang Yang

Editorial Board
Simone Diniz Junqueira Barbosa
Pontifical Catholic University of Rio de Janeiro (PUC-Rio),
Rio de Janeiro, Brazil
Joaquim Filipe
Polytechnic Institute of Setúbal, Setúbal, Portugal
Igor Kotenko
St. Petersburg Institute for Informatics and Automation of the Russian
Academy of Sciences, St. Petersburg, Russia
Krishna M. Sivalingam
Indian Institute of Technology Madras, Chennai, India
Takashi Washio
Osaka University, Osaka, Japan
Junsong Yuan
University at Buffalo, The State University of New York, Buffalo, USA
Lizhu Zhou
Tsinghua University, Beijing, China
More information about this series at http://www.springer.com/series/7899
Leonid Kalinichenko Yannis Manolopoulos

Oleg Malkov Nikolay Skvortsov


Sergey Stupnikov Vladimir Sukhomlin (Eds.)


Data Analytics
and Management
in Data Intensive Domains
XIX International Conference, DAMDID/RCDL 2017
Moscow, Russia, October 10–13, 2017
Revised Selected Papers

123
Editors
Leonid Kalinichenko Nikolay Skvortsov
Federal Research Center Federal Research Center
“Computer Science and Control” “Computer Science and Control”
Russian Academy of Sciences Russian Academy of Sciences
Moscow Moscow
Russia Russia
Yannis Manolopoulos Sergey Stupnikov
Open University of Cyprus Federal Research Center
Latsia “Computer Science and Control”
Cyprus Russian Academy of Sciences
Moscow
Oleg Malkov Russia
Institute of Astronomy
Russian Academy of Sciences Vladimir Sukhomlin
Moscow Moscow State University
Russia Moscow
Russia

ISSN 1865-0929 ISSN 1865-0937 (electronic)


Communications in Computer and Information Science
ISBN 978-3-319-96552-9 ISBN 978-3-319-96553-6 (eBook)
https://doi.org/10.1007/978-3-319-96553-6

Library of Congress Control Number: 2018948633

© Springer International Publishing AG, part of Springer Nature 2018


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the
material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now
known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book are
believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors
give a warranty, express or implied, with respect to the material contained herein or for any errors or
omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in
published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface

This CCIS volume published by Springer contains the proceedings of the XIX Inter-
national Conference Data Analytics and Management in Data-Intensive Domains
(DAMDID/RCDL 2017) that took place during October 9–13 in the Lomonosov
Moscow State University at the Department of Computational Mathematics and
Cybernetics. The DAMDID series of conferences was planned as a multidisciplinary
forum of researchers and practitioners from various domains of science and research,
promoting cooperation and exchange of ideas in the area of data analysis and man-
agement in domains driven by data-intensive research. Approaches to data analysis and
management being developed in specific data-intensive domains (DID) of X infor-
matics (such as X = astro, bio, chemo, geo, med, neuro, physics, chemistry, material
science etc.), social sciences, as well as in various branches of informatics, industry,
new technologies, finance, and business contribute to the conference content.
Traditionally DAMDID/RCDL proceedings are published locally before the con-
ference as a collection of full texts of all regular and short papers accepted by the
Program Committee as well as, abstracts of posters and demos. Soon after the con-
ference, the texts of regular papers presented at the conference are submitted for online
publishing in a volume of the European repository of the CEUR Workshop Proceed-
ings, as well as for indexing the volume content in DBLP and Scopus. Since 2016, a
DAMDID/RCDL volume of post-conference proceedings with up to one third of the
submitted papers that were previously published in CEUR Workshop Proceedings have
been published by Springer in their Communications in Computer and Information
Science (CCIS) series. Each paper selected for the CCIS post-conference volume
should be modified as follows: the title of each paper should be a new one; the paper
should be significantly extended (with at least 30% new content); the paper should refer
to its original version in the CEUR Workshop Proceedings. CCIS is abstracted/indexed
in DBLP, Google Scholar, EI-Compendex, Mathematical Reviews, SCImago, and
Scopus.
The program of DAMDID/RCDL 2017, as with the previous editions of these
conferences, alongside the traditional data management topics reflects a rapid move
into the direction of data science and data-intensive analytics. The program this year
included carefully selected invited keynote talks related to rapidly developed DID. The
respective plenary sessions were also aimed at attracting the attention of researchers in
the selected DID. A preconference plenary session on October 9 included two talks: the
keynote talk by Stefano Ceri, Professor of Database Systems at Dipartimento di
Elettronica, Informazione e Bioingegneria (DEIB) of Politecnico di Milano, and the
invited talk by Zoltan Szallasi, MD, senior research scientist, the Children’s Hospital
Informatics Program, Harvard Medical School. The session was devoted to the
development of methods and techniques for genomes and diagnostics in various
application domains (from health care to criminalistics). Stefano Ceri considered the
implementation issues of the new-generation DNA sequencing techniques in the
VI Preface

European project GeCo applying big data technologies; in the talk by Zoltan Szallasi,
an overview of approaches to the genomic-based diagnostics in various application
domains was given. In more detail, in the tutorial given by Zoltan Szallasi on October
10 the application of genomic diagnostics in cancer immunotherapy was presented. The
problems of data deluge in astronomy and approaches to their solution were considered
in the keynote talk by Giuseppe Longo (Professor of Astrophysics at the University of
Naples Federico II). On the basis of their talks, Zoltan Szallasi, Stefano Ceri with
co-authors, and Giuseppe Longo with co-authors provided invited full papers for this
CCIS volume.
The conference Program Committee reviewed 75 submissions for the conference
and eight submissions for the PhD workshop. For the workshop, five papers were
accepted and three were rejected. For the conference, 47 submissions were accepted as
full papers, 12 as short papers, two as posters, and two as demos, whereas 12 sub-
missions were rejected. According to the conference program, these 59 oral presen-
tations (of the full and short papers) are structured into 19 sessions including: Data
Analysis Projects in Astronomy; Semantic Web Techniques in DID; Special
Purpose DID Infrastructures (two sessions); Distributed Computing; System Efficiency
Evaluation; Data Analysis Projects in Neuroscience; Specific Data Analysis Tech-
niques; Ontological Models and Applications (two sessions); Heterogeneous Database
Integration; Text Analysis in Humanities (two sessions); Data Analysis Projects in
Various DID; Organization of Experiments in Data-Intensive Research; Digital Library
Projects; Knowledge Representation and Discovery; Approaches for Problem Solving
in DID; and Applications of Machine Learning. Although most of the presentations are
dedicated to the results of research conducted in organizations in the territory
of the Russian Federation including Kazan, Moscow, Novosibirsk, Obninsk, Omsk,
Orel, Pereslavl-Zalessky, Saint Petersburg, Tomsk, Yaroslavl, Zvenigorod, the
DAMDID/RCDL 2017 conference also had international features. This move is
witnessed by 12 talks (four of them are invited) prepared by the notable foreign
researchers from such countries as Armenia (Yerevan), Bahrain (Manama), Belarus
(Minsk), Bulgaria (Sofia), Germany (Dusseldorf, Kiel), UK (Harvel), Greece
(Thessaloniki), Italy (Milan, Naples), and the USA (Harvard).
For the proceedings 19 papers were selected by the Program Committee (16 peer
reviewed and three invited papers) and after careful editing they are included in this
volume structured into seven sections comprising Data Analytics: two papers;
Next-Generation Genomic Sequencing (Challenges and Solutions): two papers; Novel
Approaches to Analyzing and Classifying of Various Astronomical Entities and
Events: six papers; Ontology Population in Data-Intensive Domains: three papers;
Heterogeneous Data Integration Issues: four papers; Data Curation and Data Prove-
nance Support: one paper; Temporal Summaries Generation: one paper. Of these, eight
papers (more than one third of the total number of the papers selected) were prepared
by foreign researchers (from Bulgaria, Germany, Greece, Italy, UK, USA).
DAMDID/RCDL 2017 would not have been possible without the support of the
Russian Foundation for Basic Research, the Federal Agency of Scientific Organizations
of the Russian Federation and the Federal Research Center Computer Science and
Control of the Russian Academy of Sciences. Finally, we thank Springer for publishing
this proceedings volume, containing the invited and selected research papers, in their
Preface VII

CCIS series. The Program Committee of the conference appreciates the possibility to
use the Conference Management Toolkit (CMT) sponsored by Microsoft Research,
which provided great support during various phases of the paper submission and
reviewing process.

May 2018 Leonid Kalinichenko


Yannis Manolopoulos
Oleg Malkov
Nikolay Skvortsov
Sergey Stupnikov
Vladimir Sukhomlin
Organization

General Chair
Igor Sokolov Federal Research Center Computer Science and Control
of RAS, Russia

Program Committee Co-chairs


Leonid Kalinichenko Federal Research Center Computer Science and Control
of RAS, Russia
Yannis Manolopoulos Aristotle University of Thessaloniki, Greece

PhD Workshop Co-chairs


Sergey Stupnikov Federal Research Center Computer Science and Control
of RAS, Russia
Sergey Gerasimov Lomonosov Moscow State University, Russia

Organizing Committee Co-chairs


Vladimir Sukhomin Lomonosov Moscow State University, Russia
Victor Zakharov Federal Research Center Computer Science and Control
of RAS, Russia

Organizing Committee
Elena Zubareva Lomonosov Moscow State University, Russia
Dmitry Briukhov Federal Research Center Computer Science and Control
of RAS, Russia
Nikolay Skvortsov Federal Research Center Computer Science and Control
of RAS, Russia
Dmitry Kovalev Federal Research Center Computer Science and Control
of RAS, Russia
Evgeny Morkovin Lomonosov Moscow State University, Russia
Irina Karzalova Federal Research Center Computer Science and Control
of RAS, Russia
Yulia Trusova Federal Research Center Computer Science and Control
of RAS, Russia
Evgeniy Ilyushin Lomonosov Moscow State University, Russia
Dmitry Gouriev Lomonosov Moscow State University, Russia
Vladimir Romanov Lomonosov Moscow State University, Russia
X Organization

Supporters

Russian Foundation for Basic Research


Federal Agency of Scientific Organizations of the Russian Federation
Federal Research Center “Computer Science and Control” of the Russian Academy
of Sciences (FRC CSC RAS)
Moscow ACM SIGMOD Chapter

Coordinating Committee
Igor Sokolov (Co-chair) Federal Research Center Computer Science and Control
of RAS, Russia
Nikolay Kolchanov Institute of Cytology and Genetics, SB RAS,
(Co-chair) Novosibirsk, Russia
Leonid Kalinichenko Federal Research Center Computer Science and Control
(Deputy Chair) of RAS, Russia
Arkady Avramenko Pushchino Radio Astronomy Observatory, RAS, Russia
Pavel Braslavsky Ural Federal University, SKB Kontur, Russia
Vasily Bunakov Science and Technology Facilities Council, Harwell,
Oxfordshire, UK
Alexander Elizarov Kazan (Volga Region) Federal University, Russia
Alexander Fazliev Institute of Atmospheric Optics, RAS, Siberian Branch,
Russia
Alexei Klimentov Brookhaven National Laboratory, USA
Mikhail Kogalovsky Market Economy Institute, RAS, Russia
Vladimir Korenkov JINR, Dubna, Russia
Mikhail Kuzminski Institute of Organic Chemistry, RAS, Russia
Sergey Kuznetsov Institute for System Programming, RAS, Russia
Vladimir Litvine Evogh Inc., California, USA
Archil Maysuradze Moscow State University, Russia
Oleg Malkov Institute of Astronomy, RAS, Russia
Alexander Marchuk Institute of Informatics Systems, RAS, Siberian Branch,
Russia
Igor Nekrestjanov Verizon Corporation, USA
Boris Novikov St. Petersburg State University, Russia
Nikolay Podkolodny ICaG, SB RAS, Novosibirsk, Russia
Aleksey Pozanenko Space Research Institute, RAS, Russia
Vladimir Serebryakov Computing Center of RAS, Russia
Yury Smetanin Russian Foundation for Basic Research, Moscow
Vladimir Smirnov Yaroslavl State University, Russia
Sergey Stupnikov Federal Research Center Computer Science and Control
of RAS, Russia
Konstantin Vorontsov Moscow State University, Russia
Viacheslav Wolfengagen National Research Nuclear University MEPhI, Russia
Organization XI

Victor Zakharov Federal Research Center Computer Science and Control


of RAS, Russia

Program Committee
Karl Aberer EPFL, Lausanne, Switzerland
Plamen Angelov Lancaster University, UK
Alexander Afanasyev Institute for Information Transmission Problems, RAS,
Russia
Arkady Avramenko Pushchino Observatory, Russia
Ladjel Bellatreche LIAS/ISAE-ENSMA, Poitiers, France
Pavel Braslavski Ural Federal University, Yekaterinburg, Russia
Vasily Bunakov Science and Technology Facilities Council, Harwell, UK
Evgeny Burnaev Skoltech, Russia
George Chernishev St. Petersburg State University, Russia
Yuri Demchenko University of Amsterdam, The Netherlands
Boris Dobrov Research Computing Center of MSU, Russia
Alexander Elizarov Kazan Federal University, Russia
Alexander Fazliev Institute of Atmospheric Optics, SB RAS, Russia
Sergey Gerasimov Lomonosov Moscow State University, Russia
Vladimir Golenkov Belarusian State University of Informatics and
Radioelectronics, Belarus
Vladimir Golovko Brest State Technical University, Belarus
Olga Gorchinskaya FORS, Moscow, Russia
Evgeny Gordov Institute of Monitoring of Climatic and Ecological
Systems SB RAS, Russia
Valeriya Gribova Institute of Automation and Control Processes FEBRAS,
Far Eastern Federal University, Russia
Maxim Gubin Google Inc., USA
Natalia Guliakina Belarusian State University of Informatics and
Radioelectronics, Belarus
Ralf Hofestadt University of Bielefeld, Germany
Leonid Kalinichenko FRC CSC RAS, Moscow, Russia
George Karypis University of Minnesota, Minneapolis, USA
Nadezhda Kiselyova IMET RAS, Russia
Alexei Klimentov Brookhaven National Laboratory, USA
Mikhail Kogalovsky Market Economy Institute, RAS, Russia
Vladimir Korenkov Joint Institute for Nuclear Research, Dubna, Russia
Sergey Kuznetsov Institute for System Programming, RAS, Russia
Sergei O. Kuznetsov National Research University Higher School
of Economics, Russia
Dmitry Lande Institute for Information Recording, NASU, Russia
Giuseppe Longo University of Naples Federico II, Italy
Natalia Loukachevitch Moscow State University, Russia
Ivan Lukovic University of Novi Sad, Serbia
Oleg Malkov Institute of Astronomy, RAS, Russia
XII Organization

Yannis Manolopoulos School of Informatics of the Aristotle University


of Thessaloniki, Greece
Manuel Mazzara Innopolis University, Russia
Alexey Mitsyuk National Research University Higher School
of Economics, Russia
Xenia Naidenova S. M. Kirov Military Medical Academy, Russia
Dmitry Namiot Lomonosov Moscow State University, Russia
Igor Nekrestyanov Verizon Corporation, USA
Gennady Ososkov Joint Institute for Nuclear Research, Russia
Dmitry Paley Yaroslav State University, Russia
Nikolay Podkolodny Institute of Cytology and Genetics SB RAS, Russia
Natalia Ponomareva Scientific Center of Neurology of RAMS, Russia
Alexey Pozanenko Space Research Institute, RAS, Russia
Andreas Rauber Vienna TU, Austria
Roman Samarev Bauman Moscow State Technical University, Russia
Timos Sellis RMIT, Australia
Vladimir Serebryakov Computing Centre of RAS, Russia
Nikolay Skvortsov FRC CSC RAS, Russia
Vladimir Smirnov Yaroslavl State University, Russia
Manfred Sneps-Sneppe AbavaNet, Russia
Valery Sokolov Yaroslavl State University, Russia
Sergey Stupnikov FRC CSC RAS, Russia
Alexander Sychev Voronezh State University, Russia
Dmitry Tsarkov Google, USA
Bernhard Thalheim University of Kiel, Germany
Dmitry Tsarkov Manchester University, UK
Alexey Ushakov University of California, Santa Barbara, USA
Natalia Vassilieva Hewlett-Packard, Russia
Pavel Velikhov Finstar Financial Group, Russia
Alexey Vovchenko FRC CSC RAS, Moscow, Russia
Peter Wittenburg MPI for Psycholinguistics, Germany
Vladimir Zadorozhny University of Pittsburgh, USA
Yury Zagorulko Institute of Informatics Systems, SB RAS, Russia
Victor Zakharov FRC CSC RAS, Russia
Sergey Znamensky Institute of Program Systems, RAS, Russia
Contents

Data Analytics

Deep Model Guided Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3


Yannic Ole Kropp and Bernhard Thalheim

Data Mining and Analytics for Exploring Bulgarian Diabetic Register . . . . . . 19


Svetla Boytcheva, Galia Angelova, Zhivko Angelov,
and Dimitar Tcharaktchiev

Next Generation Genomic Sequencing: Challenges and Solutions

An Introduction to the Computational Challenges in Next


Generation Sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Zoltan Szallasi

Overview of GeCo: A Project for Exploring and Integrating Signals


from the Genome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Stefano Ceri, Anna Bernasconi, Arif Canakoglu, Andrea Gulino,
Abdulrahman Kaitoua, Marco Masseroli, Luca Nanni, and Pietro Pinoli

Novel Approaches to Analyzing and Classifying of Various Astronomical


Entities and Events

Data Deluge in Astrophysics: Photometric Redshifts as a Template


Use Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Massimo Brescia, Stefano Cavuoti, Valeria Amaro, Giuseppe Riccio,
Giuseppe Angora, Civita Vellucci, and Giuseppe Longo

Fractal Paradigm and IT-Technologies for Processing, Analyzing


and Classifying Large Flows of Astronomical Data . . . . . . . . . . . . . . . . . . . 73
Alexei V. Myshev and Andrei V. Dunin

Neural Gas Based Classification of Globular Clusters . . . . . . . . . . . . . . . . . 86


Giuseppe Angora, Massimo Brescia, Stefano Cavuoti, Giuseppe Riccio,
Maurizio Paolillo, and Thomas H. Puzia

Matching and Verification of Multiple Stellar Systems in the Identification


List of Binaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Nikolay A. Skvortsov, Leonid A. Kalinichenko, Alexey V. Karchevsky,
Dana A. Kovaleva, and Oleg Yu. Malkov
XIV Contents

Aggregation of Knowledge on Star Cluster Structure and Kinematics


in Data Intensive Astronomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Sergei V. Vereshchagin and Ekaterina S. Postnikova

Search for Short Transient Gamma-Ray Events in SPI Experiment Onboard


INTEGRAL: The Algorithm and Results . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Pavel Minaev and Alexei Pozanenko

Ontology Population in Data Intensive Domains

Development of Ontologies of Scientific Subject Domains Using Ontology


Design Patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Yury Zagorulko, Olesya Borovikova, and Galina Zagorulko

PROPheT – Ontology Population and Semantic Enrichment from Linked


Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
Marina Riga, Panagiotis Mitzias, Efstratios Kontopoulos,
and Ioannis Kompatsiaris

Ontological Description of Applied Tasks and Related Meteorological


and Climate Data Collections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
Andrey Bart, Vladislava Churuksaeva, Alexander Fazliev,
Evgeniy Gordov, Igor Okladnikov, Alexey Privezentsev,
and Alexander Titov

Heterogeneous Data Integration Issues

Integration of Data on Substance Properties Using Big Data Technologies


and Domain-Specific Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
Adilbek Erkimbaev, Vladimir Zitserman, Georgii Kobzev,
and Andrey Kosinov

Rule-Based Specification and Implementation of Multimodel


Data Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
Sergey Stupnikov

Approach to Forecasting the Development of Situations Based


on Event Detection in Heterogeneous Data Streams. . . . . . . . . . . . . . . . . . . 213
Ark Andreev, Dmitry Berezkin, and Ilya Kozlov

Integrating DBMS and Parallel Data Mining Algorithms for Modern


Many-Core Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
Timofey Rechkalov and Mikhail Zymbler
Contents XV

Data Curation and Data Provenance Support

Data Curation Policies and Data Provenance in EUDAT Collaborative


Data Infrastructure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
Vasily Bunakov, Alexander Atamas, Alexia de Casanove,
Pascal Dugénie, Rene van Horik, Simon Lambert, Javier Quinteros,
and Linda Reijnhoudt

Temporal Summaries Generation

News Timeline Generation: Accounting for Structural Aspects


and Temporal Nature of News Stream . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
Mikhail Tikhomirov and Boris Dobrov

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281


Data Analytics
Deep Model Guided Data Analysis

Yannic Ole Kropp(&) and Bernhard Thalheim

Department of Computer Science,


Christian Albrechts University Kiel, 24098 Kiel, Germany
{yk,thalheim}@is.informatik.uni-kiel.de

Abstract. Data mining is currently a well-established technique and supported


by many algorithms. It is dependent on the data on hand, on properties of the
algorithms, on the technology developed so far, and on the expectations and
limits to be applied. It must be thus matured, predictable, optimisable, evolving,
adaptable and well-founded similar to mathematics and SPICE/CMM-based
software engineering. Data mining must therefore be systematic if the results
have to be fit to its purpose. One basis of this systematic approach is model
management and model reasoning. We claim that systematic data mining is
nothing else than systematic modelling. The main notion is the notion of the
model in a variety of forms, abstraction and associations among models.

Keywords: Data mining  Modelling  Models  Framework  Deep model


Normal model  Modelling matrix

1 Introduction

Data mining and analysis is nowadays well-understood from the algorithms side. There
are thousands of algorithms that have been proposed. The number of success stories is
overwhelming and has caused the big data hype. At the same time, brute-force
application of algorithms is still the standard. Nowadays data analysis and data mining
algorithms are taken for granted. They transform data sets and hypotheses into con-
clusions. For instance, cluster algorithms check on given data sets and for a clustering
requirements portfolio whether this portfolio can be supported and provide as a set of
clusters in the positive case as an output. The Hopkins index is one of the criteria that
allow to judge whether clusters exist within a data set or not. A systematic approach to
data mining has already been proposed in [3, 16]. It is based on mathematics and
mathematical statistics and thus able to handle errors, biases and configuration of data
mining as well. Our experience in large data mining projects in archaeology, ecology,
climate research, medical research etc. has however shown that the just described
situation of ad-hoc and brute-force mining is their main approach. The results are taken
for granted and believed despite the modelling, understanding, flow of work and data
handling pitfalls. So, the results often become dubious due to these misconceptions and
pitfalls.
Data are the main source for information in data mining and analysis. Their quality
properties have been neglected for a long time. At the same time, modern data man-
agement allows to handle these problems. In [15] we compare the critical findings or

© Springer International Publishing AG, part of Springer Nature 2018


L. Kalinichenko et al. (Eds.): DAMDID/RCDL 2017, CCIS 822, pp. 3–18, 2018.
https://doi.org/10.1007/978-3-319-96553-6_1
4 Y. O. Kropp and B. Thalheim

pitfalls of [20] with resolution techniques that can be applied to overcome the crucial
pitfalls of data mining in environmental sciences reported there. The algorithms
themselves are another source of pitfalls that are typically used for the solution of data
mining and analysis tasks. It is neglected that an algorithm also has an application area,
application restrictions, data requirements, results at certain granularity and precision.
These problems must be systematically tackled if we want to rely on the results of
mining and analysis. Otherwise analysis may become misleading, biased, or not pos-
sible. Therefore, we explicitly treat properties of mining and analysis. A similar
observation can be made for data handling.
Data mining is typically not only based on one model but rather on a model ensemble
or model suite. The association among models in a model suite must be explicitly
specified. These associations provide an explicit form via model suites. Reasoning
techniques combine methods from logics (deductive, inductive, abductive, counter-
inductive, etc.), from artificial intelligence (hypothetic, qualitative, concept-based,
adductive, etc.), computational methods (algorithmics [6], topology, geometry, reduc-
tion, etc.), and cognition (problem representation and solving, causal reasoning, etc.).
In this paper, we use a model-based approach towards data mining and data analysis.
The models function as mediators in investigation, reasoning, communication and
understanding processes. In this context, each model consists of a ‘normal model’ and a
‘deep model’. Normal models represent the parts of the models, which their users are
aware of, while deep models compound of the users implicit influences, assumptions,
prejudices and expert knowledge given by their culture, (academic) domain or personal
beliefs. Normally only normal models are in focus of presentations and discussions.
They are therefore mostly well-understood and elaborated. Deep models in contrast, are
addressed rarely. They form an implicit (typically discipline specific) consensus for the
ways of handling and interpreting the normal models. In effect, it can be hard for
researchers to interpret normal models of foreign domains in the way the models creators
expected. In addition, certain reasons for (design) decisions might be unreproducible for
people outside the domain. We therefore propose an approach for data mining and
analysis, which explicitly includes the handling of deep models in its design.
First, we introduce our notion of the model. Next we show how data mining can be
designed. We apply this investigation to systematic modelling and later to systematic
data mining. It is our goal to develop a holistic and systematic framework for data
mining and analysis. Many issues are left out of the scope of this paper such as a
literature review, a formal introduction of the approach, and a detailed discussion of
data mining application cases.
Remark. A previous version of this paper has been presented on the 19th ‘Data
Analytics and Management in Data Intensive Domains’ conference (DAMDID) in
2017. Under the title ‘Data Mining Design and Systematic Modelling’ that version has
also already been published in the corresponding CEUR workshop proceedings [34].
In contrast to the previous version, this version has been revised. Section 5 was
added to strengthen the practical relevance of the approach and other parts were
shortened to keep the paper in a comfortable length for reading. In result the focus of
this version lies a bit more on deep models.
Deep Model Guided Data Analysis 5

2 Models and Modelling

Models are principle instruments in mathematics, data analysis, modern computer


engineering (CE), teaching any kind of computer technology, and also modern com-
puter science (CS). They are built, applied, revised and manufactured in many CE&CS
sub-disciplines in a large variety of application cases with different purposes and
context for different communities of practice. It is now well understood that models are
something different from theories. They are often intuitive, visualisable, and ideally
capture the essence of an understanding within some community of practice and some
context. At the same time, they are limited in scope, context and the applicability.
There is however a general notion of a model and of a conception of the model:
A model is a well-formed, adequate, and dependable instrument that represents origins
[9, 28, 29].

Its criteria of well-formedness, adequacy, and dependability must be commonly


accepted by its community of practice within some context and correspond to the
functions that a model fulfills in utilization scenarios.

2.1 Generic and Specific Models


The general notion of a model covers all aspects of adequateness, dependability, well-
formedness, scenario, functions and purposes, backgrounds (grounding and basis), and
outer directives (context and community of practice). The models themselves function
as instruments or tools. Typically, instruments come in a variety of forms and fulfill
many different functions. Instruments are partially independent or autonomous of the
thing they operate on. Models are however special instruments. They are used with a
specific intention within a utilization scenario. The quality of a model becomes
apparent in the context of this scenario.1
It might thus be better to start with generic models. A generic model [4, 25, 30, 31]
is a model which broadly satisfies the purpose and broadly functions in the given
utilization scenario. It is later tailored to suit the particular purpose and function. It
generally represents origins of interest, provides means to establish adequacy and
dependability of the model, and establishes focus and scope of the model. Generic
models should satisfy at least five properties: (i) they must be accurate; (ii) the quality
of generic models allows that they are used consciously; (iii) they should be descrip-
tive, not evaluative; (iv) they should be flexible so that they can be modified from time
to time; (v) they can be used as a first ‘best guess’.
The refinement of a (generic) model to a particular or special model provides
mechanisms for model transformation along the adequacy, the justification and the
sufficiency of a model. Refinement is based on specialization for better suitability of a
model, on removal of unessential elements, on combination of models to provide a
more holistic view, on integration that is based on binding of model components to

1
The quality can for example be characterised through quality characteristics [27] such as correctness,
generality, usefulness, comprehensibility, parsimony, robustness, novelty etc.
6 Y. O. Kropp and B. Thalheim

other components and on enhancement that typically improves a model to become


more adequate or dependable.

2.2 Model Suites


Most disciplines integrate a variety of models or a society of models, e.g. [7, 13]
Models used in CE&CS are mainly at the same level of abstraction. It is already well-
known for threescore years that they form a model ensemble (e.g. [10, 22]) or hori-
zontal model suite (e.g. [8, 26]). Developed models vary in their scopes, aspects, and
facets they represent and their abstraction.
A model suite consists of a set of models {M1, …, Mn}, of an association or
collaboration schema among the models, of controllers that maintain consistency or
coherence of the model suite, of application schemata for explicit maintenance and
evolution of the model suite, and of tracers for the establishment of the coherence.
Multi-modelling [11, 18, 23] became a culture in CE&CS. Maintenance of
coherence, co-evolution, and consistency among models has become a bottleneck in
development. Moreover, different languages with different capabilities have become an
obstacle similar to multi-language retrieval [19] and impedance mismatches. Models
are often loosely coupled. Their dependence and relationship is often not explicitly
expressed. This problem becomes more complex if models are used for different
purposes such as construction of systems, verification, optimization, explanation, and
documentation.

2.3 Deep Models and the Modelling Matrix


Model development is typically based on an explicit and rather quick description of the
‘surface’ or normal model and on the mostly unconditional acceptance of a deep
model. The latter one directs the modelling process and the surface or normal model.
Modelling itself is often understood as development and design of the normal model.
The deep model is taken for granted and accepted for a number of normal models.
The deep model can be understood as the common basis for a number of models. It
consists of the grounding for modelling (paradigms, postulates, restrictions, theories,
culture, foundations, conventions, authorities), the outer directives (context and com-
munity of practice), and the basis (assumptions, general concept space, practices,
language as carrier, thought community and thought style, methodology, pattern,
routines, commonsense) of modelling. It uses a collection of undisputable elements of
the background as grounding and additionally a disputable and adjustable basis which
is commonly accepted in the given context by the community of practice. Education on
modelling starts, for instance, directly with the deep model. In this case, the deep model
has to be accepted and is thus hidden and latent.
A (modelling) matrix is something within or from which something else originates,
develops, or takes from. The matrix is assumed to be correct for normal models. It
consists of the deep model and the modelling scenarios. The modelling agenda is
derived from the modelling scenario and the utilization scenarios. It is something like a
guideline for modeling activities and for model associations within a model suite. It
improves the quality of model outcomes by spending some effort to decide what and
Deep Model Guided Data Analysis 7

how much reasoning to do as opposed to what activities to do. It balances resources


between the data-level actions and the reasoning actions.
The modelling scenario and the deep model serve as a part of the definitional frame
within a model development process. They define also the capacity and potential of a
model whenever it is utilized.
Deep models and the modelling matrix also define some frame for adequacy and
dependability. This frame is enhanced for specific normal models. It is then used for a
statement in which cases a normal model represents the origins under consideration.

3 Data Mining Design

3.1 Conceptualization of Data Mining and Analysis


The data mining and analysis task must be enhanced by an explicit treatment of the
languages used for concepts and hypotheses, and by an explicit description of
knowledge that can be used. The algorithmic solution of the task is based on knowl-
edge on algorithms that are used and on data that are available and that are required for
the application of the algorithms. Typically, analysis algorithms are iterative and can
run forever. We are interested only in convergent ones and thus need termination
criteria. Therefore, conceptualization of the data mining and analysis task consists of a
detailed description of six main parameters (e.g. for inductive learning [32]):
a. The data analysis algorithm: Algorithm development is the main activity in data
mining research. Each of these algorithms transfers data and some specific
parameters of the algorithm to a result.
b. The concept space: the concept space defines the concepts under consideration for
analysis based on certain language and common understanding.
c. The data space: The data space typically consists of a multi-layered data set of
different granularity. Data sets may be enhanced by metadata that characterize the
data sets and associate the data sets to other data sets.
d. The hypotheses space: An algorithm is supposed to map evidence on the concepts
to be supported or rejected into hypotheses about it.
e. The prior knowledge space: Specifying the hypothesis space already provides some
prior knowledge. In particular, the analysis task starts with the assumption that the
target concept is representable in a certain way.
f. The acceptability and success criteria: Criteria for successful analysis allow to
derive termination criteria for the data analysis.
Each instantiation and refinement of the six parameters leads to specific data mining
tasks.
The result of data mining and data analysis is described within the knowledge
space. The data mining and analysis task may thus be considered to be a transformation
of data sets, concept sets and hypothesis sets into chunks of knowledge through the
application of algorithms.
8 Y. O. Kropp and B. Thalheim

Problem solving and modelling considers, however, typically six aspects [15]:
1. Application, problems, and users: The domain consists of a model of the applica-
tion, a specification of problems under consideration, of tasks that are issued, and of
profiles of users.
2. Context: The context of a problem is anything what could support the problem
solution, e.g. the sciences’ background, theories, knowledge, foundations, and
concepts to be used for problem specification, problem background, and solutions.
3. Technology: Technology is the enabler and defines the methodology. It provides
[22] means for the flow of problem solving steps, the flow of activities, the dis-
tribution, the collaboration, and the exchange.
4. Techniques and methods: Techniques and methods can be given as algorithms.
Specific algorithms are data improvers and cleaners, data aggregators, data inte-
grators, controllers, checkers, acceptance determiners, and termination algorithms.
5. Data: Data have their own structuring, their quality and their life span. They are
typically enhanced by metadata. Data management is a central element of most
problem solving processes.
6. Solutions: The solutions to problem solving can be formally given, illustrated by
visual means, and presented by models. Models are typically only normal models.
The deep model and the matrix is already provided by the context and accepted by
the community of practice in dependence of the needs of this community for the
given application scenario. Therefore, models may be the final result of a data
mining and analysis process beside other means.
Comparing these six spaces with the six parameters we discover that only four
spaces are considered so far in data mining. We miss the user and application space as
well as the representation space.

3.2 Meta-models of Data Mining


An abstraction layer approach separates the application domain, the model domain and
the data domain [16]. This separation is illustrated in Fig. 1.
The data mining design framework uses the inverse modeling approach. It starts
with the consideration of the application domain and develops models as mediators
between the data and the application domain worlds. In the sequel we are going to
combine the three approaches of this section. The meta-model corresponds to other

Fig. 1. The V meta-model of model-based Data Mining Design


Deep Model Guided Data Analysis 9

meta-models such as inductive modelling or hypothetical reasoning (hypotheses


development, experimenting and testing, analysis of results, interim conclusions,
reappraisal against real world).

4 Data Mining: A Systematic Model-Based Approach

Our approach presented so far allows to revise and to reformulate the model-oriented
data mining process on the basis of well-defined engineering [14, 24] or alternatively
on systematic mathematical problem solving [21]. Figure 2 displays this revision. We
realize that the first two phases are typically implicitly assumed and not considered. We
concentrate on the non-iterative form. Iterative processes can be handled in a similar
form.

Fig. 2. The phases in Data Mining Design (Non-iterative form)


10 Y. O. Kropp and B. Thalheim

4.1 Setting the Deep Model and the Matrix


The problem to be tackled must be clearly stated in dependence on the utilization
scenario, the tasks to be solved, the community of practice involved, and the given
context. The result of this step is the deep model and its matrix. The first one is based
on the background, the specific context parameter such as infrastructure and environ-
ment, and candidates for deep models.
The data mining tasks can be now formulated based on the matrix and the deep
model. We set up the context, the environment, the general goal of the problem and
also criteria for adequateness and dependability of the solution, e.g. invariance
properties for problem description and for the task setting and its mathematical for-
mulation and solution faithfulness properties for later application of the solution in the
given environment. What is exactly the problem, the expected benefit? What should a
solution look like? What is known about the application?
Deep models already use a background consisting of an undisputable grounding
and a selectable basis. The explicit statement of the background provides an under-
standing of the postulates, paradigms, assumptions, conceptions, practices, etc. Without
the background, the results of the analysis cannot be properly understood. Models have
their profile, i.e. goals, purposes and functions. These must be explicitly given. The
parameters of a generic model can be either order or slave parameters [12], either
primary or secondary or tertiary (also called genotypes or phenotypes or observables)
[1, 5], and either ruling (or order) or driven parameters [12]. Data mining can be
enhanced by knowledge management techniques.
Additionally, the concept space into which the data mining task is embedded must
be specified. This concept space is enhanced during data analysis.

4.2 Stereotyping the Process


The general flow of data mining activities is typically implicitly assumed on the basis
of stereotypes which form a set of tasks, e.g. tasks of prove in whatever system,
transformation tasks, description tasks, and investigation tasks. Proofs can follow the
classical deductive or inductive setting. Also, abductive, adductive, hypothetical and
other reasoning techniques are applicable. Stereotypes typically use model suites as a
collection of associated models, are already biased by priming and orientation, follow
policies, data mining design constraints, and framing.
Data mining and analysis is rather stereotyped. For instance, mathematical culture
has already developed a good number of stereotypes for problem formulation. It is
based on a mathematical language for the formulation of analysis tasks, on selection
and instantiation of the best fitting variable space and the space of opportunities pro-
vided by mathematics.
Data mining uses generic models which are the basis of normal models. Models are
based on a separation of concern according the problem setting: dependence-indicating,
dependence-describing, separation or partition spaces, pattern kinds, reasoning kinds,
etc. This separation of concern governs the classical data mining algorithmic classes:
association analysis, cluster analysis, data grouping with or without classification,
Deep Model Guided Data Analysis 11

classifiers and rules, dependences among parameters and data subsets, predictor
analysis, synergetics, blind or informed or heuristic investigation of the search space,
and pattern learning.

4.3 Initialization of the Normal Data Models


Data mining algorithms have their capacity and potential [2]. Potential and capacity
can be based on SWOT (strengths, weaknesses, opportunities, and threats), SCOPE
(situation, core competencies, obstacles, prospects, expectation), and SMART (how
simple, meaningful, adequate, realistic, and trackable) analysis of methods and algo-
rithms. Each of the algorithm classes has its strengths and weaknesses, its satisfaction
of the tasks and the purpose, and its limits of applicability. Algorithm selection also
includes an explicit specification of the order of application of these algorithms and of
mapping parameters that are derived by means of one algorithm to those that are an
input for the others, i.e. an explicit association within the model suite. Additionally,
evaluation algorithms for the success criteria are selected. Algorithms have their own
obstinacy, their hypotheses and assumptions that must be taken into consideration.
Whether an algorithm can be considered depends on acceptance criteria derived in the
previous two steps.
So, we ask: What kind of model suite architecture suits the problem best? What are
applicable development approaches for modelling? What is the best modelling tech-
nique to get the right model suite? What kind of reasoning is supported? What not?
What are the limitations? Which pitfalls should be avoided?
The result of the entire data mining process heavily depends on the appropriateness
of the data sets, their properties and quality, and more generally the data schemata with
essentially three components: application data schema with detailed description of data
types, metadata schema [17], and generated and auxiliary data schemata. The first
component is well-investigated in data mining and data management monographs. The
second and third components inherit research results from database management, from
data mart or warehouses, and layering of data. An essential element is the explicit
specification of the quality of data. It allows to derive algorithms for data improvement
and to derive limitations for applicability of algorithms. Auxiliary data support per-
formance of the algorithms.
Therefore typical data-oriented questions are: What data do we have available? Is
the data relevant to the problem? Is it valid? Does it reflect our expectations? Is the
data quality, quantity, recency sufficient? Which data we should concentrate on? How
is the data transformed for modelling? How may we increase the quality of data?

4.4 The Data Mining Process Itself


The data mining process can be understood as a coherent and stepwise refinement of
the given model suite. The model refinement may use an explicit transformation or an
extract-transform-load process among models within the model suite. Evaluation and
termination algorithms are an essential element of any data mining algorithm. They can
be based on quality criteria for the finalized models in the model suite, e.g. generality,
12 Y. O. Kropp and B. Thalheim

error-proneness, stability, selection-proneness, validation, understandability, repeata-


bility, usability, usefulness, and novelty.
Typical questions to answer within this process are: How good is the model suite in
terms of the task setting? What have we really learned about the application domain?
What is the real adequacy and dependability of the models in the model suite? How
these models can be deployed best? How do we know that the models in the model suite
are still valid? Which data are supporting which model in the model suite? Which kind
of errors of data is inherited by which part of which model?
The final result of the data mining process is then a combination of the deep model
and the normal model whereas the first one is a latent or hidden component in most
cases. If we want, however, to reason on the results then the deep model must be
understood as well. Otherwise, the results may become surprising and may not be
convincing.

5 A Typical Application Case

The presented approach demands an explicit consideration of deep models, matrices


and typical (generic) model suites when preparing data mining. It also includes the
common steps of data and algorithm selection, initialization of the process and the data
mining process itself. It therefore can be used as a guideline or generic process for good
proceedings in executing data mining.
At first the additional workload caused by this explicit consideration and almost
bureaucratic and stepwise procedure might seem over the top for simple and non-
recurring data mining procedures. However, this approach brings huge benefits in
research where recurring data mining is performed and especially in (collaborative)
projects where multiple disciplines are involved. Multiple deep models, matrices and
typical (generic) model suites do likely exist in interdisciplinary projects with multiple
(scientific) domains. Also different requirements on data, algorithms and results are
common. Additionally these deep models, matrices and other latent factors are nor-
mally implicitly taken for granted by the domain experts, but not known by the experts
of other domains. These experts of the other domains rather take the implicit agree-
ments of their domains for granted, what easily results in communication problems.
Besides communication, the exchange of data among project members from
(academic) disciplines might cause problems too. The data acquisition is also affected
by implicit assumptions. This concerns for example the granularity and quality of data.
Measuring methods of data might not be adequate for all purposes of all domains.
A hypothetical example are the differences in standard procedures, like the computation
of the average of a set of numeric values. If the corresponding data is just labelled with
‘average’, their integration into data mining processes could affect the integrity of the
analysis negatively, as the data set might not contain the expected data. There are huge
differences in mode, median and arithmetic mean. Within a discipline the usual method
is known, but for other disciplines their usual methods can differ. Applying our
approach and explicitly consider the domain specific characteristics can help to reduce
such obstacles in communication.
Deep Model Guided Data Analysis 13

5.1 Collaborative Research Cluster 1266


Let us consider an application case. The CRC 12662,3 investigates processes of
transformation from 15,000 BCE to 1 BCE. As we are currently involved in this multi-
domain interdisciplinary collaborative research project, we notice various of those
implicit assumptions. These may ease the communication within a single domain, as
they represent the best practices and ‘normal’ proceedings of this domain, but they
cause problems in cross-disciplinary collaborations. Still it is utopian to try to force the
researchers to get rid of their implicit assumptions. First it is unlikely to achieve this
goal as many assumptions are latent and only subliminal, second they are directly
connected to the academic fields and working environments of the researchers and third
there is mostly no perception of an urgent need to do so. It is therefore crucial to
explicitly identify those deep models, matrices and typical (generic) model suites, to
enable unproblematic collaboration by making them public to all project partners.
When those latent structures become know, it is possible to select adequate data for
analyses cross-disciplinary. The same applies for data which can be adapted to become
adequate. In other words does the awareness of the own latent assumptions and their
comparison with the latent assumptions of other project members ease collaboration.
The deep models, matrices and (generic) model suites are (fortunately) quite stable.
In contrast to the dynamics in research itself, the latent structures change rarely.
Therefore, it is possible to reuse the collected information about those structures. The
results of the first steps of our approach can be stored and used for reruns or similar
data mining activities. Especially in research it is common to rerun certain procedures
at different points in time due to new or revised data or with slightly different
parameters due to new knowledge. It is also common for researchers to perform
research on the same subject in varying contexts. Typical examples for such varying
contexts in archaeology are different regions or different time-frames. In a project like
the CRC 1266 it is therefore reasonable to store, maintain and manage the results of the
‘DM Initialisation’ phase, presented in Sect. 4.3, for later reuse. The results combine
the information about the disciplinary perspectives (deep models etc.), the connections
between those perspectives and the requirements for specific research intentions.
Especially such a result contains information about adequate data for the research in a
specific context and background. In other words it summarizes for a specific research
(or data mining) intention, which data within the project can be used and eventually
how it has to be preprocessed beforehand. Storing this information saves much time for
later and similar research activities. Also this information are valuable meta informa-
tion, concerning the origins of used data and provenance in a data mining process. In
general this eases the usage of this information and therefore encourages its consid-
eration. We suspect that this has a positive impact on the overall quality of interdis-
ciplinary research.

2
https://www.sfb1266.uni-kiel.de/en.
3
“Scales of Transformation - Human Environmental Interaction in Prehistoric and Archaic
Societies”.
14 Y. O. Kropp and B. Thalheim

In the CRC 1266 we are currently developing and deploying a method to support
the approach of this paper by implementing such a storage possibility. The next
paragraph will shortly describe its main idea, based on so called ‘viewpoints’.

5.2 Excursus: Viewpoints in CRC 1266


To manage the different perspectives within a project, multiple viewpoints are created.
A viewpoint represents a specific perspective on the available data in the project. It is
defined by the domains and the research intentions of its users. It basically models the
results of the first three phases of the systematic approach presented in Sect. 4.4 The
central questions, which are addressed by a viewpoint include: Is there relevant data for
my research, created by other project members? and How to preprocess it to make it fit
my perspective?
When accessing the data pool of the project in context of a viewpoint, only the
relevant data for this viewpoint will be shown. The data will also be structured
according to the viewpoints specifications. On a technical level this is implemented by
storing all project data inside of an abstract universal central database system5 and
generating (database) views on top of it. Each viewpoint is represented by a set of
views, which are dynamically constructed according to its specifications. The members
of the project access the database and its data via those views. Also interactions like
insert, update and delete is performed via these views. As our community of practice
mostly does not interact with databases via raw-sql on a daily basis, a web-portal
basing on the Django web framework6 will offer an interface for these services. The
specifications for creating the viewpoint and its views are created using the results of
the ‘DM Initialization’ phase of Sect. 4.3 (which bases on the results of the previous
phases). In detail the specifications of a viewpoint do provide information about:
1. the label -The most basic information provided is a name (or label) to be able to
identify and address a viewpoint and its views.
2. the description - Another pretty basic information, is a short description of the
viewpoint. This information is not necessary for the generation of the viewpoint, but
helps users to put the viewpoint into proper context. (Especially when the label is
not expressive.) Typical information contained in the description would be the
recommended users, the domain, the focus of research and the research subject.
3. the projection (of structures) - A way more important information are the structural
details about the kinds of data objects, which belong to the viewpoint. This includes
the (relational) types and the properties/attributes of these types, as well as the
possible connections/relations between types (and their properties). In short this
represents the information what kind of views are in the viewpoint and how they
compound. For example do the experts for ancient DNA7 collect huge amounts of

4
the phases before the actual data mining process.
5
The structure of that database is not topic of this paper, but roughly bases on the schema presented in
[33].
6
https://www.djangoproject.com/.
7
short: aDNA.
Deep Model Guided Data Analysis 15

DNA sequence data, but these are hard to interpret by non-physicians. In a view-
point for archaeologists, this raw sequence data might be needless and therefore not
integrated. The other way round, in the research of aDNA experts might the exact
stratigraphy of excavation sites be needless too. Another example could be scien-
tists who investigate distribution patterns of pottery. They do not need all available
information of the found pottery (like weight, dimensions, description of ornaments
etc.), but just the coordinates of its excavation and its classification.
This description of the view structures eventually also includes viewpoint
specific labels for types, properties etc. Such a specification might be necessary as
in different academic disciplines and domains the same labels for different
objects/subjects or different labels for same objects/subjects can be in use.
4. the selection (of instances) - Another quite important information concerns the data
in the views. In contrast to the previous point, this one deals with instances. Even if
a type is adequate for a viewpoint, because it fits to the research focus, this does not
necessarily imply that also all data tuples of this type are adequate. Further filtering
of the available data might be necessary. Such requirements are typically based on
thresholds on properties/attributes or metadata. Common examples from archaeol-
ogy originate from restrictions of the research focus. Referring back to the previous
example about the scientists who investigate distribution patterns of pottery, these
researchers might for a specific context just be interested in pottery from a specific
region or (ancient) time.
5. the adaptation/adjustment (of values) - Finally there can be need for (minor)
transformations concerning the instance values. This can be necessary if the values
from different sources are in different forms and shall be mapped to the standard
form of the viewpoints domain. Such mapping or transformation rules can for
instance be used to change the unit of a property, by adapting its values (e.g. the
unit of an attribute is transformed from centimeter to meter by multiplying its values
with 0.01). Another example could be the transfer of coordinates to a different
coordinate reference system.
To create such specifications for viewpoints it is useful to use the systematic approach
presented in Sect. 4 for collecting the required information. This standardizes their
construction and eases the process.

6 Conclusion

The literature on data mining is fairly rich. Mining tools have already gained the
maturity for supporting any kind of data analysis if the data mining problem is well
understood, the intentions for models are properly understood, and if the problem is
professionally set up. Data mining aims at development of model suites that allows to
derive and to draw dependable and thus justifiable conclusions on the given data set.
Data mining is a process that can be based on a framework for systematic modelling
that is driven by a deep model and a matrix. Textbooks on data mining typically
explore algorithms as blind search. Data mining is a specific form of modeling.
Therefore, we can combine modeling with data mining in a more sophisticated form.
Another random document with
no related content on Scribd:
1.E.5. Do not copy, display, perform, distribute or redistribute
this electronic work, or any part of this electronic work, without
prominently displaying the sentence set forth in paragraph 1.E.1
with active links or immediate access to the full terms of the
Project Gutenberg™ License.

1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if
you provide access to or distribute copies of a Project
Gutenberg™ work in a format other than “Plain Vanilla ASCII” or
other format used in the official version posted on the official
Project Gutenberg™ website (www.gutenberg.org), you must, at
no additional cost, fee or expense to the user, provide a copy, a
means of exporting a copy, or a means of obtaining a copy upon
request, of the work in its original “Plain Vanilla ASCII” or other
form. Any alternate format must include the full Project
Gutenberg™ License as specified in paragraph 1.E.1.

1.E.7. Do not charge a fee for access to, viewing, displaying,


performing, copying or distributing any Project Gutenberg™
works unless you comply with paragraph 1.E.8 or 1.E.9.

1.E.8. You may charge a reasonable fee for copies of or


providing access to or distributing Project Gutenberg™
electronic works provided that:

• You pay a royalty fee of 20% of the gross profits you derive from
the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
about donations to the Project Gutenberg Literary Archive
Foundation.”

• You provide a full refund of any money paid by a user who


notifies you in writing (or by e-mail) within 30 days of receipt that
s/he does not agree to the terms of the full Project Gutenberg™
License. You must require such a user to return or destroy all
copies of the works possessed in a physical medium and
discontinue all use of and all access to other copies of Project
Gutenberg™ works.

• You provide, in accordance with paragraph 1.F.3, a full refund of


any money paid for a work or a replacement copy, if a defect in
the electronic work is discovered and reported to you within 90
days of receipt of the work.

• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.

1.E.9. If you wish to charge a fee or distribute a Project


Gutenberg™ electronic work or group of works on different
terms than are set forth in this agreement, you must obtain
permission in writing from the Project Gutenberg Literary
Archive Foundation, the manager of the Project Gutenberg™
trademark. Contact the Foundation as set forth in Section 3
below.

1.F.

1.F.1. Project Gutenberg volunteers and employees expend


considerable effort to identify, do copyright research on,
transcribe and proofread works not protected by U.S. copyright
law in creating the Project Gutenberg™ collection. Despite
these efforts, Project Gutenberg™ electronic works, and the
medium on which they may be stored, may contain “Defects,”
such as, but not limited to, incomplete, inaccurate or corrupt
data, transcription errors, a copyright or other intellectual
property infringement, a defective or damaged disk or other
medium, a computer virus, or computer codes that damage or
cannot be read by your equipment.

1.F.2. LIMITED WARRANTY, DISCLAIMER OF DAMAGES -


Except for the “Right of Replacement or Refund” described in
paragraph 1.F.3, the Project Gutenberg Literary Archive
Foundation, the owner of the Project Gutenberg™ trademark,
and any other party distributing a Project Gutenberg™ electronic
work under this agreement, disclaim all liability to you for
damages, costs and expenses, including legal fees. YOU
AGREE THAT YOU HAVE NO REMEDIES FOR NEGLIGENCE,
STRICT LIABILITY, BREACH OF WARRANTY OR BREACH
OF CONTRACT EXCEPT THOSE PROVIDED IN PARAGRAPH
1.F.3. YOU AGREE THAT THE FOUNDATION, THE
TRADEMARK OWNER, AND ANY DISTRIBUTOR UNDER
THIS AGREEMENT WILL NOT BE LIABLE TO YOU FOR
ACTUAL, DIRECT, INDIRECT, CONSEQUENTIAL, PUNITIVE
OR INCIDENTAL DAMAGES EVEN IF YOU GIVE NOTICE OF
THE POSSIBILITY OF SUCH DAMAGE.

1.F.3. LIMITED RIGHT OF REPLACEMENT OR REFUND - If


you discover a defect in this electronic work within 90 days of
receiving it, you can receive a refund of the money (if any) you
paid for it by sending a written explanation to the person you
received the work from. If you received the work on a physical
medium, you must return the medium with your written
explanation. The person or entity that provided you with the
defective work may elect to provide a replacement copy in lieu
of a refund. If you received the work electronically, the person or
entity providing it to you may choose to give you a second
opportunity to receive the work electronically in lieu of a refund.
If the second copy is also defective, you may demand a refund
in writing without further opportunities to fix the problem.

1.F.4. Except for the limited right of replacement or refund set


forth in paragraph 1.F.3, this work is provided to you ‘AS-IS’,
WITH NO OTHER WARRANTIES OF ANY KIND, EXPRESS
OR IMPLIED, INCLUDING BUT NOT LIMITED TO
WARRANTIES OF MERCHANTABILITY OR FITNESS FOR
ANY PURPOSE.

1.F.5. Some states do not allow disclaimers of certain implied


warranties or the exclusion or limitation of certain types of
damages. If any disclaimer or limitation set forth in this
agreement violates the law of the state applicable to this
agreement, the agreement shall be interpreted to make the
maximum disclaimer or limitation permitted by the applicable
state law. The invalidity or unenforceability of any provision of
this agreement shall not void the remaining provisions.

1.F.6. INDEMNITY - You agree to indemnify and hold the


Foundation, the trademark owner, any agent or employee of the
Foundation, anyone providing copies of Project Gutenberg™
electronic works in accordance with this agreement, and any
volunteers associated with the production, promotion and
distribution of Project Gutenberg™ electronic works, harmless
from all liability, costs and expenses, including legal fees, that
arise directly or indirectly from any of the following which you do
or cause to occur: (a) distribution of this or any Project
Gutenberg™ work, (b) alteration, modification, or additions or
deletions to any Project Gutenberg™ work, and (c) any Defect
you cause.

Section 2. Information about the Mission of


Project Gutenberg™
Project Gutenberg™ is synonymous with the free distribution of
electronic works in formats readable by the widest variety of
computers including obsolete, old, middle-aged and new
computers. It exists because of the efforts of hundreds of
volunteers and donations from people in all walks of life.

Volunteers and financial support to provide volunteers with the


assistance they need are critical to reaching Project
Gutenberg™’s goals and ensuring that the Project Gutenberg™
collection will remain freely available for generations to come. In
2001, the Project Gutenberg Literary Archive Foundation was
created to provide a secure and permanent future for Project
Gutenberg™ and future generations. To learn more about the
Project Gutenberg Literary Archive Foundation and how your
efforts and donations can help, see Sections 3 and 4 and the
Foundation information page at www.gutenberg.org.

Section 3. Information about the Project


Gutenberg Literary Archive Foundation
The Project Gutenberg Literary Archive Foundation is a non-
profit 501(c)(3) educational corporation organized under the
laws of the state of Mississippi and granted tax exempt status by
the Internal Revenue Service. The Foundation’s EIN or federal
tax identification number is 64-6221541. Contributions to the
Project Gutenberg Literary Archive Foundation are tax
deductible to the full extent permitted by U.S. federal laws and
your state’s laws.

The Foundation’s business office is located at 809 North 1500


West, Salt Lake City, UT 84116, (801) 596-1887. Email contact
links and up to date contact information can be found at the
Foundation’s website and official page at
www.gutenberg.org/contact

Section 4. Information about Donations to


the Project Gutenberg Literary Archive
Foundation
Project Gutenberg™ depends upon and cannot survive without
widespread public support and donations to carry out its mission
of increasing the number of public domain and licensed works
that can be freely distributed in machine-readable form
accessible by the widest array of equipment including outdated
equipment. Many small donations ($1 to $5,000) are particularly
important to maintaining tax exempt status with the IRS.

The Foundation is committed to complying with the laws


regulating charities and charitable donations in all 50 states of
the United States. Compliance requirements are not uniform
and it takes a considerable effort, much paperwork and many
fees to meet and keep up with these requirements. We do not
solicit donations in locations where we have not received written
confirmation of compliance. To SEND DONATIONS or
determine the status of compliance for any particular state visit
www.gutenberg.org/donate.

While we cannot and do not solicit contributions from states


where we have not met the solicitation requirements, we know
of no prohibition against accepting unsolicited donations from
donors in such states who approach us with offers to donate.

International donations are gratefully accepted, but we cannot


make any statements concerning tax treatment of donations
received from outside the United States. U.S. laws alone swamp
our small staff.

Please check the Project Gutenberg web pages for current


donation methods and addresses. Donations are accepted in a
number of other ways including checks, online payments and
credit card donations. To donate, please visit:
www.gutenberg.org/donate.

Section 5. General Information About Project


Gutenberg™ electronic works
Professor Michael S. Hart was the originator of the Project
Gutenberg™ concept of a library of electronic works that could
be freely shared with anyone. For forty years, he produced and
distributed Project Gutenberg™ eBooks with only a loose
network of volunteer support.

Project Gutenberg™ eBooks are often created from several


printed editions, all of which are confirmed as not protected by
copyright in the U.S. unless a copyright notice is included. Thus,
we do not necessarily keep eBooks in compliance with any
particular paper edition.

Most people start at our website which has the main PG search
facility: www.gutenberg.org.

This website includes information about Project Gutenberg™,


including how to make donations to the Project Gutenberg
Literary Archive Foundation, how to help produce our new
eBooks, and how to subscribe to our email newsletter to hear
about new eBooks.

You might also like